The enterprise platform to build, deliver, and govern AI
Watch the 15 minute on-demand demo to get an overview of the Domino Enterprise AI Platform.
The demo passed. The audit didn’t.
When an AI-generated application fails in production, the question of who is accountable has no clear answer. The code was generated by a model from a natural language prompt. The requirements were implicit. The testing covered the happy path. And the developer may not be able to explain why a particular implementation choice was made because the model made it. This post provides a framework for governance, compliance, and legal teams evaluating AI-generated applications, and makes the case for what governable AI development actually requires.
Responsible AI development is not a new aspiration but it is being tested in a new way. Across regulated industries, governance teams are being asked to sign off on AI-generated applications that were built faster than any previous generation of software, by people who may not be engineers, using tools that generate code from natural language prompts. The systems work. The demos are compelling. And the question of who is accountable when something goes wrong has not been answered.
This blog is for the risk, compliance, and legal professionals who are living in that gap. Not to slow down development. The competitive pressure is real and legitimate. But to provide a framework for evaluating AI-generated applications with the rigor that regulated environments require, and to make the case internally for what governable AI development actually looks like.
The responsible AI development gap is the distance between what governance frameworks were built to evaluate and what they are now being asked to sign off on. The current moment in AI development has a structural accountability problem. Organizations are deploying AI-generated applications at a pace that governance processes were not designed to handle. The frameworks that exist such as model risk management, change control, audit procedures were built for a different era. They assumed that code was written by humans who could explain their decisions, tested against documented requirements, and deployed through controlled pipelines.
Vibe-coded applications break most of those assumptions. The code was generated by an AI model based on a natural language prompt. The requirements were implicit, not documented. The testing covered the happy path. The developer who built it may not be able to explain why a particular implementation choice was made, because the model made it. And yet the application is in staging, the business wants it in production, and someone in your organization is being asked to approve it.
This is the responsible AI development gap: the distance between what governance frameworks were built to evaluate and what they are now being asked to sign off on. Closing it requires not just better tooling, but a clearer model for what accountability looks like when the code was written by an AI.
Vibe coding, the practice of generating application code through AI coding assistants using natural language prompts, is a legitimate and powerful development approach. For exploration, prototyping, and internal tooling, it accelerates delivery dramatically. The problem is not the tool. The problem is what it does not produce by default.
For compliance teams, the production gap in vibe-coded applications maps directly onto regulatory exposure. Consider what is typically absent:
The compliance graveyard is already filling with AI projects that cleared a demo review and failed an audit. The gap is not capability. It is process. And the organizations that recognize this now have a structural advantage over those that will discover it under regulatory scrutiny.
This is not the first time the data science industry has faced a version of this problem. The MLOps movement emerged specifically because organizations were drowning in models that worked in notebooks and failed in production. The model graveyard, impressive experiments that never reached deployment, was the cost of building without production discipline.
What MLOps got right was making governance legible. It established that models needed to be versioned, that training data needed to be tracked, that deployment needed to be controlled, and that monitoring needed to be continuous. It gave governance teams the artifacts they needed to do their jobs: model cards, lineage records, performance benchmarks, and rollback procedures.
The agentic engineering era is the same problem at higher speed and higher stakes. The organizations that apply MLOps discipline to AI-generated applications such as demanding documented specifications, traceable development processes, layered testing, and behavioral audits are the ones that will be able to deploy responsibly at scale. The organizations that treat vibe-coded prototypes as production-ready because they passed a demo are building toward a regulatory reckoning.
MLOps taught us that speed without process is not velocity. It is drift. The same lesson applies to AI-generated code. The artifacts of responsible development are not bureaucratic overhead. They are the evidence that governance requires.
Agentic engineering, the structured methodology for building AI-assisted systems that are designed to reach production from the start, produces the artifacts that make responsible AI development verifiable. For governance teams, the distinction between a vibe-coded prototype and an agentic engineering output is the difference between a system you can audit and a system you can only hope.
Every governable AI application begins with a functional specification that exists before the first line of code is written. The spec defines inputs, outputs, constraints, and failure modes. This is not just what the system should do, but what it should not do, when it should stop, and what constitutes a failure. This document is the AI audit trail anchor: the authoritative record of what was intended, against which the system’s behavior can be evaluated.
For compliance teams, the spec serves a function analogous to a model card in the MLOps world. It is the document you reach for when a regulator asks what the system was designed to do, what assumptions it operates under, and what guardrails were put in place. Without it, those questions cannot be answered with evidence but only with testimony.
The Ralph loop is an agentic engineering development methodology that structures the development process as a sequence of documented steps: requirements, architecture, layered testing, human review, and production validation. It is described in detail in From prompt to production: an agentic engineering playbook. Taken together, those records constitute a traceable development history that governance teams can examine.
This matters for AI risk management because it reestablishes the chain of accountability that vibe coding severs. When a decision was made, who reviewed it, what alternatives were considered, and why the chosen approach was selected. All of this is documented. The AI wrote the code, but a human reviewed the spec, validated the tests, and signed off at each gate. Responsibility for AI decisions is distributed across a traceable process, not absorbed into an inscrutable model output.
For governance teams, layered testing is not a developer concern. It is the evidence of behavioral validation. Unit tests demonstrate that components function correctly in isolation. Integration tests demonstrate that the system handles real-world dependencies. End-to-end behavioral tests demonstrate that the system does what it was specified to do under realistic conditions, including failure scenarios and edge cases.
For agentic systems specifically, behavioral testing also covers constraint adherence. That’s whether the system respects the boundaries defined in the specification when it encounters situations outside its training distribution. This is the validation gate between prototype and production. A system that has not been tested against its constraints has not been validated. It has been demonstrated.
Cross-model validation, testing agentic system behavior across multiple AI models to verify that outputs are consistent and not dependent on a single model’s idiosyncrasies, functions as a documented second opinion. For governance teams in regulated industries, this addresses a core concern: that system behavior validated against one model version may not hold when that model is updated or replaced.
Cross-model review records are governance artifacts. They demonstrate that the system’s behavior was not an artifact of a specific model configuration, and that the development team identified and documented where model-dependent variation exists. In regulated industries where model change management is a compliance requirement, this documentation is not optional.
An AI accountability framework for the agentic engineering era does not need to be built from scratch. It needs to extend existing governance structures to cover what is new: the use of generative AI in the development process itself, the autonomy of agentic systems in production, and the speed at which both are moving.
The core components of an AI governance framework for this moment:
This is not a framework that slows down development. It is a framework that makes fast development governable. The organizations that establish it now will be able to move quickly with confidence. The organizations that defer it will move quickly into exposure.
For governance teams reviewing AI-generated applications, the following questions operationalize the accountability framework into a repeatable review checklist. The answers should be documented, not verbal.
These questions apply regardless of whether the application was built using generative AI, traditional code, or a combination. What changes with AI-generated code is that the answers are less likely to exist by default and more likely to require deliberate process to produce.
Many of the governance requirements outlined above including access controls, audit trails, observability, deployment pipelines, environment controls are not application-level concerns. They are infrastructure concerns. When they are handled at the platform level, governance becomes consistent and auditable across every application that runs on that platform, rather than requiring per-application review of custom implementations.
This is the platform as a responsible AI development enabler: not just a place where applications run, but a governance layer that enforces standards by default. When access controls are platform-enforced, every application inherits them. When observability is platform-provided, every application is audit-ready. When deployment is platform-controlled, every release has a traceable change record.
For governance teams, the platform question is not abstract. It is the difference between reviewing each application’s compliance posture individually and having organizational confidence that the platform makes compliant behavior the default. The former is reactive and labor-intensive. The latter is scalable.
Domino is purpose-built to provide that platform layer for enterprise AI and data science organizations, handling the infrastructure governance that makes responsible AI development the default rather than a per-project negotiation.
This post is the final part of the Path to Production series. Blog 1 covers the practitioner methodology for agentic engineering. Blog 2 covers what MLOps-era data science leaders already know about why AI projects fail. Blog 3 addresses the developer perspective on inheriting vibe-coded prototypes.
Responsible AI development for AI-generated code is the application of structured governance, documentation, and validation practices to software that was built using AI coding assistants or agentic development tools. It addresses a gap that has emerged as vibe coding, rapid code generation from natural language prompts, has become widespread in data science and software organizations.
AI coding tools generate code based on what they are asked, not what governance requires. Without deliberate process, the output lacks functional specifications, documented decision rationale, behavioral test coverage, observability, and audit trails. Responsible AI development establishes the process requirements that produce those artifacts: spec-first design, traceable development records, layered behavioral testing, decision-level logging, and human oversight checkpoints. It is not a rejection of AI-assisted development. It is the governance layer that makes AI-assisted development deployable in regulated environments.
Organizations operating under frameworks like the EU AI Act, model risk management guidelines, or sector-specific compliance requirements need responsible AI development practices to meet their obligations as AI-generated applications enter production.
An AI accountability framework for agentic engineering extends existing governance structures to cover what is new about AI-generated applications: the use of generative AI in the development process, the autonomy of agentic systems in production, and the pace of deployment.
The core components are development process standards that require functional specifications before code generation and human review at defined checkpoints; behavioral testing requirements that mandate coverage across unit, integration, and end-to-end levels before a production validation gate; decision-level observability that logs what the system was asked, what context it received, and what it decided; access controls enforced at the platform level rather than the application layer; human oversight checkpoints for high-risk agentic decisions; and model change management procedures that treat foundation model updates as configuration changes subject to change control.
The framework should be documented and applied consistently rather than negotiated per project. The organizations that establish it proactively are the ones that will be able to deploy AI-generated applications at scale without accumulating regulatory exposure.
An AI audit trail for an agentic application has two components: a development audit trail and a runtime audit trail. The development audit trail includes the functional specification that predates the code, the architectural decision log that records why implementation choices were made, the test records showing what behavioral validation was performed and what it found, the cross-model validation records, and the human review and approval records at each process checkpoint.
The runtime audit trail includes structured logs of every significant system action: what input the system received, what context it retrieved, what decision it made, and what action it took. For agentic systems, this needs to be decision-level, not just transaction-level. It is not sufficient to log that a decision was made. You need to log the inputs that produced it. In regulated industries, the runtime audit trail is what allows compliance teams to reconstruct system behavior after the fact and respond to regulatory inquiries with evidence rather than testimony. Both components should be stored in a durable, tamper-evident format, and should be accessible to governance teams without requiring developer involvement.
In regulated industries, responsible AI development is not a best practice. It is a compliance requirement that is increasingly codified in law and regulatory guidance. Financial services firms operating under model risk management frameworks must demonstrate that AI-generated models and applications have been validated against documented requirements and that decision rationale is traceable.
Healthcare organizations must ensure that AI-assisted systems meet accuracy, safety, and traceability standards before deployment. Organizations operating in the EU must comply with the EU AI Act’s requirements for high-risk AI applications, which include human oversight mechanisms, technical documentation, transparency obligations, and ongoing monitoring. Across these contexts, the common thread is that regulatory compliance requires evidence: documented specifications, traceable development processes, behavioral test records, and runtime audit trails.
Responsible AI development is the process that produces that evidence. Without it, AI-generated applications in regulated industries are not just technically incomplete. They are legally exposed. The ethical standards that regulators enforce are operationalized through documentation and process, not through intent.
Governance teams should apply a consistent pre-production checklist to every AI-generated application.
The first question is whether a functional specification exists that predates the code, as if it was written after the fact, it is documentation rather than governance. The second is whether the development team can explain and document every material architectural decision independently of the AI that generated the code. The third is whether behavioral testing has been conducted beyond the happy path, with edge cases, failure modes, and adversarial inputs documented in the test record. The fourth is whether decision-level observability is in place, enabling reconstruction of system behavior after the fact. The fifth is whether access controls are implemented at the platform level and auditable. The sixth, for agentic systems, is where the human oversight checkpoint is and whether it is documented. The seventh is whether the system has been tested across model versions. The eighth is whether a rollback procedure exists.

Danny Stout is a seasoned data science and analytics leader with over two decades of experience driving enterprise AI and machine learning initiatives. He held senior analytics and AI leadership roles across global organizations including Ernst & Young, Takeda, TIBCO, Quest, and Dell, spanning forecasting, pricing, analytics strategy, and data science consulting. His work emphasizes effectiveness over scale, focusing on governance, team alignment, and measurable outcomes as the determinants of successful AI adoption. Based in Charlton, MA, Danny holds a Ph.D. and combines technical leadership with practical insights that help organizations scale data science responsibly and effectively.
Watch the 15 minute on-demand demo to get an overview of the Domino Enterprise AI Platform.
In this article
Watch the 15 minute on-demand demo to get an overview of the Domino Enterprise AI Platform.