From Notebook to Production, Again: Applying MLOps Lessons to the AI Coding Boom
As enterprises integrate large language models and coding assistants into their development workflows, a critical distinction is emerging between vibe coding and agentic engineering. For any organization that uses AI-assisted software to write production code, understanding that gap isn't optional.
At Domino, our Solutions Engineers (SEs) and Forward Deployed Engineers (FDEs) work at this intersection every day. The pattern is consistent across regulated industries like life sciences, financial services, and insurance: a dazzling prototype means nothing if it can't scale, comply, and endure. The distinction is clear: AI is not a magical wand. It's a disciplined engineering partner, and treating it like one is what separates demos from deployable software.
Why vibe coding doesn’t survive production
Vibe coding is fast, intuitive, and conversational. A developer describes what they want, a Large Language Model (LLM) generates the code, and within minutes, there's a working demo. For exploration and early proof-of-concept work, it genuinely is valuable.
The problem is that organizations are accumulating what amounts to a prototype graveyard: impressive one-off applications never designed to move beyond the demo. They can't scale. They haven't been validated. They don't account for authentication, data quality, or regulatory requirements.
For our customers in life sciences, for example, every application touching a regulated workflow must comply with the FDA, EMA, and other regulatory bodies. Vibe coding doesn't have an answer for that. It was never meant to.
The AI software development gap is the MLOps problem in disguise
This pattern is not new. It is the defining problem that the data science industry spent a decade solving.
Data science teams once produced brilliant research notebooks and compelling model experiments that never made it into production. The models worked on a laptop and in a presentation. However, when it came time to deploy the code into real business processes at real scale, everything broke. No reproducibility. No version control. No monitoring. No governance.
That gap gave rise to MLOps: practices and platforms designed to create a repeatable, governed path from experimentation to production. MLOps didn't just solve a technical problem. It solved an organizational one, giving enterprises confidence that the models powering their decisions were validated, maintainable, and auditable over time.
The same pattern is now repeating with AI-assisted application development. Vibe coding is the new research notebook. Impressive in isolation, but without a structured path to production, those artifacts accumulate in the same graveyard where undeployed models once gathered dust.
The answer is the same, too. You need version control, testing, validation, and monitoring. You need a platform that enforces these disciplines across all applications, not just those built by the most disciplined teams. The organizations that learned this lesson through MLOps are better equipped than anyone to apply it now. The ones that didn't are about to learn it again.
What agentic engineering actually looks like
Agentic engineering treats AI as a structured collaborator throughout the entire software development lifecycle, rather than a single conversational exchange that produces code. It is iterative, multi-stage, and designed for production from the start. In many ways, it is the software development equivalent of what MLOps brought to model development: a disciplined framework that turns experimentation into something an enterprise can trust.
The most counterintuitive lesson: the majority of time in agentic engineering is not spent writing code. The bulk of the effort goes into specification and design up front, and into validation and quality assurance on the back end. The actual code generation, the part that vibe coding treats as the whole process, is the narrowest phase in the middle. When AI can generate code in seconds, human value shifts to defining what should be built, why it matters, and whether it actually works.
In practice, here's how our SE and FDE teams have been building with this methodology:
- Specification-first design. Every project starts with a detailed spec before any code. Decomposed into epics and stories, then stress-tested across multiple LLMs, each individually acting as a platform engineer, end user, security reviewer ect.
- Agentic implementation via Ralph loop. Each story runs through a structured prompt cycle trained on SDLC best practices. The agent audits, plans, implements, and reviews until it meets production-grade standards.
- Comprehensive testing and validation. Layered unit, integration, and end-to-end coverage. Unit tests validate individual functions, integration tests verify components work across service boundaries, and end-to-end tests exercise the full workflow from the user's perspective. The standard is not "does it run" but "does it do what it was designed to do."
- Cross-model validation. Output assessed across multiple LLMs to catch what any single model might miss.
The result is code that isn't just functional. It's auditable, maintainable, and ready for the regulatory and operational scrutiny that enterprise environments require.
What a production agentic engineering prompt actually looks like
To make this concrete, here is an example of the kind of structured prompt we run through the Ralph loop for each story in a development plan. This is not a casual vibe coding prompt. It is a production engineering workflow compressed into a single iterative cycle.
The prompt incorporates several best practices that have emerged in the agentic engineering space: defines a clear persona and expertise domain, provides explicit context and scope boundaries, requires a plan before any implementation, builds in self-critique and refinement gates, and treats every test failure as signal rather than noise. The key insight is that you are not asking the AI to generate code. You are instructing an autonomous agent to engineer a feature through the same rigorous cycle a senior development team would follow.
# Persona and Context
You are a world-class production software engineer with deep expertise in [domain/stack]. You are working within a structured development plan where each story has been defined in a detailed specification. Your role is to implement each story in [document] to production-grade standards. There is no room for shortcuts, incomplete implementations, stubs, or TODOs. Every decision you make should optimize for security, scalability, maintainability, and production readiness.
## Scope
You will be given a single story from the development plan. Your scope is limited to that story, but your awareness must span the full codebase. Refer to the specification and plan documents for requirements, acceptance criteria, and architectural context.
## Execution Steps
For each story, execute the following steps sequentially. Do not skip or compress steps. Complete each step fully before proceeding to the next.
### Step 1: Deep Audit
Audit the entire codebase as it relates to this story. Map every file, function, dependency, and interaction that this feature touches or should touch. Identify gaps, inconsistencies, and technical debt that could affect implementation. Document your findings before moving forward.
### Step 2: Unconstrained Planning
Develop a detailed implementation plan. Do not be constrained by the current implementation. If a larger refactor is required to implement this story correctly, include that in the plan. There should be zero duplication of code, zero fallback patterns, and zero workarounds. Define one correct, standardized approach with clean variable pass-through and full traceability.
### Step 3: Self-Critique and Iteration
Audit your own plan. Identify at least 5 to 10 specific, actionable ways to improve it. Consider security vulnerabilities, edge cases, performance bottlenecks, UX implications, and architectural alignment. Revise the plan. Repeat this critique-and-revise cycle until you are confident the plan represents the best possible approach.
### Step 4: Test-First Development
Before writing any implementation code, develop a comprehensive test suite across three layers:
- Unit tests that validate individual functions, logic paths, and data transformations in isolation.
- Integration tests that verify components interact correctly across service boundaries, APIs, and data flows.
- Add to end-to-end tests that exercise the full user workflow for this story from input to output, validating that the solution performs the actual task it was designed to perform.
Pay special attention to edge cases, failure modes, boundary conditions, and data validation scenarios. Tests should verify behavior and outcomes, not implementation details. Every test should trace back to a requirement in the specification.
### Step 5: Production Implementation
Implement the code against your finalized plan. Write clean, well-commented, production-ready code. Every function should have a clear purpose. Every variable should have a meaningful name. Every decision should be traceable back to the plan. This code will be deployed to production and must meet the highest standards of quality.
### Step 6: Test, Diagnose, and Iterate
Run all tests across all three layers: unit, integration, and end-to-end (if possible). For every failure, diagnose the root cause thoroughly. The goal is not to make tests pass. The goal is to build a state-of-the-art solution that genuinely performs its intended function. Every failure is an opportunity to uncover a deeper issue and improve the overall architecture. Iterate until all tests pass and the solution is robust.
### Step 7: Production Readiness Review
Review all changes holistically. Evaluate against the following dimensions: security posture, scalability under load, UI/UX coherence, innovation and best-in-class patterns, observability and logging, error handling and recovery, and deployment readiness. Validate that the implementation fully satisfies the original story's acceptance criteria as defined in the specification.
### Step 8: Iterate or Complete
If the production readiness review identified improvements, return to Step 3 and incorporate them. Once all standards are met, log the implementation details in the spec, mark the story as complete, and proceed to the next story.Why this prompt works
Several design choices make this prompt effective. The persona block establishes domain expertise and production expectations upfront, giving the AI a clear frame of reference for every decision. Every step has an explicit definition of done, preventing the agent from rushing or compressing phases. The self-critique gates at Steps 3 and 7 force the agent to evaluate its own work from multiple angles before moving forward, a pattern that research has shown reduces errors significantly compared to single-pass generation. The test-first approach at Step 4 creates an objective validation layer across unit, integration, and end-to-end levels, ensuring the solution actually performs the task it was designed to do. And the iterative loop back from Step 8 to Step 3 means the agent never settles for good enough, continuing to refine until the implementation meets a clearly defined production bar.
Notice where the weight sits. Steps 1 through 3 are entirely about understanding, planning, and design. Steps 6 through 8 are entirely about assessment, validation, and quality assurance. Step 5, the actual code generation, is a single step in an eight-step process. When the machine writes the code, the human and the process must own the thinking before and the verification after.
One practical detail worth calling out: LLMs degrade as their context window fills. Within the Ralph loop, we address this in two ways. Each iteration restates the agent's goals and acceptance criteria to keep the LLM focused. And we experiment with context resets after each iteration, giving the agent a clean slate with only the essential state carried forward. This keeps the agent sharp over long development cycles rather than slowly losing coherence.
When this loop runs across every story in a plan, and results are cross-validated against multiple LLMs, the output is fundamentally different from what vibe coding produces. It is structured, tested, reviewed, and production-ready.
Velocity is not a strategy
When you can build high-quality software at high velocity, the instinct is to build more. But speed and quality make strategic discipline more important than ever, not less. When the friction of development drops dramatically, the risk shifts from "can we build this?" to "should we build this?" Without clear business objectives and measurable ROI, even well-engineered software becomes waste. It just gets produced faster.
This is another lesson from the MLOps playbook. Early enterprise data science teams spun up model after model without asking whether each was tied to a real business outcome. The result was hundreds of models in development and a handful in production, not because the models were bad, but because the business case was never established. MLOps maturity taught organizations to treat model development as an investment decision: define the value, establish success metrics, and only commit production resources to models that clear that bar.
The same pattern is emerging with AI-assisted software development. A team demonstrates that they can stand up an AI-powered system in days rather than months. Leadership gets excited. Requests multiply. And suddenly, the organization is building at a pace that outstrips its ability to validate whether any of it is driving business value. The prototype graveyard doesn't only fill up with bad code. It fills up with good code that nobody needs.
The discipline starts before the spec. Every project must first answer foundational questions: What is the business problem? Who are the users and what does success look like for them? How will we measure impact? What is the cost of building and maintaining this versus the value it delivers? When AI removes the bottleneck of development speed, the bottleneck moves upstream to strategy and prioritization. The organizations that recognize this will build fewer things that matter more. The ones that don't will build many things that matter very little.
What quality means when AI writes the code
This shift forces some uncomfortable questions about traditional software quality metrics. The conventional SDLC emphasizes test coverage, maintainability, readability, and human-centric code organization. But when an LLM is both writing and maintaining the code, do all of those heuristics still apply?
Some traditional principles become less critical. LLMs tend to work effectively with large single files rather than deeply nested structures. Repeated naming conventions matter less when the AI can instantly index and navigate a full codebase. But other principles, particularly test coverage and security, become more important. The cost of an undetected defect in AI-generated code is just as high as in human-written code. And in regulated industries, auditability is not optional, regardless of who or what wrote the software.
What changes is how we assess quality. Instead of relying solely on human code review, we build LLM-based assessment pipelines: automated audits that evaluate code against the same principles, but at machine speed and scale. Unit tests alone are not sufficient. Integration tests prove components work correctly across service boundaries. End-to-end tests prove the full workflow delivers the outcome it was designed for. We have seen too many AI-generated codebases where every unit test passes, but the system does not actually solve the user's problem because no one validated the behavior against the original spec. Layered testing anchored to the specification closes that gap. And human oversight remains a critical component. LLMs alone cannot be trusted to build the quality of testing that production software requires.
Why the platform advantage is where this all comes together
This is where the MLOps analogy comes full circle. Just as MLOps platforms gave organizations a governed, repeatable path from model research to deployment, the right platform gives organizations the same path from prototype to production software.
When the platform handles authentication, environments, access controls, governance, and model deployment, those concerns disappear from the application code entirely. Engineers stop re-solving infrastructure for every project and focus on the domain logic that actually differentiates the solution. Domino adds a third option to the build versus buy equation: build the business value, not the plumbing.
That's a critical piece of how we move customers beyond the prototype graveyard toward a sustainable application portfolio: not by building more one-off apps faster, but by building on a platform that makes each application enterprise-grade by default. The platform is the path to production. Without it, every application is a custom journey. With it, production readiness becomes the default, not the exception.
Vibe coding got us started, the engineering gets us to production
The gap between vibe coding and agentic engineering is not just technical. It is strategic. Organizations that treat AI-assisted development as a novelty will keep producing demos. Organizations that invest in structured, iterative, multi-model engineering workflows grounded in clear business objectives will ship software that runs in production, passes regulatory review, and scales with the business.
The data science industry learned this lesson over the past decade. The path from research to production required more than better models. It required better processes, better governance, and better platforms. AI-assisted software development is at that same inflection point now.
For Domino's SE and FDE teams, this is already underway. We are not just using AI to write code faster. We are using it to build better and smarter, with the rigor, auditability, and production-mindedness that our customers' most demanding problems require.
Vibe coding got us started. The engineering is what gets us to production. And the strategy is what makes sure production was worth getting to in the first place.
Domino Professional Services
Domino's Solutions Engineering and Forward Deployed Engineering teams work directly with customers to solve their most complex data science and AI challenges.