AI Program Patterns

How to Run an AI Proof-of-Concept That Actually Predicts Production

Every POC in the last three years has passed. None of the corresponding production deployments have shipped on time. At some point the question isn't whether you're running good POCs — it's whether you're running the right ones.

The conventional AI proof-of-concept is a confidence-building exercise. It answers one question: can this model produce useful output? It answers that question under the best possible conditions — curated data, controlled environment, a team motivated to show success. And it almost always answers yes.

The problem is that nobody deploying AI at a financial institution is asking whether the idea can work. They're asking whether it will work, in their environment, with their data, through their integration stack, with their operational team, subject to their governance requirements. The conventional POC doesn't answer that question. It isn't designed to.

What a conventional POC tests

The standard POC runs a model on a prepared dataset, measures performance metrics, and produces a presentation. The dataset has been cleaned by the data science team — outliers removed, missing values handled, the relevant fields populated. The presentation shows AUC, precision, recall, accuracy improvement over baseline. The executive sponsor approves the next phase.

None of that is wrong. The model probably does work under those conditions. The problem is that production conditions are different in every dimension that matters: the data is messier, the integration is real, the operational team is using the output in a workflow that was never part of the POC, and model risk management is asking questions nobody thought to address during the proof-of-concept phase.

A successful conventional POC tells you the model is technically viable. It tells you almost nothing about whether the production deployment will succeed.

What a production-predictive POC tests instead

Four things change when you redesign the POC around production prediction.

Production data, not curated data. The single most predictive test is running the model on data exactly as it exists in your systems — missing values, inconsistent encoding, stale records, fields that are populated in the training set but empty in the live feed. The gap between curated POC data and actual production data is where more AI programs have failed than from any modeling error. A model that performs at 89% accuracy on cleaned data and 67% on the live system did not degrade. It was never tested under conditions that resembled production.

The integration path, even if simulated. You don't need to build the full integration to test it. You need to know whether your operational systems can surface the data the model requires at inference time, at the latency the use case demands, with the reliability the production environment needs. A manual simulation — pulling data from the relevant systems by hand and feeding it to the model — is not as good as a real integration, but it surfaces the gaps. The teams that discover at this stage that the required data lives in three different systems with different refresh cycles and one of them is batch-updated nightly are lucky. The teams that discover it six months into production integration are not.

The operational workflow. Bring in actual users — even a small group, even informally — and watch them interact with the output. Not to test their acceptance, but to test whether the system's output format, timing, and volume fits into how they actually work. The fraud analyst who receives 200 flags per day and has 30 seconds per flag will tell you immediately whether the model's output is usable. The loan officer who needs an explainable recommendation will tell you whether the model's confidence score means anything to her. You cannot design for that without asking. You definitely cannot design for it by having the data science team imagine what the operational team needs.

The governance questions. Before any AI system goes to production at a regulated financial institution, it has to get through model risk management, legal review, and potentially regulatory disclosure. A POC that doesn't answer the three questions MRM will ask — what decisions does this affect, how will drift be detected, what does the audit trail show — produces a deliverable that has to be rebuilt before it can be validated. That rebuilding is where programs stall. The institutions that include MRM in the POC design conversation, not the post-POC validation review, cut this timeline significantly. The ones that treat MRM as a gate to clear after the technical work is done spend six months answering questions they could have answered in week two.

The five questions to answer before the POC starts

A production-predictive POC is designed around the questions it needs to answer, not the metrics it needs to demonstrate. Before building anything:

1. What specific decision or process will this system affect in production, and what does a wrong output cost? Not the general use case — the specific workflow, the specific role, the specific decision point. "Credit underwriting" is not specific enough. "The commercial credit analyst's initial assessment on loans between $500K and $2M, where a wrong recommendation affects the approval queue for up to 48 hours" is specific enough to design a POC around.

2. What data does the model need at inference time — and does that data actually exist, in the required format, accessible to the system at the required latency? This question needs to be answered by your data engineers, not your data scientists. The data scientist knows what the model needs. The data engineer knows what you actually have.

3. Who will use the output, in what workflow, and how will they know when to override it? If you can't name the specific role and describe the specific workflow before the POC starts, the POC is testing something hypothetical. The operational co-design work described in the change management article belongs here, not after the model is built.

4. How will the institution know if the model is degrading — and who is responsible for acting on that signal? Monitoring is not a post-production concern. A model that you cannot monitor adequately cannot get through MRM — and it shouldn't. The monitoring design belongs in the POC, both because MRM will ask about it and because building it in is far cheaper than retrofitting it.

5. What documentation will MRM require, and can the model's behavior be described in those terms? This question requires a conversation with MRM before the POC is designed, not after. The institutions that have that conversation early find that MRM is not obstructionist — they have specific, answerable requirements. The institutions that treat MRM as a review gate find that MRM asks questions the model documentation cannot answer, because the documentation was written for a different audience.

If you can't answer all five before the POC starts, the design is incomplete. Running the POC without those answers doesn't just risk failure — it risks a passed POC that tells you nothing useful.

Design for failure

The most useful thing a production-predictive POC can do is fail visibly and early. Visible early failure means: this use case requires data infrastructure we don't have, or this workflow won't support the model's output format, or MRM will require explainability this approach can't provide. That information is worth more than a passed POC. It comes before you've committed significant resources to a full deployment. The upstream question — which use cases to POC in the first place — has its own failure mode that conventional prioritization exercises consistently produce.

This is counterintuitive. Sponsors approve POCs that succeed. Teams feel better about programs that produce green results. The organizational pressure is toward the kind of POC designed to pass — and that pressure is exactly what produces the pattern at the start of this article: every POC passes, nothing ships on time.

The reframe: a POC that surfaces a blocker early is not a failed POC. It is the POC doing its job. The blockers don't go away if the POC doesn't find them. They appear later, when reversing course costs more and the executive sponsor's patience is thinner.

The teams that consistently ship AI programs to production are not the ones that run the best POCs. They are the ones that run the most honest ones.

What a successful production-predictive POC looks like

It is less impressive to present. There is no clean accuracy curve trending upward. The results include a section called "integration gaps" and another called "governance requirements." The model performance numbers are lower because you ran it on real data instead of a curated extract.

What it does produce is a clear answer to the question that matters: will this work in production, and specifically what does it take to get it there? That answer may be yes, with a concrete list of pre-production requirements and the team that owns each one. It may be not yet, this use case requires data infrastructure we don't currently have, here's what that would take to build. It may be no, this use case is not viable given our operating constraints and here's why.

Any of those answers is more valuable than a polished 89% accuracy chart that doesn't reflect the production environment. The 89% chart moves a program forward without telling you whether forward leads somewhere. The honest assessment tells you exactly where you are and what the path actually costs.

I've worked with institutions that have run six consecutive successful POCs without shipping a single production system. The POCs weren't the problem. The problem was what the POCs were measuring.

If the initiative is already stalled, the next step is usually AI Program Rescue. If you want to sort out fit first, reach out.

If your institution's POC track record doesn't match its production track record, it's worth a conversation about what the POC is designed to test.

Contact me