Model Risk Management

How to Get an LLM Through Model Risk Management at a Bank

MRM sent back thirty-seven questions. The data science team spent three months writing answers. MRM came back with twenty more. The model had been sitting in validation for eight months, the vendor was losing patience, and nobody was quite sure what would actually resolve it.

I've seen this sequence enough times that I can describe the next chapter: eventually, someone escalates to the CRO. The CRO schedules a meeting with the head of Model Risk. Fifteen minutes into that meeting, both parties discover that the real disagreement is about three things — not thirty-seven — and that they could have had that conversation eight months earlier.

The MRM validation process for LLMs is one of the most consistent blockers I encounter in financial services AI programs. Not because MRM is obstructionist. Not because the data science team is doing bad work. Because they are operating under different assumptions about what the process is supposed to produce, and nobody is brokering the translation between them.

MRM was built for a different kind of model

Model Risk Management as a discipline in US banking is largely shaped by SR 11-7, the Federal Reserve and OCC guidance issued in 2011. SR 11-7 was written in an era of credit scorecards, PD/LGD models, prepayment models, and asset pricing engines. These are systems with a shared set of properties: deterministic inputs, documented mathematical logic, known failure modes, and outputs that map directly to a decision with quantifiable consequences.

For that class of model, the SR 11-7 framework is excellent. You document the model, define the inputs and outputs, validate the methodology against an independent dataset, stress-test the assumptions, and establish ongoing monitoring against defined performance metrics. The process is rigorous and well-understood on both sides.

An LLM has almost none of those properties. The inputs are open-ended natural language. The internal logic is not documented — it is emergent from billions of parameters. The failure modes are probabilistic, context-dependent, and not enumerable in advance. The output is not a number; it is text that a human interprets and acts on. The relationship between inputs, processing, and outputs cannot be described in a validation memo the way a credit scorecard can.

This is not a deficiency of LLMs. It is simply a different class of system. But MRM frameworks are written for the first class, and when an LLM arrives for validation, the framework asks questions that don't have clean answers — and the process stalls.

Why the standard response makes it worse

The typical response from a data science team when MRM asks difficult questions is to write more documentation. More technical detail about the model architecture. More accuracy metrics on the validation set. More benchmark comparisons. The implicit theory is that if the team answers enough questions thoroughly enough, the validation will clear.

This theory is usually wrong. The problem is not a documentation gap — it is a framework mismatch. MRM is asking questions that are structurally designed for a different kind of system, and adding more answers to those questions does not resolve the underlying issue. It produces a very long response to the wrong question.

The better response is to reframe: not to answer all thirty-seven questions, but to identify which three are actually blocking approval and have a direct conversation about those three. Everything else is either answerable with existing documentation or negotiable once the core concerns are addressed.

Getting to that conversation requires someone who can walk into MRM and ask, directly: "Of everything in your questions list, what would need to be true for this validation to clear?" That is not an aggressive question. It is the most efficient question. And MRM teams, in my experience, respond well to it — because they are usually trying to close validations, not block them, and the question gives them permission to prioritize.

The three questions MRM actually needs answered

Strip away the SR 11-7 framework formalism, and what MRM fundamentally needs to know about any AI system — LLM or otherwise — reduces to three questions. Everything else flows from these.

What decisions does this system affect, and what happens when it is wrong? This is the risk exposure question. MRM needs to understand the decision surface: who acts on the model's output, with what authority, for what dollar amount or customer impact, and under what circumstances. A model that summarizes internal documents for an analyst has a very different risk profile from a model that drafts customer-facing communications or flags transactions for review. The failure mode of the first is an analyst spending ten minutes re-reading a document. The failure mode of the second may be a regulatory finding or a customer complaint at scale.

Most LLM submissions to MRM do not answer this question clearly, because the data science team wrote the validation memo around the model rather than around the use case. MRM can't calibrate risk without understanding the decision surface. Make this explicit, up front, in plain language.

How will you know when it is drifting or degrading? This is the ongoing monitoring question, and it is genuinely harder for LLMs than for traditional models. You cannot monitor an LLM the way you monitor a credit scorecard — there is no single performance metric with a defined acceptable range. But "we can't use the traditional monitoring approach" is not an acceptable answer. What MRM needs is a credible alternative monitoring framework: a set of observable signals that would indicate the model is behaving differently than intended, with defined thresholds and a response protocol when those thresholds are crossed. The specific mechanics of model drift at a bank — why it happens faster in financial data and what adequate monitoring actually looks like — are worth understanding before you write this section of the validation memo.

For a document analysis system, this might be human review of a random sample of outputs at defined intervals, with a rating rubric. For a customer communication system, it might be a review of all flagged customer responses plus a monthly audit of a structured sample. The specific approach depends on the use case. What matters is that it is defined, it is resourced, and it produces an observable audit trail. MRM can work with that.

What does the audit trail show? This is the explainability and governance question — the one that matters most when something goes wrong. When an examiner or a plaintiff's attorney asks what the model did and why, what can the bank actually show them? For a credit scorecard, the answer is a documented model, a feature importance list, and a specific output score for the transaction in question. For an LLM, the answer is necessarily different — but there needs to be an answer.

At minimum, the audit trail should capture: the model version that generated the output, the input that was provided, the output that was produced, what (if anything) a human did with that output, and a timestamp. For high-risk use cases, it should also capture the policy or threshold the system was operating under and any human review that occurred. This is not technically difficult to implement. It is frequently not implemented because nobody specified it as a requirement before the system was built.

How to document an LLM for MRM review

The standard model documentation template — methodology description, data sources, validation results, limitations — does not translate cleanly to an LLM. But the template exists for a reason: it answers the questions a validator needs to answer. The task is to produce documentation that answers those same questions in a form appropriate to this class of system.

A useful LLM documentation package for MRM typically includes four components:

Use case documentation. What problem is this system solving, for whom, in what workflow, and with what human oversight? This should be written for a non-technical reader and should make the risk exposure unambiguous. If a senior MRM analyst who has never seen the system reads this document, they should understand exactly what is at stake.

System architecture documentation. What model or API is being used, who controls it, what data is sent to it, and where the outputs go. For third-party hosted models, this should include the vendor's data handling commitments, retention policies, and the contractual provisions governing the bank's data. MRM needs to understand whether the bank has visibility into model changes the vendor might make.

Risk characterization. An honest assessment of the failure modes for this specific use case — not a generic list of LLM risks, but the specific ways this deployment could go wrong and the specific controls that mitigate each. This is the document most teams skip, and it is the one MRM most wants to read.

Monitoring and review plan. How the bank will know the system is performing as intended, how frequently it will be reviewed, who is responsible, and what happens when a threshold is crossed. This should be specific enough that someone other than the person who wrote it could execute it.

Timing is the variable most teams control incorrectly

The most consistent difference I see between programs that clear MRM in two months and programs that are still in validation at month ten is not the quality of the documentation. It is when the conversation with MRM started.

Programs that move quickly bring MRM into the design conversation before the model is built. Not as a reviewer — as a stakeholder. They schedule a working session with MRM early in the design phase and ask: given what we're trying to do, what would a defensible validation package need to show? The answer to that question becomes a design input. The audit logging, the human review protocol, the monitoring plan — all of it gets built into the system from the start, because MRM told the team what they'd need before anyone wrote a line of code.

When that happens, the formal validation submission is not a negotiation. It is a documentation of decisions that were already made with MRM's input. The questions have already been answered because they were asked at the right moment.

Programs that struggle bring MRM in at the submission stage, hand over a documentation package, and then react to questions. Each round of questions is a new negotiation. Each answer surfaces another assumption that needs to be examined. The process has no natural resolution point because MRM's framework was never calibrated to this specific system — it was applied generically, after the fact.

The institutions that move LLMs through MRM fastest are not the ones with the best models. They are the ones that treated MRM as a design partner rather than a review gate — and started that conversation before the system existed.

What to do if you're already in the review cycle

If you are reading this because your LLM has been in MRM validation for six months and is still not cleared, the path forward is not more documentation. It is a direct conversation.

Request a working session with the head of MRM validation for this initiative — not a status update, a working session. Come in with one question: of everything on the current question list, what are the two or three things that, if resolved, would allow the validation to close? Then spend the session on those things specifically, with the people who can actually decide the answers.

In my experience this meeting almost always produces more progress in ninety minutes than the preceding months of written exchanges. Not because MRM has been holding out, but because written Q&A is a terrible medium for resolving substantive disagreement about framework interpretation. The disagreement needs to be surfaced directly, by people with enough authority to make a call.

If the validation is stuck because MRM is asking for something that genuinely doesn't exist — a mathematical derivation of an LLM's outputs, for example — the resolution is usually a negotiated equivalent: a different form of evidence that satisfies the underlying concern in a way that is achievable for this class of system. That negotiation requires a conversation. It cannot happen by email.

If you need the broader governance frame first, start with Governance Frameworks That Actually Work for AI in Banking. If you are deciding whether this needs outside leadership, see Services.

If you are shaping an AI use case, the best next step is usually AI Pilot Setup. If you want to compare notes, contact me.

Navigating a stalled MRM validation or trying to set up the right conversation before you submit? I'd be glad to compare notes — even if it doesn't lead to an engagement.

Contact me