The long road to smarter code reviews with AI: making it measurable
A few weeks ago, I wrote about how our team started using risk-based review tiers to keep up with the pace of AI-generated code. The short version: we classify every PR into high, medium, or low risk, and apply different levels of human scrutiny to each tier. It helped.
We’re still doing it. If you’re interested in that post, you can still read it here.
But a question kept coming up from colleagues, from clients, from people who read the post: how do you actually decide what high risk is? And can you automate the classification itself? And how well will that work?
Those are the right questions. The answer is interesting and more uncomfortable than I expected. I went deep on the Meta paper I referenced in that post, “Moving Faster and Reducing Risk: Using LLMs in Release Deployment,” and mapped out what it would actually take to build a genuinely data-driven risk classifier. The effort required is significant. And my recommendation for most teams is to stop well short of the finish line, at least for now. Here’s why.
You’re evaluating, but not measuring
As I explained in my previous post, we classify our PRs using a general-purpose LLM with a carefully crafted prompt. And we do check how well it’s working, we go back through past pull requests and ask whether the classification held up.
But there’s a gap between evaluating and measuring. Right now, our evaluation is informal: a manual review of recent PRs, a gut feeling about whether the tier assignment looked right in hindsight. We’re not working from a defined rubric. We’re not tracking false negatives—the medium-risk PRs we called low that then caused a regression. We’re not capturing the reasoning behind our judgments in a way that compounds over time.
The result is that our confidence in the classifier is qualitative rather than quantitative. We believe it’s working. We don’t know by how much, or where it’s weakest.
Meta’s paper is built on a much more formal foundation. Every production incident is traced back to the specific diff that caused it by human experts, as part of the process. That linkage is what turns “this PR caused a problem” into a training signal. Without a structured framework for capturing it, you accumulate experience without accumulating data, and those are not the same thing.
Getting from where we are today to a genuinely data-driven classifier is a four-phase journey. Phase 0 is the instrumentation work — defining incidents, linking them to commits, and logging every classification decision. Phase 1 is where we are now: a general-purpose LLM classifying PRs based on a structured prompt. Phase 2 adds structural signals on top: churn size, author experience, and critical service tagging to check and correct the LLM's judgment. Phase 3 is a logistic regression trained on your accumulated incident data. Phase 4 is a fine-tuned, diff-aware LLM that represents the state of the art. Each phase is unlocked by the previous one, and the entire roadmap moves only as fast as your data allows.
We skipped phase 0 a bit when we came up with the idea for risk-based code review. Moving from where we are now to the next phase isn't complicated, but it does require fixing the data collection. Define what a "bad classification" looks like: a PR that was called low-risk and caused a production issue, a high-risk classification that turned out to be routine, a case where the agent missed something a human caught. Build a lightweight rubric, something like a simple spreadsheet, and log these cases consistently. Over time, patterns emerge. You start to see which risk criteria are actually predictive and which are noise. That's the foundation Phase 3 requires.
Start now, before you care about it. Define what counts as a production incident for your team — a hotfix, a rollback, a P1 bug, a broken deployment. For every incident, require a short post-mortem that links to the specific PR that caused it. Log every classification decision your current system makes, along with the reasoning. Store the diff title, description, test plan, and the unified diff itself.
This costs almost nothing to set up. The cost of not setting it up is that everything I describe below is unavailable to you, possibly for years.
The informal evaluation we are already doing is evidence that it’s worth caring about getting the data collection right. The next step is giving that instinct a more formal home.
The upgrade path
Once you understand that everything depends on data quality, the roadmap becomes clear. It has four phases, each unlocked by the previous one.
Phase 1 is where most of us are today. In our case, the general LLM, Claude, classifies PRs based on a structured prompt. We give it explicit heuristics: authentication flows are high risk, config changes are low risk, and anything that touches untested behavior is high risk, regardless of diff size. The model outputs a tier and a brief explanation.
This is actually a reasonable baseline. The Meta paper found that unaligned foundation models — ones used without any task-specific fine-tuning — tend to perform worse than even a simple logistic regression. But that was because they were using the model as an embedder, feeding vectors into an external classifier. A well-prompted LLM doing direct classification is a different setup and can work well, especially if you include the structural signals the paper identified: churn size, number of files changed, whether critical paths are touched, and whether there’s a test plan.
The important thing at this phase is not accuracy. It’s instrumentation. Every classification decision should be logged, challenged, and eventually compared against incident outcomes. You’re building your training set without knowing it yet.
Phase 2 is where you add structural signals on top of the LLM. You can implement the features the Meta paper identified as most predictive: the ratio of added and deleted lines relative to file size, the number of files touched, the author’s historical experience with the changed files, whether the touched files have been involved in past incidents, and this one is particularly high-value, whether the code is part of a component tagged as critical.
That last point is worth calling out explicitly. Ask your client to tag their critical components. It sounds like governance overhead, but it’s actually one of the most useful signals in the entire regression model Meta built. Code paths that execute as part of a high-criticality service carry inherently different risk than utility code, regardless of what the diff looks like.
At this phase, you use these structural signals to check the LLM’s classification. If the heuristics say high risk but the model says low, you need to take a second look. The hybrid outperforms either approach alone.
Here’s the assumption buried in this phase that’s worth making explicit: “a few months of data” only holds if you’re looking across multiple projects simultaneously. A single project, especially a typical client engagement with a defined scope and end date, will rarely generate enough incident volume on its own to give you a meaningful signal. This is a structural problem. Risk-based review is not a per-project practice. It’s an organizational one. The data you need to make it work has to accumulate across projects, teams, and codebases over time. If you’re treating each project as its own isolated experiment, you’ll never get past Phase 1.
Phase 3 is a properly trained logistic regression model on your incident data. The Meta paper uses this as their production baseline, and it performs well — capturing roughly 28% of medium-severity incidents while only flagging 10% of all diffs. That’s a meaningful improvement over random gating.
But you need roughly a thousand labeled incidents to train this reliably. That number is almost impossible to reach on a single project. Even a large, active codebase with a healthy incident rate might produce a few dozen production incidents per year that are clearly traceable to a specific commit. To reach a thousand, you need years of data from multiple teams across multiple projects, collected with a consistent rubric. This is why Phase 0 has to be an organizational commitment, not a per-project habit. The regression doesn’t exist until the data exists, and the data only exists if you’ve been collecting it everywhere, not just here.
Phase 4 is a fine-tuned, diff-aware LLM. This is the state-of-the-art result from the Meta paper. Meta used the iDiffLlama-13B model, which was neither the largest nor the most sophisticated architecture. It was a smaller model that had been specifically pre-trained on code diffs, not just static code. It understood the intent behind a change, not just the change itself. When aligned to risk prediction via supervised fine-tuning, it captured 1.52× more medium-risk incidents than the logistic regression baseline.
For most teams, this phase is a long-term research project, not a near-term goal. You need a substantial incident corpus across multiple codebases, GPU infrastructure for fine-tuning, and the engineering capacity to maintain a specialized model in production. It’s achievable because Meta did it, but it is not a weekend project.
My honest recommendation: stop in time
Here’s the part where I give you advice that contradicts the obvious next step.
Phases 3 and 4 are quite impressive. The paper makes a compelling case for what’s possible when you have enough data, enough compute, and enough engineering time. If you’re building risk classification infrastructure for a large organization with hundreds of developers, active incident tracking, and dedicated ML engineering capacity, go for it. The investment pays off.
But most teams aren’t in that situation. Most teams have a handful of developers, inconsistent incident tracking, and no appetite for fine-tuning language models. For those teams, Phases 3 and 4 are not on the realistic horizon.
Phase 1 plus Phase 2 is a genuinely good outcome. A well-prompted LLM with structural heuristics on top, running against a carefully defined set of risk criteria, gets you most of the practical benefit at a fraction of the cost. The marginal improvement from a fine-tuned model matters a lot at Meta’s scale, millions of diffs, hundreds of SEVs per year, and matters much less when you’re processing fifty PRs a week.
What does matter at your scale is trust and iteration speed. You can tune a prompt in an afternoon. You can add a new structural signal in a day. You can update your risk criteria when you discover a new failure mode. That feedback loop is fast and cheap, which matters more than squeezing out extra percentage points of SEV capture rate.
The important caveat: stop at Phase 2 for now, not forever. This is why Phase 0 matters so much. If you do the instrumentation work, you’re not closing the door on Phase 3. You’re leaving it open. The data accumulates quietly in the background while you operate at Phase 2. When you have enough, and you’ll know when, because the numbers will tell you, you can move forward.
What this means practically
If I’m advising a team starting from scratch today, here’s what I tell them:
- Define your incident criteria and link every production issue to a commit. Do this before anything else, and do it consistently across every project — not just this one.
- Build Phase 1 with enough structure that you can log every classification decision. You want the tier, the reasoning, and the full diff context stored in a shared location that survives the end of any individual project, so you can query them later.
- Ask your clients to tag critical services. It’s a one-time effort with outsized long-term value.
- Add Phase 2 structural signals once you have meaningful data pooled across projects. The author-experience and critical-path features in particular are easy to implement and meaningfully predictive.
- Revisit Phase 3 only when you have incident data from multiple teams over an extended period. Not before.
The review problem isn’t going away. AI velocity is still outpacing human review capacity, and that gap is going to widen. The right response isn’t to wait for the perfect classifier; it’s to build the foundation now so the classifier can exist later.
Start collecting the data. The rest follows.