Blog

The silence rate: an ongoing AI code review metric

June 13, 2026 · Postil team

Ask any team that has turned off an AI code reviewer why they did it and you will hear the same word: noise. Not "it missed bugs." Noise. The tool commented too much, was wrong too often, and the team stopped reading it, which means it also stopped reading the correct findings. This piece is about the number that predicts that outcome, why one-off category data is not enough, and why Postil makes it an ongoing per-organization dashboard metric.

The 30% threshold

AI reviewers can generate 200 to 400 comments per week on an active repository, with 70 to 90% of them ignored, according to one analysis of review overload. The same analysis describes a behavioral cliff: above roughly 30% false positives, developers triage every comment with suspicion; above 50%, they dismiss by default. The failure is not that the tool wastes time on bad findings. It is that bad findings destroy the credibility of good ones. Once a team learns the reviewer is usually wrong, the real bug it flags on Friday gets the same dismissive glance as the forty nitpicks before it.

The practitioner record is blunt. "Too much noise to PRs, and only a very small percentage of comments are actually useful" (HN, Dec 2024). "I am no longer spending my time solving engineering challenges; I am perfecting code to pass an AI screen… It's theater" (r/webdev, 2026). A founder evaluating the category found two leading tools "at best added noise to PRs, at worst flagged false positives". Even a Cursor employee, in a thread about their own Bugbot, conceded: "You can always ask an LLM for a review but you should expect a lot of false positives, noisy comments, and inconsistent results."

Even the good reviews measure a third wasted

The best public data point comes from a review that liked the product. The Lychee open-source project audited CodeRabbit across 28 PRs and 290 findings and recommended it. The same audit classified 21% of findings as nitpicks, 15% as useless, and 13% as based on wrong assumptions. Roughly a third of the output of a well-regarded tool, measured by a sympathetic reviewer, was waste. That is what "good" currently looks like in this category, and it sits right at the threshold where developers start tuning out.

Why vendors will not publish it

The incentive problem is structural, and practitioners have named it: vendors are rewarded for "more feedback (not higher quality)", because a reviewer that stays quiet looks broken to the buyer who just installed it. Every comment is visible evidence the product is working. Silence, even correct silence, generates a "is this thing on?" support ticket. So defaults trend chatty, and the metric that would expose the cost of that choice goes unreported.

You can see the omission clearly in the benchmarks. DeepSource documented that all four vendors surveyed here that publish a benchmark (Greptile, Qodo, Augment, and Macroscope) rank their own product first, including that when Augment re-ran Greptile's evaluation on the same dataset, Greptile scored 45% against its self-reported 82%. Greptile's benchmark explicitly does not score false positives. Qodo's benchmark drew the obvious HN response: "Company creates a benchmark. Same company is best in that benchmark. Story as old as time." Publishing a high catch rate is easy; measuring precision honestly is not.

The independent evidence is thinner but more interesting. A practitioner ran four reviewers in parallel for 3.5 weeks across 146 PRs and 679 findings (author works at Sentry, conflict disclosed) and found that 93.4% of flagged locations were caught by exactly one tool. Four products looking at the same diffs almost never agreed on what mattered. If these tools were measuring something objective, they would overlap. They are not; they are emitting opinions, and volume is a choice each vendor makes.

Academia noticed the gap too. SWR-Bench built a 1,000-PR evaluation where half the PRs are intentionally clean, specifically so that saying nothing is a scored answer, and found current systems substantially underperform. That design choice is the whole point: a benchmark that contains no clean PRs cannot punish a tool for commenting on everything.

Defining the silence rate

The silence rate is the share of reviewed PRs where the tool posted zero findings. On its own it is trivially gameable (a tool that never speaks scores 100%), which is why it only means something paired with its complement: of the findings the tool did ship, how many were acted on rather than dismissed? A reviewer with a high silence rate and a high act-on rate is doing the job senior engineers do: most PRs are fine, say so by saying nothing, and when you do speak, be right. GitHub has published a related category figure in a blog post: Copilot code review stays silent on roughly 29% of reviews. It is a real one-off disclosure, not a number an organization can check on its own repositories this month.

Why Postil reports it continuously

Postil's dashboard leads with the silence rate: the share of your PRs where it said nothing, alongside the confidence distribution of every finding it did ship. Not because silence is a virtue in itself, but because publishing the number changes our incentives. A vendor that reports its silence rate cannot quietly inflate comment volume to look busy; drift toward noise shows up in a chart before your engineers feel it in their notifications. It is the same logic as a chef sitting in their own dining room.

What we are not claiming: that Postil's silence rate beats any competitor's. No peer has run our private evaluation data and we have not published comparative numbers, so there is nothing honest to claim yet. The claim is narrower and checkable: the metric exists, it is on the dashboard from day one, and you can watch it on your own traffic. You can see it run across the public evidence cases, including one where it correctly stays silent.

What to ask any vendor (including us)

What share of PRs does your tool stay silent on, and where do I see that number for my repos, continuously?
What share of shipped findings get dismissed or ignored, and is that on the dashboard too?
Does your benchmark include clean PRs where the correct answer is silence? If not, what stops the tool from commenting on everything?
Can I run advisory-only for two weeks and see these numbers before anything becomes a required check?

That last one matters most. The adoption pattern recommended in third-party integration guides is to run any AI reviewer advisory for a couple of weeks and promote it to a required check only if the dismissal rate stays under roughly 30%. If a vendor cannot show you the numbers that decision needs, that is itself the answer.

Sources

CodeAnt: preventing AI code review overload (comment volume, 30%/50% thresholds)
Lychee: 28-PR CodeRabbit audit (Sep 2025)
DeepSource: AI code review benchmark critique (Feb 2026)
Independent 4-tool parallel study, 146 PRs (May 2026)
SWR-Bench (arXiv)
GitHub: 60 million Copilot code reviews
Practitioner threads: HN (Dec 2024), HN (Jan 2026), HN on the Qodo benchmark, r/webdev, r/ycombinator, r/cursor

Watch the number yourself.

Run Postil advisory on your next PRs. The silence rate is the first thing on the dashboard.

Install the CLI See it run