Blog
The silence rate: the AI code review metric nobody publishes
June 2026 · Postil team
Ask any team that has turned off an AI code reviewer why they did it and you will hear the same word: noise. Not "it missed bugs." Noise. The tool commented too much, was wrong too often, and the team stopped reading it, which means it also stopped reading the correct findings. This piece is about the number that predicts that outcome, why no vendor reports it, and why we decided to make it the first thing on Postil's dashboard.
The 30% threshold
AI reviewers can generate 200 to 400 comments per week on an active repository, with 70 to 90% of them ignored, according to one analysis of review overload. The same analysis describes a behavioral cliff: above roughly 30% false positives, developers triage every comment with suspicion; above 50%, they dismiss by default. The failure is not that the tool wastes time on bad findings. It is that bad findings destroy the credibility of good ones. Once a team learns the reviewer is usually wrong, the real bug it flags on Friday gets the same dismissive glance as the forty nitpicks before it.
The practitioner record is blunt. "Too much noise to PRs, and only a very small percentage of comments are actually useful" (HN, Dec 2024). "I am no longer spending my time solving engineering challenges; I am perfecting code to pass an AI screen… It's theater" (r/webdev, 2026). A founder evaluating the category found two leading tools "at best added noise to PRs, at worst flagged false positives". Even a Cursor employee, in a thread about their own Bugbot, conceded: "You can always ask an LLM for a review but you should expect a lot of false positives, noisy comments, and inconsistent results."
Even the good reviews measure a third wasted
The best public data point comes from a review that liked the product. The Lychee open-source project audited CodeRabbit across 28 PRs and 290 findings and recommended it. The same audit classified 21% of findings as nitpicks, 15% as useless, and 13% as based on wrong assumptions. Roughly a third of the output of a well-regarded tool, measured by a sympathetic reviewer, was waste. That is what "good" currently looks like in this category, and it sits right at the threshold where developers start tuning out.
Why vendors will not publish it
The incentive problem is structural, and practitioners have named it: vendors are rewarded for "more feedback (not higher quality)", because a reviewer that stays quiet looks broken to the buyer who just installed it. Every comment is visible evidence the product is working. Silence, even correct silence, generates a "is this thing on?" support ticket. So defaults trend chatty, and the metric that would expose the cost of that choice goes unreported.
You can see the omission clearly in the benchmarks. DeepSource documented that every vendor benchmark in the category ranks its own product first, including that when Augment re-ran Greptile's evaluation on the same dataset, Greptile scored 45% against its self-reported 82%. Greptile's benchmark explicitly does not score false positives. Qodo's benchmark drew the obvious HN response: "Company creates a benchmark. Same company is best in that benchmark. Story as old as time." Catch-rate theater is easy; precision accounting is not.
The independent evidence is thinner but more interesting. A practitioner ran four reviewers in parallel for 3.5 weeks across 146 PRs and 679 findings (author works at Sentry, conflict disclosed) and found that 93.4% of flagged locations were caught by exactly one tool. Four products looking at the same diffs almost never agreed on what mattered. If these tools were measuring something objective, they would overlap. They are not; they are emitting opinions, and volume is a choice each vendor makes.
Academia noticed the gap too. SWR-Bench built a 1,000-PR evaluation where half the PRs are intentionally clean, specifically so that saying nothing is a scored answer, and found current systems substantially underperform. That design choice is the whole point: a benchmark that contains no clean PRs cannot punish a tool for commenting on everything.
Defining the silence rate
The silence rate is the share of reviewed PRs where the tool posted zero findings. On its own it is trivially gameable (a tool that never speaks scores 100%), which is why it only means something paired with its complement: of the findings the tool did ship, how many were acted on rather than dismissed? A reviewer with a high silence rate and a high act-on rate is doing the job senior engineers do: most PRs are fine, say so by saying nothing, and when you do speak, be right. The only public number close to this from a major vendor is GitHub's, mentioned in passing in a blog post: Copilot code review stays silent on roughly 29% of reviews. It is a real disclosure and GitHub deserves credit for it, but it is a one-off marketing statistic, not a number you can check on your own repositories this month.
Why Postil reports it first
Postil's dashboard leads with the silence rate: the share of your PRs where it said nothing, alongside the confidence distribution of every finding it did ship. Not because silence is a virtue in itself, but because publishing the number changes our incentives. A vendor that reports its silence rate cannot quietly inflate comment volume to look busy; drift toward noise shows up in a chart before your engineers feel it in their notifications. It is the same logic as a chef sitting in their own dining room.
What we are not claiming: that Postil's silence rate beats any competitor's. No peer has run our benchmark and we have not published comparative numbers, so there is nothing honest to claim yet. The claim is narrower and checkable: the metric exists, it is on the dashboard from day one, and you can watch it on your own traffic. You can see it run on three real diffs, including one where it correctly stays silent.
What to ask any vendor (including us)
- What share of PRs does your tool stay silent on, and where do I see that number for my repos, continuously?
- What share of shipped findings get dismissed or ignored, and is that on the dashboard too?
- Does your benchmark include clean PRs where the correct answer is silence? If not, what stops the tool from commenting on everything?
- Can I run advisory-only for two weeks and see these numbers before anything becomes a required check?
That last one matters most. The adoption pattern recommended in third-party integration guides is to run any AI reviewer advisory for a couple of weeks and promote it to a required check only if the dismissal rate stays under roughly 30%. If a vendor cannot show you the numbers that decision needs, that is itself the answer.
Sources
- CodeAnt: preventing AI code review overload (comment volume, 30%/50% thresholds)
- Lychee: 28-PR CodeRabbit audit (Sep 2025)
- DeepSource: AI code review benchmark critique (Feb 2026)
- Independent 4-tool parallel study, 146 PRs (May 2026)
- SWR-Bench (arXiv)
- GitHub: 60 million Copilot code reviews
- Practitioner threads: HN (Dec 2024), HN (Jan 2026), HN on the Qodo benchmark, r/webdev, r/ycombinator, r/cursor
Watch the number yourself.
Run Postil advisory on your next PRs. The silence rate is the first thing on the dashboard.