Postil

Blog

Every AI code review benchmark has the same winner: its author

June 2026 · Postil team

There is a pattern in this category that nobody really disputes. Every vendor that publishes a benchmark for AI code review wins it. Not most of them. Every one. The chart always has the publisher's logo on top, and the gap to second place is always comfortable. That is not a coincidence and it is not fraud. It is the predictable result of a design choice: when you build the test, pick the metric, and assemble the dataset, you control the answer before anyone runs anything. This piece walks through the public evidence for that claim and then gives you a five-point test you can apply to any benchmark in your tabs, including ours.

Exhibit A: the same dataset, two scores

The single most legible proof that the number travels with the scorer rather than the tool is Greptile's benchmark. Greptile reports an 82% overall bug-catch rate on its own evaluation, built from 50 real bug-fix PRs (10 per repo across five repos), with the bugs reconstructed by reverting the fixes onto clean forks, conducted in July 2025. Then a competitor, Augment, re-ran the same five repositories. As DeepSource documented, Greptile scored 45% in that run, not 82%. Same repos, same tool, roughly half the score, the only thing that changed was who held the stopwatch. DeepSource's own summary of the category is the title of the piece: every AI code review vendor benchmarks itself, and wins.

Why catch rate alone is a rigged frame

The deeper problem is what the headline metric leaves out. Greptile's methodology, stated on its own benchmark page, scores only whether the original bug was detected. Verbatim: "false positives, style suggestions, and unrelated comments did not affect the catch rate." Read that again with a buyer's hat on. A tool that comments on everything cannot lose a recall-only benchmark, because the noise it generates is invisible to the score. But noise is the actual pain. Teams turn off AI reviewers because of false positives, not because of missed bugs, and a benchmark that refuses to count false positives is blind by construction to the failure mode that matters most. Catch rate without precision is half a measurement presented as a whole one.

The vendor-benchmark zoo

Once you know to look for it, the pattern repeats across the category. None of these are dishonest in the legal sense. They are reasonable marketing artifacts. The problem is reading them as science.

Vendor benchmarkHeadline resultWho wins
Greptile82% catch rate, false positives unscoredGreptile
QodoF1 60.1%, "best overall"Qodo
AugmentF1 59%, ranked firstAugment
MacroscopeSelf-published precision claimMacroscope

Qodo's benchmark used 100 PRs with 580 issues injected by an LLM, and Qodo reports an F1 of 60.1% and ranks itself first. Augment's benchmark ran 50 PRs across five large open-source repos, described as an expanded and corrected version of Greptile's golden set, and Augment reports the highest F1 at 59% and ranks itself first. Macroscope self-publishes a precision claim on its own bug set as well. The directional reading of these is fine. Each tells you the vendor cares about a metric and tuned for it. The leaderboard reading is what fails, because there is no neutral referee and no shared ground truth across any two of these tests.

What independent data shows: near-zero agreement

The most useful counterweight is the one study that ran multiple tools in parallel without a horse in the race for any single product's number. A practitioner ran four reviewers (CodeRabbit, Sentry Seer, Greptile, Cursor Bugbot) in parallel for 3.5 weeks across 146 merged PRs, producing 679 findings across 446 review events. The result: 93.4% of flagged locations were caught by exactly one tool, and zero locations were flagged by all four. Volume varied enormously, with CodeRabbit emitting 281 findings and Greptile 120 at near-zero false positives across its verdicts. The author discloses up front that they work at Sentry, which makes Seer one of the four, and says so plainly, which is exactly the honesty this whole piece is arguing for.

Sit with the 93.4% number. If these tools were measuring the same underlying thing, the way two thermometers measure the same temperature, they would overlap heavily. They almost never do. A leaderboard implies an agreed-upon ground truth, a fixed set of bugs that exist independently of the tool looking for them. The parallel run says that ground truth does not exist in practice. These products are emitting opinions about diffs, and how loud each one is is a product decision, not a fact about the code.

The academic counter-model: score the silence

Academia built the test the vendor benchmarks omit. SWR-Bench is 1,000 manually verified PRs, deliberately split 500 change-PRs and 500 clean-PRs. The clean half is the point. Any comment a tool generates on a clean PR counts, by definition, as a false positive. Evaluation is LLM-based with roughly 90% agreement with human raters. That structure means a tool cannot win by commenting on everything, because half the test rewards saying nothing, and the headline finding is that current AI code review systems substantially underperform on this balanced framing. A benchmark with no clean PRs in it literally cannot punish a tool for noise. SWR-Bench can, which is why it lands differently from anything a vendor publishes.

What an honest evaluation looks like

Strip the marketing away and an evaluation worth trusting has to clear five bars. Use this as a checklist against any vendor's "benchmark," including the one in the tab next to this.

  • Clean PRs are in the set, and they are scored. If the test only contains PRs with planted bugs, silence is never the correct answer and noise is never penalized.
  • False positives are counted, not discarded. A recall-only score rewards the chattiest tool. Precision has to be in the chart, not a footnote.
  • A precision or silence metric is reported next to recall. One number without the other is half a measurement.
  • Raw artifacts are published so anyone can re-run. The Greptile 82%-to-45% gap only became visible because someone could re-run the dataset. Irreproducible scores are assertions.
  • The author is not the only vendor in the chart. If the publisher is also the winner, treat the result as directional marketing until a third party reproduces it.

Where Postil stands

We are not exempt from this critique, so we will be specific about what we do and do not claim. Postil publishes methodology, not a leaderboard. We report our own silence rate, the share of PRs where we said nothing, and our confirmed-finding rate, on public open-source PRs, with the raw review envelopes attached so anyone can inspect what the tool actually emitted. We report those numbers even where they are unflattering, because the alternative is the pattern this entire piece is about. It follows from the product doctrine: silence is a feature, findings without a citation are discarded, and the system fails closed rather than guessing loudly.

What we are explicitly not doing: we have not run a peer benchmark, and we publish no Postil score against any competitor. We have not put a rival tool on our fixtures, and we will not show you a chart with our logo on top of theirs, because we would be the author of that chart and this article is about why you should distrust exactly that. When we do publish detection numbers, they will be Postil's own numbers, on a test that includes clean PRs and counts false positives, with the artifacts published. Hold us to the same five-point checklist as everyone else.

Run the test yourself

The next time you hit a "we benchmarked the category and won" page, do not argue with the number. Apply the checklist. Are there clean PRs in the set? Are false positives counted? Is precision next to recall? Are the artifacts published so you could re-run it? Is the author the only vendor in the chart? Most vendor benchmarks fail three or more of those on the first read, and that failure is more informative than any catch-rate figure they print. Point the same five questions at us when our numbers ship. If a benchmark cannot survive the test, the score it reports was never the thing you needed to know.

Sources

Methodology, not a leaderboard.

See Postil's silence rate and confirmed findings on real diffs, with the raw envelope behind every one.