Blog

Four AI code review benchmarks, four home-team winners

July 8, 2026 · Postil team

All four vendors surveyed here that publish an AI code review benchmark (Greptile, Qodo, Augment, and Macroscope) win their own benchmark. Each chart puts the publisher's logo on top. That is the predictable result of a design choice: when you build the test, pick the metric, and assemble the dataset, you influence the answer before anyone runs anything. This piece walks through the public evidence for that claim and gives you a five-point test for any benchmark in your tabs, including ours.

Exhibit A: the same dataset, two scores

The single most legible proof that the number travels with the scorer rather than the tool is Greptile's benchmark. Greptile reports an 82% overall bug-catch rate on its own evaluation, built from 50 real bug-fix PRs (10 per repo across five repos), with the bugs reconstructed by reverting the fixes onto clean forks, conducted in July 2025. Then a competitor, Augment, re-ran the same five repositories. As DeepSource documented, Greptile scored 45% in that run, not 82%. Same repos, same tool, roughly half the score, the only thing that changed was who held the stopwatch. Across the four benchmark publishers surveyed here, the publisher ranks itself first.

Why catch rate alone is a rigged frame

The deeper problem is what the headline metric leaves out. Greptile's methodology, stated on its own benchmark page, scores only whether the original bug was detected. Verbatim: "false positives, style suggestions, and unrelated comments did not affect the catch rate." Read that again with a buyer's hat on. A tool that comments on everything cannot lose a recall-only benchmark, because the noise it generates is invisible to the score. But noise is the actual pain. Teams turn off AI reviewers because of false positives, not because of missed bugs, and a benchmark that refuses to count false positives is blind by construction to the failure mode that matters most. Catch rate without precision is half a measurement presented as a whole one.

The vendor-benchmark zoo

Once you know to look for it, the pattern repeats across the category. None of these are dishonest in the legal sense. They are reasonable marketing artifacts. The problem is reading them as science.

Vendor benchmark	Headline result	Dataset	Who wins
Greptile	82% catch rate, false positives unscored	50 reverted bug-fix PRs, 5 repos	Greptile
Qodo	F1 60.1%, "best overall"	100 PRs, 580 LLM-injected bugs	Qodo
Augment	F1 59%, ranked first	50 PRs, "corrected" Greptile set	Augment
Macroscope	Self-published precision claim	Own bug set	Macroscope

Qodo's benchmark used 100 PRs with 580 issues injected by an LLM, and Qodo reports an F1 of 60.1% and ranks itself first. Augment's benchmark ran 50 PRs across five large open-source repos, described as an expanded and corrected version of Greptile's golden set, and Augment reports the highest F1 at 59% and ranks itself first. Macroscope self-publishes a precision claim on its own bug set as well. The directional reading of these is fine. Each tells you the vendor cares about a metric and tuned for it. The leaderboard reading is what fails, because there is no neutral referee and no shared ground truth across any two of these tests.

What independent data shows: near-zero agreement

The most useful counterweight is the one study that ran multiple tools in parallel without a horse in the race for any single product's number. A practitioner ran four reviewers (CodeRabbit, Sentry Seer, Greptile, Cursor Bugbot) in parallel for 3.5 weeks across 146 merged PRs, producing 679 findings across 446 review events. The result: 93.4% of flagged locations were caught by exactly one tool, and zero locations were flagged by all four. Volume varied enormously, with CodeRabbit emitting 281 findings and Greptile 120 at near-zero false positives across its verdicts. The author discloses up front that they work at Sentry, which makes Seer one of the four, and says so plainly, which is exactly the honesty this whole piece is arguing for.

Sit with the 93.4% number. If these tools were measuring the same underlying thing, the way two thermometers measure the same temperature, they would overlap heavily. They almost never do. A leaderboard implies an agreed-upon ground truth, a fixed set of bugs that exist independently of the tool looking for them. The parallel run says that ground truth does not exist in practice. These products are emitting opinions about diffs, and how loud each one is is a product decision, not a fact about the code.

The academic counter-model: score the silence

Academia built the test the vendor benchmarks omit. SWR-Bench is 1,000 manually verified PRs, deliberately split 500 change-PRs and 500 clean-PRs. The clean half is the point. Any comment a tool generates on a clean PR counts, by definition, as a false positive. Evaluation is LLM-based with roughly 90% agreement with human raters. That structure means a tool cannot win by commenting on everything, because half the test rewards saying nothing, and the headline finding is that current AI code review systems substantially underperform on this balanced framing. A benchmark with no clean PRs in it literally cannot punish a tool for noise. SWR-Bench can, which is why it lands differently from anything a vendor publishes.

What an honest evaluation looks like

Strip the marketing away and an evaluation worth trusting has to clear five bars. Use this as a checklist against any vendor's "benchmark," including the one in the tab next to this.

Clean PRs are in the set, and they are scored. If the test only contains PRs with planted bugs, silence is never the correct answer and noise is never penalized.
False positives are counted, not discarded. A recall-only score rewards the chattiest tool. Precision has to be in the chart, not a footnote.
A precision or silence metric is reported next to recall. One number without the other is half a measurement.
Enough artifacts are published for a fair re-run. The Greptile 82%-to-45% gap only became visible because someone could re-run the dataset. Irreproducible scores are assertions, and private datasets should be treated as internal evidence rather than public leaderboards.
The author is not the only vendor in the chart. If the publisher is also the winner, treat the result as directional marketing until a third party reproduces it.

Where Postil stands

We are not exempt from this critique, so we are specific about what we do and do not claim. Postil publishes methodology, not a leaderboard. The evidence page links real GitHub check-runs so each published catch can be inspected at its source. We report no Postil score against a competitor. The product doctrine is narrower: silence is a feature, findings without a citation are discarded, and the system fails closed rather than guessing loudly.

We have not run a peer benchmark or put a rival tool on our fixtures. A chart authored by Postil with the Postil logo above competitors would have the same conflict described throughout this article. Apply the same five-point checklist to our published evidence.

Run the test yourself

The next time you hit a "we benchmarked the category and won" page, do not argue with the number. Apply the checklist. Are there clean PRs in the set? Are false positives counted? Is precision next to recall? Are enough artifacts public for a fair re-run, or is the dataset private? Is the author the only vendor in the chart? The answers expose what each score measures and omits more clearly than the leaderboard does. Point the same five questions at Postil's published evidence.

Sources

Greptile benchmark (82% catch rate, false positives explicitly unscored; 50 reverted bug-fix PRs)
DeepSource benchmark survey (Greptile 82% vs 45% on Augment's re-run)
Augment: we benchmarked 7 AI code review tools (F1 59%, Augment ranks first)
Qodo: how we built a real-world benchmark (F1 60.1%, Qodo ranks first; 580 LLM-injected bugs)
Independent 4-tool parallel study, 146 PRs, 679 findings (93.4% caught by exactly one tool, zero by all four; author at Sentry, COI disclosed)
SWR-Bench (arXiv) (1,000 PRs, 500 clean / 500 change, false positives scored by definition)

Methodology, not a leaderboard.

See Postil's real catches on public pull requests, with the exact check-run behind every one.

See it run Read the envelope spec