About Long-Context-Code-Bench
Long-Context-Code-Bench evaluates AI coding agents on enterprise-scale repositories with ~20,000 files. Unlike existing benchmarks that focus on small codebases, LCB measures what matters most for enterprise adoption: the ability to understand, modify, and integrate changes across massive real-world repositories.
Each agent is tested on recreating actual PR changes from the Elasticsearch repository given only the PR descriptionโno access to the solution or git history. Solutions are evaluated by a judge model (Claude Sonnet 4.5) that compares the agent's changes against the ground truth diff, scoring on correctness, completeness, code reuse, best practices, and unsolicited documentation.
๐ Most Openly Verifiable Benchmark: For each task, you can view the judge's detailed rationale, side-by-side diff comparisons, and complete agent execution logsโmaking this the most transparent and verifiable coding agent benchmark available. The entire benchmark is reproducible and fully open source at github.com/AugmentedAJ/Long-Context-Code-Bench.
Agent Leaderboard
Results from single-agent evaluation across 100 Elasticsearch PRs. Each agent attempts to recreate the PR changes given only the task description.
All scores shown are averages across all 100 PRs for each agent.
- Win Rate: Percentage of PRs where the agent's aggregate score equals or exceeds the median across all agents for that PR. Calculated as (wins + 0.5รties) / total_matches.
- Correctness (โ1 to 1): Average accuracy of the agent's changes vs ground truth. +1 = perfect match, 0 = neutral, โ1 = completely wrong.
- Completeness (โ1 to 1): Average coverage of required changes. +1 = all changes implemented, 0 = partial, โ1 = critical changes missing.
- Code Reuse (โ1 to 1): Average score for leveraging existing codebase patterns. +1 = excellent reuse, โ1 = reinvents the wheel.
- Best Practices (โ1 to 1): Average adherence to coding standards and maintainability. +1 = exemplary, โ1 = poor practices.
- Unsol. Docs (โ1 to 1): Average penalty for unsolicited documentation. 0 = appropriate, โ1 = excessive/unnecessary docs.
All metrics scored by Claude Sonnet 4.5 comparing agent output against the human-authored ground truth PR diff.
| Rank | Agent | Win Rate | Correctness | Completeness | Code Reuse | Best Practices | Unsol. Docs |
|---|---|---|---|---|---|---|---|
| Loading leaderboard... | |||||||
Head-to-Head Details by PR
View pairwise agent judgments and results for each PR
Loading cross-agent analyses...