Long-Context-Bench

About Long-Context-Code-Bench

Long-Context-Code-Bench evaluates AI coding agents on enterprise-scale repositories with ~20,000 files. Unlike existing benchmarks that focus on small codebases, LCB measures what matters most for enterprise adoption: the ability to understand, modify, and integrate changes across massive real-world repositories.

Each agent is tested on recreating actual PR changes from the Elasticsearch repository given only the PR description—no access to the solution or git history. Solutions are evaluated by a judge model (Claude Sonnet 4.5) that compares the agent's changes against the ground truth diff, scoring on correctness, completeness, code reuse, best practices, and unsolicited documentation.

🔍 Most Openly Verifiable Benchmark: For each task, you can view the judge's detailed rationale, side-by-side diff comparisons, and complete agent execution logs—making this the most transparent and verifiable coding agent benchmark available. The entire benchmark is reproducible and fully open source at github.com/AugmentedAJ/Long-Context-Code-Bench.

📊 Version v1 — This release features 100 PRs from the Elasticsearch repository (~20,000 files) with single-agent evaluation. Future versions will include head-to-head agent-as-judge comparisons, diverse codebases across multiple languages (Java, TypeScript, Go, Rust), and even larger repository sizes to comprehensively evaluate context engines and retrieval systems at scale.

Agent Leaderboard

Results from single-agent evaluation across 100 Elasticsearch PRs. Each agent attempts to recreate the PR changes given only the task description.

📊 Metric Definitions:

All scores shown are averages across all 100 PRs for each agent.

Win Rate: Percentage of PRs where the agent's aggregate score equals or exceeds the median across all agents for that PR. Calculated as (wins + 0.5×ties) / total_matches.
Correctness (−1 to 1): Average accuracy of the agent's changes vs ground truth. +1 = perfect match, 0 = neutral, −1 = completely wrong.
Completeness (−1 to 1): Average coverage of required changes. +1 = all changes implemented, 0 = partial, −1 = critical changes missing.
Code Reuse (−1 to 1): Average score for leveraging existing codebase patterns. +1 = excellent reuse, −1 = reinvents the wheel.
Best Practices (−1 to 1): Average adherence to coding standards and maintainability. +1 = exemplary, −1 = poor practices.
Unsol. Docs (−1 to 1): Average penalty for unsolicited documentation. 0 = appropriate, −1 = excessive/unnecessary docs.

All metrics scored by Claude Sonnet 4.5 comparing agent output against the human-authored ground truth PR diff.

Rank	Agent	Win Rate	Correctness	Completeness	Code Reuse	Best Practices	Unsol. Docs
Loading leaderboard...

Head-to-Head Details by PR

View pairwise agent judgments and results for each PR

Loading cross-agent analyses...

📋 Task Description

▼

📊 Agent Score Comparison

🏆 Agent Results

Rank	Agent	Aggregate	Corr.	Comp.	Reuse	Pract.	Docs	Time

🔍 Side-by-Side Inspector

Compare agent outputs against the human ground truth. Select Summary, Diff, or Logs for each side.

About Long-Context-Code-Bench

Agent Leaderboard

Head-to-Head Details by PR

📋 Task Description

📊 Agent Score Comparison

🏆 Agent Results

🔍 Side-by-Side Inspector

Comparative Analysis

Summary

Best Agent Reasoning

Approach Differences