I found myself refreshing ~30 different leaderboards whenever a new model dropped, so I made a dashboard that scrapes my favorites and computes a composite score using the same methodology as Epoch AI’s ECI capability index. For those who are unfamiliar, Epoch uses multiple benchmarks to estimate a single “capability” level per model (kind of like an IQ score). Their index is great, but it doesn’t track a lot of the benchmarks I care about, especially those which are unbounded like EQ-Bench.
Loading benchmark dataset…
Capability Index
No models match the current filters.
No models with release dates match the current filters.
The index is fixed so that GPT-4.1 will always be at 100 and GPT-5 will always be at 150. The other values might drift as the benchmark composition changes. Epoch’s model requires scores to range from 0-1, so I apply normalization to Elo-based evals such that the top model scores 100% and each other model x’s score equals 2 * P(model_x wins against top_model). To reduce noise from models with low benchmark coverage, I require that each model be on at least eight benchmarks and at least 1/3rd of the benchmarks that have data on or before that model’s release date. For sub-indices, I apply the same 1/3rd rule to the category but reduce the lower bound to four across the full dataset.
Expect that I will add / remove benchmarks and adjust the formulas over time. The main objective is to strongly differentiate between the latest models, and less to track long-term progress. It’s a non-exhaustive compilation of evals that I think are high-signal and roughly correlate with my personal experiences and anecdotes I hear from others.
How did I decide what to include? My main concerns with most benchmarks mostly fall into three categories: data quality, memorization, and relevance. Data quality is perhaps the biggest peristent issue across benchmarks and can be broken down into sub-categories like incorrect answer keys, faulty automated graders, impossible tasks, and many more. The best way to prevent quality issues is simple but boring: look at the data. Ideally you will have a human, or multiple, attempt each task in your benchmark. If this is impractical, you better have a good automated pipeline and do manual validation of that. When I evaluate a benchmark, I either look at the data myself, or I judge whether the authors or someone with domain knowledge have done this thoroughly.
Relevance is a close runner-up for the title of biggest benchmark issue. Many benchmarks make grand claims about measuring some ability, but then quickly get saturated despite models still clearly lacking in that area. Sometimes this is due to the authors making overly broad generalizations with the underlying tasks still measuring something useful, while in other cases the tasks are poor imitations of real use-cases and success does not translate. Unbounded tasks (Elo or profit based as examples) or aggregation methods like Epoch’s can resolve the saturation problem, while transferability requires authors to ask “what would this look like in production” and design their tasks around this. I generally filter for benchmarks that make specific, detailed claims and have sensible orderings. Newer models in a family are almost always better; occasionally there are outliers but if a benchmark consistently disagrees, it is suspicious. Labs all have extensive internal evals to ensure they are not regressing on capabilities and they rarely miss the mark.
Memorization was historically a large problem, since any public dataset quickly gets scraped and included in the pretraining mix. Labs attempt to filter out answers for benchmarks they care about, but this is imperfect and there’s no way to know exactly what is and isn’t contaminated. Fortunately, memorization is becoming much less of a problem with newer benchmarks (though still something to consider), as we’ve evolved from simple question answer pairs to “complete this long-horizon workflow.” Full trajectories are often not published and memorizing perfectly across millions of tokens would be much more difficult. When I look at strictly knowledge-based benchmarks, I prefer those where the answers cannot be easily found through Google and require multiple reasoning hops to arrive at.
METR Time Horizon vs Capability Index
Need at least three overlapping METR time-horizon models for the current filters.
For fun, I’ve calculated the correlation between the indices and the METR Time Horizon, with the coding sub-index having a particularly strong r^2 of 0.88. METR is extremely thorough and produced a great benchmark, but I’ve excluded it from my index for two reasons. First, it has mostly been saturated at the time of writing and second, I wanted to see how well I could predict their results for new models.
US vs China Frontier
Need dated US or China model results for the current filters.
I’ve also made a tracker for the US-China gap, which has shrunk and currently stands at ~5 months. This is slightly shorter than Epoch’s calculation of 7 months in January and the fit line does show the gap narrowing, but looking closer at the data I do not see a clear trend. I also remain skeptical it will narrow further due to capital constraints and export controls. Currently, Chinese labs seem to lack the ability to secure the necessary compute and data. I will of course revisit this assumption as we get updates.
US-China Frontier Delta
Need overlapping dated US and China frontier points for the current filters.