Member-only story
Why Open-Source & Chinese LLMs Lead Coding Benchmarks, But Struggle in the Real World?
Do Coding Benchmarks Still Mean Anything? Lessons from SWE-rebench
Youssef Hosni8 min read·Just now--
Over the past year, coding benchmarks have become one of the main ways we compare LLMs. When a new model is released, the first question many people ask is simple: how does it score on the SWE-bench Leaderboard?
Some open-source models, including several Chinese models, have posted impressive numbers on SWE-bench. In some cases, they reach or even match the scores of leading closed models such as Claude Opus.
But when these same models are evaluated on fresh tasks using SWE-rebench, the picture changes. Performance drops across the board, and the relative ranking between models begins to shift. Models that looked nearly identical on the SWE-bench separate more clearly when tested on recently collected issues.
This raises an uncomfortable but important question. Are we measuring true generalization, or are we measuring familiarity with a benchmark that has been public for too long?
In this article, I will break down what SWE-rebench changes, what the results actually show, and what they imply about choosing coding models for real-world engineering work.