The top of SWE-bench Verified leaderboard was at around 20% in mid-2024, i.e. AI was failing at most tasks.
Now it's at 70%.
Clearly it's objectively better at tackling typical development tasks.
And it's not like it went from 2% to 7%.
The pressure for AI companies to release a new SOTA model is real, as the technology rapidly become commoditised. I think people have good reason to be skeptical of these benchmark results.
But there's a plenty of people who actually tried LLMs for actual work and swear they work now. Do you think they are all lying?..
Many people with good reputation, not just noobs.