zlacker

Isn't SWE-bench based on public Github issues? Wouldn't the increase in performance also be explained by continuing to train on newer scraped Github data, aka training on the test set?

The pressure for AI companies to release a new SOTA model is real, as the technology rapidly become commoditised. I think people have good reason to be skeptical of these benchmark results.

replies(1): >>killer+D

>>lexand+(OP)
That sounds like a conspiracy theory. If it was just some mysterious benchmark and nothing else then sure, you have reasons to be skeptical.

But there's a plenty of people who actually tried LLMs for actual work and swear they work now. Do you think they are all lying?..

Many people with good reputation, not just noobs.