My personal theory is that getting a significant productivity boost from LLM assistance and AI tools has a much steeper learning curve than most people expect.
This study had 16 participants, with a mix of previous exposure to AI tools - 56% of them had never used Cursor before, and the study was mainly about Cursor.
They then had those 16 participants work on issues (about 15 each), where each issue was randomly assigned a "you can use AI" v.s. "you can't use AI" rule.
So each developer worked on a mix of AI-tasks and no-AI-tasks during the study.
A quarter of the participants saw increased performance, 3/4 saw reduced performance.
One of the top performers for AI was also someone with the most previous Cursor experience. The paper acknowledges that here:
> However, we see positive speedup for the one developer who has more than 50 hours of Cursor experience, so it's plausible that there is a high skill ceiling for using Cursor, such that developers with significant experience see positive speedup.
My intuition here is that this study mainly demonstrated that the learning curve on AI-assisted development is high enough that asking developers to bake it into their existing workflows reduces their performance while they climb that learing curve.
LLMs have a v. steep and long learning curve as you posit (though note the points from the paper authors in the other reply).
Current LLMs just are not as good as they are sold to be as a programming assistant and people consistently predict and self-report in the wrong direction on how useful they are.
One thing that happened here is that they aren't using current LLMs:
> Most issues were completed in February and March 2025, before models like Claude 4 Opus or Gemini 2.5 Pro were released.
That doesn't mean this study is bad! In fact, I'd be very curious to see it done again, but with newer models, to see if that has an impact.
I've been hearing this for 2 years now
the previous model retroactively becomes total dogshit the moment a new one is released
convenient, isn't it?
Yes, it might make a difference, but it is a little tiresome that there's always a “this is based on a model that is x months old!” comment, because it will always be true: an academic study does not get funded, executed, written up, and published in less time.
"No, the 2.8 release is the first good one. It massively improves workflows"
Then, 6 months later, the study comes out.
"Ah man, 2.8 was useless, 3.0 really crossed the threshold on value add"
At some point, you roll your eyes and assume it is just snake oil sales
* the release of agentic workflow tools
* the release of MCPs
* the release of new models, Claude 4 and Gemini 2.5 in particular
* subagents
* asynchronous agents
All or any of these could have made for a big or small impact. For example, I’m big on agentic tools, skeptical of MCPs, and don’t think we yet understand subagents. That’s different from those who, for example, think MCPs are the future.
> At some point, you roll your eyes and assume it is just snake oil sales
No, you have to realize you’re talking to a population of people, and not necessarily the same person. Opinions are going to vary, they’re not literally the same person each time.
There are surely snake oil salesman, but you can’t buy anything from me.