One thing that did work in my favor is that, I was clearly creating a failing repro test case, and adding before and after along with PR. That helped getting the PR landed.
There are also a few PRs that never got accepted because the repro is not as strong or clear.
If you're short on time, I'd recommend just reading the linked blogpost or the announcement thread here [1], rather than the full paper.
My personal theory is that getting a significant productivity boost from LLM assistance and AI tools has a much steeper learning curve than most people expect.
This study had 16 participants, with a mix of previous exposure to AI tools - 56% of them had never used Cursor before, and the study was mainly about Cursor.
They then had those 16 participants work on issues (about 15 each), where each issue was randomly assigned a "you can use AI" v.s. "you can't use AI" rule.
So each developer worked on a mix of AI-tasks and no-AI-tasks during the study.
A quarter of the participants saw increased performance, 3/4 saw reduced performance.
One of the top performers for AI was also someone with the most previous Cursor experience. The paper acknowledges that here:
> However, we see positive speedup for the one developer who has more than 50 hours of Cursor experience, so it's plausible that there is a high skill ceiling for using Cursor, such that developers with significant experience see positive speedup.
My intuition here is that this study mainly demonstrated that the learning curve on AI-assisted development is high enough that asking developers to bake it into their existing workflows reduces their performance while they climb that learing curve.
Per our website, “To date, April 2025, we have not accepted compensation from AI companies for the evaluations we have conducted.” You can check out the footnote on this page: https://metr.org/donate
Not all payment is cash. Compute credits is still by all means compensation.
I feel like a proper study for this would involve following multiple developers over time, tracking how their contribution patterns and social standing changes. For example, take three cohorts of relatively new developers: instruct one to go all in on agentic development, one to freely use AI tools, and one prohibited from AI tools. Then teach these developers open source (like a course off of this book: https://pragprog.com/titles/a-vbopens/forge-your-future-with...) and have them work for a year to become part of a project of their choosing. Then in the end, track a number of metrics such as leadership position in community, coding/non-coding contributions, emotional connection to project, social connections made with community, knowledge of code base, etc.
Personally, my prior probability is that the no-ai group would likely still be ahead overall.
Developers totally spend time totally differently, though, this is a great callout! On page 10 of the paper [1], you can see a breakdown of how developers spend time when they have AI vs. not - in general, when these devs have AI, they spend a smaller % of time writing code, and a larger % of time working with AI (which... makes sense).
[1] https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf
There's some existing lit about increased contributions to OS repositories after the introduction of AI -- I've also personally heard a fear anecdotes about an increase in the number of low-quality PRs from first time contributors, seemingly as a result of AI making it easier to get started -- ofc, the tradeoff is that making it easier to get started has pros to it too!
[1] https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf
We explore this factor in section (C.2.5) - "Trading speed for ease" - in the paper [1]. It's labeled as a factor with an unclear effect, some developers seem to think so, and others don't!
> like the developers deliberately picked "easy" tasks that they already knew how to do
We explore this factor in (C.2.2) - "Unrepresentative task distribution." I think the effect here is unclear; these are certainly real tasks, but they are sampled from the smaller end of tasks developers would work on. I think the relative effect on AI vs. human performance is not super clear...
[1] https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf
We'll be releasing anonymized data and some basic analysis code to replicate core results within the next few weeks (probably next, depending).
Our GitHub is here (http://github.com/METR/) -- or you can follow us (https://x.com/metr_evals) and we'll probably tweet about it.
That being said, we can't rule out that the experiment drove them to use more AI than they would have outside of the experiment (in a way that made them less productive). You can see more in section "Experimentally driven overuse of AI (C.2.1)" [1]
[1] https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf
TLDR: over the first 8 issues, developers do not appear to get majorly less slowed down.
[1] https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf
Paper is here: https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf
TLDR: mixed evidence that developers make it less effortful, from quantitative and qualitative reports. Unclear effect.
If you pay attention to who says it, you'll find that people have different personal thresholds for finding llms useful, not that any given person like steveklabnik above keeps flip-flopping on their view.
This is a variant on the goomba fallacy: https://englishinprogress.net/gen-z-slang/goomba-fallacy-exp...
https://softwarecrisis.dev/letters/llmentalist/
Plus there's a gambling mechanic: Push the button, sometimes get things for free.
This shows that everyone in the study (economic experts, ML experts and even developers themselves, even after getting experience) are novices if we look at them from the Dunning-Kruger effect [1] perspective.
[1] https://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect
"The Dunning–Kruger effect is a cognitive bias in which people with limited competence in a particular domain overestimate their abilities."
anyway -AI as the tech currently stand is a new skill to use and takes us humans time to learn, but once we do well, its becomes force multiplier
ie see this: https://claude.ai/public/artifacts/221821f0-0677-409b-8294-3...
[0]: https://marketplace.visualstudio.com/items?itemName=anthropi...
Apple's Response to iPhone 4 Antenna Problem: You're Holding It Wrong https://www.wired.com/2010/06/iphone-4-holding-it-wrong/
We speak (the best we can) to changes in amount of code -- I'll note that this metric is quite messy and hard to reason about!
https://www.businessinsider.com/apple-antennagate-scandal-ti...
e.g., Nokia 1600 user guide from 2005 (page 16) [0]
[0] https://www.instructionsmanuals.com/sites/default/files/2019...
Therefore, classical marketing is less dominant, although more present at down-stream sellers.
You brought up Rust, it is fascinating.
The Rust's type system differs from typical Hindle-Milner by having operations that can remove definitions from environment of the scope.
Rust was conceived in 2006.
In 2006 there already were HList papers by Oleg Kiselyov [1] that had shown how to keep type level key-value lists with addition, removal and lookup, and type-level stateful operations like in [2] were already possible, albeit, most probably, not with nice monadic syntax support.
[1] https://okmij.org/ftp/Haskell/HList-ext.pdf
[2] http://blog.sigfpe.com/2009/02/beyond-monads.html
It was entirely possible to have prototype Rust to be embedded into Haskell and have borrow checker implemented as type-level manipulation over double parameterized state monad.But it was not, Rust was not embedded into Haskell and now it will never get effects (even as weak as monad transformers) and, as a consequence, will never get proper high performance software transactional memory.
So here we are: everything in Haskell's strong type system world that would make Rust better was there at the very beginning of the Rust journey, but had no impact on Rust.
Rhyme that with LLM.
My original point was about history and about how can we extract possible outcome from it.
My other comment tries to amplify that too. Type systems were strong enough for several decades now, had everything Rust needed and more years before Rust began, yet they have little penetration into real world, example being that fancy dandy Rust language.
> By early 2030, the robot economy has filled up the old SEZs, the new SEZs, and large parts of the ocean. The only place left to go is the human-controlled areas. [...]
> The new decade dawns with Consensus-1’s robot servitors spreading throughout the solar system. By 2035, trillions of tons of planetary material have been launched into space and turned into rings of satellites orbiting the sun. The surface of the Earth has been reshaped into Agent-4’s version of utopia: datacenters, laboratories, particle colliders, and many other wondrous constructions doing enormously successful and impressive research.
This scenario prediction, which is co-authored by a former OpenAI researcher (now at Future of Humanity Institute), received almost 1 thousand upvotes here on HN and the attention of the NYT and other large media outlets.
If you read that and still don't believe the AI hype is _extreme_ then I really don't know what else to tell you.
--
(https://github.com/albumentations-team/Albumentations)
15k stars, 5 million monthly downloads
----
It may happen that Cursor in the agentic mode writes code slower than I am. But!
It frees me from being in the IDE 100% of the time.
There is infinite list of educational videos, blog posts, scientific papers, hacker news, twitter, reddit that I want to read and going through them, while agents do their job is ultra convenient.
=> If I think about "productivity" in a broader way => with Cursor + agents, my overall productivity moved to a whole another level.
- Test subjects consistently report that keyboarding is faster than mousing.
- The stopwatch consistently proves mousing is faster than keyboarding.
So far, Fred Brook’s “No Silver Bullet” remains undefeated.