Measuring the impact of AI on experienced open-source developer productivity

>>dheera+(OP)
This study focused on experienced OSS maintainers. Here is my personal experience, but a very different persona (or opposite to the one in the study). I always wanted to contribute to OSS but never had time to. Finally was able to do that, thanks to AI. Last month, I was able to contribute to 4 different repositories which I would never have dreamed of doing it. I was using an async coding agent I built[1], to generate PRs given a GitHub issue. Some PRs took a lot of back and forth. And some PRs were accepted as is. Without AI, there is no way I would have contributed to those repositories.

One thing that did work in my favor is that, I was clearly creating a failing repro test case, and adding before and after along with PR. That helped getting the PR landed.

There are also a few PRs that never got accepted because the repro is not as strong or clear.

[1] https://workback.ai

>>dheera+(OP)
Hey HN, study author here. I'm a long-time HN user -- and I'll be in the comments today to answer questions/comments when possible!

If you're short on time, I'd recommend just reading the linked blogpost or the announcement thread here [1], rather than the full paper.

[1] https://x.com/METR_Evals/status/1943360399220388093

>>NewsaH+37
https://metr.org/about Seems like they get paid by AI companies, and they also get government funding.

>>dheera+(OP)
Here's the full paper, which has a lot of details missing from the summary linked above: https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf

My personal theory is that getting a significant productivity boost from LLM assistance and AI tools has a much steeper learning curve than most people expect.

This study had 16 participants, with a mix of previous exposure to AI tools - 56% of them had never used Cursor before, and the study was mainly about Cursor.

They then had those 16 participants work on issues (about 15 each), where each issue was randomly assigned a "you can use AI" v.s. "you can't use AI" rule.

So each developer worked on a mix of AI-tasks and no-AI-tasks during the study.

A quarter of the participants saw increased performance, 3/4 saw reduced performance.

One of the top performers for AI was also someone with the most previous Cursor experience. The paper acknowledges that here:

> However, we see positive speedup for the one developer who has more than 50 hours of Cursor experience, so it's plausible that there is a high skill ceiling for using Cursor, such that developers with significant experience see positive speedup.

My intuition here is that this study mainly demonstrated that the learning curve on AI-assisted development is high enough that asking developers to bake it into their existing workflows reduces their performance while they climb that learing curve.

>>NewsaH+37
Our largest funding was through The Audacious Project -- you can see an announcement here: https://metr.org/blog/2024-10-09-new-support-through-the-aud...

Per our website, “To date, April 2025, we have not accepted compensation from AI companies for the evaluations we have conducted.” You can check out the footnote on this page: https://metr.org/donate

>>narush+Pa
This is really disingenuous when you also say that OpenAI and Anthropic have provided you with access and compute credits (on https://metr.org/about).

Not all payment is cash. Compute credits is still by all means compensation.

>>nestor+3c
In this case though you still wouldn't necessarily know if the AI tools had a positive causal effect. For example, I practically live in Emacs. Take that away and no doubt I would be immensely less effective. That Emacs improves my productivity and without it I am much worse in no way implies that Emacs is better than the alternatives.

I feel like a proper study for this would involve following multiple developers over time, tracking how their contribution patterns and social standing changes. For example, take three cohorts of relatively new developers: instruct one to go all in on agentic development, one to freely use AI tools, and one prohibited from AI tools. Then teach these developers open source (like a course off of this book: https://pragprog.com/titles/a-vbopens/forge-your-future-with...) and have them work for a year to become part of a project of their choosing. Then in the end, track a number of metrics such as leadership position in community, coding/non-coding contributions, emotional connection to project, social connections made with community, knowledge of code base, etc.

Personally, my prior probability is that the no-ai group would likely still be ahead overall.

>>kokane+T3
Qualitatively, we don't see a drop in PR quality in between AI-allowed and AI-disallowed conditions in the study; the devs who participate are generally excellent, know their repositories standards super well, and aren't really into the 'get up a bad PR' vibe -- the median review time on the PRs in the study is about a minute.

Developers totally spend time totally differently, though, this is a great callout! On page 10 of the paper [1], you can see a breakdown of how developers spend time when they have AI vs. not - in general, when these devs have AI, they spend a smaller % of time writing code, and a larger % of time working with AI (which... makes sense).

[1] https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf

>>MYEUHD+R8
Yeah, I'll note that this study does _not_ capture the entire OS dev workflow -- you're totally right that reviewing PRs is a big portion of the time that many maintainers spend on their projects (and thanks to them for doing this [often hard] work). In the paper [1], we explore this factor in more detail -- see section (C.2.2) - Unrepresentative task distribution.

There's some existing lit about increased contributions to OS repositories after the introduction of AI -- I've also personally heard a fear anecdotes about an increase in the number of low-quality PRs from first time contributors, seemingly as a result of AI making it easier to get started -- ofc, the tradeoff is that making it easier to get started has pros to it too!

[1] https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf

>>IshKeb+h6
> which feels like it is easier and hence faster.

We explore this factor in section (C.2.5) - "Trading speed for ease" - in the paper [1]. It's labeled as a factor with an unclear effect, some developers seem to think so, and others don't!

> like the developers deliberately picked "easy" tasks that they already knew how to do

We explore this factor in (C.2.2) - "Unrepresentative task distribution." I think the effect here is unclear; these are certainly real tasks, but they are sampled from the smaller end of tasks developers would work on. I think the relative effect on AI vs. human performance is not super clear...

[1] https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf

>>igorkr+3i
Yep, sorry, meant to post this somewhere but forgot in final-paper-polishing-sprint yesterday!

We'll be releasing anonymized data and some basic analysis code to replicate core results within the next few weeks (probably next, depending).

Our GitHub is here (http://github.com/METR/) -- or you can follow us (https://x.com/metr_evals) and we'll probably tweet about it.

>>antonv+Pm
The instructions given to developers was not just "implement with AI" - but rather that they could use AI if they deemed it would be helpful, but indeed did _not need to use AI if they didn't think it would be helpful_. In about ~16% of labeled screen recordings where developers were allowed to use AI, they choose to use no AI at all!

That being said, we can't rule out that the experiment drove them to use more AI than they would have outside of the experiment (in a way that made them less productive). You can see more in section "Experimentally driven overuse of AI (C.2.1)" [1]

[1] https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf

>>gojomo+hq
You can see this analysis in the factor analysis of "Below-average use of AI tools" (C.2.7) in the paper [1], which we mark as an unclear effect.

TLDR: over the first 8 issues, developers do not appear to get majorly less slowed down.

[1] https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf

>>dheera+(OP)
For certain tasks it can speed me up 30x compared to an expert in the space: https://rust-gpu.github.io/blog/2025/06/24/vulkan-shader-por...

>>IshKeb+p6
You can see a list of repositories with participating developers in the appendix! Section G.7.

Paper is here: https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf

>>Dilett+lv
I'm not sure, but something involving Kahneman et al. seems very plausible: The relevant term is probably "Attribute Substitution."

https://en.wikipedia.org/wiki/Attribute_substitution

>>JackC+Fz
We attempted to! We explore this more in the section Trading speed for ease (C.2.5) in the paper (https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf).

TLDR: mixed evidence that developers make it less effortful, from quantitative and qualitative reports. Unclear effect.

>>blibbl+Kv
If you interact with internet comments and discussions as an amorphous blob of people you'll see a constant trickle of the view that models now are useful, and before were useless.

If you pay attention to who says it, you'll find that people have different personal thresholds for finding llms useful, not that any given person like steveklabnik above keeps flip-flopping on their view.

This is a variant on the goomba fallacy: https://englishinprogress.net/gen-z-slang/goomba-fallacy-exp...

>>evanel+zF
> The LLMentalist Effect: how chat-based Large Language Models replicate the mechanisms of a psychic’s con

https://softwarecrisis.dev/letters/llmentalist/

Plus there's a gambling mechanic: Push the button, sometimes get things for free.

>>dheera+(OP)
What is interesting here is that all predictions were positive, but results are negative.

This shows that everyone in the study (economic experts, ML experts and even developers themselves, even after getting experience) are novices if we look at them from the Dunning-Kruger effect [1] perspective.

[1] https://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect

"The Dunning–Kruger effect is a cognitive bias in which people with limited competence in a particular domain overestimate their abilities."

>>dheera+(OP)
Hey guys why are we making it so complicated? do we really need a paper and study?

anyway -AI as the tech currently stand is a new skill to use and takes us humans time to learn, but once we do well, its becomes force multiplier

ie see this: https://claude.ai/public/artifacts/221821f0-0677-409b-8294-3...

>>stevek+nS
Worth noting for the folks asking: there's an official Claude Code extension for VS Code now [0]. I haven't tried it personally, but that's mostly because I mainly use the terminal and vim.

[0]: https://marketplace.visualstudio.com/items?itemName=anthropi...

>>ivanov+861
>It's hard to think of any other major tech product where it's acceptable to shift so much blame on the user.

Apple's Response to iPhone 4 Antenna Problem: You're Holding It Wrong https://www.wired.com/2010/06/iphone-4-holding-it-wrong/

>>solid_+i71
Check out section AI increasing issue scope (C.2.3) in the paper -- https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf

We speak (the best we can) to changes in amount of code -- I'll note that this metric is quite messy and hard to reason about!

>>Maxiou+Gq1
I don't see how the Antennagate can be qualified as "acceptable" since it caused a big public uproar and Apple had to settle a class action lawsuit.

https://www.businessinsider.com/apple-antennagate-scandal-ti...

>>Maxiou+Gq1
Mobile phone manufacturers were telling users this long before the iPhone was ever invented.

e.g., Nokia 1600 user guide from 2005 (page 16) [0]

[0] https://www.instructionsmanuals.com/sites/default/files/2019...

>>TeMPOr+gD1
It's called grassroots marketing. It works particularly well in the context of GenAI because it is fed with esoteric and ideological fragments that overlap with common beliefs and political trends. https://en.wikipedia.org/wiki/TESCREAL

Therefore, classical marketing is less dominant, although more present at down-stream sellers.

>>leshow+UO
That probabilistic output has to be symbolically constrained - SQL/JSON/other code is generated through syntax constrained beam search.

You brought up Rust, it is fascinating.

The Rust's type system differs from typical Hindle-Milner by having operations that can remove definitions from environment of the scope.

Rust was conceived in 2006.

In 2006 there already were HList papers by Oleg Kiselyov [1] that had shown how to keep type level key-value lists with addition, removal and lookup, and type-level stateful operations like in [2] were already possible, albeit, most probably, not with nice monadic syntax support.

  [1] https://okmij.org/ftp/Haskell/HList-ext.pdf
  [2] http://blog.sigfpe.com/2009/02/beyond-monads.html

It was entirely possible to have prototype Rust to be embedded into Haskell and have borrow checker implemented as type-level manipulation over double parameterized state monad.

But it was not, Rust was not embedded into Haskell and now it will never get effects (even as weak as monad transformers) and, as a consequence, will never get proper high performance software transactional memory.

So here we are: everything in Haskell's strong type system world that would make Rust better was there at the very beginning of the Rust journey, but had no impact on Rust.

Rhyme that with LLM.

>>ruszki+3M
Contrarily I believe that strong type system is a plus. Please, look at my other comment: >>44529347

My original point was about history and about how can we extract possible outcome from it.

My other comment tries to amplify that too. Type systems were strong enough for several decades now, had everything Rust needed and more years before Rust began, yet they have little penetration into real world, example being that fancy dandy Rust language.

>>intend+EK1
It really is, for example here is a quote from AI 2027:

> By early 2030, the robot economy has filled up the old SEZs, the new SEZs, and large parts of the ocean. The only place left to go is the human-controlled areas. [...]

> The new decade dawns with Consensus-1’s robot servitors spreading throughout the solar system. By 2035, trillions of tons of planetary material have been launched into space and turned into rings of satellites orbiting the sun. The surface of the Earth has been reshaped into Agent-4’s version of utopia: datacenters, laboratories, particle colliders, and many other wondrous constructions doing enormously successful and impressive research.

This scenario prediction, which is co-authored by a former OpenAI researcher (now at Future of Humanity Institute), received almost 1 thousand upvotes here on HN and the attention of the NYT and other large media outlets.

If you read that and still don't believe the AI hype is _extreme_ then I really don't know what else to tell you.

--

>>43571851

>>dheera+(OP)
I guess, I am experienced open-source developer

(https://github.com/albumentations-team/Albumentations)

15k stars, 5 million monthly downloads

----

It may happen that Cursor in the agentic mode writes code slower than I am. But!

It frees me from being in the IDE 100% of the time.

There is infinite list of educational videos, blog posts, scientific papers, hacker news, twitter, reddit that I want to read and going through them, while agents do their job is ultra convenient.

=> If I think about "productivity" in a broader way => with Cursor + agents, my overall productivity moved to a whole another level.

>>fiddle+wu1
This same conclusion has been actually observed practically verbatim about the keyboard vs. mouse https://www.asktog.com/TOI/toi06KeyboardVMouse1.html from the post:

- Test subjects consistently report that keyboarding is faster than mousing.

- The stopwatch consistently proves mousing is faster than keyboarding.

>>robenk+oWa
Yeah, that was one of the pieces of research I was thinking about. Another is this review of the evidence, such as it is, for static types: https://danluu.com/empirical-pl/

So far, Fred Brook’s “No Silver Bullet” remains undefeated.

>>dheera+(OP)
Here's a reflection from one of the study's participants, which covers some of the common criticisms and where he see opportunities for further study.

https://blog.stdlib.io/reflection-on-the-metr-study-2025/

>>dheera+(OP)
AI used to refer to the extensive range of techniques of the field of Artificial Intelligence. Now it refers to LLMs and maybe other multi-layer networks trained on vast datasets. LLMs are great for some tasks and are also great as parts of hybrid systems like the IBM Watson Jepardy system. There's much more to Artificial Intelligence, e.g. https://en.m.wikipedia.org/wiki/Knowledge_representation_and... et al.

zlacker

Measuring the impact of AI on experienced open-source developer productivity