2025: The Year in LLMs

>>simonw+(OP)
All these improvement in a single year, 2025. While this may seem obvious to those who follows along the AI / LLM news. It may be worth pointing out again ChatGPT was introduced to us in November 2022.

I still dont believe AGI, ASI or Whatever AI will take over human in short period of time say 10 - 20 years. But it is hard to argue against the value of current AI, which many of the vocal critics on HN seems to have the opinion of. People are willing to pay $200 per month, and it is getting $1B dollar runway already.

Being more of a Hardware person, the most interesting part to me is the funding of all the developments of latest hardware. I know this is another topic HN hate because of the DRAM and NAND pricing issue. But it is exciting to see this from a long term view where the pricing are short term pain. Right now the industry is asking, we have together over a trillion dollar to spend on Capex over the next few years and will even borrow more if it needs to be, when can you ship us 16A / 14A / 10A and 8A or 5A, LPDDR6, Higher Capacity DRAM at lower power usage, better packaging, higher speed PCIe or a jump to optical interconnect? Every single part of the hardware stack are being fused with money and demand. The last time we have this was Post-PC / Smartphone era which drove the hardware industry forward for 10 - 15 years. The current AI can at least push hardware for another 5 - 6 years while pulling forward tech that was initially 8 - 10 years away.

I so wished I brought some Nvidia stock. Again, I guess no one knew AI would be as big as it is today, and it is only just started.

>>ksec+RD
This is not a great argument:

> But it is hard to argue against the value of current AI [...] it is getting $1B dollar runway already.

The psychic services industry makes over $2 billion a year in the US [1], with about a quarter of the population being actual believers. [2].

[1] The https://www.ibisworld.com/united-states/industry/psychic-ser...

[2] https://news.gallup.com/poll/692738/paranormal-phenomena-met...

>>wpietr+T61
2022/2023: "It hallucinates, it's a toy, it's useless."

2024/2025: "Okay, it works, but it produces security vulnerabilities and makes junior devs lazy."

2026 (Current): "It is literally the same thing as a psychic scam."

Can we at least make predictions for 2027? What shall the cope be then! Lemme go ask my psychic.

>>ctoth+mA1
2022/2023: "Next year software engineering is dead"

2024: "Now this time for real, software engineering is dead in 6 months, AI CEO said so"

2025: "I know a guy who knows a guy who built a startup with an LLM in 3 hours, software engineering is dead next year!"

What will be the cope for you this year?

>>bopbop+KD1
The cope + disappointment will be knowing that a large population of HN users will paint a weird alternative reality. There are a multitude of messages about AI that are out there, some are highly detached from reality (on the optimistic and pessimistic side). And then there is the rational middle, professionals who see the obvious value of coding agents in their workflow and use them extensively (or figure out how to best leverage them to get the most mileage). I don't see software engineering being "dead" ever, but the nature of the job _has already changed_ and will continue to change. Look at Sonnet 3.5 -> 3.7 -> 4.5 -> Opus 4.5; that was 17 months of development and the leaps in performance are quite impressive. You then have massive hardware buildouts and improvements to stack + a ton of R&D + competition to squeeze the juice out of the current paradigm (there are 4 orders of magnitude of scaling left before we hit real bottlenecks) and also push towards the next paradigm to solve things like continual learning. Some folks have opted not to use coding agents (and some folks like yourself seem to revel in strawmanning people who point out their demonstrable usefulness). Not using coding agents in Jan 2026 is defensible. It won't be defensible for long.

>>aspenm+n42
Please do provide some data for this "obvious value of coding agents". Because right now the only thing obvious is the increase in vulnerabilities, people claiming they are 10x more productive but aren't shipping anything, and some AI hype bloggers that fail to provide any quantitative proof.

>>bopbop+F52
Sure: at my MAANG company, where I watch the data closely on adoption of CC and other internal coding agent tools, most (significant) LOC are written by agents, and most employees have adopted coding agents as WAU, and the adoption rate is positively correlated with seniority.

Like a lot of things LLM related (Simon Willison's pelican test, researchers + product leaders implementing AI features) I also heavily "vibe" check the capabilities myself on real work tasks. The fact of the matter is I am able to dramatically speed up my work. It may be actually writing production code + helping me review it, or it may be tasks like: write me a script to diagnose this bug I have, or build me a streamlit dashboard to analyze + visualize this ad hoc data instead of me taking 1 hour to make visualizations + munge data in a notebook.

> people claiming they are 10x more productive but aren't shipping anything, and some AI hype bloggers that fail to provide any quantitative proof.

what would satisfy you here? I feel you are strawmanning a bit by picking the most hyperbolic statements and then blanketing that on everyone else.

My workflow is now:

- Write code exclusively with Claude

- Review the code myself + use Claude as a sort of review assistant to help me understand decisions about parts of the code I'm confused about

- Provide feedback to Claude to change / steer it away or towards approaches

- Give up when Claude is hopelessly lost

It takes a bit to get the hang of the right balance but in my personal experience (which I doubt you will take seriously but nevertheless): it is quite the game changer and that's coming from someone who would have laughed at the idea of a $200 coding agent subscription 1 year ago

>>aspenm+z72
Anecdotes don’t prove anything, ones without any metrics, and especially at MAANG where AI use is strongly incentivized.

Evidence is peer reviewed research, or at least something with metrics. Like the METR study that shows that experienced engineers often got slower on real tasks with AI tools, even though they thought they were faster.

>>bopbop+u92
That's why I gave you data! METR study was 16 people using Sonnet 3.5/3.7. Data I'm talking about is 10s of thousands of people and is much more up to date.

Some counter examples to METR that are in the literature but I'll just say: "rigor" here is very difficult (including METR) because outcomes are high dimensional and nuanced, or ecological validity is an issue. It's hard to have any approach that someone wouldn't be able to dismiss due to some issue they have with the methodology. The sources below also have methodological problems just like METR

https://arxiv.org/pdf/2302.06590 -- 55% faster implementing HTTP server in javascript with copilot (in 2023!) but this is a single task and not really representative.

https://demirermert.github.io/Papers/Demirer_AI_productivity... -- "Though each experiment is noisy, when data is combined across three experiments and 4,867 developers, our analysis reveals a 26.08% increase (SE: 10.3%) in completed tasks among developers using the AI tool. Notably, less experienced developers had higher adoption rates and greater productivity gains." (but e.g. "completed tasks" as the outcome measure is of course problematic)

To me, internal company measures for large tech companies will be most reliable -- they are easiest to track and measure, the scale is large enough, and the talent + task pool is diverse (junior -> senior, different product areas, different types of tasks). But then outcome measures are always a problem...commits per developer per month? LOC? task completion time? all of them are highly problematic, especially because its reasonable to expect AI tools would change the bias and variance of the proxy so its never clear if you're measuring the change in "style" or the change in the underlying latent measure of productivity you care about

>>aspenm+7m2
To be fair, I’ll take a non-biased 16 person study over “internal measures” from a MAANG company that burned 100s of billions on AI with no ROI that is now forcing its employees to use AI.

>>bopbop+2q2
I could have guessed you would say that :) but METR is not an unbiased study either. Maybe you mean that METR is less likely to intentionally inflate their numbers?

If you insist or believe in a conspiracy I don’t think there’s really anything I or others will be able to say or show you that would assuage you, all I can say is I’ve seen the raw data. It’s a mess and again we’re stuck with proxies (which are bad since you start conflating the change in the proxy-latent relationship with the treatment effect). And it’s also hard and arguably irresponsible to run RCTs.

All I will say is: there are flaws everywhere. METR results are far from conclusive. Totally understandable if there is a mismatch between perception and performance. But also consider: even if task takes the same or even slightly more time, one big advantage for me is that it substantially reduces cognitive load so I can work in parallel sessions on two completely different issues.

>>aspenm+lC2
I bet it does reduce your cognitive load, considering you, in your own words "Give up when Claude is hopelessly lost". No better way to reduce cognitive load.

>>bopbop+bF2
I give up using Claude when it gets hopelessly lost, and then my cognitive load increases.

zlacker