zlacker

2025: The Year in LLMs

submitted by simonw+(OP) on 2025-12-31 23:54:46 | 940 points 511 comments
[view article] [source] [go to bottom]

NOTE: showing posts with links only show all posts
4. sanrea+T8[view] [source] 2026-01-01 01:11:19
>>simonw+(OP)
> Vendor-independent options include GitHub Copilot CLI, Amp, OpenHands CLI, and Pi

...and the best of them all, OpenCode[1] :)

[1]: https://opencode.ai

◧◩
9. simonw+6a[view] [source] [discussion] 2026-01-01 01:19:38
>>npalli+M9
Given how badly my 2025 predictions aged I'm probably going to sit that one out! https://simonwillison.net/2025/Jan/10/ai-predictions/
11. aussie+Oa[view] [source] 2026-01-01 01:25:37
>>simonw+(OP)
> The year of YOLO and the Normalization of Deviance #

On this including AI agents deleting home folders, I was able to run agents in Firejail by isolating vscode (Most of my agents are vscode based ones, like Kilo Code).

I wrote a little guide on how I did it https://softwareengineeringstandard.com/2025/12/15/ai-agents...

Took a bit of tweaking, vscode crashing a bunch of times with not being able to read its config files, but I got there in the end. Now it can only write to my projects folder. All of my projects are backed up in git.

◧◩
18. measur+Dc[view] [source] [discussion] 2026-01-01 01:43:02
>>sho_hn+cc
The people working on this stuff have convinced themselves they're on a religious quest so it's not going to get better: https://x.com/RobertFreundLaw/status/2006111090539687956
◧◩◪◨
27. quaint+jf[view] [source] [discussion] 2026-01-01 02:12:55
>>OGEnth+2b
They were, just not as many. https://www.wired.com/story/the-worlds-biggest-bitcoin-mine-...
◧◩
35. simonw+3h[view] [source] [discussion] 2026-01-01 02:30:17
>>techpr+gg
"Nothing about the severe impact on the environment"

I literally said:

"AI data centers continue to burn vast amounts of energy and the arms race to build them continues to accelerate in a way that feels unsustainable."

AND I linked to my coverage from last year, which is still true today (hence why I felt no need to update it): https://simonwillison.net/2024/Dec/31/llms-in-2024/#the-envi...

◧◩
61. zvolsk+Hm[view] [source] [discussion] 2026-01-01 03:31:24
>>didip+Th
The idea of HN being dismissive of impactful technology is as old as HN. And indeed, the crowd often appears stuck in the past with hindsight. That said, HN discussions aren't homogeneous, and as demonstrated by Karpathy in his recent blogpost "Auto-grading decade-old Hacker News", at least some commenters have impressive foresight: https://karpathy.bearblog.dev/auto-grade-hn/
◧◩◪
79. simonw+pp[view] [source] [discussion] 2026-01-01 04:10:22
>>d4rkp4+ep
Apparently it does work with Claude Max: https://opencode.ai/docs/providers/#anthropic

I don't see a similar option for ChatGPT Pro. Here's a closed issue: https://github.com/sst/opencode/issues/704

◧◩◪
86. passwo+ar[view] [source] [discussion] 2026-01-01 04:36:02
>>tkgall+ze
Don't forget you can pay Simon to keep up with less!

> At the end of every month I send out a much shorter newsletter to anyone who sponsors me for $10 or more on GitHub

https://simonwillison.net/about/#monthly

◧◩
97. crysta+lt[view] [source] [discussion] 2026-01-01 05:04:11
>>waldre+T7
That must have been a long time back. Having lived through the time when web pages were served through CGI and mobile phones only existed in movies, when SVMs where the new hotness in ML and people would write about how weird NNs were, I feel like I've seen a lot more concrete progress in the last few decades than this year.

This year honestly feels quite stagnant. LLMs are literally technology that can only reproduce the past. They're cool, but they were way cooler 4 years ago. We've taken big ideas like "agents" and "reinforcement learning" and basically stripped them of all meaning in order to claim progress.

I mean, do you remember Geoffrey Hinton's RBM talk at Google in 2010? [0] That was absolutely insane for anyone keeping up with that field. By the mid-twenty teens RBMs were already outdated. I remember when everyone was implementing flavors of RNNs and LSTMs. Karpathy's character 2015 RNN project was insane [1].

This comment makes me wonder if part of the hype around LLMs is just that a lot of software people simply weren't paying attention to the absolutely mind-blowing progress we've seen in this field for the last 20 years. But even ignoring ML, the world's of web development and mobile application development have gone through incredible progress over the last decade and a half. I remember a time when JavaScript books would have a section warning that you should never use JS for anything critical to the application. Then there's the work in theorem provers over the last decade... If you remember when syntactic sugar was progress, either you remember way further back than I do, or you weren't paying attention to what was happening in the larger computing world.

0. https://www.youtube.com/watch?v=VdIURAu1-aU

1. https://karpathy.github.io/2015/05/21/rnn-effectiveness/

◧◩◪◨
117. jjude+Lw[view] [source] [discussion] 2026-01-01 05:52:26
>>zahlma+Cj
I use predictions to prepare rather than to plan.

Planing depends on deterministic view of the future. I used to plan (esp annual plans) until about 5 years. Now I scan for trends and prepare myself for different scenarios that can come in the future. Even if you get it approximately right, you stand apart.

For tech trends, I read Simon, Benedict Evans, Mary Meeker etc. Simon is in a better position make these predictions than anyone else having closely analyzed these trends over the last few years.

Here I wrote about my approach: https://www.jjude.com/shape-the-future/

123. compas+Mx[view] [source] 2026-01-01 06:08:08
>>simonw+(OP)
>I’m still holding hope that slop won’t end up as bad a problem as many people fear.

That's the pure, uncut copium. Meanwhile, in the real world, search on major platforms is so slanted towards slop that people need to specify that they want actual human music:

https://old.reddit.com/r/MusicRecommendations/comments/1pq4f...

◧◩◪◨
128. willia+yy[view] [source] [discussion] 2026-01-01 06:21:15
>>simonw+pp
There's a plugin that evidently supports ChatGPT Pro with Opencode: https://github.com/sst/opencode/issues/1686#issuecomment-349...
◧◩
136. simonw+rA[view] [source] [discussion] 2026-01-01 06:50:10
>>lopati+gA
See https://simonwillison.net/2025/nov/13/training-for-pelicans-... (also in the pelicans section of the post).
138. Razeng+uA[view] [source] 2026-01-01 06:50:25
>>simonw+(OP)
My experience with AI so far: It's still far from "butler" level assistance for anything beyond simple tasks.

I posted about my failures to try to get them to review my bank statements [0] and generally got gaslit about how I was doing it wrong, that I if trust them to give them full access to my disk and terminal, they could do it better.

But I mean, at that point, it's still more "manual intelligence" than just telling someone what I want. A human could easily understand it, but AI still takes a lot of wrangling and you still need to think from the "AI's PoV" to get the good results.

[0] >>46374935

----

But enough whining. I want AI to get better so I can be lazier. After trying them for a while, one feature that I think all natural-language As need to have, would be the ability to mark certain sentences as "Do what I say" (aka Monkey's Paw) and "Do what I mean", like how you wrap phrases in quotes on Google etc to indicate a verbatim search.

So for example I could say "[[I was in Japan from the 5th to 10th]], identify foreign currency transactions on my statement with "POS" etc in the description" then the part in the [[]] (or whatever other marker) would be literal, exactly as written, but the rest of the text would be up to the AI's interpretation/inference so it would also search for ATM withdrawals etc.

Ideally, eventually we should be able to have multiple different AI "personas" akin to different members of household staff: your "chef" would know about your dietary preferences, your "maid" would operate your Roomba, take care of your laundry, your "accountant" would do accounty stuff.. and each of them would only learn about that specific domain of your life: the chef would pick up the times when you get hungry, but it won't know about your finances, and so on. The current "Projects" paradigm is not quite that yet.

◧◩◪◨⬒⬓
192. fmbb+rI[view] [source] [discussion] 2026-01-01 08:44:03
>>scotty+tD
> How long before introduction of computers lead to increases in average productivity?

I think it never did. Still has not.

https://en.wikipedia.org/wiki/Productivity_paradox

◧◩
234. monkey+KQ[view] [source] [discussion] 2026-01-01 10:25:23
>>rr808+tQ
https://cognition.ai/blog/devin-annual-performance-review-20...
◧◩◪
236. fullst+iR[view] [source] [discussion] 2026-01-01 10:32:01
>>crysta+fn
I wrote an article complaining about the whole hype over a year ago:

https://chrisfrewin.medium.com/why-llms-will-never-be-agi-70...

Seems to be playing out that way.

◧◩
237. fullst+NR[view] [source] [discussion] 2026-01-01 10:37:58
>>lukasl+dy
I just use a couple of custom MCP tools with the standard claude desktop app:

https://chrisfrew.in/blog/two-of-my-favorite-mcp-tools-i-use...

IMO this is the best balance of getting agentic work done while having immediate access to anything else you may need with your development process.

◧◩◪◨⬒⬓
245. spectr+tT[view] [source] [discussion] 2026-01-01 10:56:26
>>scotty+tD
The best example is that even ATM machines didn't reduce bank teller jobs.

Why? Because even the bank teller is doing more than taking and depositing money.

IMO there is an ontological bias that pervades our modern society that confuses the map for the territory and has a highly distorted view of human existence through the lens of engineering.

We don't see anything in this time series, because this time series itself is meaningless nonsense that reflects exactly this special kind of ontological stupidity:

https://fred.stlouisfed.org/series/PRS85006092

As if the sum of human interaction in an economy is some kind of machine that we just need to engineer better parts for and then sum the outputs.

Any non-careerist, thinking person that studies economics would conclude we don't and will probably not have the tools to properly study this subject in our lifetimes. The high dimensional interaction of biology, entropy and time. We have nothing. The career economist is essentially forced to sing for their supper in a type of time series theater. Then there is the method acting of pretending to be surprised when some meaningless reductionist aspect of human interaction isn't reflected in the fake time series.

◧◩
273. wpietr+T61[view] [source] [discussion] 2026-01-01 13:26:48
>>ksec+RD
This is not a great argument:

> But it is hard to argue against the value of current AI [...] it is getting $1B dollar runway already.

The psychic services industry makes over $2 billion a year in the US [1], with about a quarter of the population being actual believers. [2].

[1] The https://www.ibisworld.com/united-states/industry/psychic-ser...

[2] https://news.gallup.com/poll/692738/paranormal-phenomena-met...

◧◩
287. simonw+Ge1[view] [source] [discussion] 2026-01-01 14:30:20
>>Gud+IQ
I talked about that in this section https://simonwillison.net/2025/Dec/31/the-year-in-llms/#the-... - and touched on it a bit in the section about Chinese AI labs: https://simonwillison.net/2025/Dec/31/the-year-in-llms/#the-...
◧◩◪◨⬒⬓⬔⧯
289. windex+if1[view] [source] [discussion] 2026-01-01 14:35:02
>>orders+u91
> 5 years ago a typical argument against AGI was that computers would never be able to think because "real thinking" involved mastery of language which was something clearly beyond what computers would ever be able to do.

Mastery of words is thinking? In that line of argument then computers have been able to think for decades.

Humans don't think only in words. Our context, memory and thoughts are processed and occur in ways we don't understand, still.

There's a lot of great information out there describing this [0][1]. Continuing to believe these tools are thinking, however, is dangerous. I'd gather it has something to do with logic: you can't see the process and it's non-deterministic so it feels like thinking. ELIZA tricked people. LLMs are no different.

[0] https://archive.is/FM4y8 [0] https://www.theverge.com/ai-artificial-intelligence/827820/l... [1] https://www.raspberrypi.org/blog/secondary-school-maths-show...

◧◩
322. simonw+Py1[view] [source] [discussion] 2026-01-01 16:58:22
>>asgR1t+Qw1
In what way did they get worse?

I made you a dashboard of my 2025 writing about open-source that didn't include AI: https://simonwillison.net/dashboard/posts-with-tags-in-a-yea...

◧◩◪◨
336. ndiddy+qJ1[view] [source] [discussion] 2026-01-01 18:01:03
>>Al-Khw+eS
Well the "solution" for that will be the GPU vendors focusing solely on B2B sales because it's more profitable, therefore keeping GPUs out of the hands of average consumers. There's leaks suggesting that nVidia will gradually hike the prices of their 5090 cards from $2000 to $5000 due to RAM price increases ( https://wccftech.com/geforce-rtx-5090-prices-to-soar-to-5000... ). At that point, why even bother with the R&D for newer consumer cards when you know that barely anyone will be able to afford them?
◧◩◪◨⬒⬓⬔⧯▣
341. aoeusn+aS1[view] [source] [discussion] 2026-01-01 18:50:33
>>llmsla+RG
There is already evidence provided of it! METR time horizons is going up on an exponential trend. This is literally the most famous AI benchmark and already mentioned in this thread.

https://metr.org/blog/2025-03-19-measuring-ai-ability-to-com...

https://metr.org/blog/2025-07-14-how-does-time-horizon-vary-...

◧◩◪◨
342. aoeusn+WS1[view] [source] [discussion] 2026-01-01 18:56:12
>>steveB+Fc1
You say that as if Uber's playbook didn't work. Try this: https://www.google.com/finance/quote/UBER:NYSE
◧◩
357. andai+BY1[view] [source] [discussion] 2026-01-01 19:34:14
>>andai+4K
Re: yolo mode

https://markdownpastebin.com/?id=1ef97add6ba9404b900929ee195...

My notes from back when I set this up! Includes instructions for using a GUI file explorer as the agent user. As well as setting up a systemd service to fix the permissions automatically.

(And a nice trick which shows you which GUI apps are running as which user...)

However, most of these are just workarounds for the permission issue I kept running into, which is that Claude Code would for some reason create files with incorrect permissions so that I couldn't read or write those files from my normal account.

If someone knows how to fix that, or if someone at Anthropic is reading, then most of this Rube Goldberg machine becomes unnecessary :)

◧◩◪◨⬒
361. simonw+I02[view] [source] [discussion] 2026-01-01 19:50:43
>>arctic+RW1
The summer slump was a thing in 2023 but apparently didn't repeat in 2024: https://www.similarweb.com/blog/insights/ai-news/chatgpt-bea...

The weekend slumps could equally suggest people are using it at work.

◧◩◪◨⬒⬓⬔⧯▣
393. aspenm+7m2[view] [source] [discussion] 2026-01-01 22:16:11
>>bopbop+u92
That's why I gave you data! METR study was 16 people using Sonnet 3.5/3.7. Data I'm talking about is 10s of thousands of people and is much more up to date.

Some counter examples to METR that are in the literature but I'll just say: "rigor" here is very difficult (including METR) because outcomes are high dimensional and nuanced, or ecological validity is an issue. It's hard to have any approach that someone wouldn't be able to dismiss due to some issue they have with the methodology. The sources below also have methodological problems just like METR

https://arxiv.org/pdf/2302.06590 -- 55% faster implementing HTTP server in javascript with copilot (in 2023!) but this is a single task and not really representative.

https://demirermert.github.io/Papers/Demirer_AI_productivity... -- "Though each experiment is noisy, when data is combined across three experiments and 4,867 developers, our analysis reveals a 26.08% increase (SE: 10.3%) in completed tasks among developers using the AI tool. Notably, less experienced developers had higher adoption rates and greater productivity gains." (but e.g. "completed tasks" as the outcome measure is of course problematic)

To me, internal company measures for large tech companies will be most reliable -- they are easiest to track and measure, the scale is large enough, and the talent + task pool is diverse (junior -> senior, different product areas, different types of tasks). But then outcome measures are always a problem...commits per developer per month? LOC? task completion time? all of them are highly problematic, especially because its reasonable to expect AI tools would change the bias and variance of the proxy so its never clear if you're measuring the change in "style" or the change in the underlying latent measure of productivity you care about

◧◩◪◨⬒⬓⬔⧯▣▦▧▨
396. windex+Gp2[view] [source] [discussion] 2026-01-01 22:36:00
>>Camper+h12
There are quite a few studies to refute this highly ignorant comment. I'd suggest some reading [0].

From the abstract: "Is thought possible without language? Individuals with global aphasia, who have almost no ability to understand or produce language, provide a powerful opportunity to find out. Astonishingly, despite their near-total loss of language, these individuals are nonetheless able to add and subtract, solve logic problems, think about another person’s thoughts, appreciate music, and successfully navigate their environments. Further, neuroimaging studies show that healthy adults strongly engage the brain’s language areas when they understand a sentence, but not when they perform other nonlinguistic tasks like arithmetic, storing information in working memory, inhibiting prepotent responses, or listening to music. Taken together, these two complementary lines of evidence provide a clear answer to the classic question: many aspects of thought engage distinct brain regions from, and do not depend on, language."

[0] https://pmc.ncbi.nlm.nih.gov/articles/PMC4874898/

◧◩◪
403. martin+qy2[view] [source] [discussion] 2026-01-01 23:38:27
>>cloudk+EF
Totally agree - wrote this over the holidays which sums it all pretty well https://martinalderson.com/posts/why-im-building-my-own-clis...
◧◩◪◨⬒⬓⬔
405. Ianjit+TA2[view] [source] [discussion] 2026-01-01 23:56:35
>>bopbop+F52
The productivity uplift is massive, Meta got a 6-12% productivity uplift from AI coding!

https://youtu.be/1OzxYK2-qsI?si=8Tew5BPhV2LhtOg0

◧◩◪◨⬒⬓⬔⧯▣▦
406. Ianjit+5B2[view] [source] [discussion] 2026-01-01 23:59:26
>>aspenm+7m2
Meta internal study showed a 6-12% productivity uplift.

https://youtu.be/1OzxYK2-qsI?si=8Tew5BPhV2LhtOg0

412. zeroco+NF2[view] [source] 2026-01-02 00:32:50
>>simonw+(OP)
The "local models got good, but cloud models got even better" section nails the current paradox. Simon's observation that coding agents need reliable tool calling that local models can't yet deliver is accurate - but it frames the problem purely as a capability gap.

There's a philosophical angle being missed: do we actually want our coding agents making hundreds of tool calls through someone else's infrastructure? The more capable these systems become, the more intimate access they have to our codebases, credentials, and workflows. Every token of context we send to a frontier model is data we've permanently given up control of.

I've been working on something addressing this directly - LocalGhost.ai (https://www.localghost.ai/manifesto) - hardware designed around the premise that "sovereign AI" isn't just about capability parity but about the principle that your AI should be yours. The manifesto articulates why I think this matters beyond the technical arguments.

Simon mentions his next laptop will have 128GB RAM hoping 2026 models close the gap. I'm betting we'll need purpose-built local inference hardware that treats privacy as a first-class constraint, not an afterthought. The YOLO mode section and "normalization of deviance" concerns only strengthen this case - running agents in insecure ways becomes less terrifying when "insecure" means "my local machine" rather than "the cloud plus whoever's listening."

The capability gap will close. The trust gap won't unless we build for it.

◧◩◪◨
413. tim333+EH2[view] [source] [discussion] 2026-01-02 00:46:00
>>rainco+Ln
Yeah the internet kind of started with ARPANET in 1969 and didn't really get going with the public till around 1999 so thirty years on.

Here's a graph of internet takeoff with Krugman's famous quote of 1998 that it wouldn't amount to much being maybe the end of the skepticism https://www.contextualize.ai/mpereira/paul-krugmans-poor-pre...

In common with AI there was probably a long period when the hardware wasn't really good enough for it to be useful to most people. I remember 300 baud modems and rubber things to try to connect to your telephone handset back in the 80s.

◧◩◪◨⬒⬓⬔⧯▣▦▧▨◲
418. Camper+bN2[view] [source] [discussion] 2026-01-02 01:34:06
>>windex+Gp2
Yeah, you can prove pretty much anything with a pubmed link. Do dead salmon "think?" fMRI says maybe!

https://pmc.ncbi.nlm.nih.gov/articles/PMC2799957/

The resources that the brain is using to think -- whatever resources those are -- are language-based. Otherwise there would be no way to communicate with the test subjects. "Language" doesn't just imply written and spoken text, as these researchers seem to assume.

◧◩◪◨⬒⬓⬔⧯▣
447. ben_w+tC3[view] [source] [discussion] 2026-01-02 10:53:32
>>Denzel+3T2
To add to your point:

If the M stands for Meta, I would also like to note that as a user, I have been seeing increasingly poor UI, of the sort I'd expect from people committing code that wasn't properly checked before going live, as I would expect from vibe coding in the original sense of "blindly accept without review". Like, some posts have two copies of the sender's name in the same location on screen with slightly different fonts going out of sync with each other.

I can easily believe the metrics that all [MF]AANG bonuses are denominated in are going up, our profession has had jokes about engineers gaming those metrics even back when our comics were still printed in books: https://imgur.com/bug-free-programs-dilbert-classic-tyXXh1d

◧◩◪◨⬒⬓⬔
456. aspenm+rS3[view] [source] [discussion] 2026-01-02 13:31:25
>>ben_w+DB3
This is not my own claim, it’s based on the following analysis from Epoch: https://epoch.ai/blog/can-ai-scaling-continue-through-2030

But I forgot how old that article is: it’s 4 orders of magnitude past GPT-4 in terms of total compute which is I think only 3.5 orders of magnitude from where we are today (based on 4.4x scaling/yr)

◧◩◪◨⬒⬓⬔⧯▣▦
467. Camper+vb4[view] [source] [discussion] 2026-01-02 15:32:53
>>habine+BQ3
Except of course it's not true lol. Horses are smart critters, but they absolutely cannot do arithmetic no matter how much you train them.

These things are not horses. How can anyone choose to remain so ignorant in the face of irrefutable evidence that they're wrong?

https://arxiv.org/abs/2507.15855

It's as if a disease like COVID swept through the population, and every human's IQ dropped 10 to 15 points while our machines grew smarter to an even larger degree.

481. tanton+vN4[view] [source] 2026-01-02 18:57:29
>>simonw+(OP)
Claude Opus 4.5 has been a big step up for me personally, and I used to think Sonnet 3.5 was good. It is an amazing deal at $20.

Just yesterday, it helped me parse out and understand a research paper - complete with step-by-step examples (this one: https://research.nvidia.com/sites/default/files/pubs/2016-03...). I will now go ahead and implement it myself, possibly relegating some of the more grunt-work type tasks to Claude code.

Without it, I would have been struggling through the paper for days, wading through WGSL shader code and there would be a high chance that I just give up on it since this is for a side project and not my $job.

It has been a major force multiplier just for learning things. I have had the $20 subscription for about a year now. I bump it up to the $100 plan if I happen to be working on some project that eats through the $20 allocation. This happens to be one such month. I will probably go back to the $20 plan after this month. I continue to get a lot of value out of it.

◧◩◪◨⬒⬓⬔⧯▣▦▧
498. habine+466[view] [source] [discussion] 2026-01-03 05:29:41
>>Camper+vb4
(Continuing from my other post)

The first thing I checked was "how did they verify the proofs were correct" and the answer was they got other AI people to check it, and those people said there were serious problems with the paper's methodology and it would not be a gold medal.

https://x.com/j_dekoninck/status/1947587647616004583

This is why we do not take things at face value.

[go to top]