...and the best of them all, OpenCode[1] :)
[1]: https://opencode.ai
On this including AI agents deleting home folders, I was able to run agents in Firejail by isolating vscode (Most of my agents are vscode based ones, like Kilo Code).
I wrote a little guide on how I did it https://softwareengineeringstandard.com/2025/12/15/ai-agents...
Took a bit of tweaking, vscode crashing a bunch of times with not being able to read its config files, but I got there in the end. Now it can only write to my projects folder. All of my projects are backed up in git.
I literally said:
"AI data centers continue to burn vast amounts of energy and the arms race to build them continues to accelerate in a way that feels unsustainable."
AND I linked to my coverage from last year, which is still true today (hence why I felt no need to update it): https://simonwillison.net/2024/Dec/31/llms-in-2024/#the-envi...
I don't see a similar option for ChatGPT Pro. Here's a closed issue: https://github.com/sst/opencode/issues/704
> At the end of every month I send out a much shorter newsletter to anyone who sponsors me for $10 or more on GitHub
This year honestly feels quite stagnant. LLMs are literally technology that can only reproduce the past. They're cool, but they were way cooler 4 years ago. We've taken big ideas like "agents" and "reinforcement learning" and basically stripped them of all meaning in order to claim progress.
I mean, do you remember Geoffrey Hinton's RBM talk at Google in 2010? [0] That was absolutely insane for anyone keeping up with that field. By the mid-twenty teens RBMs were already outdated. I remember when everyone was implementing flavors of RNNs and LSTMs. Karpathy's character 2015 RNN project was insane [1].
This comment makes me wonder if part of the hype around LLMs is just that a lot of software people simply weren't paying attention to the absolutely mind-blowing progress we've seen in this field for the last 20 years. But even ignoring ML, the world's of web development and mobile application development have gone through incredible progress over the last decade and a half. I remember a time when JavaScript books would have a section warning that you should never use JS for anything critical to the application. Then there's the work in theorem provers over the last decade... If you remember when syntactic sugar was progress, either you remember way further back than I do, or you weren't paying attention to what was happening in the larger computing world.
Planing depends on deterministic view of the future. I used to plan (esp annual plans) until about 5 years. Now I scan for trends and prepare myself for different scenarios that can come in the future. Even if you get it approximately right, you stand apart.
For tech trends, I read Simon, Benedict Evans, Mary Meeker etc. Simon is in a better position make these predictions than anyone else having closely analyzed these trends over the last few years.
Here I wrote about my approach: https://www.jjude.com/shape-the-future/
That's the pure, uncut copium. Meanwhile, in the real world, search on major platforms is so slanted towards slop that people need to specify that they want actual human music:
https://old.reddit.com/r/MusicRecommendations/comments/1pq4f...
I posted about my failures to try to get them to review my bank statements [0] and generally got gaslit about how I was doing it wrong, that I if trust them to give them full access to my disk and terminal, they could do it better.
But I mean, at that point, it's still more "manual intelligence" than just telling someone what I want. A human could easily understand it, but AI still takes a lot of wrangling and you still need to think from the "AI's PoV" to get the good results.
[0] >>46374935
----
But enough whining. I want AI to get better so I can be lazier. After trying them for a while, one feature that I think all natural-language As need to have, would be the ability to mark certain sentences as "Do what I say" (aka Monkey's Paw) and "Do what I mean", like how you wrap phrases in quotes on Google etc to indicate a verbatim search.
So for example I could say "[[I was in Japan from the 5th to 10th]], identify foreign currency transactions on my statement with "POS" etc in the description" then the part in the [[]] (or whatever other marker) would be literal, exactly as written, but the rest of the text would be up to the AI's interpretation/inference so it would also search for ATM withdrawals etc.
Ideally, eventually we should be able to have multiple different AI "personas" akin to different members of household staff: your "chef" would know about your dietary preferences, your "maid" would operate your Roomba, take care of your laundry, your "accountant" would do accounty stuff.. and each of them would only learn about that specific domain of your life: the chef would pick up the times when you get hungry, but it won't know about your finances, and so on. The current "Projects" paradigm is not quite that yet.
I think it never did. Still has not.
https://chrisfrewin.medium.com/why-llms-will-never-be-agi-70...
Seems to be playing out that way.
https://chrisfrew.in/blog/two-of-my-favorite-mcp-tools-i-use...
IMO this is the best balance of getting agentic work done while having immediate access to anything else you may need with your development process.
Why? Because even the bank teller is doing more than taking and depositing money.
IMO there is an ontological bias that pervades our modern society that confuses the map for the territory and has a highly distorted view of human existence through the lens of engineering.
We don't see anything in this time series, because this time series itself is meaningless nonsense that reflects exactly this special kind of ontological stupidity:
https://fred.stlouisfed.org/series/PRS85006092
As if the sum of human interaction in an economy is some kind of machine that we just need to engineer better parts for and then sum the outputs.
Any non-careerist, thinking person that studies economics would conclude we don't and will probably not have the tools to properly study this subject in our lifetimes. The high dimensional interaction of biology, entropy and time. We have nothing. The career economist is essentially forced to sing for their supper in a type of time series theater. Then there is the method acting of pretending to be surprised when some meaningless reductionist aspect of human interaction isn't reflected in the fake time series.
> But it is hard to argue against the value of current AI [...] it is getting $1B dollar runway already.
The psychic services industry makes over $2 billion a year in the US [1], with about a quarter of the population being actual believers. [2].
[1] The https://www.ibisworld.com/united-states/industry/psychic-ser...
[2] https://news.gallup.com/poll/692738/paranormal-phenomena-met...
Mastery of words is thinking? In that line of argument then computers have been able to think for decades.
Humans don't think only in words. Our context, memory and thoughts are processed and occur in ways we don't understand, still.
There's a lot of great information out there describing this [0][1]. Continuing to believe these tools are thinking, however, is dangerous. I'd gather it has something to do with logic: you can't see the process and it's non-deterministic so it feels like thinking. ELIZA tricked people. LLMs are no different.
[0] https://archive.is/FM4y8 [0] https://www.theverge.com/ai-artificial-intelligence/827820/l... [1] https://www.raspberrypi.org/blog/secondary-school-maths-show...
I made you a dashboard of my 2025 writing about open-source that didn't include AI: https://simonwillison.net/dashboard/posts-with-tags-in-a-yea...
https://metr.org/blog/2025-03-19-measuring-ai-ability-to-com...
https://metr.org/blog/2025-07-14-how-does-time-horizon-vary-...
https://markdownpastebin.com/?id=1ef97add6ba9404b900929ee195...
My notes from back when I set this up! Includes instructions for using a GUI file explorer as the agent user. As well as setting up a systemd service to fix the permissions automatically.
(And a nice trick which shows you which GUI apps are running as which user...)
However, most of these are just workarounds for the permission issue I kept running into, which is that Claude Code would for some reason create files with incorrect permissions so that I couldn't read or write those files from my normal account.
If someone knows how to fix that, or if someone at Anthropic is reading, then most of this Rube Goldberg machine becomes unnecessary :)
The weekend slumps could equally suggest people are using it at work.
Some counter examples to METR that are in the literature but I'll just say: "rigor" here is very difficult (including METR) because outcomes are high dimensional and nuanced, or ecological validity is an issue. It's hard to have any approach that someone wouldn't be able to dismiss due to some issue they have with the methodology. The sources below also have methodological problems just like METR
https://arxiv.org/pdf/2302.06590 -- 55% faster implementing HTTP server in javascript with copilot (in 2023!) but this is a single task and not really representative.
https://demirermert.github.io/Papers/Demirer_AI_productivity... -- "Though each experiment is noisy, when data is combined across three experiments and 4,867 developers, our analysis reveals a 26.08% increase (SE: 10.3%) in completed tasks among developers using the AI tool. Notably, less experienced developers had higher adoption rates and greater productivity gains." (but e.g. "completed tasks" as the outcome measure is of course problematic)
To me, internal company measures for large tech companies will be most reliable -- they are easiest to track and measure, the scale is large enough, and the talent + task pool is diverse (junior -> senior, different product areas, different types of tasks). But then outcome measures are always a problem...commits per developer per month? LOC? task completion time? all of them are highly problematic, especially because its reasonable to expect AI tools would change the bias and variance of the proxy so its never clear if you're measuring the change in "style" or the change in the underlying latent measure of productivity you care about
From the abstract: "Is thought possible without language? Individuals with global aphasia, who have almost no ability to understand or produce language, provide a powerful opportunity to find out. Astonishingly, despite their near-total loss of language, these individuals are nonetheless able to add and subtract, solve logic problems, think about another person’s thoughts, appreciate music, and successfully navigate their environments. Further, neuroimaging studies show that healthy adults strongly engage the brain’s language areas when they understand a sentence, but not when they perform other nonlinguistic tasks like arithmetic, storing information in working memory, inhibiting prepotent responses, or listening to music. Taken together, these two complementary lines of evidence provide a clear answer to the classic question: many aspects of thought engage distinct brain regions from, and do not depend on, language."
There's a philosophical angle being missed: do we actually want our coding agents making hundreds of tool calls through someone else's infrastructure? The more capable these systems become, the more intimate access they have to our codebases, credentials, and workflows. Every token of context we send to a frontier model is data we've permanently given up control of.
I've been working on something addressing this directly - LocalGhost.ai (https://www.localghost.ai/manifesto) - hardware designed around the premise that "sovereign AI" isn't just about capability parity but about the principle that your AI should be yours. The manifesto articulates why I think this matters beyond the technical arguments.
Simon mentions his next laptop will have 128GB RAM hoping 2026 models close the gap. I'm betting we'll need purpose-built local inference hardware that treats privacy as a first-class constraint, not an afterthought. The YOLO mode section and "normalization of deviance" concerns only strengthen this case - running agents in insecure ways becomes less terrifying when "insecure" means "my local machine" rather than "the cloud plus whoever's listening."
The capability gap will close. The trust gap won't unless we build for it.
Here's a graph of internet takeoff with Krugman's famous quote of 1998 that it wouldn't amount to much being maybe the end of the skepticism https://www.contextualize.ai/mpereira/paul-krugmans-poor-pre...
In common with AI there was probably a long period when the hardware wasn't really good enough for it to be useful to most people. I remember 300 baud modems and rubber things to try to connect to your telephone handset back in the 80s.
https://pmc.ncbi.nlm.nih.gov/articles/PMC2799957/
The resources that the brain is using to think -- whatever resources those are -- are language-based. Otherwise there would be no way to communicate with the test subjects. "Language" doesn't just imply written and spoken text, as these researchers seem to assume.
If the M stands for Meta, I would also like to note that as a user, I have been seeing increasingly poor UI, of the sort I'd expect from people committing code that wasn't properly checked before going live, as I would expect from vibe coding in the original sense of "blindly accept without review". Like, some posts have two copies of the sender's name in the same location on screen with slightly different fonts going out of sync with each other.
I can easily believe the metrics that all [MF]AANG bonuses are denominated in are going up, our profession has had jokes about engineers gaming those metrics even back when our comics were still printed in books: https://imgur.com/bug-free-programs-dilbert-classic-tyXXh1d
But I forgot how old that article is: it’s 4 orders of magnitude past GPT-4 in terms of total compute which is I think only 3.5 orders of magnitude from where we are today (based on 4.4x scaling/yr)
These things are not horses. How can anyone choose to remain so ignorant in the face of irrefutable evidence that they're wrong?
https://arxiv.org/abs/2507.15855
It's as if a disease like COVID swept through the population, and every human's IQ dropped 10 to 15 points while our machines grew smarter to an even larger degree.
Just yesterday, it helped me parse out and understand a research paper - complete with step-by-step examples (this one: https://research.nvidia.com/sites/default/files/pubs/2016-03...). I will now go ahead and implement it myself, possibly relegating some of the more grunt-work type tasks to Claude code.
Without it, I would have been struggling through the paper for days, wading through WGSL shader code and there would be a high chance that I just give up on it since this is for a side project and not my $job.
It has been a major force multiplier just for learning things. I have had the $20 subscription for about a year now. I bump it up to the $100 plan if I happen to be working on some project that eats through the $20 allocation. This happens to be one such month. I will probably go back to the $20 plan after this month. I continue to get a lot of value out of it.
The first thing I checked was "how did they verify the proofs were correct" and the answer was they got other AI people to check it, and those people said there were serious problems with the paper's methodology and it would not be a gold medal.
https://x.com/j_dekoninck/status/1947587647616004583
This is why we do not take things at face value.