From a strategic standpoint of privacy, cost and control, I immediately went for local models, because that allowed to baseline tradeoffs and it also made it easier to understand where vendor lock-in could happen, or not get too narrow in perspective (e.g. llama.cpp/open router depending on local/cloud [1] ).
With the explosion of popularity of CLI tools (claude/continue/codex/kiro/etc) it still makes sense to be able to do the same, even if you can use several strategies to subsidize your cloud costs (being aware of the lack of privacy tradeoffs).
I would absolutely pitch that and evals as one small practice that will have compounding value for any "automation" you want to design in the future, because at some point you'll care about cost, risks, accuracy and regressions.
[1] - https://alexhans.github.io/posts/aider-with-open-router.html
https://docs.z.ai/devpack/tool/claude
https://www.cerebras.ai/blog/introducing-cerebras-code
or i guess one of the hosted gpu providers
if you're basically a homelabber and wanted an excuse to run quantized models on your own device go for it but dont lie and mutter under your own tin foil hat that its a realistic replacement
The one I mentioned called continue.dev [1] is easy to try out and see if it meets your needs.
Hitting local models with it should be very easy (it calls APIs at a specific port)
This is with my regular $20/month ChatGpT subscription and my $200 a year (company reimbursed) Claude subscription.
And the probability that any given model you use today is the same as what you use tomorrow is doubly doubtful:
1. The model itself will change as they try to improve the cost-per-test improves. This will necessarily make your expectations non-deterministic.
2. The "harness" around that model will change as business-cost is tightened and the amount of context around the model is changed to improve the business case which generates the most money.
Then there's the "cataclysmic" lockout cost where you accidently use the wrong tool that gets you locked out of the entire ecosystem and you are black listed, like a gambler in vegas who figures out how to count cards and it works until the house's accountant identifies you as a non-negligible customer cost.
It's akin to anti-union arguments where everyone "buying" into the cloud AI circus thinks they're going to strike gold and completely ignores the fact that very few will and if they really wanted a better world and more control, they'd unionize and limit their illusions of grandeur. It should be an easy argument to make, but we're seeing about 1/3 of the population are extremely susceptible to greed based illusions.,
And they do? That's what the API is.
The subscription always seemed clearly advertised for client usage, not general API usage, to me. I don't know why people are surprised after hacking the auth out of the client. (note in clients they can control prompting patterns for caching etc, it can be cheaper)
This is, however, a major improvement from ~6 months ago when even a single token `hi` from an agentic CLI could take >3 minutes to generate a response. I suspect the parallel processing of LMStudio 0.4.x and some better tuning of the initial context payload is responsible.
6 months from now, who knows?
I've been running CC with Qwen3-Coder-30B (FP8) and I find it just as fast, but not nearly as clever.
tldr; `ollama launch claude`
glm-4.7-flash is a nice local model for this sort of thing if you have a machine that can run it
The API is for using the model directly with your own tools. It can be in dev, or experiments, or anything.
Subscriptions are for using the apps Claude + code. That's what it always said when you sign up.
I set up a bot on 4claw and although it’s kinda slow, it took twenty minutes to load 3 subs and 5 posts from each then comment on interesting ones.
It actually managed to correctly use the api via curl though at one point it got a little stuck as it didn’t escape its json.
I’m going to run it for a few days but very impressed so for for such a small model.
Wildly understating this part.
Even the best local models (ones you run on beefy 128GB+ RAM machines) get nowhere close to the sheer intelligence of Claude/Gemini/Codex. At worst these models will move you backwards and just increase the amount of work Claude has to do when your limits reset.
For a full claude code replacement I'd go with opencode instead, but good models for that are something you run in your company's basement, not at home
Among these, I had lots of trouble getting GLM-4.7-Flash to work (failed tool calls etc), and even when it works, it's at very low tok/s. On the other hand Qwen3 variants perform very well, speed wise. For local sensitive document work, these are excellent; for serious coding not so much.
One caviat missed in most instructions is that you have to set CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC = 1 in your ~/.claude/settings.json, otherwise CC's telemetry pings cause total network failure because local ports are exhausted.
[1] claude-code-tools local LLM setup: https://github.com/pchalasani/claude-code-tools/blob/main/do...
Closed models are specifically tuned with tools, that model provider wants them to work with (for example specific tools under claude code), and hence they perform better.
I think this will always be the case, unless someone tunes open models to work with the tools that their coding agent will use.
Will it work? Yes. Will it produce same quality as Sonnet or Opus? No.
LLMs are a hyper-competitive market at the moment, and we have a wealth of options, so if Anthropic is overpricing their API they'll likely be hurting themselves.
Whether it's a giant corporate model or something you run locally, there is no intelligence there. It's still just a lying engine. It will tell you the string of tokens most likely to come after your prompt based on training data that was stolen and used against the wishes of its original creators.
That might incentivize it to perform slightly better from the get go.
Local models are relatively small, it seems wasteful to try and keep them as generalists. Fine tuning on your specific coding should make for better use of their limited parameter count.
Thanks again for this info & setup guide! I'm excited to play with some local models.
we can do much better with a cheap model on openrouter (glm 4.7, kimi, etc.) than anything that I can run on my lowly 3090 :)
But I was really disappointed when I tried to use subagents. In theory I really liked the idea: have Haiku wrangle small specific tasks that are tedious but routine and have Sonnet orchestrate everything. In practice the subagents took so many steps and wrote so much documentation that it became not worth it. Running 2-3 agents blew through the 5 hour quota in 20 minutes of work vs normal work where I might run out of quota 30-45 minutes before it resets. Even after tuning the subagent files to prevent them from writing tests I never asked for and not writing tons of documentation that I didn’t need they still produced way too much content and blew the context window of the main agent repeatedly. If it was a local model I wouldn’t mind experimenting with it more.
I'll add on https://unsloth.ai/docs/models/qwen3-coder-next
The full model is supposedly comparable to Sonnet 4.5 But, you can run the 4 bit quant on consumer hardware as long as your RAM + VRAM has room to hold 46GB. 8 bit needs 85.
But as a counterpoint: there are whole communities of people in this space who get significant value from models they run locally. I am one of them.
Right now OpenAI is giving away fairly generous free credits to get people to try the macOS Codex client. And... it's quite good! Especially for free.
I've cancelled my Anthropic subscription...
Some open models have specific training for defined tools (a notable example is OpenAI GPT-OSS and its "built in" tools for browser use and python execution (they are called built in tools, but they are really tool interfaces it is trained to use if made available.) And closed models are also trained to work with generic tools as well as their “built in” tools.
It is probably enough to handle a lot of what people use the big-3 closed models for. Somewhat slower and somewhat dumber, granted, but still extraordinarily capable. It punches way above its weight class for an 80B model.
The rest of your points are why I mentioned AI evals and regressions. I share your sentiment. I've pitched it in the past as "We can’t compare what we can’t measure" and "Can I trust this to run on its own?" and how automation requires intent and understanding your risk profile. None of this is new for anyone who has designed software with sufficient impact in the past, of course.
Since you're interested in combating non-determinism, I wonder if you've reached the same conclusion of reducing the spaces where it can occur and compound making the "LLM" parts as minimal as possible between solid deterministic and well-tested building blocks (e.g. https://alexhans.github.io/posts/series/evals/error-compound... ).
That said, Claude Code is designed to work with Anthropic's models. Agents have a buttload of custom work going on in the background to massage specific models to do things well.
1. Switch to extra usage, which can be increased on the Claude usage page: https://claude.ai/settings/usage
2. Logout and Switch to API tokens (using the ANTHROPIC_API_KEY environment variable) instead of a Claude Pro subscription. Credits can be increased on the Anthropic API console page: https://platform.claude.com/settings/keys
3. Add a second 20$/month account if this happens frequently, before considering a Max account.
4. Not a native option: If you have a ChatGPT Plus or Pro account, Codex is surprisingly just as good and comes with a much higher quota.
I was using GLM on ZAI coding plan (jerry rigged Claude Code for $3/month), but finding myself asking Sonnet to rewrite 90% of the code GLM was giving me. At some point I was like "what the hell am I doing" and just switched.
To clarify, the code I was getting before mostly worked, it was just a lot less pleasant to look at and work with. Might be a matter of taste, but I found it had a big impact on my morale and productivity.
Are there a lot of options how "how far" do you quantize? How much VRAM does it take to get the 92-95% you are speaking of?
PC or Mac? A PC, ya, no way, not without beefy GPUs with lots of VRAM. A mac? Depends on the CPU, an M3 Ultra with 128GB of unified RAM is going to get closer, at least. You can have decent experiences with a Max CPU + 64GB of unified RAM (well, that's my setup at least).
So many: https://www.reddit.com/r/LocalLLaMA/comments/1ba55rj/overvie...
> How much VRAM does it take to get the 92-95% you are speaking of?
For inference, it's heavily dependent on the size of the weights (plus context). Quantizing an f32 or f16 model to q4/mxfp4 won't necessarily use 92-95% less VRAM, but it's pretty close for smaller contexts.
It’s slower and about 90% as good, so it definitely works as a great back up, but CC with Opus is noticeably better for all of my workloads
Kimi K2.5 is a trillion parameter model. You can't run it locally on anything other than extremely well equipped hardware. Even heavily quantized you'd still need 512GB of unified memory, and the quantization would impact the performance.
Also the proprietary models a year ago were not that good for anything beyond basic tasks.
This is a very common sequence of events.
The frontier hosted models are so much better than everything else that it's not worth messing around with anything lesser if doing this professionally. The $20/month plans go a long way if context is managed carefully. For a professional developer or consultant, the $200/month plan is peanuts relative to compensation.
I've also been testing OpenClaw. It burned 8M tokens during my half hour of testing, which would have been like $50 with Opus on the API. (Which is why everyone was using it with the sub, until Anthropic apparently banned that.)
I was using GLM on Cerebras instead, so it was only $10 per half hour ;) Tried to get their Coding plan ("unlimited" for $50/mo) but sold out...
(My fallback is I got a whole year of GLM from ZAI for $20 for the year, it's just a bit too slow for interactive use.)
Going dumb/cheap just ends up costing more, in the short and long term.
Unless you include it in "frontier", but that has usually been used to refer to "Big 3".
Now as the other replies say, you should very likely run a quantized version anyway.
No one's running Sonnet/Gemini/GPT-5 locally though.
This completely depends on the domain, as always. Each of the big 3 have their strengths and weaknesses.
I have claude pro $20/mo and sometimes run out. I just set ANTHROPIC_BASE_URL to a localllm API endpoint that connects to a cheaper Openai model. I can continue with smaller tasks with no problem. This has been done for a long time.
(At todays ram prices upgrading to that for me would pay for a _lot_ of tokens...)
Kimi K2.5 is good, but it's still behind the main models like Claude's offerings and GPT-5.2. Yes, I know what the benchmarks say, but the benchmarks for open weight models have been overpromising for a long time and Kimi K2.5 is no exception.
Kimi K2.5 is also not something you can easily run locally without investing $5-10K or more. There are hosted options you can pay for, but like the parent commenter observed: By the time you're pinching pennies on LLM costs, what are you even achieving? I could see how it could make sense for students or people who aren't doing this professionally, but anyone doing this professionally really should skip straight to the best models available.
Unless you're billing hourly and looking for excuses to generate more work I guess?