Claude Code: connect to a local model when your quota runs out

>>fugu2+(OP)
Useful tip.

From a strategic standpoint of privacy, cost and control, I immediately went for local models, because that allowed to baseline tradeoffs and it also made it easier to understand where vendor lock-in could happen, or not get too narrow in perspective (e.g. llama.cpp/open router depending on local/cloud [1] ).

With the explosion of popularity of CLI tools (claude/continue/codex/kiro/etc) it still makes sense to be able to do the same, even if you can use several strategies to subsidize your cloud costs (being aware of the lack of privacy tradeoffs).

I would absolutely pitch that and evals as one small practice that will have compounding value for any "automation" you want to design in the future, because at some point you'll care about cost, risks, accuracy and regressions.

[1] - https://alexhans.github.io/posts/aider-with-open-router.html

[2] - https://www.reddit.com/r/LocalLLaMA

>>fugu2+(OP)
i mean the other obvious answer is to plug in to the other claude code proxies that other model companies have made for you:

https://docs.z.ai/devpack/tool/claude

https://www.cerebras.ai/blog/introducing-cerebras-code

or i guess one of the hosted gpu providers

if you're basically a homelabber and wanted an excuse to run quantized models on your own device go for it but dont lie and mutter under your own tin foil hat that its a realistic replacement

>>mogoma+GQb
What are your needs/constraints (hardware constraints definitely a big one)?

The one I mentioned called continue.dev [1] is easy to try out and see if it meets your needs.

Hitting local models with it should be very easy (it calls APIs at a specific port)

[1] - https://github.com/continuedev/continue

>>fugu2+(OP)
Openrouter can also be used with claude code. https://openrouter.ai/docs/guides/claude-code-integration

>>mogoma+GQb
we recently added a `launch` command to Ollama, so you can set up tools like Claude Code easily: https://ollama.com/blog/launch

tldr; `ollama launch claude`

glm-4.7-flash is a nice local model for this sort of thing if you have a machine that can run it

>>fugu2+(OP)
Since Llama.cpp/llama-server recently added support for the Anthropic messages API, running Claude Code with several recent open-weight local models is now very easy. The messy part is what llama-server flags to use, including chat template etc. I've collected all of that setup info in my claude-code-tools [1] repo, for Qwen3-Coder-next, Qwen3-30B-A3B, Nemotron-3-Nano, GLM-4.7-Flash etc.

Among these, I had lots of trouble getting GLM-4.7-Flash to work (failed tool calls etc), and even when it works, it's at very low tok/s. On the other hand Qwen3 variants perform very well, speed wise. For local sensitive document work, these are excellent; for serious coding not so much.

One caviat missed in most instructions is that you have to set CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC = 1 in your ~/.claude/settings.json, otherwise CC's telemetry pings cause total network failure because local ports are exhausted.

[1] claude-code-tools local LLM setup: https://github.com/pchalasani/claude-code-tools/blob/main/do...

>>zozbot+i8c
The article mentions https://unsloth.ai/docs/basics/claude-codex

I'll add on https://unsloth.ai/docs/models/qwen3-coder-next

The full model is supposedly comparable to Sonnet 4.5 But, you can run the 4 bit quant on consumer hardware as long as your RAM + VRAM has room to hold 46GB. 8 bit needs 85.

>>paxys+ecc
On Macbooks, no. But there are a few lunatics like this guy:

https://www.youtube.com/watch?v=bFgTxr5yst0

>>fugu2+(OP)
Claude Code Router or ccr can connect to OpenRouter. When your quota runs out, it’s a much better speed vs quality vs cost tradeoff compared to running Qwen3 locally - https://github.com/musistudio/claude-code-router

>>zozbot+i8c
Kimi K2.5 is fourth place for intelligence right now. And it's not as good as the top frontier models at coding, but it's better than Claude 4.5 Sonnet. https://artificialanalysis.ai/models

>>cyanyd+O0c
You're right. Control is the big one and both privacy and cost are only possible because you have control. It's a similar benefit to the one of Linux distros or open source software.

The rest of your points are why I mentioned AI evals and regressions. I share your sentiment. I've pitched it in the past as "We can’t compare what we can’t measure" and "Can I trust this to run on its own?" and how automation requires intent and understanding your risk profile. None of this is new for anyone who has designed software with sufficient impact in the past, of course.

Since you're interested in combating non-determinism, I wonder if you've reached the same conclusion of reducing the spaces where it can occur and compound making the "LLM" parts as minimal as possible between solid deterministic and well-tested building blocks (e.g. https://alexhans.github.io/posts/series/evals/error-compound... ).

>>fugu2+(OP)
Some native Claude code options when your quota runs out:

1. Switch to extra usage, which can be increased on the Claude usage page: https://claude.ai/settings/usage

2. Logout and Switch to API tokens (using the ANTHROPIC_API_KEY environment variable) instead of a Claude Pro subscription. Credits can be increased on the Anthropic API console page: https://platform.claude.com/settings/keys

3. Add a second 20$/month account if this happens frequently, before considering a Max account.

4. Not a native option: If you have a ChatGPT Plus or Pro account, Codex is surprisingly just as good and comes with a much higher quota.

>>Muffin+3Dc
> Are there a lot of options how "how far" do you quantize?

So many: https://www.reddit.com/r/LocalLLaMA/comments/1ba55rj/overvie...

> How much VRAM does it take to get the 92-95% you are speaking of?

For inference, it's heavily dependent on the size of the weights (plus context). Quantizing an f32 or f16 model to q4/mxfp4 won't necessarily use 92-95% less VRAM, but it's pretty close for smaller contexts.

>>Schema+2fc
I wonder if the "distributed AI computing" touted by some of the new crypto projects [0] works and is relatively cheaper.

0. https://www.daifi.ai/

>>reilly+A9c
Depending on what your usage requirements are, Mac Minis running UMA over RDMA is becoming a feasible option. At roughly 1/10 of the cost you're getting much much more than 1/10 the performance. (YMMV)

https://buildai.substack.com/i/181542049/the-mac-mini-moment

>>mogoma+GQb
You must try GLM4.7 and KimiK2.5 !

I also highly suggest OpenCode. You'll get the same Claude Code vibe.

If your computer is not beefy enough to run them locally, Synthetic is a bless when it comes to providing these models, their team is responsive, no downtime or any issue for the last 6 months.

Full list of models provided : https://dev.synthetic.new/docs/api/models

Referal link if you're interested in trying it for free, and discount for the first month : https://synthetic.new/?referral=kwjqga9QYoUgpZV

>>BuildT+Ujd
https://ollama.com

Although I'm starting to like LMStudio more, as it has more features that Ollama is missing.

https://lmstudio.ai

You can then get Claude to create the MCP server to talk to either. Then a CLAUDE.md that tells it to read the models you have downloaded, determine their use and when to offload. Claude will make all that for you as well.

zlacker

Claude Code: connect to a local model when your quota runs out