zlacker

For context on what cloud API costs look like when running coding agents:

With Claude Sonnet at $3/$15 per 1M tokens, a typical agent loop with ~2K input tokens and ~500 output per call, 5 LLM calls per task, and 20% retry overhead (common with tool use): you're looking at roughly $0.05-0.10 per agent task.

At 1K tasks/day that's ~$1.5K-3K/month in API spend.

The retry overhead is where the real costs hide. Most cost comparisons assume perfect execution, but tool-calling agents fail parsing, need validation retries, etc. I've seen retry rates push effective costs 40-60% above baseline projections.

Local models trading 50x slower inference for $0 marginal cost start looking very attractive for high-volume, latency-tolerant workloads.

replies(3): >>taneq+P9 >>pstuar+to >>jychan+Xy

>>opencl+(OP)
At this point isn’t the marginal cost based on power consumption? At 30c/kWh and with a beefy desktop pc pulling up to half a kW, that’s 15c/hr. For true zero marginal cost, maybe get solar panels. :P

replies(1): >>EGreg+Np

>>opencl+(OP)
Might there be a way to leverage local models just to help minimize the retries -- doing the tool calling handling and giving the agent "perfect execution"?

I'm a noob and am asking as wishful thinking.

replies(1): >>jermau+tT1

>>taneq+P9
This is an interesting question actually!

Marginal cost includes energy usage but also I burned out a MacBook GPU with vanity-eth last year so wear-and-tear is also a cost.

>>opencl+(OP)
On the other hand, Deepseek V3.2 is $0.38 per million tokens output. And on openrouter, most providers serve it at 20 tokens/sec.

At 20t/s over 1 month, that's... $19something running literally 24/7. In reality it'd be cheaper than that.

I bet you'd burn more than $20 in electricity with a beefy machine that can run Deepseek.

The economics of batch>1 inference does not go in favor of consumers.

replies(1): >>selcuk+nX

>>jychan+Xy
> At 20t/s over 1 month, that's... $19something running literally 24/7.

You can run agents in parallel, but yeah, that's a fair comparison.

>>pstuar+to
> I'm a noob and am asking as wishful thinking.

Don't minimize your thoughts! Outside voices and naive questions sometimes provide novel insights that might be dismissed, but someone might listen.

I've not done this exactly, but I have setup "chains" that create a fresh context for tool calls so their call chains don't fill the main context. There is no reason why the Tool Calls couldn't be redirected to another LLM endpoint (local for instance). Especially with something like gpt-oss-20b, where I've found executing tools happens at a higher success than claude sonnet via openrouter.