Towards a science of scaling agent systems: When and why agent systems work

>>Curiou+Rd
Reasoning is recursive - you cannot isolate where is should be symbolic and where it should be llm based (fuzzy/neural). This is the idea that started https://github.com/zby/llm-do - there is also RLM: https://alexzhang13.github.io/blog/2025/rlm/ RLM is simpler - but my approach also have some advantages.

>>gmays+(OP)
I've been building something in this space ("Clink" - multi-agent coordination layer) and this research confirms some of the assumptions that motivated the project. You can't just throw more agents at a problem and expect it to get better.

The error amplification numbers are wild! 17x for independent agents vs 4x with some central coordination. Clink provides users (and more importantly their agents) the primitives to choose their own pattern.

The most relevant features are...

- work queues with claim/release for parallelizable tasks - checkpoint dependencies when things need to be sequential - consensus voting as a gate before anything critical happens

The part about tool count increasing coordination overhead is interesting too. I've been considering exposing just a single tool to address this, but I wonder how this plays out as people start stacking more MCP servers together. It feels like we're all still learning what works here. The docs are at https://docs.clink.voxos.ai if anyone wants to poke around!

>>zkmon+ds
> The paper sounds too shallow. The errors data doesn't seem to have a rationale or correlation against the architecture. Specifically, what makes the SAS architecture to have lowest error rates while the similar architecture with independent agents having highest error rates?

I can believe SAS works great until the context has errors which were corrected - there seems to be a leakage between past mistakes and new ones, if you leave them all in one context window.

My team wrote a similar paper[1] last month, but we found the orchestrator is not the core component, but a specialized evaluator for each action to match the result, goal and methods at the end of execution to report back to the orchestrator on goal adherence.

The effect is sort of like a perpetual evals loop, which lets us improve the product every week but agent by agent without the Snowflake agent picking up the Bigquery tools etc.

We started building this Nov 2024, so the paper is more of a description of what worked for us (see Section 3).

Also specific models are great at some tasks, but not always good at others.

My general finding is that Google models do document extraction best, Claude does code well and OpenAI does task management in somewhat sycophantic fashion.

Multi-agents was originally supposed to let us put together a "best of all models" world, but it works at error correcting if I have Claude write code and GPT 5 check the results instead of everything going into one context.

[1] - https://arxiv.org/abs/2601.14351

>>gmays+(OP)
It feels like everyone these days are thinking of Markdown IPC hierarchical multi agent orchestration. Just the other day I saw this[1] vibecoded thing. I wonder if there's any ones notable, or maybe I should try my hands at it.

1: https://github.com/yohey-w/multi-agent-shogun

>>0xbadc+TY
Agreed.

Even in the case of a single agent, the compounding of errors [1] can easily make your "flow" unacceptable for your use case. The deterministic where possibe/decoupled/well tested approach is key.

With such a fast moving space I'm always wary of adopting optimization techniques that I can't easily prove and pivot from (which means measuring/evals are necessary).

Slowly but surely, abstractions allow us to use others' deep investments in the matter of coordination without losing control (e.g. pyspark worker/driver coordination) and we can invest on friction removal and direct value generation in our domains (e.g. banking/retail/legal payments, etc)

- [1] https://alexhans.github.io/posts/series/evals/error-compound...

zlacker

Towards a science of scaling agent systems: When and why agent systems work