The error amplification numbers are wild! 17x for independent agents vs 4x with some central coordination. Clink provides users (and more importantly their agents) the primitives to choose their own pattern.
The most relevant features are...
- work queues with claim/release for parallelizable tasks - checkpoint dependencies when things need to be sequential - consensus voting as a gate before anything critical happens
The part about tool count increasing coordination overhead is interesting too. I've been considering exposing just a single tool to address this, but I wonder how this plays out as people start stacking more MCP servers together. It feels like we're all still learning what works here. The docs are at https://docs.clink.voxos.ai if anyone wants to poke around!
I can believe SAS works great until the context has errors which were corrected - there seems to be a leakage between past mistakes and new ones, if you leave them all in one context window.
My team wrote a similar paper[1] last month, but we found the orchestrator is not the core component, but a specialized evaluator for each action to match the result, goal and methods at the end of execution to report back to the orchestrator on goal adherence.
The effect is sort of like a perpetual evals loop, which lets us improve the product every week but agent by agent without the Snowflake agent picking up the Bigquery tools etc.
We started building this Nov 2024, so the paper is more of a description of what worked for us (see Section 3).
Also specific models are great at some tasks, but not always good at others.
My general finding is that Google models do document extraction best, Claude does code well and OpenAI does task management in somewhat sycophantic fashion.
Multi-agents was originally supposed to let us put together a "best of all models" world, but it works at error correcting if I have Claude write code and GPT 5 check the results instead of everything going into one context.
Even in the case of a single agent, the compounding of errors [1] can easily make your "flow" unacceptable for your use case. The deterministic where possibe/decoupled/well tested approach is key.
With such a fast moving space I'm always wary of adopting optimization techniques that I can't easily prove and pivot from (which means measuring/evals are necessary).
Slowly but surely, abstractions allow us to use others' deep investments in the matter of coordination without losing control (e.g. pyspark worker/driver coordination) and we can invest on friction removal and direct value generation in our domains (e.g. banking/retail/legal payments, etc)
- [1] https://alexhans.github.io/posts/series/evals/error-compound...