zlacker

> The paper sounds too shallow. The errors data doesn't seem to have a rationale or correlation against the architecture. Specifically, what makes the SAS architecture to have lowest error rates while the similar architecture with independent agents having highest error rates?

I can believe SAS works great until the context has errors which were corrected - there seems to be a leakage between past mistakes and new ones, if you leave them all in one context window.

My team wrote a similar paper[1] last month, but we found the orchestrator is not the core component, but a specialized evaluator for each action to match the result, goal and methods at the end of execution to report back to the orchestrator on goal adherence.

The effect is sort of like a perpetual evals loop, which lets us improve the product every week but agent by agent without the Snowflake agent picking up the Bigquery tools etc.

We started building this Nov 2024, so the paper is more of a description of what worked for us (see Section 3).

Also specific models are great at some tasks, but not always good at others.

My general finding is that Google models do document extraction best, Claude does code well and OpenAI does task management in somewhat sycophantic fashion.

Multi-agents was originally supposed to let us put together a "best of all models" world, but it works at error correcting if I have Claude write code and GPT 5 check the results instead of everything going into one context.

[1] - https://arxiv.org/abs/2601.14351