Smaller prompts and fewer tools tends to be more stable. I try to stay within 1000 tokens and 10 tools for a single inference pass. I become visibly amused when I read many of the system prompts out there. Anthropomorphism is the biggest anti pattern with these models. It's a very easy and comfortable trap to fall into.
The core issue I see with coding agents is that the moment you read a file, you've polluted the context in terms of token coherence. It's probably not critical in most cases, but it's safer to pretend like it is. Recursive/iterative decomposition of the problem is the only thing I've seen so far that can scale arbitrarily. For example, if you invoke a sub agent every time you read a file, you can reduce the impact to the token budget of the caller by orders of magnitude. The callee can return a brief summary or yes/no response to the caller after reading 500kb of source. This applies at each level of recursion and can compound dramatically (exponentially) over just a few nested calls.