zlacker

I don't think the length you're talking about is that much of an issue. As you say, depending on how you measure it, LLMs are better at remaining accurate over a long span of text.

The issue seems to be more in the intelligence department. You can't really leave them in an agent-like loop with compiler/shell output and expect them to meaningfully progress on their tasks past some small number of steps.

Improving their initial error-free token length is solving the wrong problem. I would take less initial accuracy than a human but equally capable of iterating on their solution over time.