I think these subtle issues are just harder to provide a "harness" for, like a compiler or rigorous test suite that lets the LLM converge toward a good (if sometimes inelegant) solution. Probably a finer-tuned QA agent would have changed the final result.
I wonder what it was doing with all those tokens?
E.g. I wouldn't be surprised if identifying the lack of touch screen support on the menu, feeding it in, and then regenerating the menu code sometime between 800k and 7MM took a lot of tokens.
UPDATE: without AI usage at all (just to clarify).