When you have a conversation with an AI, in simple terms, when you type a new line and hit enter, the client sends the entire conversation to the LLM. It has always worked this way, and it's how "reasoning tokens" were first realized. you allow a client to "edit" the context, and the client deletes the hallucination, then says "Wait..." at the end of the context, and hits enter.
the LLM is tricked into thinking it's confused/wrong/unsure, and "reasons" more about that particular thing.
I'm 99.9999% sure Gemini has a dynamic scaling system that will route you to smaller models when its overloaded, and that seems to be when it will still occasionally do things like tell you it edited some files without actually presenting the changes to you or go off on other strange tangents.
That has reliably worked for me with Gemini, Codex, and Opus. If you can get them to check-off features as they complete them, works even better (i.e, success criteria and an empty checkbox for them to mark off).