It trained the model with a lot of data to write code instead (probably sandwiched between some special tokens like [run-python]. The LLM runner then takes the code, runs it in a sandbox, and feeds the output back into the prompt and lets GPT continue inferencing. But TL;DR: it trained the model to write code for math problems instead of trying to solve them itself.
>>Me1000+(OP)
It also has some training on problem decomposition. Many smaller models fail before writing the code, they fail when parsing the question.
You can ask them to serialized a problem in prolog, and see exactly when their understanding breaks - this is open hermes 2.5: https://pastebin.com/raw/kr62Hybq