zlacker

It trained the model with a lot of data to write code instead (probably sandwiched between some special tokens like [run-python]. The LLM runner then takes the code, runs it in a sandbox, and feeds the output back into the prompt and lets GPT continue inferencing. But TL;DR: it trained the model to write code for math problems instead of trying to solve them itself.

replies(1): >>averev+no

>>Me1000+(OP)
It also has some training on problem decomposition. Many smaller models fail before writing the code, they fail when parsing the question.

You can ask them to serialized a problem in prolog, and see exactly when their understanding breaks - this is open hermes 2.5: https://pastebin.com/raw/kr62Hybq