what did OpenAI do for the LLM to know "if given a math question, write Python for it, and run the code in order to get result" instead of trying to do the math itself?
You can ask them to serialized a problem in prolog, and see exactly when their understanding breaks - this is open hermes 2.5: https://pastebin.com/raw/kr62Hybq