fails at math of course, even if the problem is very easy, like all mistrals. good for genration, probably not the best for RAG, there's mistral tunes that stay coherent to 16k tokens, and that cuts down chunking significanty
what did OpenAI do for the LLM to know "if given a math question, write Python for it, and run the code in order to get result" instead of trying to do the math itself?
You can ask them to serialized a problem in prolog, and see exactly when their understanding breaks - this is open hermes 2.5: https://pastebin.com/raw/kr62Hybq