fails at math of course, even if the problem is very easy, like all mistrals. good for genration, probably not the best for RAG, there's mistral tunes that stay coherent to 16k tokens, and that cuts down chunking significanty
Edit: mistook tokens for parameters for a moment there. Keeping up with AI jargon is exhausting for an idiot like me.
What your see in the link is the copy paste of a discussion between me and the model in question, that I pasted into gpt4 with the instructions to evaluate it.the answer with the votes in 10/10 is gpt evaluating the chart between me and the smaller model. The smaller model is producing the text after ASSISTANT, the question that I do as USER is part of a fixes script that I run with every new model so that I have a sort of a validation set before doing some more rigorous testing.
what did OpenAI do for the LLM to know "if given a math question, write Python for it, and run the code in order to get result" instead of trying to do the math itself?
You can ask them to serialized a problem in prolog, and see exactly when their understanding breaks - this is open hermes 2.5: https://pastebin.com/raw/kr62Hybq