Mistral 7B Fine-Tune Optimized

>>nickth+Cb
Looks like they utilized the Bradley-Terry model, but that's not one I'm super familiar with.

https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model

>>nickth+1d
Well it's pretty easy to find examples online, this one using Llama 2, not even Mistral or fancy techniques: https://www.anyscale.com/blog/fine-tuning-llama-2-a-comprehe...

>>hospit+cd
Don't forget that ChatGPT 4 also has seasonal depression [1].

[1]: https://twitter.com/RobLynch99/status/1734278713762549970

(Though with that said, the seasonal issue might be common to any LLM with training data annotated by time of year.)

>>nickth+1d
They're quite close in arena format: https://chat.lmsys.org/?arena

>>xrd+Hi
> I think the adage about "a solution needs to be 10x other solutions to make someone switch" applies here.

Cheaper and faster is also better. The cheapest version of GPT-4 costs $0.01/$0.03 per 1K input/output tokens [1]. Mistral AI is charging 0.14€/0.42€ per ONE MILLION input/output tokens for their 7B model [2]. It's night and day.

If people can start fine-tuning a 7B model to do the same work they were doing with GPT-4, they will 100% switch.

[1]: https://help.openai.com/en/articles/7127956-how-much-does-gp...

[2]: https://docs.mistral.ai/platform/pricing/

>>nickth+Cb
https://chat.lmsys.org/?arena

Try a few blinds, mixtral 8x7b-instruct and gpt-4 are 50-50 for me, and it outperforms 3.5 almost every time, and you can run inference on it with a modern cpu and 64 GB of RAM on a personal device lmfao. and the instruct finetuning has had nowhere near the $$$ and rlhf that openai has. It's not a done deal, but people will be able to run models better than today's SOTA on <$1000 hardware in <3 months, I hope for their own sake that OpenAI is moving fast.

>>tosh+(OP)
not a bad model, becomes incoherent at above 8k token, and it's not helped by the fact that's very verbose, but seems very coherent and stay on topic closely until then: https://chat.openai.com/share/089d1b8c-3467-4c01-af9f-6568c0...

fails at math of course, even if the problem is very easy, like all mistrals. good for genration, probably not the best for RAG, there's mistral tunes that stay coherent to 16k tokens, and that cuts down chunking significanty

>>nickth+Cb
(Post author here). Totally fair concern. I'll find some representative examples on a sample task we've done some fine-tuning on and add them to the post.

EDIT: Ok so the prompt and outputs are long enough that adding them to the post directly would be kind of onerous. But I didn't want to leave you waiting, so I copied an example into a Notion doc you can see here: https://opipe.notion.site/PII-Redaction-Example-ebfd29939d25...

>>tosh+(OP)
One thing that most people don't realize is that (full parameter)finetuned models are costly unless you run it in batched mode. Which means unless the request rate is very high and consistent, it is better to use prompts with GPT-3.5. e.g. batch of 1, mistral is more expensive than GPT-4[1].

[1]: https://docs.mystic.ai/docs/mistral-ai-7b-vllm-fast-inferenc...

>>moneyw+3l
Nope, they're using GPT for those

https://blogs.microsoft.com/blog/2023/08/22/microsoft-and-ep...

>>m3kw9+AR
Base models are just trying to autocomplete the input text. The most logical completion for an instruction is something approximately like what you asked, but base models are raw. They have not been taught to follow instructions, so they generally do a poor job. They're especially bad at knowing when to stop, and they will often generate their own questions to answer, which they will then answer, followed by more questions and more answers.

When chat models are trained, they are first pre-trained (the "PT" in "GPT"), which creates a base model, then they are "fine tuned" (RLHF, aligned, whatever you want to call it).

A base model can be fine tuned with an instruction dataset (like OpenOrca[0]) to learn how to follow instructions or how to chat. It can also be fine-tuned with a collection of any inputs and the expected outputs, and learn how to do that specific task.

OpenPipe appears to specialize in fine-tuning base models for specific applications. They wanted a better base model. If you want it instruction-tuned, I'm sure they would be happy to help with that, or you can wait for someone in the community to make one of those from their base model... but I believe the whole point of the article is that a small, specialized model can outperform a large, general model. Their goal does not seem to be to build a tiny, general, chat-tuned model that outperforms GPT-4 in everything. They want you to train the base model on a very specific task, with the expectation that it will outperform GPT-4 and be tremendously cheaper to run at the same time. Many LLM tasks are centered around summarization, extraction, or classification, which have nothing to do with chatting.

[0]: https://huggingface.co/datasets/Open-Orca/OpenOrca

>>Turing+CO
I haven't used the llama2 models much in quite awhile, because they just aren't very good compared to other options that exist at this point. The instruction-tuned variants of Mistral and Mixtral seem to have very little trouble responding in JSON when I ask for it. However, with LLMs that you run yourself, you can also enforce a grammar for the response if you want to, guaranteeing that it will respond with valid JSON (that matches your schema!) and no extraneous text.

Something potentially helpful here: https://github.com/ggerganov/llama.cpp/discussions/2494

If you fine-tuned a base model (like the one in the article) on various inputs and the expected JSON output for each input, it would probably do even better.

>>empora+8U
This is not so surprising if you consider the fact that finetuning is extremely sparse and barely imparts any new knowledge to the model. The paper "Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch"[1] made this clear:

> We initially demonstrate that SFT LM (either encoder- or decoder-based) always tends to acquire excessively redundant delta parameters. To be specific, we present DARE, which randomly resets some delta parameters to zeros based on a drop rate p and subsequently scales the remaining parameters by a factor of 1/(1 − p). Despite its simplicity, with the assistance of DARE, when the LM model parameters reach 70 billion, we can eliminate up to 99% delta parameters with minimal impact on model performance (see Figure 1(a)). The more parameters the LM has, the larger p it can tolerate. This discovery suggests that SFT LM indeed learns a multitude of low-rank structures akin to LoRA [25]

Insofar as those adaptations are mostly distinct, you can just preserve both sets and that's what explains successes of merging, I guess.

1. https://arxiv.org/abs/2311.03099

>>tosh+(OP)
If anyone wants to finetune their own Mistral 7b model 2.2x faster and use 62% less memory - give our open source package Unsloth a try! https://github.com/unslothai/unsloth thanks! :)

>>Me1000+oS
It also has some training on problem decomposition. Many smaller models fail before writing the code, they fail when parsing the question.

You can ask them to serialized a problem in prolog, and see exactly when their understanding breaks - this is open hermes 2.5: https://pastebin.com/raw/kr62Hybq

>>YetAno+wB
I cloud host Mistral 7B for 20x cheaper than GPT-4-Turbo.

And Mistral 7B API is $0.00/1M tokens, i.e. free : https://openrouter.ai/models/mistralai/mistral-7b-instruct

zlacker

Mistral 7B Fine-Tune Optimized