>Sometimes it will spit out terrible horrid answers. I believe this might be due to time of the day/too many users. They limit tokens.
>Sometimes it will lie because it has alignment
>Sometimes I feel like it tests things on me
So, yes you are right, gpt4 is overall better, but I find myself using local models because I stopped trusting gpt4.
But finetuning on just a few tasks?
Depending on the task, it's totally reasonable to expect that a 7B model might eke out a win against stock GPT4. Especially if there's domain knowledge in the finetune, and the given task is light on demand for logical skills.
The best open source has to offer is Mixtral that will confidently make up a biography of a person it's never heard of before or write a script with nonexistant libraries.
[1]: https://twitter.com/RobLynch99/status/1734278713762549970
(Though with that said, the seasonal issue might be common to any LLM with training data annotated by time of year.)
There is nothing unreasonable about this. However I do dislike it when that information is presented in a fishy way, implying that it "outperforms GPT4" without any qualification.
I think the adage about "a solution needs to be 10x other solutions to make someone switch" applies here.
Saying something performs slightly better than the industry standard offerings (OpenAI) means that OpenAI is going to laugh all the way to the bank. Everyone will just use their APIs over anything else.
I'm excited about the LLM space and I can barely keep up with the model names, much less all the techniques for fine tuning. A customer is going to have an even worse time.
No one will ever get fired for buying OpenAI (now that IBM is dead, and probably sad Watson never made a dent).
I do use Mistral for all my personal projects but I'm not sure that is going to have the same effect on the industry as open source software did in the past.
It’s an argument they make at least as much to market fine tuning as their own model.
This is not a generic model that outperforms another generic model (GPT-4).
That can of course have useful applications because the resource/cost is then comparatively minuscule for certain business use cases.
their methodology also appears to be 'try 12 different models and hope 1 of them wins out.' multiple hypothesis adjustments come to mind here :)
Some of the things it said I’d done were genuinely good ideas, and I might actually go and do them at some point.
ChatGPT just said no.
Cheaper and faster is also better. The cheapest version of GPT-4 costs $0.01/$0.03 per 1K input/output tokens [1]. Mistral AI is charging 0.14€/0.42€ per ONE MILLION input/output tokens for their 7B model [2]. It's night and day.
If people can start fine-tuning a 7B model to do the same work they were doing with GPT-4, they will 100% switch.
[1]: https://help.openai.com/en/articles/7127956-how-much-does-gp...
It's already superior to OpenAI because it doesn't require an API. You can run the model on your own hardware, in your own datacenter, and your data is guaranteed to remain confidential. Creating a one-off fine-tune is a different story than permanently joining your company at the hip to OpenAI.
I know in our bubble, in the era of Cloud, it's easy to send confidential company data to some random API on the Internet and not worry about it, but that's absolutely not the case for anyone in Healthcare, Government, or even normal companies that are security conscious. For them, OpenAI was never a valid consideration in the first place.
When building something similar powered by OpenAI I had a real pain in the ass anonymizing the data, then de-anonymizing the answers before showing it to the customer.
Also in my example, I'm sure using a string like "Pineapple Cave Inc." instead of the real business name hurt the AI's ability to contextualize the information and data and that hurt the LLM somewhat -- right?
Try a few blinds, mixtral 8x7b-instruct and gpt-4 are 50-50 for me, and it outperforms 3.5 almost every time, and you can run inference on it with a modern cpu and 64 GB of RAM on a personal device lmfao. and the instruct finetuning has had nowhere near the $$$ and rlhf that openai has. It's not a done deal, but people will be able to run models better than today's SOTA on <$1000 hardware in <3 months, I hope for their own sake that OpenAI is moving fast.
I don't think we will have an Open Source GPT4 for a long time so this is sorta clickbait, but for the small, specialized tasks, tuned on high quality data, we are already in the "Linux" era of OSS models. They can do real, practical work.
For example: I wanted my personal assistant to track hygiene, which is a natural use case. But then you arrive at the natural conclusion that either a) the user needs to enter the data themselves (“I brushed my teeth and washed my face and took X medications at Y time”), or b) you need some sort of sensor in the bathroom, ranging from mics or radio sensors up to a tasteful camera. And a million subtle versions of (b) is where I see people going “no, that’s weird, it’s too much info all together”
fails at math of course, even if the problem is very easy, like all mistrals. good for genration, probably not the best for RAG, there's mistral tunes that stay coherent to 16k tokens, and that cuts down chunking significanty
EDIT: Ok so the prompt and outputs are long enough that adding them to the post directly would be kind of onerous. But I didn't want to leave you waiting, so I copied an example into a Notion doc you can see here: https://opipe.notion.site/PII-Redaction-Example-ebfd29939d25...
Education and research without gatekeepers in academia and industry complaining about their book sales or prestige titles being obsoleted
Whole lot of uses cases that break us out of having to kowtow to experts who were merely born before us trying to monopolize exploration of science and technology
To that end I’m working on a GPU accelerated client backed by local AI, with NERFs and Gaussian splatting built in.
The upside to being an EE with MSc in math; most of my money comes from engineering real things. I don’t have skin in the cloud CRUD app/API game and don’t see a reason to spend money propping up middle men who, given my skills and abilities, don’t add value
Programmers can go explore syntax art in their parent’s basement again. Tired of 1970s semantics and everyone with a DSL thinking that’s the best thing to happen to computing as a field of inquiry ever.
Like all industries big tech is monopolized by aging rent seekers. Disrupt by divesting from it is my play now.
Bsically, the statistic means that there's a set of data for which that particular (finetuned) network performs slightly better than GPT-4, and everywhere else, pretty bad. It's just not generalizable to everything while GPT-4 is. It's just as good as saying "calculators outperform GPT-4 at counting". Like, yes, they probably do, but I would like to see - is it applicable and practical, or did you just train a LLM to write all the names in Polish alphabetically really well? And that's why qualitative approach for evaluation LLMs is just better.
Edit: mistook tokens for parameters for a moment there. Keeping up with AI jargon is exhausting for an idiot like me.
Zoom got away with it and still does and no one got fired for using zoom.
I'm happy to have a debate with someone that has successfully sold those ideas to a customer, but I'm skeptical until then.
What your see in the link is the copy paste of a discussion between me and the model in question, that I pasted into gpt4 with the instructions to evaluate it.the answer with the votes in 10/10 is gpt evaluating the chart between me and the smaller model. The smaller model is producing the text after ASSISTANT, the question that I do as USER is part of a fixes script that I run with every new model so that I have a sort of a validation set before doing some more rigorous testing.
Sure, I use OpenAI APIs for certain heavy lifting tasks that don't involve sensitive information, but for anything sensitive it's self hosted LLMs all the way.
[1]: https://docs.mystic.ai/docs/mistral-ai-7b-vllm-fast-inferenc...
Not according to my calculation. For low request rate it is likely more expensive than GPT4.
Can you recommend where I can learn more about hardware requirements for running Mistral/Mixtral?
Like I said, most of my money is wfh design of branded gadgets. Not really the sort to care about the reach of others; if content industry collapses because people don’t need to spend money on it, meh. More interested in advancing computing. Pour money into R&D of organic computers, rather than web apps running on the same old gear with more HP under the hood. yawn
I want bioengineered kaiju sized dogs and drug glands that stoke hallucination I’m on another planet.
Humanity is a generational cup and string. Time to snip the 1900s loose.
"There are lots of Mistral fine-tunes. Why another one?
A very healthy ecosystem of Mistral fine-tunes already exists, but they’re typically optimized for direct use. We wanted something different — a model optimized to be the strongest base model for further fine-tunes to be built on."
And for some use-cases, the "alignment" work on GPT 3.5 and 4 gets more in the way than it helps (even OpenAI admits that alignment makes the model perform worse, even on generic benchmarks).
This is the biggest problem we're having swapping LLMs. While Langchain allows easy swap, and while we dont care as much about quality during integration testing, etc...the bigger problem is following directions. OpenAI does well at outputting a JSON if I ask for one. Unfortunately now our software has come to expect JSON output in such cases. Swap it to, say, llama2 and you dont get JSON even if asking for one. This makes swapping not just a quality decision but an integration challenge.
https://blogs.microsoft.com/blog/2023/08/22/microsoft-and-ep...
what did OpenAI do for the LLM to know "if given a math question, write Python for it, and run the code in order to get result" instead of trying to do the math itself?
When chat models are trained, they are first pre-trained (the "PT" in "GPT"), which creates a base model, then they are "fine tuned" (RLHF, aligned, whatever you want to call it).
A base model can be fine tuned with an instruction dataset (like OpenOrca[0]) to learn how to follow instructions or how to chat. It can also be fine-tuned with a collection of any inputs and the expected outputs, and learn how to do that specific task.
OpenPipe appears to specialize in fine-tuning base models for specific applications. They wanted a better base model. If you want it instruction-tuned, I'm sure they would be happy to help with that, or you can wait for someone in the community to make one of those from their base model... but I believe the whole point of the article is that a small, specialized model can outperform a large, general model. Their goal does not seem to be to build a tiny, general, chat-tuned model that outperforms GPT-4 in everything. They want you to train the base model on a very specific task, with the expectation that it will outperform GPT-4 and be tremendously cheaper to run at the same time. Many LLM tasks are centered around summarization, extraction, or classification, which have nothing to do with chatting.
Something potentially helpful here: https://github.com/ggerganov/llama.cpp/discussions/2494
If you fine-tuned a base model (like the one in the article) on various inputs and the expected JSON output for each input, it would probably do even better.
It’s well know that small fine tunes outperform big models for specific tasks.
But unless my task happens to be something similar to what was tested and fine tuned here it doesn’t really help?
I would hope the article give some more details on model merging. Is it merging two different fine-tuned models, one fine-tuned on dogs, another fine-tuned on cats, and the merging of the two different models is good on cats and dogs as if by magic?
Like fine-tune one model just on Python and test it thoroughly, fine-tune one on Java and test it thoroughly, and then if the need arises for a project that uses both Java and Python, merge the two together and use that. If there is no need for Java, use the one fine-tuned just on Python.
Pretty magical indeed! Let alone the fact, that a separate smaller model of half a billion parameters could figure out how to merge the two together. If the cost of LMs could be reduced by a factor of 100, why not reduce it by a factor of 1000?
> We initially demonstrate that SFT LM (either encoder- or decoder-based) always tends to acquire excessively redundant delta parameters. To be specific, we present DARE, which randomly resets some delta parameters to zeros based on a drop rate p and subsequently scales the remaining parameters by a factor of 1/(1 − p). Despite its simplicity, with the assistance of DARE, when the LM model parameters reach 70 billion, we can eliminate up to 99% delta parameters with minimal impact on model performance (see Figure 1(a)). The more parameters the LM has, the larger p it can tolerate. This discovery suggests that SFT LM indeed learns a multitude of low-rank structures akin to LoRA [25]
Insofar as those adaptations are mostly distinct, you can just preserve both sets and that's what explains successes of merging, I guess.
It doesn't have order-of-magnitude, or I'd even wager 50%, benefits in enabling smaller models. But you nailed it exactly. Fine tune on dogs, fine tune on cats, then...just...average the weights. And you have something better than the original with minimal loss from finetuning.
LoRA's end up being more popular for that use case because they're easier to combine and mix, match, and scale. Model merging is still a key technique for a successful base model.
We are beseiged by vendors promising the earth from their amazing AI tools and we peel back 1 surface layer and they are just shoving things wholesale into GPT-4. When I ask "can we please deploy this on a local model" they run off scared. I can't get any vendor to give us anything except OpenAI.
The primary issue I’ve run into is exhausting the context window much sooner than I’d like. Fine-tuning tends to mostly fix this issue though.
My pet theory is that OpenAI are cooking high quality user data by empowering GPT with all these toys + human-in-the-loop. The purpose is to use this data as a sort of continual evaluation sifting for weak points and enhancing their fine-tuning datasets.
Every human response can carry positive or negative connotation. The model can use that as a reward signal. They claimed to have 100M users, times let's say 10K tokens per month makes 1T synthetic tokens. In a whole year they generate about as much text as the original dataset, 13T. And we know that LLMs can benefit a lot from synthetic data when it is filtered/engineered for quality.
So I think OpenAI's moat is the data they generate.
But the quality is not superior to OpenAI however. I run Mistral 7B on LM Studio, and I can't get far before it starts giving me wrong answers.
ChatGPT-4 on the other hand is correct most of the time (and knows to trigger Python code evaluation or RAG to answer questions). This makes it useful.
You can ask them to serialized a problem in prolog, and see exactly when their understanding breaks - this is open hermes 2.5: https://pastebin.com/raw/kr62Hybq
Mixtral 8x7b is closer to GPT-4 quality though and only 2x the compute requirement of Mistral 7B.
And Mistral 7B API is $0.00/1M tokens, i.e. free : https://openrouter.ai/models/mistralai/mistral-7b-instruct
Some data might never travel across a Google account, but very well over ChatGPT.
If you're processing personal data of other person, then you don't really have a choice in the matter: gain permission from them to transfer their data to a third party or self-host the model.