zlacker

Doesn’t really follow instructions too well, if you ask it to list 10 songs on or 5 things it’s give you way more. I’m not sure why some models do it well like Mistral instruct v1, ChatGPT 3.5/4 but here it extremely verbose and it outputs like a short circuited robot

replies(3): >>coder5+W4 >>Turing+W7 >>fl0id+rf

>>m3kw9+(OP)
They released a base model. It is not instruction-tuned, so it won't really follow instructions unless you fine-tune it to do that.

"There are lots of Mistral fine-tunes. Why another one?

A very healthy ecosystem of Mistral fine-tunes already exists, but they’re typically optimized for direct use. We wanted something different — a model optimized to be the strongest base model for further fine-tunes to be built on."

replies(1): >>m3kw9+Ua

>>m3kw9+(OP)
>> Doesn’t really follow instructions too well,

This is the biggest problem we're having swapping LLMs. While Langchain allows easy swap, and while we dont care as much about quality during integration testing, etc...the bigger problem is following directions. OpenAI does well at outputting a JSON if I ask for one. Unfortunately now our software has come to expect JSON output in such cases. Swap it to, say, llama2 and you dont get JSON even if asking for one. This makes swapping not just a quality decision but an integration challenge.

replies(2): >>coder5+Xb >>jay-ba+ep

>>coder5+W4
Then how come the base model can somewhat follow instructions but not very well, or why is it that the base model won’t follow instructions well?

replies(1): >>coder5+bb

>>m3kw9+Ua
Base models are just trying to autocomplete the input text. The most logical completion for an instruction is something approximately like what you asked, but base models are raw. They have not been taught to follow instructions, so they generally do a poor job. They're especially bad at knowing when to stop, and they will often generate their own questions to answer, which they will then answer, followed by more questions and more answers.

When chat models are trained, they are first pre-trained (the "PT" in "GPT"), which creates a base model, then they are "fine tuned" (RLHF, aligned, whatever you want to call it).

A base model can be fine tuned with an instruction dataset (like OpenOrca[0]) to learn how to follow instructions or how to chat. It can also be fine-tuned with a collection of any inputs and the expected outputs, and learn how to do that specific task.

OpenPipe appears to specialize in fine-tuning base models for specific applications. They wanted a better base model. If you want it instruction-tuned, I'm sure they would be happy to help with that, or you can wait for someone in the community to make one of those from their base model... but I believe the whole point of the article is that a small, specialized model can outperform a large, general model. Their goal does not seem to be to build a tiny, general, chat-tuned model that outperforms GPT-4 in everything. They want you to train the base model on a very specific task, with the expectation that it will outperform GPT-4 and be tremendously cheaper to run at the same time. Many LLM tasks are centered around summarization, extraction, or classification, which have nothing to do with chatting.

[0]: https://huggingface.co/datasets/Open-Orca/OpenOrca

>>Turing+W7
I haven't used the llama2 models much in quite awhile, because they just aren't very good compared to other options that exist at this point. The instruction-tuned variants of Mistral and Mixtral seem to have very little trouble responding in JSON when I ask for it. However, with LLMs that you run yourself, you can also enforce a grammar for the response if you want to, guaranteeing that it will respond with valid JSON (that matches your schema!) and no extraneous text.

Something potentially helpful here: https://github.com/ggerganov/llama.cpp/discussions/2494

If you fine-tuned a base model (like the one in the article) on various inputs and the expected JSON output for each input, it would probably do even better.

>>m3kw9+(OP)
At least ChatGPT 3.5 also has that problem. Ask it to summarize in X sentences, chances are it’s a wrong amount.

>>Turing+W7
In my experience, Llama 2 (70B) can semi-reliably provide JSON output when provided with clear instructions and various distinct but similarly structured examples. It goes from “semi-reliably” to “consistently” when fine-tuned.

The primary issue I’ve run into is exhausting the context window much sooner than I’d like. Fine-tuning tends to mostly fix this issue though.