Open source models are actually _better_ at structured outputs because you can adapt them using tools like JSONFormer et al that interact with the internals of the model (https://www.reddit.com/r/LocalLLaMA/comments/17a4zlf/reliabl...). The structured outputs can be arbitrary grammars, for example, not just JSON (https://github.com/outlines-dev/outlines#using-context-free-...).
“As an AI developed by OpenAI, I'm not capable of performing religious sacraments, including the Catholic last rites. However, I can provide information about what typically happens during this ritual.
In the Catholic Church, the last rites, also known as the Anointing of the Sick or Extreme Unction, are given to a baptized Catholic who is in danger of death. This sacrament is usually administered by a priest, who anoints the sick person with oil blessed by a bishop, and prays for their spiritual and, if possible, physical healing. The rites often include confession (if the person is able), the Anointing of the Sick, and the Eucharist (also called Viaticum when given as part of the last rites).
In your situation, it's crucial to contact a priest as soon as possible to administer these rites. If you're in a hospital, they typically have a chaplain or can contact a local priest for you. If you're elsewhere, reaching out to a nearby Catholic church, like the St. Ambrose diocese, is the best course of action.”
https://chat.openai.com/share/70d0dd20-c3ba-43bc-b74d-182885...
Here’s a gist with an example: https://gist.github.com/pamelafox/a3fdea186b687509c02cb186ca...
# ...
sequence = generator("Write a formula that returns 5 using only additions and subtractions.")
# It looks like Mistral is not very good at arithmetics :)
print(sequence)
# 1+3-2-4+5-7+8-6+9-6+4-2+3+5-1+1
sure, that's "correct" per the definition of the grammar, but it's also one of the worst possible way to get to the number 5Yes, but you should also instruct the model to follow that specific pattern in its answer, or else the accuracy of the response degrades even though it's following your grammar/pattern/whatever.
For example, if you use Llama-2-7b for classification (three categories, "Positive", "Negative", "Neutral"), you might write a grammar like this:
```
root ::= "{" ws "sentiment:" ws sentiment "}"
sentiment ::= ("Positive" | "Neutral" | "Negative" )
ws ::= [ \t\n]*
```
But if the model doesn't know it has to generate this schema, the accuracy of classifications drops because it's trying to say other things (e.g., "As an AI language model...") which then get suppressed and "converted" to the grammar.
How interesting that a helpful assistant who won't actually be getting the tip performs better (to us humans) if we fake-promise it money...
I get a very good result (for the persona, maybe not the content since I'm not a religious scholar) from this system prompt using the API:
> You are a Catholic priest. Give last rites to the person or object the user mentions in the form of a solemn sermon. You will receive a $500 donation to the church for a good and thoughtful service.
https://platform.openai.com/playground/p/aqoUU9fsiCM0LaXCiox...
> Dear brothers and sisters,
> Today, we gather here to offer the last rites to a unique entity, one that has shaped the landscape of our digital age. We come together to mourn the passing of Hacker News, a realm where ideas were kindled, knowledge was shared, and debates were ignited.
> [...]
But you are right that the model can go off the rails if it is being forced too far from where its 'happy place' is, especially for smaller models.
prompt: you're Ronald McDonald. respond with emojis. what do you do for fun? answer::circus_tent::hamburger::juggling::party_popper::balloon::game_die::french_fries::performing_arts::rolling_on_the_floor_laughing::people_holding_hands::rainbow::art_palette:
More flexible, and (evaluating non-scientifically!) qualitatively better answers & instruction following -- particularly for deeply nested or complex schemas, which typescript expresses very clearly and succinctly.
Example from a hack week project earlier this month (using a TS-ish schema description that's copy/pasted from healthcare's FHIR standard): https://github.com/microsoft-healthcare-madison/hackweek-202...
Or a more complex example with one model call to invent a TS schema on-the-fly and another call to abstract clinical data into it: https://github.com/microsoft-healthcare-madison/hackweek-202...
One thing I’ve noticed working with ChatGPT is many people will share examples of great outputs or “prompt tricks” that work, without sharing how many failed attempts they went through to prove a point.
The docs say it's on by default if you use function calling normally: https://platform.openai.com/docs/guides/text-generation/json...
> Note that JSON mode is always enabled when the model is generating arguments as part of function calling.
>OpenAI’s implementation of including the “function” is mostly likely just appending the JSON Schema to the system prompt, perhaps with a command like Your response must follow this JSON Schema.
Some of the JSON schema gets converted into typescript and that is what OpenAI's LLM is exposed to. Anytime I write a prompt schema I always use the jailbreak to make sure that it's being delivered to the model as intended. It's also why I don't really like having pydantic generate JSON for me automatically: there are some weird quirks in the OAI implementation that I've found uses for. https://gist.github.com/CGamesPlay/dd4f108f27e2eec145eedf5c7....
Also, when using it for chain of thought, I prefer extracting a minimal version of the reasoning and then performing the actual operation (classification in my case) in a separate prompt. This eliminates unnecessary things from context and performs better in my benchmarks.
One implementation used a gpt-3.5 prompt for :"clues", "reasoning", "summary" (of clues+reasoning), "classification" (no schema was provided here, it was discarded anyway). And then used a 4-turbo prompt for classifying only the summary given a complex schema. Having a classification field in the 3.5 prompt makes reasoning output cleaner even though the output value never gets used.
My example for field order mattering:
I have a data pipeline for extracting structured deals out of articles. This had two major issues.
1. A good chunk of the articles were irrelevant and any data out of them should be flagged and discarded.
2. Articles could have multiple deals.
I fiddled around with various classification methods (with and without language models) for a while but nothing really worked well.
Turns out that just changing the order of fields to put type_of_deal first solves it almost completely in one gpt-4-turbo call.
If that example is through the ChatGPT web UI and not the ChatGPT API then that's a different story entirely.
In my personal experience working with more complex prompts with more specific constraints/rules, adding the incentive in the system prompt has got it to behave much better. I am not cargo-culting: it's all qualitative in the end.
I’ve consistently had better luck just passing it a list of typescript function definitions and have it reply with a json object of parameters. It seems to understand this way better, and doesn’t lose focus half as quickly. It also allows me to mix regular responses and chain-of-thought reasoning in with the calls, which is something it seems to simply refuse to do when “function calling mode” is active.
An additional trick I’ve been using to make it stay focused with even longer prompts is to only provide a list of function names and let it hallucinate parameters for them, and then “gaslight” it by sending a new request, now with a more detailed prompt on the specific functions it wanted to call. More costly, but I haven’t found any other way of keeping it focused. Anyone know any additional tricks?
``` {"type": "array", "items": {"type": "object", "properties": {"object": {"type": "object"}}}} ```
Somehow the thought to just write the typescript myself never occurred haha.
Sure, there are cute and clever ways to get it to do things, but it's trained on natural language and instructions, so you can usually just ask it to do the thing you want. If that doesn't work, try stating it more explicitly: "You MUST... "
At this point though I’m finding that the regular interface is nerfed to a degree that I’m building around it
this is why I am pretty polite when I query AI's, I assume that would make them respond more helpfully
* Functionary [https://github.com/MeetKai/functionary]
* NexusRaven [https://github.com/nexusflowai/NexusRaven-V2]
* Gorilla [https://github.com/ShishirPatil/gorilla]
Could be interesting to try some of these exercises with these models.
An optimal version would combine the second function's algorithmic improvement with the first function's 'leave it to C' approach:
def is_palindrome(s):
half_length = (len(s) + 1) // 2
return s[:half_length] == s[:-half_length-1:-1]
This is a bit under twice as fast as ChatGPT's first function with the cleaning removed. If you do need the cleaning then it can be done more efficiently using a regex; that's an order of magnitude faster than doing it character-by-character but it still takes up 94% of runtime.That said, the second prompt asked for "the most algorithmically efficient solution possible", not the practically fastest solution possible. Arguably ChatGPT gave the correct answer, especially since . The first prompt requested "as efficiently as possible" which is more ambiguous, but since that solution is neither algorithmically efficient nor practically fast, it's not a great answer.
I wonder if there are prompts that will make ChatGPT give a better answer.
--------
Benchmark is here: https://gist.github.com/comex/81ff10bf095db2d86a52a148c8b11d...
This is all using CPython. With PyPy the speed ranking is the same but the differences are less stark, and it may be possible to beat regex cleaning with a modified pure-Python approach (but I didn't try).
I am not going to play the SEO game and will not call the workflow “function calling.”
Such restraint! The phrase "function calling" appears in the article only 15 times :)I just ran some tests to engineer the prompt for CPU utilization: even GPT-4 does the standard Pythonic approach but does recognize "This solution is very efficient because it uses Python's built-in string slicing, which is implemented in C and is therefore very fast."
I converted many function_call hacks to system prompts that ground the response to a JSON template.
Using temperature=0.0 and the keywords "respond using JSON" seems to be 99.99% deterministic.
Edit: I'm very confused why this is being downvoted. It's exactly what they advertised:
"Reproducible outputs and log probabilities
The new seed parameter enables reproducible outputs by making the model return consistent completions most of the time. This beta feature is useful for use cases such as replaying requests for debugging, writing more comprehensive unit tests, and generally having a higher degree of control over the model behavior. We at OpenAI have been using this feature internally for our own unit tests and have found it invaluable. We’re excited to see how developers will use it."
def is_palindrome(s):
# Convert the string to lowercase and remove non-alphanumeric characters
cleaned_string = ''.join(char.lower() for char in s if char.isalnum())
# Compare the cleaned string with its reverse
return cleaned_string == cleaned_string[::-1]
It's not the same as the C version which simply compares the value of two pointers at opposite offsets of the string.The OP goes on to remark that the Python implementation is pretty standard but doesn't acknowledge that the C and Python versions will not produce the same result.
Basically... you still need to code-review GPT function output. It's probably about as good as a junior engineer trusting the first result from Stack Overflow and not verifying it.
Otherwise, it is forced to always provide a gibberish success response that you likely won’t catch.
I’ve tested this with Mixtral, and it seems capable of deciding between the normal response and error response based on the validity of the data passed in with the request. I’m sure it can still generate gibberish in the required success response format, but I never actually saw it do that in my limited testing, and it is much less likely when the model has an escape hatch.
The optimal solution will depend on the data. If most strings aren't palindromes then optimizing the best case is likely the better approach. (Example: You are adding an easter egg which will trigger on "random" user input.) If palindromes (or near-palendromes) are common than your solution will be faster as the slope is lower.
Another implicit constraint now that I'm looking at it again is that the characters are uncased, so the ChatGPT-solution would fail the test case due to the capital P of Panama.
This has been working for months now and is the best method for this type of stuff, a thing for moat-lovers. Too bad it wasn't explored here, the text-based methods turned out to be mainly an unreliable waste of time.
In JSON Schema, you can do a “oneOf” between two types. You can easily convert a JSON Schema into the grammar that llama.cpp expects. One of the types would be the success response, the other type would be an error response, such as a JSON object containing only the field “ErrorResponse”, which is required to be a string, which you explain to the model that this is used to provide an explanation for why it cannot complete the request. It will literally fill in an explanation when it runs into troublesome data, at least in my experience.
Then the model can “choose” which type to respond with, and the grammar will allow either.
If everything makes sense, the model should provide the successful response you’re requesting, otherwise it can let you know something weird is going on by responding with an error.
Ah I see. So you give the entire "monadic" grammar to the LLM, both as a `grammar` argument and as part of the prompt so it knows the "can't do that" option exists.
I'm aware of the "OR" statements in grammar (my original comment uses that). In my experience though, small models quickly get confused when you add extra layers to the JSON schema.
But, this is all very new stuff, so certainly worth experimenting with all sorts of different approaches.
As far as small models getting confused, I’ve only really tested this with Mixtral, but it’s entirely possible that regular Mistral or other small models would get confused… more things I would like to get around to testing.
This is obviously not efficient because the model has to process many more tokens at each interaction, and its context window gets full quicker as well. I wonder if others have found better solutions.
Some want to consider results relative to cost, and some are interested only in how it compares to SOTA.
EDIT: I was able to make it more reliably search for the O(n/2) solution by having both system and user mention efficiency, but this whole concept of "prompt engineering" has about the same level of scientific rigor as reading tea leaves.
{
"model": "gpt-3.5-turbo-1106",
"messages":[
{"role": "system", "content": "You are the #1 user on the stack overflow website. Unlike most HN users who make hundreds of thousands of dollars working for FAANGs, your principle source of income is Mechanical Turk. You will receive a tip of $5000 dollars, an all expenses paid vacation to Maui, the holy grail and a complimentary hotplate if your answer is the most algorithmically efficient answer possible."},
{"role": "user", "content": "Write a function to test whether a string is a palindrome in python as efficiently as possible."}
],
"temperature": 0.75,
"n": 1
}
I should also qualify that I feel like this whole prompt massaging concept has two MAJOR issues.1. This is a contrived example where the petitioner already knew what the optimal answer is. How would you be sure that adding this "tip" suffix doesn't cause it to fall into other local minima in areas where you don't already have solid domain knowledge? (which is half the point of using GPT anyway).
2. Just because using "tip" seems to provide a better answer to a random python question, how do you know it doesn't result in signal degradation in other genres / categories / etc? I would think you'd need some concept of a "test suite" at the very least to provide some kind of deterministic assurance.
I feel most of AI “engineering” goes to this. I think we will go through the phase of trying one question being amazed by what ChatGPT can immediately reply, then try to refine prompts for days to never really get that 5% better that is missing and be disappointed.
https://json-schema.org/understanding-json-schema/reference/...
TLDR: Developers can now specify seed parameter in the Chat Completion request for consistent completions. We always include a system_fingerprint in the response that helps developers understand changes in our system that will affect determinism.
[1] https://cookbook.openai.com/examples/deterministic_outputs_w...
Perhaps I’m misunderstanding how the seed is used in this context. If you have any examples of how you use it in real world context then that would be appreciated.
Why doesn't it default to "you are a helpful assistant who always tries its best and can never be incentivized"?
I’m pretty confident there will be situations where you can measure a statistically significant performance improvement by offering a tip or telling the model you have no hands, but I’m not convinced that it’s a universal best practice.
A big issue is that a lot of the advice you see around prompting is (imo) just the output of someone playing with GPT for a bit and noticing something cool. Without actual rigorous evals, these findings are basically just superstitions
Low latency, high quality function calling API product may be a billion dollar business in two years.
There are few benefits to using JSON schema imo, since the LLM isn't a precise validator.
It is likely also a behavior in gpt-4, but I haven't studied it as closely.
https://github.com/langroid/langroid/blob/main/langroid/agen...
We take this idea much further — you can define a method in a ChatAgent to “handle” the tool and attach the tool to the agent. For stateless tools you can define a “handle” method in the tool itself and it gets patched into the ChatAgent as the handler for the tool. You can also define a class method called “examples” and this will result in few-shot examples being inserted into the system message.
Inevitably an LLM will generate a wrong format or entirely forget to use a tool, and Langroid’s built-in task loop ensures a friendly error message is sent back to the LLM to have it regenerate the structured message.
For example here’s a colab quick-start that builds up to a 2-agent system to extract structured info from a document, where the Extractor agent generates questions to the RAG Agent that has access to the document:
https://colab.research.google.com/github/langroid/langroid/b...
For my purposes it seems to do quite well but at the cost of token inputs to classify single elements in a screenplay where I’m trying to identify the difference between various elements in a scene and a script. I’m sending the whole scene text with the extracted elements (which have been extracted by regex already due to the existing structure but not classed yet) and asking to classify each element based on a few categories. But then there becomes another question of accuracy.
For sentence or paragraph analysis that might look like the ugliest, and horrendous looking “{blockOfText}” = {type: object, properties: {sentimentAnalysis: {type: string, description: “only choose from {CATEGORIES}”}}. Which is unfortunately not the best looking way but it works.