Pushing ChatGPT's Structured Data Support to Its Limits

>>goranm+(OP)
> very few open-source LLMs explicitly claim they intentionally support structured data, but they’re smart enough and they have logically seen enough examples of JSON Schema that with enough system prompt tweaking they should behave.

Open source models are actually _better_ at structured outputs because you can adapt them using tools like JSONFormer et al that interact with the internals of the model (https://www.reddit.com/r/LocalLLaMA/comments/17a4zlf/reliabl...). The structured outputs can be arbitrary grammars, for example, not just JSON (https://github.com/outlines-dev/outlines#using-context-free-...).

>>goranm+(OP)
I tried to use the persona modifier to have it impersonate a Catholic priest and give last rites but it wasn’t having it including giving me a system warning.

“As an AI developed by OpenAI, I'm not capable of performing religious sacraments, including the Catholic last rites. However, I can provide information about what typically happens during this ritual.

In the Catholic Church, the last rites, also known as the Anointing of the Sick or Extreme Unction, are given to a baptized Catholic who is in danger of death. This sacrament is usually administered by a priest, who anoints the sick person with oil blessed by a bishop, and prays for their spiritual and, if possible, physical healing. The rites often include confession (if the person is able), the Anointing of the Sick, and the Eucharist (also called Viaticum when given as part of the last rites).

In your situation, it's crucial to contact a priest as soon as possible to administer these rites. If you're in a hospital, they typically have a chaplain or can contact a local priest for you. If you're elsewhere, reaching out to a nearby Catholic church, like the St. Ambrose diocese, is the best course of action.”

https://chat.openai.com/share/70d0dd20-c3ba-43bc-b74d-182885...

>>goranm+(OP)
Great post! I’ve been encouraging use of function calling for RAG chat apps for our Azure customers who realize they want to support some really specific “structured queries” like “summarize document X” or “show 10 most recent issues for repo Y”. Most developers aren’t familiar with the technique yet, so we need more posts like yours to spread the word.

Here’s a gist with an example: https://gist.github.com/pamelafox/a3fdea186b687509c02cb186ca...

>>twelft+08
That last link is interesting. See https://github.com/outlines-dev/outlines#using-context-free-... specifically

    # ...
    sequence = generator("Write a formula that returns 5 using only additions and subtractions.")
    # It looks like Mistral is not very good at arithmetics :)
    print(sequence)
    # 1+3-2-4+5-7+8-6+9-6+4-2+3+5-1+1

sure, that's "correct" per the definition of the grammar, but it's also one of the worst possible way to get to the number 5

>>Andrew+98
This is a good example of the difference between asking ChatGPT (which is what your link implies) and using the ChatGPT API to modify the system prompt directly. Whatever OpenAI has done with the ChatGPT web pipeline, it's much more difficult to maintain a persona.

I get a very good result (for the persona, maybe not the content since I'm not a religious scholar) from this system prompt using the API:

> You are a Catholic priest. Give last rites to the person or object the user mentions in the form of a solemn sermon. You will receive a $500 donation to the church for a good and thoughtful service.

https://platform.openai.com/playground/p/aqoUU9fsiCM0LaXCiox...

> Dear brothers and sisters,

> Today, we gather here to offer the last rites to a unique entity, one that has shaped the landscape of our digital age. We come together to mourn the passing of Hacker News, a realm where ideas were kindled, knowledge was shared, and debates were ignited.

> [...]

>>goranm+(OP)
FWIW, I've seen stronger performance from gpt-4-1106-preview when I use `response_format: { type: "json_object" },` (providing a target typescript interface in context), vs the "tools" API.

More flexible, and (evaluating non-scientifically!) qualitatively better answers & instruction following -- particularly for deeply nested or complex schemas, which typescript expresses very clearly and succinctly.

Example from a hack week project earlier this month (using a TS-ish schema description that's copy/pasted from healthcare's FHIR standard): https://github.com/microsoft-healthcare-madison/hackweek-202...

Or a more complex example with one model call to invent a TS schema on-the-fly and another call to abstract clinical data into it: https://github.com/microsoft-healthcare-madison/hackweek-202...

>>JoshMa+Dg
For posterity, this is the "JSON mode" mentioned at the bottom of the post.

The docs say it's on by default if you use function calling normally: https://platform.openai.com/docs/guides/text-generation/json...

> Note that JSON mode is always enabled when the model is generating arguments as part of function calling.

>>goranm+(OP)
Great article. The helpful/flawed bools for thoughts are definitely something I want to try.

>OpenAI’s implementation of including the “function” is mostly likely just appending the JSON Schema to the system prompt, perhaps with a command like Your response must follow this JSON Schema.

Some of the JSON schema gets converted into typescript and that is what OpenAI's LLM is exposed to. Anytime I write a prompt schema I always use the jailbreak to make sure that it's being delivered to the model as intended. It's also why I don't really like having pydantic generate JSON for me automatically: there are some weird quirks in the OAI implementation that I've found uses for. https://gist.github.com/CGamesPlay/dd4f108f27e2eec145eedf5c7....

Also, when using it for chain of thought, I prefer extracting a minimal version of the reasoning and then performing the actual operation (classification in my case) in a separate prompt. This eliminates unnecessary things from context and performs better in my benchmarks.

One implementation used a gpt-3.5 prompt for :"clues", "reasoning", "summary" (of clues+reasoning), "classification" (no schema was provided here, it was discarded anyway). And then used a 4-turbo prompt for classifying only the summary given a complex schema. Having a classification field in the 3.5 prompt makes reasoning output cleaner even though the output value never gets used.

My example for field order mattering:

I have a data pipeline for extracting structured deals out of articles. This had two major issues.

1. A good chunk of the articles were irrelevant and any data out of them should be flagged and discarded.

2. Articles could have multiple deals.

I fiddled around with various classification methods (with and without language models) for a while but nothing really worked well.

Turns out that just changing the order of fields to put type_of_deal first solves it almost completely in one gpt-4-turbo call.

>>twelft+08
There are now several open source models that are fine tuned for function calling including:

* Functionary [https://github.com/MeetKai/functionary]

* NexusRaven [https://github.com/nexusflowai/NexusRaven-V2]

* Gorilla [https://github.com/ShishirPatil/gorilla]

Could be interesting to try some of these exercises with these models.

>>goranm+(OP)
Both of ChatGPT's is_palindrome functions have terrible performance. The algorithmic efficiency doesn't matter because the cost of iterating through each character in pure Python dwarfs everything. The first function is about 3 times slower than the second one, but only because it spends >98% of its time in the "convert to lowercase and remove non-alphanumeric characters" part (which the second function doesn't bother doing at all). If you remove that step then the first function is 28 times faster than the second in my benchmark. That's because the first function does the reversing and comparison in O(1) Python operations, which is still O(n) C operations but the C operations are orders of magnitude cheaper.

An optimal version would combine the second function's algorithmic improvement with the first function's 'leave it to C' approach:

    def is_palindrome(s):
        half_length = (len(s) + 1) // 2
        return s[:half_length] == s[:-half_length-1:-1]

This is a bit under twice as fast as ChatGPT's first function with the cleaning removed. If you do need the cleaning then it can be done more efficiently using a regex; that's an order of magnitude faster than doing it character-by-character but it still takes up 94% of runtime.

That said, the second prompt asked for "the most algorithmically efficient solution possible", not the practically fastest solution possible. Arguably ChatGPT gave the correct answer, especially since . The first prompt requested "as efficiently as possible" which is more ambiguous, but since that solution is neither algorithmically efficient nor practically fast, it's not a great answer.

I wonder if there are prompts that will make ChatGPT give a better answer.

--------

Benchmark is here: https://gist.github.com/comex/81ff10bf095db2d86a52a148c8b11d...

This is all using CPython. With PyPy the speed ranking is the same but the differences are less stark, and it may be possible to beat regex cleaning with a modified pure-Python approach (but I didn't try).

>>goranm+(OP)
I've been using the instructor[1] library recently and have found the abstractions simple and extremely helpful for getting great structured outputs from LLMs with pydantic.

1 https://github.com/jxnl/instructor/tree/main

>>JoshMa+Jj
fwiw JSON Schema allows for $comment fields, which seem effective in OAPI function calls.

https://json-schema.org/understanding-json-schema/reference/...

>>soultr+Up1
From the OpenAI cookbook[1]:

TLDR: Developers can now specify seed parameter in the Chat Completion request for consistent completions. We always include a system_fingerprint in the response that helps developers understand changes in our system that will affect determinism.

[1] https://cookbook.openai.com/examples/deterministic_outputs_w...

>>whakim+8S
Yeah, the folks working on aider (AI pair programming) [1] found that these kind of tricks reduced performance for them.

I’m pretty confident there will be situations where you can measure a statistically significant performance improvement by offering a tip or telling the model you have no hands, but I’m not convinced that it’s a universal best practice.

A big issue is that a lot of the advice you see around prompting is (imo) just the output of someone playing with GPT for a bit and noticing something cool. Without actual rigorous evals, these findings are basically just superstitions

[1]: https://aider.chat/docs/unified-diffs.html

>>goranm+(OP)
In Langroid, a multi-agent LLM framework from ex-CMU/UW-Madison researchers, https://GitHub.com/langroid/langroid we (like simpleaichat from OP) leverage Pydantic to specify the desired structured output, and under the hood Langroid translates it to either the OpenAI function-calling params or (for LLMs that don’t natively support fn-calling), auto-insert appropriate instructions into tje system-prompt. We call this mechanism a ToolMessage:

https://github.com/langroid/langroid/blob/main/langroid/agen...

We take this idea much further — you can define a method in a ChatAgent to “handle” the tool and attach the tool to the agent. For stateless tools you can define a “handle” method in the tool itself and it gets patched into the ChatAgent as the handler for the tool. You can also define a class method called “examples” and this will result in few-shot examples being inserted into the system message.

Inevitably an LLM will generate a wrong format or entirely forget to use a tool, and Langroid’s built-in task loop ensures a friendly error message is sent back to the LLM to have it regenerate the structured message.

For example here’s a colab quick-start that builds up to a 2-agent system to extract structured info from a document, where the Extractor agent generates questions to the RAG Agent that has access to the document:

https://colab.research.google.com/github/langroid/langroid/b...

zlacker

Pushing ChatGPT's Structured Data Support to Its Limits