When you start to pull in multiple large documents, especially all at once, things start to act weird, but pulling in documents one at a time seems to preserve context over multiple documents. There's a character limit of 100k per API request, so I'm assuming a 32k context window, but it's not totally clear what is going on in the background.
It's kind of clunky but works well enough for me. It's not something that I would be putting sensitive info into - but it's also much cheaper than using GPT-4 via the API and I maintain control of the data flow and storage.