Then it dawned on me how many companies are deeply integrating Copilot into their everyday workflows. It's the perfect Trojan Horse.
None of the mainstream paid services ingest operating data into their training sets. You will find a lot of conspiracy theories claiming that companies are saying one thing but secretly stealing your data, of course.
Nothing is really preventing this though. AI companies have already proven they will ignore copyright and any other legal nuisance so they can train models.
It's not really a conspiracy when we have multiple examples of high profile companies doing exactly this. And it keeps happening. Granted I'm unaware of cases of this occuring currently with professional AI services but it's basic security 101 that you should never let anything even have the remote opportunity to ingest data unless you don't care about the data.
Also I wonder if the ToS covers "queries & interaction" vs "uploaded data" - I could imagine some tricky language in there that says we wont use your word document, but we may at some time use the queries you put against it, not as raw corpus but as a second layer examining what tools/workflows to expand/exploit.
“How can I control whether my data is used for model training?
If you are logged into Copilot with a Microsoft Account or other third-party authentication, you can control whether your conversations are used for training the generative AI models used in Copilot. Opting out will exclude your past, present, and future conversations from being used for training these AI models, unless you choose to opt back in. If you opt out, that change will be reflected throughout our systems within 30 days.” https://support.microsoft.com/en-us/topic/privacy-faq-for-mi...
At this point suggesting it has never and will her happen is wildly optimistic.
While this isn't used specifically for LLM training, it can involve aggregating insights from customer behaviour.
This is objectively untrue? Giants swaths of enterprise software is based on establishing trust with approved vendors and systems.
if they can get away with it (say by claiming it's "fair use"), they'll ignore corporate ones too
despite all 3 branches of the government disagreeing with them over and over again
Many of the top AI services use human feedback to continuously apply "reinforcement learning" after the initial deployment of a pre-trained model.
https://en.wikipedia.org/wiki/Reinforcement_learning_from_hu...
What? That’s literally my point: Enterprise agreements aren’t training on the data of their enterprise customers like the parent commenter claimed.
Do you have any citations or sources for this at all?
There may very well be clever techniques that don't require directly training on the users' data. Perhaps generating a parallel paraphrased corpus as they serve user queries - one which they CAN train on legally.
The amount of value unlocked by stealing practically ~everyone's lunch makes me not want to put that past anyone who's capable of implementing such a technology.
Inference (what happens when you use an LLM as a customer) is separate from training.
Inference and training are separate processes. Using an LLM doesn’t train it. That’s not what RLHF means.
Merely using an LLM for inference does not train it on the prompts and data, as many incorrectly assume. There is a surprising lack of understanding of this separation even on technical forums like HN.
Many businesses simply couldn't afford to operate without such an edge.
The enterprise user agreement is preventing this.
Suggesting that AI companies will uniquely ignore the law or contracts is conspiracy theory thinking.
There are claims all through this thread that “AI companies” are probably doing bad things with enterprise customer data but nobody has provided a single source for the claim.
This has been a theme on HN. There was a thread a few weeks back where someone confidently claimed up and down the thread that Gemini’s terms of service allowed them to train on your company’s customer data, even though 30 seconds of searching leads to the exact docs that say otherwise. There is a lot of hearsay being spread as fact, but nobody actually linking to ToS or citing sections they’re talking about.
The big companies - take Midjourney, or OpenAI, for example - take the feedback that is generated by users, and then apply it as part of the RLHF pass on the next model release, which happens every few months. That's why they have the terms in their TOS that allow them to do that.
There’s a range of ways to lie by omission, here, and the major players have established a reputation for being willing to take an expansive view of their legal rights.
“You can use an LLM to paraphrase the incoming requests and save that. Never save the verbatim request. If they ask for all the request data we have, we tell them the truth, we don’t have it. If they ask for paraphrased data, we’d have no way of correlating it to their requests.”
“And what would you say, is this a 3 or a 5 or…”
Everything obvious happens. Look closely at the PII management agreements. Btw OpenAI won’t even sign them because they’re not sure if paraphrasing “counts.” Google will.
"Meta Secretly Trained Its AI on a Notorious Piracy Database, Newly Unredacted Court Docs Reveal"
https://www.wired.com/story/new-documents-unredacted-meta-co...
They even admitted to using copyrighted material.
"‘Impossible’ to create AI tools like ChatGPT without copyrighted material, OpenAI says"
https://www.theguardian.com/technology/2024/jan/08/ai-tools-...
"We will train new models using data from Free, Pro, and Max accounts when this setting is on (including when you use Claude Code from these accounts)."
For example, in RL, you have a train set, and a test set, which the model never sees, but is used to validate it - why not put proprietary data in the test set?
I'm pretty sure 99% of ML engineers would say this would constitute training on your data, but this is an argument you could drag out in courts forever.
Or alternatively - it's easier to ask for forgiveness than permission.
I've recently had an apocalyptic vision, that one day we'll wake up, an find that AI companies have produced an AI copy of every piece of software in existence - AI Windows, AI Office, AI Photoshop etc.
https://www.vice.com/en/article/meta-says-the-2400-adult-mov...
However, let's say I record human interactions with my app; for example when a user accepts or rejects an AI sythesised answer.
This data can be used by me, to influence the behaviour of an LLM via RAG or by altering application behaviour.
It's not going to change the weighting of the model, but it would influence its behaviour.