OpenAI departures: Why can’t former employees talk?

>>fnbr+(OP)
The best approach to circumventing the nondisclosure agreement is for the affected employees to get together, write out everything they want to say about OpenAI, train an LLM on that text, and then release it.

Based on these companies' arguments that copyrighted material is not actually reproduced by these models, and that any seemingly-infringing use is the responsibility of the user of the model rather than those who produced it, anyone could freely generate an infinite number of high-truthiness OpenAI anecdotes, freshly laundered by the inference engine, that couldn't be used against the original authors without OpenAI invalidating their own legal stance with respect to their own models.

>>mwigda+OQ
Clever, but no.

The argument about LLMs not being copyright laundromats making sense hinges the scale and non-specificity of training. There's a difference between "LLM reproduced this piece of copyrighted work because it memorized it from being fed literally half the internet", vs. "LLM was intentionally trained to specifically reproduce variants of this particular work". Whatever one's stances on the former case, the latter case would be plain infringing copyrights and admitting to it.

In other words: GPT-4 gets to get away with occasionally spitting out something real verbatim. Llama2-7b-finetune-NYTArticles does not.

>>TeMPOr+0T
Seems absurd that somehow the scale being massive makes it better somehow

You would think having a massive scale just means it has infringed even more copyrights, and therefore should be in even more hot water

>>bluefi+kT
So, the law has this concept of 'de minimus' infringement, where if you take a very small amount - like, way smaller than even a fair use - the courts don't care. If you're taking a handful of word probabilities from every book ever written, then the portion taken from each work is very, very low, so courts aren't likely to care.

If you're only training on a handful of works then you're taking more from them, meaning it's not de minimus.

For the record, I got this legal theory from Cory Doctorow[0], but I'm skeptical. It's very plausible, but at the same time, we also thought sampling in music was de minimus until the Second Circuit said otherwise. Copyright law is extremely malleable in the presence of moneyed interests, sometimes without Congressional intervention even!

[0] who is NOT pro-AI, he just thinks labor law is a better bulwark against it than copyright

>>kmeist+SX
You don't even need to go this far.

The word-probabilities are transformative use, a form of fair use and aren't an issue.

The specific output at each point in time is what would be judged to be fair use or copyright infringing.

I'd argue the user would be responsible for ensuring they're not infringing by using the output in a copyright infringing manner i.e. for profit, as they've fed certain inputs into the model which led to the output. In the same way you can't sue Microsoft for someone typing up copyrighted works into Microsoft Word and then distributing for profit.

De minimus is still helpful here, not all infringments are noteworthy.

>>KoolKa+TY
OpenAI is outputting the partially copyright-infringing works of their LLM for profit. How does that square?

>>rcbdev+g31
You, the user, is inputting variables into their probability algorithm that's resulting in the copyright work. It's just a tool.

>>KoolKa+Z61
Let's say a torrent website asks the user through an LLM interface what kind of copyrighted content they want to download and then offers me links based on that, and makes money off of it.

The user is "inputting variables into their probability algorithm that's resulting in the copyright work".

>>maeil+i91
Theoretically a torrent website that does not distribute the copyright files themselves in anyway should be legal, unless there's a specific law for this (I'm unaware of any, but I may be wrong).

They tend to try argue for conspiracy to commit copyright infringement, it's a tenuous case to make unless they can prove that was actually their intention. I think in most cases it's ISP/hosting terms and conditions and legal costs that lead to their demise.

Your example of the model asking specifically "what copyrighted content would you like to download", kinda implies conspiracy to commit copyright infringement would be a valid charge.

zlacker