And I say this as someone that is extremely bothered by how easily mass amounts of open content can just be vacuumed up into a training set with reckless abandon and there isn’t much you can do other than put everything you create behind some kind of authentication wall but even then it’s only a matter of time until it leaks anyway.
Pandora’s box is really open, we need to figure out how to live in a world with these systems because it’s an un winnable arms race where only bad actors will benefit from everyone else being neutered by regulation. Especially with the massive pace of open source innovation in this space.
We’re in a “mutually assured destruction” situation now, but instead of bombs the weapon is information.
I don't see it that way, but I'm sure from an American perspective that how it seems.
The original intent was to provide an incentive for human authors to publish work, but has become more out of touch since the internet allowed virtually free publishing and copying. I think with the dawn of LLMs, copyright law is now mainly incentivising lawyers.
These media businesses have shareholders and employees to protect. They need to try and survive this technological shift. The internet destroyed their profitability but AI threatens to remove their value proposition.
And there seems to be an an obvious advantage from my perspective to having an information vacuum that is not bound by any kind of copyright law.
If that’s good or bad is more of a matter of opinion.
I don’t think NYT, or any other industry, for that matter knows AI isn’t going away: in fact, they likely prefer it doesn’t, so long as they can get a slice of that pie.
That’s what the WGA and SAG struck over, and won protections ensuring AI enhanced scripts or shows will not interfere with their royalties, for example.
Media amalgamated power by farming the lives of “common” people for content, and attempt to use that content to manage lives of both the commons and unique, under the auspice of entertainmet. Which in and of itself is obviously a narrative convention which infers implied consent (id ask to what facetiously).
Keepsake of the gods if you will…
We are discussing these systems as though they are new (ai and the like, not the apple of iOS), they are not…
this is an obfuscation of the actual theft that’s been taking place (against us by us, not others).
There is something about reaping what you sow written down somewhere, just gotta find it.
-mic
My opinion is that the US should do things that are consistent with their laws. I don't think a Chinese or Russian LLM is much of a concern in terms of this specific aspect, because if they want to operate in the US they still need to operate legally in the US.
Like all things, it’s about finding a balance. American, or any other, AI isn’t free from the global system which exists around us— capitalism.
Courts don’t decide cases based on whether infringement can occur again, they decide them based on the individual facts of the case. Or equivalently: the fact that someone will be murdered in the future does not imply that your local DA should not try their current murder cases.
Personally, I think it would be a lot simpler if the internet was declared a non-copyright zone for sites that aren't paywalled as there's already a legal grey area as viewing a site invariably involves copying it.
Maybe we'll end up with publishers introducing traps/paper towns like mapmakers are prone to do. That way, if an LLM reproduces the false "fact", it'll be obvious where they got it from.
The same is equally applicable to image: Google got rich in part by making illegal copies of whatever image he could find. Existing regulations could be updated to include ML model but that won't stop bad or big enough actors to do what they want.
> We’re in a “mutually assured destruction” situation now
No, we aren't. Very good spam generators aren't comparable to mass destruction weapons.
Banning a synthetic brain from studying copyrighted content just because it could later recite some of that content is as stupid as banning a biological person from studying copyrighted content because it could later quote from it verbatim.
> Also, presumably NYT still has a business model unrelated to whatever OpenAI is doing with [NYT’s] data…
That’s exactly the question. They are claiming it is destroying their business, which is pretty much self-evident given all the people in here defending the convenience of OpenAI’s product: they’re getting the fruits of NYTimes’ labor without paying for it in eyeballs or dollars. That’s the entire value prop of putting this particular data into the LLMs.
If your business is profitable only when you get your raw materials for free it's not a very good business.
Navalny probably has a different opinion.
There isn’t a country on the planet that doesn’t have people and companies. That doesn’t mean they all have functional legal systems.
People produce countless volumes of unpaid works of art and fiction purely for the joy of doing so; that's not going to change in future.
Foreign companies can be barred from selling infringing products in the United States.
Russian and Chinese consumers are less interested in English-language articles.
I can’t really get behind the argument that we need to let LLM companies use any material they want because other countries (with other languages, no less) might not have the same restrictions.
If you want some examples of LLMs held back by regulations, look into some of the examinations of how Chinese LLMs are clearly trained to avoid answering certain topics that their government deems sensitive.
I believe you equate incentive to monetary rewards. And while that it probably true for the majority of news outlets, money isn't always necessarily what motivates journalists.
So considering the hypothetical situation where journalists (or more generally, people that might publish stuff) were somehow compensated. But in this hypothetical, they would not be attributed (or only to very limited extent) because LLMs are just bad at attribution.
Shouldn't in that case the fact that information distribution by the LLM were "better" be enough to satisfy the deeper goal of wanting to publish stuff? Ie.: reach as many people looking for that information as possible, without blasting it out or targeting and tracking audiences?
But they're not; you can download open source Chinese base models like Yi and Deepseek and ask them about Tianmen Square yourself and see, they don't have any special filtering.
You seem to be assuming an "information economy" should exist at all. Can you justify that?
And although you were being flippant, yes, Chinese LLMs are bad actors.
Do I really want to use a Chinese word processor that spits unattributed passages from the NYT into the articles I write? Once I publish that to my blog now I'm infringing and I can get sued too. Point is I don't see how output which complies with copyright law makes an LLM inferior.
The argument applies equally to code, if your use of ChatGPT, OpenAI etc. today is extensive enough, who knows what copyrighted material you may have incorporated illegally into your codebase? Ignorance is not a legal defense for infringement.
If anything it's a competitive advantage if someone develops a model which I can use without fear of infringement.
Edit: To me this all parallels Uber and AirBnB in a big way. OpenAI is just another big tech company that knew they were going to break the law on a massive scale, and said look this is disruptive and we want to be first to market, so we'll just do it and litigate the consequences. I don't think the situation is that exotic. Being giant lawbreakers has not put Uber or AirBnB out of business yet.
Much of it is only cost-effective to produce if you can share it with a massive audience, I.e. sure if I want to read a great investigative piece on the corruption of a Supreme Court Justice I can hypothetically commission one, but in practice it seems much much better to allow people to have businesses that undertake such matters and publish their findings to a large audience at a low unit price.
Now what’s your argument for removing such an incentive?
But (under different accounts) I used to be very active on both HN and reddit. I just don't want to be anymore now for LLM reasons. I still comment on HN, but more like every couple of weeks than every day. And I have made exactly one (1) comment on reddit in all of 2023.
I'm not the only one, and a lot of smaller reddit communities I used to be active on have basically been destroyed by either LLMs, or by API pricing meant to reflect the value of LLM training data.
And yet the content industry still creates massive profits every year from people buying content.
I think internet-native people can forget that internet piracy doesn’t immediately make copyright obsolete simply because someone can copy an article or a movie if sufficiently motivated. These businesses still exist because copyright allows them to monetize their work.
Eliminating copyright and letting anyone resell or copy anything would end production of the content many people enjoy. You can’t remove content protections and also maintain the existence of the same content we have now.
If they were watered down, I wouldn't see any moral or ethical loss in that.
We will not have "AIs as capable as humans" in a couple decades. AIs will keep being tools used by humans. If you use copyrighted texts as input to a digital transformation, that's vopyright infringement. It's essentially the same situation as sampling in music, and imo the same solutions can be applied here: e.g. licenses with royalties.
The LLM could reproduce the whole library quicker than a person could reproduce a single book.
A writer or journalist just can't make money if any huge company can package their writing and market it without paying them a cent. This is not comparable to piracy, by the way, since huge companies don't move into piracy. But you try to compete with both Disney and Fox for selling your new script/movie, as an individual.
This experiment has also been tried to some extent in software: no company has been able to live off selling open source software. RedHat is the one that came closest, and they actually live by selling support for the free software they sell. Others like MySQL or Mongo lived by selling the non-GPL version of their software. And the GPL itself depends critically on copyright existing. Not to mention, software is still a best case scenario, since just having a binary version is often not enough, you need the original sources which are easy to guard even without copyright - no one cares so much for the "sources" of a movie or book.
They don't mind sharing their work for free to individuals or hell, to a large group of individuals and even companies, but AIs really take it to a whole different level in their eyes.
Whether this is a trend that will accelerate or even make a dent in the grand scheme of things, who knows, but at least in my circle of friends a lot of people are against AI companies (which is basically == M$) being able to get away with their shenanigans.
"Moot derives from gemōt, an Old English name for a judicial court. Originally, moot referred to either the court itself or an argument that might be debated by one. By the 16th century, the legal role of judicial moots had diminished, and the only remnant of them were moot courts, academic mock courts in which law students could try hypothetical cases for practice. Back then, moot was used as a synonym of debatable, but because the cases students tried in moot courts were simply academic exercises, the word gained the additional sense "deprived of practical significance." Some commentators still frown on using moot to mean "purely academic," but most editors now accept both senses as standard."
- Merriam-Webster.com
There are massive number of piracy content in China, but Hollywood are also making billions in the same time, and in fact China already surpassed NA as #1 market for Hollywood years ago [1].
NYT is obvious different than Disney, and may not be able to bend their knees far enough, but maybe there can be similar ways out of this.
[1] https://www.theatlantic.com/culture/archive/2021/09/how-holl...
Isn't it just one additional step to automatically translate them?
Imagine if California had banned Google spidering websites without consent, in the late 90's. On some backwards-looking, moralizing "intellectual property" theory, like the current one targeting LLM's. 2/3rd of modern Silicon Valley wouldn't exist today, and equivalent ecosystems would have instead grown up in, who knows where. Not-California.
We're all stupidly rich and we have forgotten why we're rich in the first place.
It better. Copyright has essentially fucking ceased to exist in the eyes of AI people. Just because you have a shiny new toy doesn't mean the law suddenly stops applying to you. The internet does its best to route around laws and government but the more technologically up to date bureaucracy becomes, the faster it will catch up.
A few weeks after the release it finds books on Amazon who plagiarized the book. Finds copies of the book available for free from Russian sites, and ChatGPT spitting verbatim parts of the source code on the book.
Which parts of copyright law would you say are out of date for the example above?
I'm also far more amenable to dismissing copyright laws when there is no profit involved on the part of the violator. Copying a song from a friend's computer is whatever, but selling that song to others certainly feels a lot more wrong. It's not just that OpenAI is violating copyright, they are also making money off of it.
And if you learned anything from videos/books/newsletters with commercial licenses, you would have to pay some sort of fee for using that information.
The expectation that the author will get life+70 years of protection and income, when technical publications are very rarely still relevant after 5 years. Also, the modern ease of copying/distribution makes it almost impossible for the author to even locate which people to try to prosecute.
Which means that either OpenAI is allowed to be the only lawbreaker in the country (because rich and lawyers), or nobody is. I say prosecute 'em and tell them to make tools that follow the law.
Why did you specify that this stuff you like, you only like if it's "not free"?
The hidden assumption is that the information you like wouldn't be made available unless someone was paying for it. But that's not in evidence; a lot of information and content is provided to the public due to other incentives: self-promotion, marketing, or just plain interest.
Would you prefer not to have access to Wikipedia?
Which evidence?
There are ways to make it free to the consumer, yes. One way is charity (Wikipedia) and another way is advertising. Neither is free to produce; the advertising incentive is also nuked by LLMs; and I’m not comfortable depending on charity for all of my information.
It is a lot cheaper to produce low-quality than high-quality information. This is doubly so in a world of LLMs.
There is ONE Wikipedia, and it is surely one of mankind’s crowning achievements. You’re pointing to that to say, “see look, it’s possible!”?
When for-profit companies seek access to library material they pay a much much higher price.
Also: GPT is not a legal entity in the united states. Humans have different rights than computer software. You are legally allowed to borrow books from the library. You are legally allowed to recite the content you read. You're not allowed to sell verbatim recitation of what you read. This is, obvious, I think? But its exactly what LLMs are doing right now.
On the one hand, they should realize they are one of today’s horse carriage manufacturers. They’ll only survive in very narrow realms (someone has to build the Central Park horse carriages still), but they will be miniscule in size and importance.
On the other hand, LLMs should observe copyright and not be immune to copyright.
Also, plagiarism has nothing to do with copyright. It has to do with attribution. This is easily proven: you can plagiarise Beethoven's music even though it's public domain.
So it is not good when people use copyleft as a justification for copyright, given that its whole purpose was to destroy it.
Suppose I research for a book that I'm writing - it doesn't matter whether I type it on a Mac, PC, or typewriter. It doesn't matter if I use the internet or the library. It doesn't matter if I use an AI powered voice-to-text keyboard or an AI assistant.
If I release a book that has a chapter which was blatantly copied from another book, I might be sued under copyright law. That doesn't mean that we should lock me out of the library, or prevent my tools from working there.
The other question, which I think is more topical to this lawsuit, is whether the company that trains and publishes the model itself is infringing, given they're making available something that is able to reproduce near-verbatim copyrighted works, even if they themselves have not directly asked the model to reproduce them.
I certainly don't have the answers, but I also don't think that simplistic arguments that the cat is already out of the bag or that AIs are analogous to humans learning from books are especially helpful, so I think it's valid and useful for these kinds of questions to be given careful legal consideration.
And I should mention YouTubers wouldn't be making that much money if YouTube weren't enforcing copyright, as you could just upload their videos and get the ad money. Without copyright, you could also cut off their in-video promotions and add your own, including your own Patreon - so you would get 100% of the money off their work if you can out-promote them.
It's only live performances which are protected by the physical world's strict no-copying laws (the ones that don't allow the same macro object to be in two places at the same time).
So basically, no medium which allows copying of the works in whole or nearly whole has been successfully run with public works.
Fortunately, the computer isn't the one being sued.
Instead it is the humans who use the computer. And those humans maintain their existing rights, even if they use a computer.
Craftsmen don't claim copyright on their artifacts. Furniture designs were widely copied; but Chippendale did alright for himself. Gardeners at stately homes didn't rely on copyright. Vergil, Plato and Aristotle managed OK without copyright. People made a living composing music, songs and poetry before the idea of copyright was invented. Truck-drivers make a living; driving a truck is hardly a performance art. Labourers and factory workers get by successfully. Accountants and legal advocates get rich without copyright.
None of these trades amounts to "performance arts".
We've always been in that situation. Computers made the copying, transmission and processing of information trivial since the day they were invented. They changed the world forever.
It's the intellectual property industry that keeps denying reality since it's such an existential threat to them. They think they actually own those bits. They think they can own numbers. It's time to let go of such insane notions but they refuse to let it go.
I contribute to Wikipedia, and I don't consider my contributions to be "charity"; I contribute because I enjoy it. Even in the age of printing presses, copyright law was widely ignored, well into the 20thC. The USA didn't join the Berne Convention until 1989 (and they promptly went mad with copyright).
Yes, there's only one Wikipedia; but there are lots of copies, and lots of similar efforts. Yes, there's one Wikipedia, like there's one Mona Lisa. There are lots of things of which there's only one; in that sense, Wikipedia isn't remotely unique.
Does your personal satisfaction pay the server bills too?
Also, craftsmen rely on the fact that the part of their work that can't be easily copied, the physical artifact they produce, is most of the value (plus they rely on trademark laws and design patents quite often). Similarly for gardeners. The ancient greek writers were again paid for performance, typically as teachers. Literature was once quite a performative act. And again, at that time, physical copies of writings were greatly valuable artifacts, not that much different from the value of the writing itself, since copying large texts was so hard.
Similarly, the work of drivers, labourers, factory workers, accountants is valuable in itself and very hard or impossible to copy (again, the physical world is the ultimate copyright protection). The output of lawyers is in fact sometimes copyrighted, but even when it's not, it's not applicable to others' cases, so copies of it are not valuable: no one is making a business that replaces lawyers by re-distributing affidavits.
Well you'd be mistaken. Lately, it was custom software, for a particular client, and of no interest to others. Earlier, it was before software copyright was a thing, and computer manufacturers gave software away to sell the hardware.
At the very beginning, yes, it was "very specific" hardware; it was Burroughs hardware, which used Burroughs processors. But that was before microprocessors, and all hardware was "very specific".
> (plus they rely on trademark laws and design patents quite often)
Craftsmen and labourers were earning a living long before anyone had the idea of a "trademark", still less a "design patent".
> The output of lawyers is in fact sometimes copyrighted
You're right. That's why I didn't say "lawyers", I said "legal advocates". Those are people who speak on your behalf in courts of law, not scribes writing contracts. Anyway, the ancient Greeks and Romans had written laws, contracts and so on; they managed without trademarks and copyrights.
There's a tendency among some people to take the nostrums of economists about the aggregate behaviour of populations as if they described human nature, and to then go on and conclude that because human behaviour in aggregate can be understood in terms of economic incentives, that an individual human can only be motivated economically. I find that an impoverished and shallow outlook, and I think I'm happier for not sharing it.
I never made the claim that paying server bills would produce great content.
I never made the claim “an individual human can only be motivated economically.”
Your strategy for personal happiness is unrelated to what actually works in the real world at scale.
In hindsight, China wasn’t diligent in the enforcement of IP violations. However, it’s clear foreign presences and investment grew substantially in China during the early 90s upon the belief IP would be protected, or at the very least there would be recourse for violations.
No, they're not. This is The New York Times (a corporation) vs OpenAI and Microsoft (two more corporations).
OpenAI is run by humans as well though.
So the same argument applies.
Those humans have fair use rights as well.
Then I am not mistaken: the company was initially selling hardware, with the software being just a value add as you say (no copyright: no interest in trying to sell, exactly my point). Then, you were being paid for building software that (a) was probably not being made public anyway, and (b) would not have been of interest to others even if it were.
Even so, if someone came to your client and offered to take on the software maintenance for a much lower price, you might have lost your client entirely. This has very much happened to contractors in the past.
And my point is you couldn't have a Microsoft or Adobe or possibly even RedHat if you didn't have copyright protecting their business. So, you'd probably not have virtually any kind of consumer software.
We didn't charge maintenance for this software. We would write it to close the sale of a computer. It was treated as "cost of sale". I'm sure it was cheaper (to us) than the various discounts and kickbacks that happened in big mainframe deals.
As far as Microsoft and Adobe is concerned, I wouldn't regard it as a misfortune if they had never existed. I'm not convinced that RedHat's existence is contingent on copyright.