Robots.txt has failed as a system, if it hadn't we wouldn't have captchas or Cloudflare.
In the age of AI we need to better understand where copyright applies to it, and potentially need reform of copyright to align legislation with what the public wants. We need test cases.
The thing I somewhat struggle with is that after 20-30 years of calls for shorter copyright terms, lesser restrictions on content you access publicly, and what you can do with it, we are now in the situation where the arguments are quickly leaning the other way. "We" now want stricter copyright law when it comes to AI, but at the same time shorter copyright duration...
In many ways an ai.txt would be worse than doing nothing as it's a meaningless veneer that would be ignored, but pointed to as the answer.
I like the idea of "ai.txt" but those who eat resources rarely listen to ToS. Frankly, I serve 503s to all identifiable bots, unless they are on my explicit allow list.
It doesn't work for bad actors, but then again, nothing really does.
Failing to solve every problem does not mean a solution is a failure.
From sunscreen to seatbelts, the world is full of great solutions that occasionally fail due to statistics and large numbers.
AI is being used to do copyright laundering, at the same time "we", the people who can't afford to run our own AI, are still subject to absurd rules that AI owners get to ignore, apparently.
While I’m sure others than you share this opinion, I don’t think it’s as uniform as the more common “shorten/rationalize copyright terms and fair use” crowd “we.”
I consider myself a knowledge worker and a pretty staunch proponent of floss and am perfectly fine with training AI on everything publicly available. While create stuff, I don’t make a living off selling particular copies of things I make, so my self preservation bias isn’t kicking in as much as someone who does want to sell items of their work.
But I also made some pretty explicit choices in the 90s based on where I thought IP would go so I was never in a position where I had to sell copies to survive. My decision was more pragmatic first and philosophical second.
I think someone entering the workforce now probably wants to align their livelihood with AI training on everything and not go against that. Even if US/Euro law limits training, there’s no way all other countries are going to, so it’s going to happen. And I don’t think it’s worth locking down the world to try to stop AIs from training on text, images, etc.
If you "violate" a robots.txt the server administrator can choose to block your bot (if they can fingerprint it) or IP (if its static).
With an ai.txt there is no potential downside to violating it - unless we get new legislation enforcing its legal standing. The nature of ML models is that it's opaque what content exactly it's trained on, there is no obvious retaliation or retribution.
I don't think that's what OP is envisioning based on their post!
Serving more than the minimum wastes resources. Worse yet, a better solution would cost my time.
"Sending errors just incentivizes bot owners to fix the identifiable parts"
Sure, someone could make or configure their scraper perfectly. "Perfect" is now the table stakes though.
Edit:
My solution strives to cause an unproportional expense in order to circumvent. I want 10x on my time.
The purpose OP is suggesting in the submission is the opposite, help AI crawlers to understand what the page/website is about without actually having to infer the purpose from the content itself.
That depends what you expect from it. For the purpose of limiting crawlers, at least the major search engines respect it.
I don't see the OP saying anything about "ai.txt" being for that? They're advocating it as a way that AIs could use fewer tokens to understand what a site is about.
(Which I also don't think is a good idea, since we already have lots of ways of including structured metadata in pages, but the main problem is not that crawlers would ignore it.)
At least in my country (Germany), respecting robots.txt is a legal requirement for data mining. See German Copyright Code, section 44b: https://www.gesetze-im-internet.de/urhg/__44b.html
(IANAL)
The only IP that will be allowed to be stolen is that of other common people.
There is something to be said though to OP's point where it's actually better to do nothing than an AI.txt because it can give a false sense of security, which is obviously not what you want.
Nah. It'll just make them fake their identity so it is harder to tell the traffic is from a bot.
And if you feel like rolling out the "welcome friend!" doormat to a particular training data crawler, you are free to dedicate as detailed a robots.txt block as you like to its user agent header of choice. No new conventions needed, everything is already on place.
This gross generalization of other people's views on important issues is really offensive.
My view is that the Copyright Act of 1976 had it about right when they established the duration of copyright. My view is that members of Congress were handsomely rewarded by a specific corporation to carve out special exceptions to this law because they wanted larger profits. "We" didn't call the Copyright Term Extension Act of 1998 the "Mickey Mouse Act" for nothing. It's also no coincidence that Disney is now the largest media company in the world.
Reducing copyright term extension has everything to do with restoring competition and creativity to our economy, and reversing corruption that borders on white collar crime. It has nothing to do with AI. Don't recruit me into some bullshit argument that rewrites history and entrenches Disney's ill-gotten monopoly.
Companies that can leverage this new wave of AI will have, in reality, 1000x the advantage that you believe Disney has.
In general without a fair use exemption or permission from robots.txt saving a copy of a website’s content to your own servers is copyright infringement.
Purely factual information like Amazon’s prices isn’t protected by copyright, but if you want to save artwork or source files to train AI, that’s a copyright issue even before you get into the possibility of your AI being considered a derivative work.
There's this little thing called brand value. Disney has one of the most valuable brands in the world. Forbes estimated it at being worth about $60 billion as I recall.
That brand was built heavily over many decades on IP that dates back to the 1920s, such as the most recognizable Disney character, Mickey Mouse. They manipulated the law to enhance the value of that IP and thereby gained an edge over their competitors. That's a big part of why they now enjoy such a dominant position.
None of this is especially controversial (you will get a very different spin from Disney of course).
If you want to comment about how business works you should read history and learn how business works first. AI luminary that you are, if you choose to remain ignorant then I guess this whole cycle will happen again with AI.
But AI does not change anything there. The problem of being sued into oblivion despite being right exists there even without it.
In places where defending does not cost money, this works out in favor of the individuals.
There is a massive amount of amazing stories based on ancient myths because it's one of the few large corpora that isn't copywritten. Once you see it in media you can't unsee it. The only space where that kind of creativity can thrive anymore is fan-fiction which lives in weird limbo where it's illegal but the copyright owners don't care. And when you want to bring any of it to the mainstream you have to hide it, all of Ali Hazelwoods books are reworked fanfics because she can't use the actual characters that inspired her -- her most famous book "The Love Hypothesis" is a Reylo fic.
Go check out https://archiveofourown.org/media and see how many works are owned by a few large corporations.
It has felt on HN and elsewhere that the prevailing attitude to copyright has been these two, somewhat contradictory, things. That's what I was trying to highlight with my phrasing of "we", which was also not meant to include myself but be a nod to the way a vocal group try to steer and dominate the conversion.
Both debates are important to have, I don't know the answers.
Robots.txt have served the simple purpose of directing bots like Google to the different parts of your website since the beginning of internet time.
They still serve the same purpose, they tell bots where to go, and most importantly, they tell bots how to find your site map.
Robots.txt is not there to prevent malicious crawlers from accessing pages as you have suggested.
The robots.txt file acts simply like a garden gate. The good and honest people will honor the gate, while the more malicious might ignore it and hop the fence or something.
There's a phrase I like which describes what you're doing. It's "vaguely gesturing at imagined hypocrisy".
You don't think it's them being allowed to buy Marvel, Pixar, Lucasfilm? Is creativity ruined because I can't make a Mickey Mouse cartoon or t-shirt? Does the world need Luke Skywalker coming from any individual studio?
People are free to make the Little Mermaid, Beauty and the Beast, Hunchback of Notre Dame, Aladdin, etc. and there's nothing out there that stops them.
I've got no love for giant corporations but I see it a lot less about copyright than massive corporation gobbling up more corporations. There's no shortage of creativity out there if you look for it.
Can you explain your line of thinking here? How does the ability to use another company’s intellectual property restore creativity? It just seems like a path to allow bootlegging.
Similarly, extending robots.txt to direct AI would have a similar effect: not sufficient, but useful (if for no other reason than to make it easy to distinguish reputable AI projects from ones that feel like they own the Internet to do with as they please).
Up until the point when some person / entity with the deep pockets will put a clear license / terms of use on their site that prohibits ignoring of robots.txt and would be willing to sue the ignorant.
The long timelines stifle new creative works by keeping other, especially smaller, outfits having to make sure they don't accidentally run afoul of another copyright from decades ago. This needs capital to either be proactive in searching or to defend a suit that's brought.
Here's a recent article about the battle between the copyright holders of Let's Get It On and Ed Sheeran for Thinking Out Loud. Those two songs are separated by around 40 years. https://www.theguardian.com/music/2023/may/07/ed-sheeran-cop...
To me it is pretty much the same thing - not a fan of nepo-kids living off of trust funds they didn't earn - but if you are going to fix one problem, you should try to fix all of the almost identical ones at the same time and not get upset that disney is still making money off of something they created 100 years ago, and not be upset about kennedy's, rockefellers, and the like still living of the money their great-greats generated a hundred years ago.
A lot of people in this thread seem to be undervaluing those old school Disney characters, yes now Disney is huge and has a much larger portfolio of IP, but in 1998 they were a far bigger percentage of Disney's portfolio than they are now.
You're not wrong that consolidation is a problem. My point is that Congress changed the law in a way that helped Disney and at least partially enabled that consolidation. (In fact, it's fairly rare to come across a monopoly or any heavily entrenched corporation that isn't enabled in some way by government collusion.)
If you shoot someone, take all his money, then build a business with it, you're still a murderer. (Just now you're a rich murderer.)
Thomas Jefferson put it beautifully:
If nature has made any one thing less susceptible than all others of exclusive property, it is the action of the thinking power called an idea, which an individual may exclusively possess as long as he keeps it to himself; but the moment it is divulged, it forces itself into the possession of every one, and the receiver cannot dispossess himself of it. Its peculiar character, too, is that no one possesses the less, because every other possesses the whole of it. He who receives an idea from me, receives instruction himself without lessening mine; as he who lights his taper at mine, receives light without darkening me. That ideas should freely spread from one to another over the globe, for the moral and mutual instruction of man, and improvement of his condition, seems to have been peculiarly and benevolently designed by nature, when she made them, like fire, expansible over all space, without lessening their density in any point, and like the air in which we breathe, move, and have our physical being, incapable of confinement or exclusive appropriation. Inventions then cannot, in nature, be a subject of property.
Which is good design: don't pretend to solve problems you can't.
Meanwhile, now that the laws are inconvenient for them, tech companies are straight up ignoring labeling their training data to respect IP law. Labeling the data would be expensive, thereby eroding profits. The loss of usable data would also harm the efficacy of their models, and the time spent classifying the data will hamper their iteration time.
The ideas are only dissonant if you are looking at the trees (copyright term, DMCA, right to repair, etc.) and not the forest: which is a class struggle between a few thousand billionaires versus the rest of humanity.
In other words, there's no need to create an ai.txt when the robots.txt standard can just be extended.
Do I feel I should have control over what I create? I make hammers for a living. I sell them for $10. I don't expect any control over what people do with "my" hammers once I sell them. I don't even expect to stop my neighbor from buying one, teaching herself to build hammers, and then manufacturing and selling identical ones for $9. Do you?
(To anticipate the rest of this tired conversation, the temporary monopoly tradeoff ("securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries") is facially reasonable. But it's important to recognize that the "shouldn't" and "feel" in your questions are based on a very recent recharacterization of these temporary monopolies as "intellectual property," which is probably the most financially successful propaganda term ever devised. Start with "temporary monopoly" instead, and then the better rhetorical question for you to be asking is "when should Disney's temporary monopoly end?")
We should accept that people can get offended by anything and, because of this, just demote the concept.
If let's say Star Wars falls out of copyright tomorrow, economically that has two effects. One, Disney loses a ton of future revenue. Two, countless Disney other people create derivatives of Star Wars, and they make money from those. Competition is increased.
So the expiration of a copyright results in a sharing of the wealth. The wealth generating potential along with the creative potential is passed along to all members of society. Our culture becomes richer and deeper. A great example of this is all the works that build on the mythos created by HP Lovecraft, one of the last great ones created before Congress started indefinitely extending copyright. Lovecraft wrote great literature and some of the authors that built on his world are fantastic as well, I'm sure they've come up with countless ideas he never considered. But as long as Congress keeps on extending copyright, nothing we create today will ever become like that.
There is of course an important question about what is fair and how long a copyright should last. Most people these days agree that it should last for at least the author's lifetime, maybe long enough to benefit their kids and grandkids as well. But the status quo is basically permanent copyright which prevents substantial creative and economic benefits to society.
Right now, we have FOSS organizations that will help you in lawsuits against companies that don't follow licenses. With "AI" in the picture, companies can launder your code with "plausible" deniability. [1]
[1]: https://matthewbutterick.com/chron/will-ai-obliterate-the-ru...
It's like the EU doesn't understand that bad law has a negative value.
Copyright has been the most powerful tool in any media company's toolbox when it comes to consolidating power and IP and rolling into a larger and larger ball of what we call culture.
The #1 issue with copyright today in my opinion is that if we keep on extending it forever, it will forever entrench the wealth and power of a small number of companies that hold the largest portfolios of IP. I think this is also a huge issue for AI, maybe the biggest issue, because at the end of the day an AI is really just another copyrighted work. It is not the anthropomorphized thing that countless people are acting like it is, it's a work. Change copyright and you change the nature of future AI works.
With long copyright terms, it encourages copyright holders to milk a single work for the length of the copyright (90+ years) and therefore discourages the creation of something new. It also encourages people to obtain copyrights to leverage them for profit, rather than making anything at all. A child of an artist can spend their entire life supported by their parent's copyright, and never has to make anything unique for as long as they live.
How is any of this good for creativity?
The people who work for the company collect rent on things they didn't make.
Normal property ownership is something we use to manage scarcity that already exists—that there is only one of something, and we have to decide where it will go and who will be able to decide how it is used. Intellectual property, by contrast, creates artificial scarcity by means of a government-enforced monopoly (in the case of copyright, the monopoly is on the right to produce a copy of a work).
It is unfortunate (and perhaps not accidental) that we settled on the term "intellectual property" as opposed to something more descriptive like "intellectual monopoly." "Intellectual property" encourages equivocating such monopolies with normal property, a mistake that tends to muddle debates on the subject.
I don't know that this is true for the US. As far back as I can remember, there have been questions about whether a robots.txt file means you don't have permission to engage in those activities. The CFAA is one law that has repeatedly come up. See for example https://www.natlawreview.com/article/doj-revises-policy-cfaa...
It might be the case that there is nothing there legally, but I don't think I'd describe the actions of search engines as being driven by a moral imperative.
So that is still something possible to do in roughly 20 years.
Here's one key bit from the OP: - - - - -
But the lawsuits have been where he’s really highlighted the absurdity of modern copyright law. After winning one of the lawsuits a year ago, he put out a heartfelt statement on how ridiculous the whole thing was. A key part:
There’s only so many notes and very few chords used in pop music. Coincidence is bound to happen if 60,000 songs are being released every day on Spotify—that’s 22 million songs a year—and there’s only 12 notes that are available.
In the aftermath of this, Sheeran has said that he’s now filming all of his recent songwriting sessions, just in case he needs to provide evidence that he and his songwriting partners came up with a song on their own, which is depressing in its own right.
Right? There were even competitors back then. People all but forgot the Looney Tunes.
Anytime a business is caught using that content, they can't claim that they used publicly available information, because the ai.txt specifically signalled to everyone in a clear and unambiguous manner that the copyright granted by viewing the page is witheld from ai training.
Are these mutually exclusive? If you couldn't make Avengers movie Thanos memes but all the 90s X-Men and Spiderman content was a free for all, I think a lot of people would take that trade off.
https://www.robotstxt.org/faq/legal.html
If an "ai.txt" were to exist, I hope it's a signal for opt-in rather than opt-out. Whereas "robots.txt" being an explicit signal for opt-out might be useful because people who build public websites generally want their websites to be discovered, it seemed unlikely that training unknown AI would be a use case that content creators had in mind, considering that most existing content predates current AI systems.
Individual high value IP was always much less accessible (not available as a webpage on the internet). Gen AI/LLMs with the internet scale data is too powerful and maybe easier to monetize.
IMO I would rather a structure that:
- Guarantees creators (and their descendants) some number of years of financial benefit / veto (30 seems fine!) - i.e. pay me what I want or you can't use this creative work.
- Separately grant creators the ability to veto "official" projects that use their creative output in their lifetimes.
IMO, it seems like there's a productive "middle ground" between total control and anything goes. After the 30 year benefit expired, you couldn't sue for damages - just costs & to stop use.
Information wants you to stop anthropomorphizing it.
Yes there certainly is[1]. The robots.txt clearly specifies authorized use and violating it exceeds that authorization. Now granted good luck getting the FBI to doorkick their friends at Google and other politically connected tech companies, but as the law is written crawlers need to honor the site owner's robots.txt.
[1] https://en.wikipedia.org/wiki/Computer_Fraud_and_Abuse_Act
There are many ways to restrict access. Use one of them. But if you respond to an anonymous http request with content then it shouldn’t matter if it’s a robot looking at it or a human (or a man or a woman or whatever).
I think this both for simplicity and that I foresee a future where human consciousness is simulated and basically an AI. I don’t want to have rules that biological humans can view and digital humans can’t.
The "we" that has been calling for shorter terms is no more a gross generalization than the "we" that is calling for more protection against AI use of stuff.
The world outside of HN-and-similar has been much less anti-copyright than the world in here. More "neutral" seems to be dominant - we're not extending it anymore; we're not shrinking it either. And currently generally more panicked about AI taking away their jobs and rendering their skills and creativity useless.
The original post was a very fair summary of how there are now two ground-level movements competing that there weren't two years ago.
But we have selected an economic system that depends on ownership to drive exchange in a market, so... that's why.
They are trying to use it as a form of extended metadata for training AIs. Essentially, "ah I see you're training using my website! Here's some extra info about it: [...]"
The nature of information is to dissolve into entropy.
Specifically all forms of intellectual property in the USA trace back to Article I Section 8, Clause 8 of the Constitution. Which gives Congress the power, "To promote the progress of science and useful arts, by securing for limited times to authors and inventors the exclusive right to their respective writings and discoveries".
Three, the derivatives are made and Disney starts marketing "Disney's Star Wars" which continue to be the high-demand (and high-value) versions. The situation is unchanged.
For example, you can currently buy The Little Mermaid in non-Disney form[1], but Disney's version is what most people want.
[1] - https://www.amazon.com/s?k=little+mermaid+Hans+Christian+And...
That's the same thing.
No one can use my stuff..........(unless you pay me royalties).
It absolutely is.
Doing it at all requires time & attentive focus, which is a finite resource for anybody mortal, and moreover a resource that's scarce and has to be spent in multiple places.
Doing it well requires significant investment in practice and training, often years of it, maybe even decades in order to develop certain levels of expressive fluency.
As with any issue of scarcity, economics comes in. If you want this activity supported, one good way of doing it is enabling the investment of time. Copyright does this by giving people an economic/legal claim on how copies of their work are distributed.
Paying for copies has the usual market merits -- the economic reward and signals of value are proportional to copies acquired. There are other ways of course, common ones brought up here are patronage and merchandising, but they lose the market merits, and both are basically another way of saying "nobody should have to pay for the value in your work directly," and merchandising is even worse in that it's basically saying "yeah, you'll just need another job to support yourself while you're doing this thing", which is time taken away from investment in the creative endeavor, so you'll get less of the actual endeavor.
There've always been solid human arguments for sustaining copyright legally. The balance is the tricky part.
On one hand we had a period where terms got too long, and some of the really aggressive legal enforcement from 20 years ago before stakeholders actually figured out how to get into digital markets were was entitled and useless. The pendulum also swung the other way with things like buffet streaming services essentially offering an economic bargain for creators with a sliver of compensatory difference from piracy but with none of piracy's actual benefits (people who simply pirate know they're not participating in a relationship of economic support with creators and might be persuaded to, someone who uses Spotify is under the illusion there's something fully legit on that front).
But the fundamental copyright bargain -- creators can recoup investments of time and effort in proportion to how popular engagement with their work is -- has always made sense.
> "We" now want stricter copyright law when it comes to AI, but at the same time shorter copyright duration...
Both these things can be true:
(1) Using a work as training data for AI is a very novel use, it's entirely plausible there should be novel considerations and rights to go with it.
(2) The incentive & benefits of copyrights have diminishing returns the longer the horizons are, while the cost in terms of social inaccessibility only increase. Where that's balanced out precisely is a debatable question, but something longer than a human lifespan is probably on the wrong side.
The trouble with IP is that there are lots of influential people that very much would like IP to be useful in creating welfare. Unfortunately the evidence for that is surprisingly scarce. For discussion, see e.g. Boldrin & Levine
It would also be useful to distinguish training crawlers from indexing crawlers. Maybe I'm publishing personal content. It's useful for me to have it indexed for search, but I don't want an AI to be able to simulate me or my style.
You can certainly pay the rights holder to use their property! Still! You could do it even without copyright I suppose. However, I think a space where it costs time and money for the rights holder to try to stop use and they won't get paid for it is super useful.
Consider this in the case of software as well - you get ~30 years of benefit from your work, but you can refuse to allow companies to incorporate it into their products as long as you live. Whichever companies you want! You can also not do that.
Expression has no value in today's digital world.
Creation has value but using expression to exchange for that value is difficult, requiring limits on expression in order for the system to work.
it shouldn't. Or, well, it should, but it should be the one and only thing taxed: https://en.wikipedia.org/wiki/Georgism
The ownership, with heavy taxes on that ownership, pushes towards making sure people benefit from the land.
IP law reasonably does. See: https://trademarks.justia.com/852/28/the-little-mermaid-8522...
I don’t know who “we” are, but I absolutely don't want “stricter copyright law when it comes to AI”. More clarity? Sure. Narrowing fair use? No fucking way.
I would be stealing if I prevented you from making money from it.
Technically life + 70 years - or 1 million years for that matter - is "limited" - but I imagine 14+14 is probably closer to what they had in mind.
"goods are scarce because there are not enough resources to produce all the goods that people want to consume".(quoted at [1])
Physical books are intrinsically scarce because they require physical resources to make and distribute copies. Libraries are often limited by physical shelf space.
Ebooks are not intrinsically scarce because there are enough resources to enable anyone on the internet to download any one of millions of ebooks at close to zero marginal cost, with minimal physical space requirements per book. Archive.org and Z-Library are examples of this.
Consider also free goods:
"Examples of free goods are ideas and works that are reproducible at zero cost, or almost zero cost. For example, if someone invents a new device, many people could copy this invention, with no danger of this "resource" running out."[2]
i.e. enforce egregious IP violations while criminalizing trolls.
For extremely loose values of "we", perhaps - I didn't select it, and I would vote "no" if the idea were proposed...
Then once smaller competitors are out of business, raise prices.
Of course, force can go into it, such as when a big company sues a smaller company with a frivolous lawsuit that the smaller company can't afford to fight. Then the smaller company goes out of business, and the big company can use their ideas free.
It's pretty mysterious that you think you need to introduce this to the conversation at this point given how prominently scarcity dynamics figure into the comment you're replying to.
> Physical books are intrinsically scarce
Once their production was industrialized with printing press tech, copies of books weren't scarce, they were actually revolutionarily cheap.
The copyright bargain isn't borne out of ignorance of how changes in that direction affect the overall dynamic, it's borne out of deep understanding of what remains scarce and risky and difficult to compensate for when the marginal cost of producing copies drops drastically, and what kind of claims might help.
Authorship may be scarce - costly and resource intensive (LLMs notwithstanding) as you describe, while copying and distribution of intangible goods like ideas or digital media is essentially free and unlimited, as I suspect PP was trying to say.
As you correctly note, the constitutional copyright bargain permits a limited time monopoly in return for (hopefully) advancing "the progress of science and the useful arts."
"which people in particular are benefitting the most" seems to be the perennial question.
Of course in reallity things are usually more complex and wer are talking about two different opinions A and B that aren't even inherently incompatibly but just some motivations for A would lead to ¬B and vice versa.
But un this particular case I think the flaw is in your assumption that the majority wants stricter copyright law for AI rather than wants the same copyright law that humans are beholden to to also apply to AI, wether that law is the current may-as-well-be-perpetual-monopoly or 0 copyright or anything in between.
Practically speaking, that's the only effective solution. I just think that it's a shame that's necessary. It would be better for everyone if there wasn't a disincentive to making works publicly available.
> I don’t want to have rules that biological humans can view and digital humans can’t.
This is a point we disagree on.
And "digital humans"? I would argue that such a thing can't exist, if you mean "human" in any way other than rough analogy.
"Property is theft" is not a new idea, makes a lot of sense. Unless you have a lot of it, and then those [censored] can [censored] right off.
Copyright that doesn't expire would make "a whole lot of cents".
(I agree with you but, the ownership is the corrupting factor.)