I started to add an ai.txt to my projects. The file is just a basic text file with some useful info about the website like what it is about, when was it published, the author, etc etc.
It can be great if the website somehow ends up in a training dataset (who knows), and it can be super helpful for AI website crawlers, instead of using thousands of tokens to know what your website is about, they can do it with just a few hundred.
https://www.iana.org/assignments/well-known-uris/well-known-...
Basically, assuming that you have a spec, I think it amounts to filing a PR or discussing it on a mailing list.
Robots.txt has failed as a system, if it hadn't we wouldn't have captchas or Cloudflare.
In the age of AI we need to better understand where copyright applies to it, and potentially need reform of copyright to align legislation with what the public wants. We need test cases.
The thing I somewhat struggle with is that after 20-30 years of calls for shorter copyright terms, lesser restrictions on content you access publicly, and what you can do with it, we are now in the situation where the arguments are quickly leaning the other way. "We" now want stricter copyright law when it comes to AI, but at the same time shorter copyright duration...
In many ways an ai.txt would be worse than doing nothing as it's a meaningless veneer that would be ignored, but pointed to as the answer.
AI.txt doesn't have this feedback to the AI to improve it. Also it seems likely users might have reason to lie.
I like the idea of "ai.txt" but those who eat resources rarely listen to ToS. Frankly, I serve 503s to all identifiable bots, unless they are on my explicit allow list.
It doesn't work for bad actors, but then again, nothing really does.
Failing to solve every problem does not mean a solution is a failure.
From sunscreen to seatbelts, the world is full of great solutions that occasionally fail due to statistics and large numbers.
AI is being used to do copyright laundering, at the same time "we", the people who can't afford to run our own AI, are still subject to absurd rules that AI owners get to ignore, apparently.
While I’m sure others than you share this opinion, I don’t think it’s as uniform as the more common “shorten/rationalize copyright terms and fair use” crowd “we.”
I consider myself a knowledge worker and a pretty staunch proponent of floss and am perfectly fine with training AI on everything publicly available. While create stuff, I don’t make a living off selling particular copies of things I make, so my self preservation bias isn’t kicking in as much as someone who does want to sell items of their work.
But I also made some pretty explicit choices in the 90s based on where I thought IP would go so I was never in a position where I had to sell copies to survive. My decision was more pragmatic first and philosophical second.
I think someone entering the workforce now probably wants to align their livelihood with AI training on everything and not go against that. Even if US/Euro law limits training, there’s no way all other countries are going to, so it’s going to happen. And I don’t think it’s worth locking down the world to try to stop AIs from training on text, images, etc.
https://developer.mozilla.org/en-US/docs/Learn/HTML/Introduc...
https://developers.google.com/search/blog/2013/08/relauthor-....
If you "violate" a robots.txt the server administrator can choose to block your bot (if they can fingerprint it) or IP (if its static).
With an ai.txt there is no potential downside to violating it - unless we get new legislation enforcing its legal standing. The nature of ML models is that it's opaque what content exactly it's trained on, there is no obvious retaliation or retribution.
[0]: https://developers.google.com/search/docs/appearance/structu...
I don't think that's what OP is envisioning based on their post!
Serving more than the minimum wastes resources. Worse yet, a better solution would cost my time.
"Sending errors just incentivizes bot owners to fix the identifiable parts"
Sure, someone could make or configure their scraper perfectly. "Perfect" is now the table stakes though.
Edit:
My solution strives to cause an unproportional expense in order to circumvent. I want 10x on my time.
The purpose OP is suggesting in the submission is the opposite, help AI crawlers to understand what the page/website is about without actually having to infer the purpose from the content itself.
Putting a good comment at the top of robots.txt would be just as good as any other solution, given it could serve as a type of prompt template for processing the data on the site it represents.
That depends what you expect from it. For the purpose of limiting crawlers, at least the major search engines respect it.
Aka, an ai.txt file that disallow ai to train or use your data similar to robots.txt (but for cases when you still want to be crawled, just not extrapolated)
I don't see the OP saying anything about "ai.txt" being for that? They're advocating it as a way that AIs could use fewer tokens to understand what a site is about.
(Which I also don't think is a good idea, since we already have lots of ways of including structured metadata in pages, but the main problem is not that crawlers would ignore it.)
At least in my country (Germany), respecting robots.txt is a legal requirement for data mining. See German Copyright Code, section 44b: https://www.gesetze-im-internet.de/urhg/__44b.html
(IANAL)
Aren't there already things in place for that info (e.g. meta tags?)
Regardless, I do agree that something like a robots.txt for AI can be very useful. I'd like my website to be excluded from most AI projects and some kind of standardized way to communicate this preference would be nice, although I realize most AI projects don't exactly care about things like the wishes of authors, copyright, or ethical considerations. It's the idea that matters, really.
If I can use an ai.txt to convince the crawlers that my website contains illegal hardcore terrorist pornography to get it excluded from the datasets, that's another way to accomplish this I suppose.
The only IP that will be allowed to be stolen is that of other common people.
There is something to be said though to OP's point where it's actually better to do nothing than an AI.txt because it can give a false sense of security, which is obviously not what you want.
Nah. It'll just make them fake their identity so it is harder to tell the traffic is from a bot.
And if you feel like rolling out the "welcome friend!" doormat to a particular training data crawler, you are free to dedicate as detailed a robots.txt block as you like to its user agent header of choice. No new conventions needed, everything is already on place.
https://developers.google.com/search/docs/crawling-indexing/...
This gross generalization of other people's views on important issues is really offensive.
My view is that the Copyright Act of 1976 had it about right when they established the duration of copyright. My view is that members of Congress were handsomely rewarded by a specific corporation to carve out special exceptions to this law because they wanted larger profits. "We" didn't call the Copyright Term Extension Act of 1998 the "Mickey Mouse Act" for nothing. It's also no coincidence that Disney is now the largest media company in the world.
Reducing copyright term extension has everything to do with restoring competition and creativity to our economy, and reversing corruption that borders on white collar crime. It has nothing to do with AI. Don't recruit me into some bullshit argument that rewrites history and entrenches Disney's ill-gotten monopoly.
e.g: https://developers.google.com/search/docs/appearance/structu...
Companies that can leverage this new wave of AI will have, in reality, 1000x the advantage that you believe Disney has.
With the current situation you either assume that everything is not usable, or you just not care and crawl everything that you can reach.
In general without a fair use exemption or permission from robots.txt saving a copy of a website’s content to your own servers is copyright infringement.
Purely factual information like Amazon’s prices isn’t protected by copyright, but if you want to save artwork or source files to train AI, that’s a copyright issue even before you get into the possibility of your AI being considered a derivative work.
I postulate that robots need at least a single manipulator in the physical realm: Mechanical arm assembling car doors = robot. CNC machine that follows a path = robot. Mechs with chicken legs = robot. Brain in a vat = not a robot... but can be embedded in a robot.
There's this little thing called brand value. Disney has one of the most valuable brands in the world. Forbes estimated it at being worth about $60 billion as I recall.
That brand was built heavily over many decades on IP that dates back to the 1920s, such as the most recognizable Disney character, Mickey Mouse. They manipulated the law to enhance the value of that IP and thereby gained an edge over their competitors. That's a big part of why they now enjoy such a dominant position.
None of this is especially controversial (you will get a very different spin from Disney of course).
If you want to comment about how business works you should read history and learn how business works first. AI luminary that you are, if you choose to remain ignorant then I guess this whole cycle will happen again with AI.
> security.txt provides a way for websites to define security policies. The security.txt file sets clear guidelines for security researchers on how to report security issues. security.txt is the equivalent of robots.txt, but for security issues.
Carbon.txt: https://github.com/thegreenwebfoundation/carbon.txt :
> A proposed convention for website owners and digital service providers to demonstrate that their digital infrastructure runs on green electricity.
"Work out how to make it discoverable - well-known, TXT records or root domains" https://github.com/thegreenwebfoundation/carbon.txt/issues/3... re: JSON-LD instead of txt, signed records with W3C Verifiable Credentials (and blockcerts/cert-verifier-js)
SPDX is a standard for specifying software licenses (and now SBOMs Software Bill of Materials, too) https://en.wikipedia.org/wiki/Software_Package_Data_Exchange
It would be transparent to disclose the SBOM in AI.txt or elsewhere.
How many parsers should be necessary for https://schema.org/CreativeWork https://schema.org/license metadata for resources with (Linked Data) URIs?
But AI does not change anything there. The problem of being sued into oblivion despite being right exists there even without it.
In places where defending does not cost money, this works out in favor of the individuals.
There is a massive amount of amazing stories based on ancient myths because it's one of the few large corpora that isn't copywritten. Once you see it in media you can't unsee it. The only space where that kind of creativity can thrive anymore is fan-fiction which lives in weird limbo where it's illegal but the copyright owners don't care. And when you want to bring any of it to the mainstream you have to hide it, all of Ali Hazelwoods books are reworked fanfics because she can't use the actual characters that inspired her -- her most famous book "The Love Hypothesis" is a Reylo fic.
Go check out https://archiveofourown.org/media and see how many works are owned by a few large corporations.
It has felt on HN and elsewhere that the prevailing attitude to copyright has been these two, somewhat contradictory, things. That's what I was trying to highlight with my phrasing of "we", which was also not meant to include myself but be a nod to the way a vocal group try to steer and dominate the conversion.
Both debates are important to have, I don't know the answers.
Robots.txt have served the simple purpose of directing bots like Google to the different parts of your website since the beginning of internet time.
They still serve the same purpose, they tell bots where to go, and most importantly, they tell bots how to find your site map.
Robots.txt is not there to prevent malicious crawlers from accessing pages as you have suggested.
The robots.txt file acts simply like a garden gate. The good and honest people will honor the gate, while the more malicious might ignore it and hop the fence or something.
There's a phrase I like which describes what you're doing. It's "vaguely gesturing at imagined hypocrisy".
How does this differ from what would be useful in humans.txt?
You don't think it's them being allowed to buy Marvel, Pixar, Lucasfilm? Is creativity ruined because I can't make a Mickey Mouse cartoon or t-shirt? Does the world need Luke Skywalker coming from any individual studio?
People are free to make the Little Mermaid, Beauty and the Beast, Hunchback of Notre Dame, Aladdin, etc. and there's nothing out there that stops them.
I've got no love for giant corporations but I see it a lot less about copyright than massive corporation gobbling up more corporations. There's no shortage of creativity out there if you look for it.
access.txt will return an individual access key for the user agent like a session, and the user agent can only crawl using the access key
This would mean that we could standardize session starts with rate limits. Regular user is unlikely to hit the user rate limits, but bots would get rocked by rate limiting.
Great. Now authorized crawlers, bing, google, etc, all use PKI so that they can sign the request to access.txt to get their access key. If the access.txt request is signed with a known crawler the rate limits can be loosened to levels that a crawler will enjoy
This will allow users / browsers to use normal access patterns without any issue, but crawlers will have to request elevated rate limits to perform their tasks. Crawlers and AI alike could be allowed or disallowed by the service owners, which is really what everyone wanted from robots.txt in the first place
One issue I see with this already is that it solidifies the existing search engines as the market leaders
As users we're forced to browse the Web with a million agreements that say "by using this site you agree to our Terms", what stops you from saying "by crawling this site to train your AI you agree to share profits with us" or whatever, particularly if you can prove that your data ends up being used?
ai.txt is useful, but I am not sure we have nailed down what it can be used for. One use is to tell AI not to train on the content found within because it could be an AI generation.
Seems like it relies on everyone playing by the rules and only requesting one license per user. Why would a bot developer be incentivized to follow that rule and not just request 1M licenses?
Lexicography tends to be descriptive rather than prescriptive. If enough people use a word to mean a thing, that word means that thing. As least in some contexts. See also "gay", "hacker", etc...
Note that it is possible for a word's meaning to be "reclaimed", but it generally doesn't get that way by some small group of people just shouting "You're doing it wrong!"
Can you explain your line of thinking here? How does the ability to use another company’s intellectual property restore creativity? It just seems like a path to allow bootlegging.
Similarly, extending robots.txt to direct AI would have a similar effect: not sufficient, but useful (if for no other reason than to make it easy to distinguish reputable AI projects from ones that feel like they own the Internet to do with as they please).
Why would the crawler trust you to be accurate instead of just figuring it out for itself?
Besides, they want to hoover up all the data for their training set anyway.
A standard protocol for reputable crawlers to semantically understand some high-level page navigation rules.
Actual, useful crawling (i.e. to build search indices) would be much messier and more useless without most interesting sites putting up meaningful robots.txt guide-rails. Look at facebook.com/robots.txt and consider how much crap both Facebook and indexers would have to deal with lacking that information.
Up until the point when some person / entity with the deep pockets will put a clear license / terms of use on their site that prohibits ignoring of robots.txt and would be willing to sue the ignorant.
Kirk: Everything Harry tells you is a lie. Remember that. Everything Harry tells you is a lie.
Harry: Listen to this carefully, Norman. I am lying. # cat > /var/www/.well-known/ai.txt
Disallow: *
^D
# systemctl restart apache2
Until then, I'm seriously considering prompt injection in my websites to disrupt the current generation of AI. Not sure if it would work.Please share with me ideas, links and further reading about adversarial anti-AI countermeasures.
EDIT: I've made an Ask HN for this: https://news.ycombinator.com/item?id=35888849
This would block search engines but on some URL's this may be fine, such as data one would not want LLM's to hoover up.
Do search robots even care if you have a "noindex" in your page `<head>`? Do websites care if your browser sends a Do Not Track request?
The long timelines stifle new creative works by keeping other, especially smaller, outfits having to make sure they don't accidentally run afoul of another copyright from decades ago. This needs capital to either be proactive in searching or to defend a suit that's brought.
Here's a recent article about the battle between the copyright holders of Let's Get It On and Ed Sheeran for Thinking Out Loud. Those two songs are separated by around 40 years. https://www.theguardian.com/music/2023/may/07/ed-sheeran-cop...
To me it is pretty much the same thing - not a fan of nepo-kids living off of trust funds they didn't earn - but if you are going to fix one problem, you should try to fix all of the almost identical ones at the same time and not get upset that disney is still making money off of something they created 100 years ago, and not be upset about kennedy's, rockefellers, and the like still living of the money their great-greats generated a hundred years ago.
Additionally, any cooperative attempt won't work because humans will attempt to misrepresent themselves.
No successful AI system will listen to someone's self representation because the AI system does not need proxies: it can act by simply acquiring all recorded observed behaviour.
A lot of people in this thread seem to be undervaluing those old school Disney characters, yes now Disney is huge and has a much larger portfolio of IP, but in 1998 they were a far bigger percentage of Disney's portfolio than they are now.
You're not wrong that consolidation is a problem. My point is that Congress changed the law in a way that helped Disney and at least partially enabled that consolidation. (In fact, it's fairly rare to come across a monopoly or any heavily entrenched corporation that isn't enabled in some way by government collusion.)
If you shoot someone, take all his money, then build a business with it, you're still a murderer. (Just now you're a rich murderer.)
Thomas Jefferson put it beautifully:
If nature has made any one thing less susceptible than all others of exclusive property, it is the action of the thinking power called an idea, which an individual may exclusively possess as long as he keeps it to himself; but the moment it is divulged, it forces itself into the possession of every one, and the receiver cannot dispossess himself of it. Its peculiar character, too, is that no one possesses the less, because every other possesses the whole of it. He who receives an idea from me, receives instruction himself without lessening mine; as he who lights his taper at mine, receives light without darkening me. That ideas should freely spread from one to another over the globe, for the moral and mutual instruction of man, and improvement of his condition, seems to have been peculiarly and benevolently designed by nature, when she made them, like fire, expansible over all space, without lessening their density in any point, and like the air in which we breathe, move, and have our physical being, incapable of confinement or exclusive appropriation. Inventions then cannot, in nature, be a subject of property.
How do you differentiate an AI crawler from a normal crawler? Almost all of the LLMs are trained on commoncrawl, which the concept of LLMs didn't even exist when CC started. What about a crawler that creates a search database, but's context is fed into a LLM as context? Or a middleware that fetches data in real time?
Honestly that's a terrible idea. and robots.txt can cover the use cases. But is still pretty ineffective, because it's more just a set of suggestions than rules that must be followed.
Whilst the output of AI is astonishing by itself, is it really creating meaningful content en masse? I see myself relying more and more on human-curated content because typical commercialized use cases of AI generated stuff (product descriptions, corp blogs, SEO landing pages, etc.) all read like meaningless blabber, to me at least.
Whenever I see some cool techbro boasting how he created his "SEO factory" using ChatGPT, I can't help but think that the poor guy is shitting where he eats without even realizing it. Take Google with their Search and Ads; over the last decade they managed to bring down overall quality of web content that much, that I'm just completely fed up using it because by 99% chance I'll land on some meaningless SEO page.
From what I can perceive with things like HN, Mastodon, etc. it feels more like a rejuvenation of the human centric brand trusted Web. And by that I mean: Dear crawler, just use my content. Maybe you can do something good with it, maybe not. But chances are low, it's gonna replace me in any way but rather improve my content. It only leads to a downward spiral if we stick with the past of commercial thinking (more cheap content, more followers, more ads); if we'd instead switch to subscription models individuals won't get rich but we'd have a great ecosystem of ideas and content again.
Which is good design: don't pretend to solve problems you can't.
Meanwhile, now that the laws are inconvenient for them, tech companies are straight up ignoring labeling their training data to respect IP law. Labeling the data would be expensive, thereby eroding profits. The loss of usable data would also harm the efficacy of their models, and the time spent classifying the data will hamper their iteration time.
The ideas are only dissonant if you are looking at the trees (copyright term, DMCA, right to repair, etc.) and not the forest: which is a class struggle between a few thousand billionaires versus the rest of humanity.
In other words, there's no need to create an ai.txt when the robots.txt standard can just be extended.
Do I feel I should have control over what I create? I make hammers for a living. I sell them for $10. I don't expect any control over what people do with "my" hammers once I sell them. I don't even expect to stop my neighbor from buying one, teaching herself to build hammers, and then manufacturing and selling identical ones for $9. Do you?
(To anticipate the rest of this tired conversation, the temporary monopoly tradeoff ("securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries") is facially reasonable. But it's important to recognize that the "shouldn't" and "feel" in your questions are based on a very recent recharacterization of these temporary monopolies as "intellectual property," which is probably the most financially successful propaganda term ever devised. Start with "temporary monopoly" instead, and then the better rhetorical question for you to be asking is "when should Disney's temporary monopoly end?")
We should accept that people can get offended by anything and, because of this, just demote the concept.
If let's say Star Wars falls out of copyright tomorrow, economically that has two effects. One, Disney loses a ton of future revenue. Two, countless Disney other people create derivatives of Star Wars, and they make money from those. Competition is increased.
So the expiration of a copyright results in a sharing of the wealth. The wealth generating potential along with the creative potential is passed along to all members of society. Our culture becomes richer and deeper. A great example of this is all the works that build on the mythos created by HP Lovecraft, one of the last great ones created before Congress started indefinitely extending copyright. Lovecraft wrote great literature and some of the authors that built on his world are fantastic as well, I'm sure they've come up with countless ideas he never considered. But as long as Congress keeps on extending copyright, nothing we create today will ever become like that.
There is of course an important question about what is fair and how long a copyright should last. Most people these days agree that it should last for at least the author's lifetime, maybe long enough to benefit their kids and grandkids as well. But the status quo is basically permanent copyright which prevents substantial creative and economic benefits to society.
Right now, we have FOSS organizations that will help you in lawsuits against companies that don't follow licenses. With "AI" in the picture, companies can launder your code with "plausible" deniability. [1]
[1]: https://matthewbutterick.com/chron/will-ai-obliterate-the-ru...
If you want to know about copyright that applies to my work: https://www.riksdagen.se/sv/dokument-lagar/dokument/svensk-f...
Beeing in the US does not shield you from my country's laws. You are not allowed to copy my work without my permission, you are not allowed to transform it.
It's like the EU doesn't understand that bad law has a negative value.
Copyright has been the most powerful tool in any media company's toolbox when it comes to consolidating power and IP and rolling into a larger and larger ball of what we call culture.
The #1 issue with copyright today in my opinion is that if we keep on extending it forever, it will forever entrench the wealth and power of a small number of companies that hold the largest portfolios of IP. I think this is also a huge issue for AI, maybe the biggest issue, because at the end of the day an AI is really just another copyrighted work. It is not the anthropomorphized thing that countless people are acting like it is, it's a work. Change copyright and you change the nature of future AI works.
With long copyright terms, it encourages copyright holders to milk a single work for the length of the copyright (90+ years) and therefore discourages the creation of something new. It also encourages people to obtain copyrights to leverage them for profit, rather than making anything at all. A child of an artist can spend their entire life supported by their parent's copyright, and never has to make anything unique for as long as they live.
How is any of this good for creativity?
There's no such thing. Without a license you can't enforce any restrictions.
AI training is basically just building a very complex Markov chain, that's obviously not copyright violation because the output product doesn't contain the input - only data about it. If your text has been copied then please point to it in these weights here.
The people who work for the company collect rent on things they didn't make.
Normal property ownership is something we use to manage scarcity that already exists—that there is only one of something, and we have to decide where it will go and who will be able to decide how it is used. Intellectual property, by contrast, creates artificial scarcity by means of a government-enforced monopoly (in the case of copyright, the monopoly is on the right to produce a copy of a work).
It is unfortunate (and perhaps not accidental) that we settled on the term "intellectual property" as opposed to something more descriptive like "intellectual monopoly." "Intellectual property" encourages equivocating such monopolies with normal property, a mistake that tends to muddle debates on the subject.
I don't know that this is true for the US. As far back as I can remember, there have been questions about whether a robots.txt file means you don't have permission to engage in those activities. The CFAA is one law that has repeatedly come up. See for example https://www.natlawreview.com/article/doj-revises-policy-cfaa...
It might be the case that there is nothing there legally, but I don't think I'd describe the actions of search engines as being driven by a moral imperative.
So that is still something possible to do in roughly 20 years.
Here's one key bit from the OP: - - - - -
But the lawsuits have been where he’s really highlighted the absurdity of modern copyright law. After winning one of the lawsuits a year ago, he put out a heartfelt statement on how ridiculous the whole thing was. A key part:
There’s only so many notes and very few chords used in pop music. Coincidence is bound to happen if 60,000 songs are being released every day on Spotify—that’s 22 million songs a year—and there’s only 12 notes that are available.
In the aftermath of this, Sheeran has said that he’s now filming all of his recent songwriting sessions, just in case he needs to provide evidence that he and his songwriting partners came up with a song on their own, which is depressing in its own right.
Why would I make a request to your low trust self description when I can make one to your homepage?
Right? There were even competitors back then. People all but forgot the Looney Tunes.
That's how you improve its context recognition. You show it many contexts.
> most AI projects don't exactly care about things like the wishes of authors, copyright, or ethical considerations
Why is it 'ethical' that you get to add a bunch of restrictions to a pre-negotiated situation? You get copyright protections in trade for letting people use your work. There's a way to add restrictions - licensing - and you're looking to get the benefits of licensing, and to take away fair use right from other people, without paying the costs of doing so.
fwiw, I copy most pages I visit and store them. The website has given me the equivalent of a pamphlet and I store it instead of discarding it when I'm finished. This way I can go back and read it again later without having to track down the author and ask for another copy. It's not AI which has me doing this, I've been doing it for decades - it's censorship that has shown me the need.
Anytime a business is caught using that content, they can't claim that they used publicly available information, because the ai.txt specifically signalled to everyone in a clear and unambiguous manner that the copyright granted by viewing the page is witheld from ai training.
The way copyright laws work is that work is copyrighted (assuming the work is original enough, of course) by default. You don't get to use it unless you have a license. Now, of course, as an author, you can choose to add a license to your work (whether that's CC0 or GPL-3), but you don't have to.
You do have an implicit license to consume this content, but not to reproduce it. If you put all of those copies you've saved on some public other website, that's a copyright violation. Furthermore, access to privately-owned blog posts and websites is a privilege, not a right. You're not my boss, I don't have to write content for you.
The exact legal status of AI models trained on other people's unlicensed works and their output is still largely unknown. Legal professionals much more qualified than me have argued how AI models and generated work can either be completely fair use, with no need to apply any kind of copyright restriction, or how AI generated work can be classified as a derivative work, which means you need a license. There are two major lawsuits about this going on as far as I know and it'll take years for those to flesh out.
If it turns out that AI models and the works they produce are completely fair game, I suppose I'll need take down my content wherever I can in order not to be a free source of training data for big tech; public datasets and the internet archive will still have to respond to DMCA takedowns, after all. However, I'm not all that confident that what AI is doing is all that legally okay.
I have no problem with you saving and archiving anything you want to read. I also fully support the Internet Archive and its goal. I do have a problem with these multi billion dollar companies scouring the internet for their money maker, giving nothing in return.
Are these mutually exclusive? If you couldn't make Avengers movie Thanos memes but all the 90s X-Men and Spiderman content was a free for all, I think a lot of people would take that trade off.
https://www.robotstxt.org/faq/legal.html
If an "ai.txt" were to exist, I hope it's a signal for opt-in rather than opt-out. Whereas "robots.txt" being an explicit signal for opt-out might be useful because people who build public websites generally want their websites to be discovered, it seemed unlikely that training unknown AI would be a use case that content creators had in mind, considering that most existing content predates current AI systems.
Individual high value IP was always much less accessible (not available as a webpage on the internet). Gen AI/LLMs with the internet scale data is too powerful and maybe easier to monetize.
IMO I would rather a structure that:
- Guarantees creators (and their descendants) some number of years of financial benefit / veto (30 seems fine!) - i.e. pay me what I want or you can't use this creative work.
- Separately grant creators the ability to veto "official" projects that use their creative output in their lifetimes.
IMO, it seems like there's a productive "middle ground" between total control and anything goes. After the 30 year benefit expired, you couldn't sue for damages - just costs & to stop use.
Information wants you to stop anthropomorphizing it.
Not when you give it to me. "Hey, can I see your pamphlet? Sure, here's a copy."
> an implicit license to consume this content
No, copyright prevents copying, not use. There's no implicit license needed to use a work so there's no place to attach those usage restrictions. If you want me to agree to a license you need to not give me the work until I do.
You could have a ToS click-through agreement ("no training an AI on this!"), and then only serve content to logged-in users who have agreed to your conditions.
> but not to reproduce it.
I agree - those "pamphlets" were given to me and I can't copy them for someone else. They'd have to view my collection.
> The exact legal status of AI models trained on other people's unlicensed works and their output is still largely unknown.
Sure, predicting all courts in the world is a futile exercise. Surely someone will try to over reach from copyright to preventing what they feel is a bad use but it's unlikely to become law because there are already analogous uses, scanning someone's text and pulling data from it - data like which words follow which other words.
> I do have a problem with these multi billion dollar companies scouring the internet for their money maker, giving nothing in return.
Well, FB released Llama... It's not a closed technology, it's being led by for-profit businesses but the community (which consists of many of the corporate engineers as well) is trying to keep up.
Even if you can and do attach usage regulations to your site I feel it'll hurt the little guy more than the corporations. There are probably not any unique linguistic constructions on your site that will render a corporate AI less valuable, but for hackers and tinkerers and eventual historians, who knows what it'll interfere with.
But we all know without these original works such a tool cannot exist in principle, the works are the key ingredient, so now please explain how we are not looking at these works being exploited commercially and copyright being violated.
The output product is an automatically created derivative work, copyright very much applies especially since the tool is used to generate derivative works for profit (like in case of openai/microsoft).
>Not when you give it to me. "Hey, can I see your pamphlet? Sure, here's a copy."
>> an implicit license to consume this content
>No, copyright prevents copying, not use. There's no implicit license needed to use a work so there's no place to attach those usage restrictions. If you want me to agree to a license you need to not give me the work until I do.
>You could have a ToS click-through agreement ("no training an AI on this!"), and then only serve content to logged-in users who have agreed to your conditions.
Fair enough, I worded that wrong.
>Sure, predicting all courts in the world is a futile exercise. Surely someone will try to over reach from copyright to preventing what they feel is a bad use but it's unlikely to become law because there are already analogous uses, scanning someone's text and pulling data from it - data like which words follow which other words.
Kazaa was banned despite being very popular for a few years. The DMCA was signed into law years after the first copyright trouble started. Just because the government is slow doesn't mean they won't write new law.
> Well, FB released Llama... It's not a closed technology, it's being led by for-profit businesses but the community (which consists of many of the corporate engineers as well) is trying to keep up.
FB's model leaked, it was subject to a strict whitelist originally. They didn't mean for it to get out there, but they wisely chose not to cause the Streisand effect to hurt them even more. And OpenAI (nice name) stopped releasing their model after it became good enough.
> Even if you can and do attach usage regulations to your site I feel it'll hurt the little guy more than the corporations. There are probably not any unique linguistic constructions on your site that will render a corporate AI less valuable, but for hackers and tinkerers and eventual historians, who knows what it'll interfere with.
I don't want to hurt anyone. I wish AI companies would do the right thing and simply ask for permission before taking someone's work and training on it. I'd probably agree if they did so a few years back!
I know my contribution to the larger model is extremely insignificant. However, my incentive to help others is greatly diminished when my wishes and ethical concerns are ignored so blatantly. I also don't think I'm alone in this. The amount of digital art I'm seeing in my timelines has greatly decreased, for example; more and more is being locked away behind paywalls because sharing your work freely only helps megacorporations replace you.
[1] https://github.com/cheatcode/joystick/blob/development/LICEN...
Yes there certainly is[1]. The robots.txt clearly specifies authorized use and violating it exceeds that authorization. Now granted good luck getting the FBI to doorkick their friends at Google and other politically connected tech companies, but as the law is written crawlers need to honor the site owner's robots.txt.
[1] https://en.wikipedia.org/wiki/Computer_Fraud_and_Abuse_Act
There are many ways to restrict access. Use one of them. But if you respond to an anonymous http request with content then it shouldn’t matter if it’s a robot looking at it or a human (or a man or a woman or whatever).
I think this both for simplicity and that I foresee a future where human consciousness is simulated and basically an AI. I don’t want to have rules that biological humans can view and digital humans can’t.
It will be gamed.
The "we" that has been calling for shorter terms is no more a gross generalization than the "we" that is calling for more protection against AI use of stuff.
The world outside of HN-and-similar has been much less anti-copyright than the world in here. More "neutral" seems to be dominant - we're not extending it anymore; we're not shrinking it either. And currently generally more panicked about AI taking away their jobs and rendering their skills and creativity useless.
The original post was a very fair summary of how there are now two ground-level movements competing that there weren't two years ago.
But we have selected an economic system that depends on ownership to drive exchange in a market, so... that's why.
The problem is that such ai.txt would be an unidimensional opinion based on what? On the way the site describes itself. So a self-referencing source.
But the AIs reading it, are precisely going to invariably be trained with different world views that will summarize and express opinions biased by these worldviews. It's even deeper as every worldview can't help but belong to one ideology or another.
So who is aligned with truth now?
The author? AI1? AI2? AI3?...AIN?
We're in such a mess.
They are trying to use it as a form of extended metadata for training AIs. Essentially, "ah I see you're training using my website! Here's some extra info about it: [...]"
The nature of information is to dissolve into entropy.
There's a downside to dumping vast amounts of crap content into an LLM training set. The training method has no notion of data quality.
Specifically all forms of intellectual property in the USA trace back to Article I Section 8, Clause 8 of the Constitution. Which gives Congress the power, "To promote the progress of science and useful arts, by securing for limited times to authors and inventors the exclusive right to their respective writings and discoveries".
OSV is a new format for reporting security vulnerabilities like CVEs and an HTTP API for looking up CVEs from software component name and version. https://github.com/ossf/osv-schema
A number of tools integrate with OSV-schema data hosted by osv.dev: https://github.com/google/osv.dev#third-party-tools-and-inte... :
> We provide a Go based tool that will scan your dependencies, and check them against the OSV database for known vulnerabilities via the OSV API.
> Currently it is able to scan various lockfiles [ repo2docker REES config files like and requirements.txt, Pipfile lock, environment.yml, or a custom Dockerfile, ], debian docker containers, SPDX and CycloneDB SBOMs, and git repositories.
httpS://schema.org/license
Also: https://news.ycombinator.com/item?id=35891631
extruct is one way to parse linked data from HTML pages: https://github.com/scrapinghub/extruct
Three, the derivatives are made and Disney starts marketing "Disney's Star Wars" which continue to be the high-demand (and high-value) versions. The situation is unchanged.
For example, you can currently buy The Little Mermaid in non-Disney form[1], but Disney's version is what most people want.
[1] - https://www.amazon.com/s?k=little+mermaid+Hans+Christian+And...
That's the same thing.
No one can use my stuff..........(unless you pay me royalties).
It absolutely is.
Doing it at all requires time & attentive focus, which is a finite resource for anybody mortal, and moreover a resource that's scarce and has to be spent in multiple places.
Doing it well requires significant investment in practice and training, often years of it, maybe even decades in order to develop certain levels of expressive fluency.
As with any issue of scarcity, economics comes in. If you want this activity supported, one good way of doing it is enabling the investment of time. Copyright does this by giving people an economic/legal claim on how copies of their work are distributed.
Paying for copies has the usual market merits -- the economic reward and signals of value are proportional to copies acquired. There are other ways of course, common ones brought up here are patronage and merchandising, but they lose the market merits, and both are basically another way of saying "nobody should have to pay for the value in your work directly," and merchandising is even worse in that it's basically saying "yeah, you'll just need another job to support yourself while you're doing this thing", which is time taken away from investment in the creative endeavor, so you'll get less of the actual endeavor.
There've always been solid human arguments for sustaining copyright legally. The balance is the tricky part.
On one hand we had a period where terms got too long, and some of the really aggressive legal enforcement from 20 years ago before stakeholders actually figured out how to get into digital markets were was entitled and useless. The pendulum also swung the other way with things like buffet streaming services essentially offering an economic bargain for creators with a sliver of compensatory difference from piracy but with none of piracy's actual benefits (people who simply pirate know they're not participating in a relationship of economic support with creators and might be persuaded to, someone who uses Spotify is under the illusion there's something fully legit on that front).
But the fundamental copyright bargain -- creators can recoup investments of time and effort in proportion to how popular engagement with their work is -- has always made sense.
> "We" now want stricter copyright law when it comes to AI, but at the same time shorter copyright duration...
Both these things can be true:
(1) Using a work as training data for AI is a very novel use, it's entirely plausible there should be novel considerations and rights to go with it.
(2) The incentive & benefits of copyrights have diminishing returns the longer the horizons are, while the cost in terms of social inaccessibility only increase. Where that's balanced out precisely is a debatable question, but something longer than a human lifespan is probably on the wrong side.
The trouble with IP is that there are lots of influential people that very much would like IP to be useful in creating welfare. Unfortunately the evidence for that is surprisingly scarce. For discussion, see e.g. Boldrin & Levine
Metadata like in tags, HTML meta tags, etc. is where you describe the content so meaning can be extracted from it by machines and automated processing.
It would also be useful to distinguish training crawlers from indexing crawlers. Maybe I'm publishing personal content. It's useful for me to have it indexed for search, but I don't want an AI to be able to simulate me or my style.
As to copyright - yes I agree the Micky Mouse copyright law has been extended far too long and should be about thirty years. On the other hand I think trade marks should not be nowhere so easily liable to be lost even if people do use the term generally. Disney should still be able to make new Micky Mouse cartoons and be defended from others making them.
You can certainly pay the rights holder to use their property! Still! You could do it even without copyright I suppose. However, I think a space where it costs time and money for the rights holder to try to stop use and they won't get paid for it is super useful.
Consider this in the case of software as well - you get ~30 years of benefit from your work, but you can refuse to allow companies to incorporate it into their products as long as you live. Whichever companies you want! You can also not do that.
Expression has no value in today's digital world.
Creation has value but using expression to exchange for that value is difficult, requiring limits on expression in order for the system to work.
it shouldn't. Or, well, it should, but it should be the one and only thing taxed: https://en.wikipedia.org/wiki/Georgism
People could be adding a specific robot user-agent, if they knew openai even existed before yesterday and was stealing their content. but nobody did.
The ownership, with heavy taxes on that ownership, pushes towards making sure people benefit from the land.
IP law reasonably does. See: https://trademarks.justia.com/852/28/the-little-mermaid-8522...
I don’t know who “we” are, but I absolutely don't want “stricter copyright law when it comes to AI”. More clarity? Sure. Narrowing fair use? No fucking way.
I would be stealing if I prevented you from making money from it.
2. These are all complex formats. If you want to ingest and process them then you already have to build all the hard parts. Getting the metadata out is dead simple compared to parsing, decoding, and then processing an image, for example.
Technically life + 70 years - or 1 million years for that matter - is "limited" - but I imagine 14+14 is probably closer to what they had in mind.
"goods are scarce because there are not enough resources to produce all the goods that people want to consume".(quoted at [1])
Physical books are intrinsically scarce because they require physical resources to make and distribute copies. Libraries are often limited by physical shelf space.
Ebooks are not intrinsically scarce because there are enough resources to enable anyone on the internet to download any one of millions of ebooks at close to zero marginal cost, with minimal physical space requirements per book. Archive.org and Z-Library are examples of this.
Consider also free goods:
"Examples of free goods are ideas and works that are reproducible at zero cost, or almost zero cost. For example, if someone invents a new device, many people could copy this invention, with no danger of this "resource" running out."[2]
i.e. enforce egregious IP violations while criminalizing trolls.
For extremely loose values of "we", perhaps - I didn't select it, and I would vote "no" if the idea were proposed...
Then once smaller competitors are out of business, raise prices.
Of course, force can go into it, such as when a big company sues a smaller company with a frivolous lawsuit that the smaller company can't afford to fight. Then the smaller company goes out of business, and the big company can use their ideas free.
It's pretty mysterious that you think you need to introduce this to the conversation at this point given how prominently scarcity dynamics figure into the comment you're replying to.
> Physical books are intrinsically scarce
Once their production was industrialized with printing press tech, copies of books weren't scarce, they were actually revolutionarily cheap.
The copyright bargain isn't borne out of ignorance of how changes in that direction affect the overall dynamic, it's borne out of deep understanding of what remains scarce and risky and difficult to compensate for when the marginal cost of producing copies drops drastically, and what kind of claims might help.
Authorship may be scarce - costly and resource intensive (LLMs notwithstanding) as you describe, while copying and distribution of intangible goods like ideas or digital media is essentially free and unlimited, as I suspect PP was trying to say.
As you correctly note, the constitutional copyright bargain permits a limited time monopoly in return for (hopefully) advancing "the progress of science and the useful arts."
Project AIs.txt is a mental model of a machine learning permission system. Intuitively, question this: what if we could make a human-readable file that declines machine learning (a.k.a. Copilot use)? It's like robots.txt, but for Copilot.
User-agent: OpenAI Disallow: /some-proprietary-codebase/
User-agent: Facebook Disallow: /no-way-mark/
User-agent: Copilot Disallow: /expensive-code/
Sitemap: /public/sitemap.xml Sourcemap: /src/source.js.map License: MIT
# SOME LONG LEGAL STATEMENTS HERE
Key Issues Would it be legally binding? For now, no. It would be a polite way to mark my preference to opt-out of such data mining. It's closer to the Ask BigTechs Not to Track option rather than a legal license. Technically, Apple's App Tracking Transparency does not ban all tracking activity; it never can.
254AFC.png
Why not LICENSE or COPYING.txt? Both are mainly written in human language and cannot provide granular scraping permissions depending on the collector. Also, GitHub Copilot ignores LICENSE or COPYING.txt, claiming we consented to Copilot using our codes for machine learning by signing up and pushing code to GitHub, We may expand the LICENSE system to include the terms for machine learning use, but that would even more edge case and chaotic licensing systems.
Does machine learning purposes of copyrighted works require a license? This question is still under debate. Opt-out should be the default if it requires a license, making such a license system meaningless. If it doesn't require a license, then which company would respect the license system, given that it is not legally binding?
Is robots.txt legally binding? No. Even if you scrape the web prohibited under robots.txt, it is not against the law. See HIQ LABS, INC., Plaintiff-Appellee, v. LINKEDIN CORPORATION, Defendant-Appellant.. robots.txt cannot make fair use illegal.
Any industry trends? W3 has been working on robots.txt for machine learning, aligning with EU Copyright Directives.
The goal of this Community Group is to facilitate TDM in Europe and elsewhere by specifying a simple and practical machine-readable solution capable of expressing the reservation of TDM rights. w3c/tdm-reservation-protocol: Repository of the Text and Data Mining Reservation Protocol Community Group
Can we even draw the line? No. One could reasonably argue that AI is doing the same as humans, much better and more efficiently. However, that claim goes against the fundamentals of intellectual property. If any IP is legally protected, machine-generated code must also have the same level of awareness system to respect it and prevent any plagiarism. Otherwise, they must bear legal duties.
Maybe it can benefit AI companies too ... by excluding all hacky codes and only opting for best-practice codes. If implemented correctly, it can work as an effective data sanitation system.
"which people in particular are benefitting the most" seems to be the perennial question.
Of course in reallity things are usually more complex and wer are talking about two different opinions A and B that aren't even inherently incompatibly but just some motivations for A would lead to ¬B and vice versa.
But un this particular case I think the flaw is in your assumption that the majority wants stricter copyright law for AI rather than wants the same copyright law that humans are beholden to to also apply to AI, wether that law is the current may-as-well-be-perpetual-monopoly or 0 copyright or anything in between.
Practically speaking, that's the only effective solution. I just think that it's a shame that's necessary. It would be better for everyone if there wasn't a disincentive to making works publicly available.
> I don’t want to have rules that biological humans can view and digital humans can’t.
This is a point we disagree on.
And "digital humans"? I would argue that such a thing can't exist, if you mean "human" in any way other than rough analogy.
"Property is theft" is not a new idea, makes a lot of sense. Unless you have a lot of it, and then those [censored] can [censored] right off.
Copyright that doesn't expire would make "a whole lot of cents".
(I agree with you but, the ownership is the corrupting factor.)
Profit/nonprofit is irrelevant to copyright.