Regardless, I do agree that something like a robots.txt for AI can be very useful. I'd like my website to be excluded from most AI projects and some kind of standardized way to communicate this preference would be nice, although I realize most AI projects don't exactly care about things like the wishes of authors, copyright, or ethical considerations. It's the idea that matters, really.
If I can use an ai.txt to convince the crawlers that my website contains illegal hardcore terrorist pornography to get it excluded from the datasets, that's another way to accomplish this I suppose.
That's how you improve its context recognition. You show it many contexts.
> most AI projects don't exactly care about things like the wishes of authors, copyright, or ethical considerations
Why is it 'ethical' that you get to add a bunch of restrictions to a pre-negotiated situation? You get copyright protections in trade for letting people use your work. There's a way to add restrictions - licensing - and you're looking to get the benefits of licensing, and to take away fair use right from other people, without paying the costs of doing so.
fwiw, I copy most pages I visit and store them. The website has given me the equivalent of a pamphlet and I store it instead of discarding it when I'm finished. This way I can go back and read it again later without having to track down the author and ask for another copy. It's not AI which has me doing this, I've been doing it for decades - it's censorship that has shown me the need.
The way copyright laws work is that work is copyrighted (assuming the work is original enough, of course) by default. You don't get to use it unless you have a license. Now, of course, as an author, you can choose to add a license to your work (whether that's CC0 or GPL-3), but you don't have to.
You do have an implicit license to consume this content, but not to reproduce it. If you put all of those copies you've saved on some public other website, that's a copyright violation. Furthermore, access to privately-owned blog posts and websites is a privilege, not a right. You're not my boss, I don't have to write content for you.
The exact legal status of AI models trained on other people's unlicensed works and their output is still largely unknown. Legal professionals much more qualified than me have argued how AI models and generated work can either be completely fair use, with no need to apply any kind of copyright restriction, or how AI generated work can be classified as a derivative work, which means you need a license. There are two major lawsuits about this going on as far as I know and it'll take years for those to flesh out.
If it turns out that AI models and the works they produce are completely fair game, I suppose I'll need take down my content wherever I can in order not to be a free source of training data for big tech; public datasets and the internet archive will still have to respond to DMCA takedowns, after all. However, I'm not all that confident that what AI is doing is all that legally okay.
I have no problem with you saving and archiving anything you want to read. I also fully support the Internet Archive and its goal. I do have a problem with these multi billion dollar companies scouring the internet for their money maker, giving nothing in return.
Not when you give it to me. "Hey, can I see your pamphlet? Sure, here's a copy."
> an implicit license to consume this content
No, copyright prevents copying, not use. There's no implicit license needed to use a work so there's no place to attach those usage restrictions. If you want me to agree to a license you need to not give me the work until I do.
You could have a ToS click-through agreement ("no training an AI on this!"), and then only serve content to logged-in users who have agreed to your conditions.
> but not to reproduce it.
I agree - those "pamphlets" were given to me and I can't copy them for someone else. They'd have to view my collection.
> The exact legal status of AI models trained on other people's unlicensed works and their output is still largely unknown.
Sure, predicting all courts in the world is a futile exercise. Surely someone will try to over reach from copyright to preventing what they feel is a bad use but it's unlikely to become law because there are already analogous uses, scanning someone's text and pulling data from it - data like which words follow which other words.
> I do have a problem with these multi billion dollar companies scouring the internet for their money maker, giving nothing in return.
Well, FB released Llama... It's not a closed technology, it's being led by for-profit businesses but the community (which consists of many of the corporate engineers as well) is trying to keep up.
Even if you can and do attach usage regulations to your site I feel it'll hurt the little guy more than the corporations. There are probably not any unique linguistic constructions on your site that will render a corporate AI less valuable, but for hackers and tinkerers and eventual historians, who knows what it'll interfere with.