I started to add an ai.txt to my projects. The file is just a basic text file with some useful info about the website like what it is about, when was it published, the author, etc etc.
It can be great if the website somehow ends up in a training dataset (who knows), and it can be super helpful for AI website crawlers, instead of using thousands of tokens to know what your website is about, they can do it with just a few hundred.
https://www.iana.org/assignments/well-known-uris/well-known-...
Basically, assuming that you have a spec, I think it amounts to filing a PR or discussing it on a mailing list.
https://developer.mozilla.org/en-US/docs/Learn/HTML/Introduc...
https://developers.google.com/search/blog/2013/08/relauthor-....
[0]: https://developers.google.com/search/docs/appearance/structu...
At least in my country (Germany), respecting robots.txt is a legal requirement for data mining. See German Copyright Code, section 44b: https://www.gesetze-im-internet.de/urhg/__44b.html
(IANAL)
https://developers.google.com/search/docs/crawling-indexing/...
e.g: https://developers.google.com/search/docs/appearance/structu...
> security.txt provides a way for websites to define security policies. The security.txt file sets clear guidelines for security researchers on how to report security issues. security.txt is the equivalent of robots.txt, but for security issues.
Carbon.txt: https://github.com/thegreenwebfoundation/carbon.txt :
> A proposed convention for website owners and digital service providers to demonstrate that their digital infrastructure runs on green electricity.
"Work out how to make it discoverable - well-known, TXT records or root domains" https://github.com/thegreenwebfoundation/carbon.txt/issues/3... re: JSON-LD instead of txt, signed records with W3C Verifiable Credentials (and blockcerts/cert-verifier-js)
SPDX is a standard for specifying software licenses (and now SBOMs Software Bill of Materials, too) https://en.wikipedia.org/wiki/Software_Package_Data_Exchange
It would be transparent to disclose the SBOM in AI.txt or elsewhere.
How many parsers should be necessary for https://schema.org/CreativeWork https://schema.org/license metadata for resources with (Linked Data) URIs?
There is a massive amount of amazing stories based on ancient myths because it's one of the few large corpora that isn't copywritten. Once you see it in media you can't unsee it. The only space where that kind of creativity can thrive anymore is fan-fiction which lives in weird limbo where it's illegal but the copyright owners don't care. And when you want to bring any of it to the mainstream you have to hide it, all of Ali Hazelwoods books are reworked fanfics because she can't use the actual characters that inspired her -- her most famous book "The Love Hypothesis" is a Reylo fic.
Go check out https://archiveofourown.org/media and see how many works are owned by a few large corporations.
# cat > /var/www/.well-known/ai.txt
Disallow: *
^D
# systemctl restart apache2
Until then, I'm seriously considering prompt injection in my websites to disrupt the current generation of AI. Not sure if it would work.Please share with me ideas, links and further reading about adversarial anti-AI countermeasures.
EDIT: I've made an Ask HN for this: https://news.ycombinator.com/item?id=35888849
The long timelines stifle new creative works by keeping other, especially smaller, outfits having to make sure they don't accidentally run afoul of another copyright from decades ago. This needs capital to either be proactive in searching or to defend a suit that's brought.
Here's a recent article about the battle between the copyright holders of Let's Get It On and Ed Sheeran for Thinking Out Loud. Those two songs are separated by around 40 years. https://www.theguardian.com/music/2023/may/07/ed-sheeran-cop...
Right now, we have FOSS organizations that will help you in lawsuits against companies that don't follow licenses. With "AI" in the picture, companies can launder your code with "plausible" deniability. [1]
[1]: https://matthewbutterick.com/chron/will-ai-obliterate-the-ru...
If you want to know about copyright that applies to my work: https://www.riksdagen.se/sv/dokument-lagar/dokument/svensk-f...
Beeing in the US does not shield you from my country's laws. You are not allowed to copy my work without my permission, you are not allowed to transform it.
I don't know that this is true for the US. As far back as I can remember, there have been questions about whether a robots.txt file means you don't have permission to engage in those activities. The CFAA is one law that has repeatedly come up. See for example https://www.natlawreview.com/article/doj-revises-policy-cfaa...
It might be the case that there is nothing there legally, but I don't think I'd describe the actions of search engines as being driven by a moral imperative.
Here's one key bit from the OP: - - - - -
But the lawsuits have been where he’s really highlighted the absurdity of modern copyright law. After winning one of the lawsuits a year ago, he put out a heartfelt statement on how ridiculous the whole thing was. A key part:
There’s only so many notes and very few chords used in pop music. Coincidence is bound to happen if 60,000 songs are being released every day on Spotify—that’s 22 million songs a year—and there’s only 12 notes that are available.
In the aftermath of this, Sheeran has said that he’s now filming all of his recent songwriting sessions, just in case he needs to provide evidence that he and his songwriting partners came up with a song on their own, which is depressing in its own right.
https://www.robotstxt.org/faq/legal.html
If an "ai.txt" were to exist, I hope it's a signal for opt-in rather than opt-out. Whereas "robots.txt" being an explicit signal for opt-out might be useful because people who build public websites generally want their websites to be discovered, it seemed unlikely that training unknown AI would be a use case that content creators had in mind, considering that most existing content predates current AI systems.
[1] https://github.com/cheatcode/joystick/blob/development/LICEN...
Yes there certainly is[1]. The robots.txt clearly specifies authorized use and violating it exceeds that authorization. Now granted good luck getting the FBI to doorkick their friends at Google and other politically connected tech companies, but as the law is written crawlers need to honor the site owner's robots.txt.
[1] https://en.wikipedia.org/wiki/Computer_Fraud_and_Abuse_Act
OSV is a new format for reporting security vulnerabilities like CVEs and an HTTP API for looking up CVEs from software component name and version. https://github.com/ossf/osv-schema
A number of tools integrate with OSV-schema data hosted by osv.dev: https://github.com/google/osv.dev#third-party-tools-and-inte... :
> We provide a Go based tool that will scan your dependencies, and check them against the OSV database for known vulnerabilities via the OSV API.
> Currently it is able to scan various lockfiles [ repo2docker REES config files like and requirements.txt, Pipfile lock, environment.yml, or a custom Dockerfile, ], debian docker containers, SPDX and CycloneDB SBOMs, and git repositories.
httpS://schema.org/license
Also: https://news.ycombinator.com/item?id=35891631
extruct is one way to parse linked data from HTML pages: https://github.com/scrapinghub/extruct
Three, the derivatives are made and Disney starts marketing "Disney's Star Wars" which continue to be the high-demand (and high-value) versions. The situation is unchanged.
For example, you can currently buy The Little Mermaid in non-Disney form[1], but Disney's version is what most people want.
[1] - https://www.amazon.com/s?k=little+mermaid+Hans+Christian+And...
it shouldn't. Or, well, it should, but it should be the one and only thing taxed: https://en.wikipedia.org/wiki/Georgism
IP law reasonably does. See: https://trademarks.justia.com/852/28/the-little-mermaid-8522...
"goods are scarce because there are not enough resources to produce all the goods that people want to consume".(quoted at [1])
Physical books are intrinsically scarce because they require physical resources to make and distribute copies. Libraries are often limited by physical shelf space.
Ebooks are not intrinsically scarce because there are enough resources to enable anyone on the internet to download any one of millions of ebooks at close to zero marginal cost, with minimal physical space requirements per book. Archive.org and Z-Library are examples of this.
Consider also free goods:
"Examples of free goods are ideas and works that are reproducible at zero cost, or almost zero cost. For example, if someone invents a new device, many people could copy this invention, with no danger of this "resource" running out."[2]
Then once smaller competitors are out of business, raise prices.
Of course, force can go into it, such as when a big company sues a smaller company with a frivolous lawsuit that the smaller company can't afford to fight. Then the smaller company goes out of business, and the big company can use their ideas free.
Project AIs.txt is a mental model of a machine learning permission system. Intuitively, question this: what if we could make a human-readable file that declines machine learning (a.k.a. Copilot use)? It's like robots.txt, but for Copilot.
User-agent: OpenAI Disallow: /some-proprietary-codebase/
User-agent: Facebook Disallow: /no-way-mark/
User-agent: Copilot Disallow: /expensive-code/
Sitemap: /public/sitemap.xml Sourcemap: /src/source.js.map License: MIT
# SOME LONG LEGAL STATEMENTS HERE
Key Issues Would it be legally binding? For now, no. It would be a polite way to mark my preference to opt-out of such data mining. It's closer to the Ask BigTechs Not to Track option rather than a legal license. Technically, Apple's App Tracking Transparency does not ban all tracking activity; it never can.
254AFC.png
Why not LICENSE or COPYING.txt? Both are mainly written in human language and cannot provide granular scraping permissions depending on the collector. Also, GitHub Copilot ignores LICENSE or COPYING.txt, claiming we consented to Copilot using our codes for machine learning by signing up and pushing code to GitHub, We may expand the LICENSE system to include the terms for machine learning use, but that would even more edge case and chaotic licensing systems.
Does machine learning purposes of copyrighted works require a license? This question is still under debate. Opt-out should be the default if it requires a license, making such a license system meaningless. If it doesn't require a license, then which company would respect the license system, given that it is not legally binding?
Is robots.txt legally binding? No. Even if you scrape the web prohibited under robots.txt, it is not against the law. See HIQ LABS, INC., Plaintiff-Appellee, v. LINKEDIN CORPORATION, Defendant-Appellant.. robots.txt cannot make fair use illegal.
Any industry trends? W3 has been working on robots.txt for machine learning, aligning with EU Copyright Directives.
The goal of this Community Group is to facilitate TDM in Europe and elsewhere by specifying a simple and practical machine-readable solution capable of expressing the reservation of TDM rights. w3c/tdm-reservation-protocol: Repository of the Text and Data Mining Reservation Protocol Community Group
Can we even draw the line? No. One could reasonably argue that AI is doing the same as humans, much better and more efficiently. However, that claim goes against the fundamentals of intellectual property. If any IP is legally protected, machine-generated code must also have the same level of awareness system to respect it and prevent any plagiarism. Otherwise, they must bear legal duties.
Maybe it can benefit AI companies too ... by excluding all hacky codes and only opting for best-practice codes. If implemented correctly, it can work as an effective data sanitation system.