zlacker

Tell HN: We should start to add “ai.txt” as we do for “robots.txt”

submitted by Jeanne+(OP) on 2023-05-10 12:20:05 | 562 points 281 comments
[source] [go to bottom]

I started to add an ai.txt to my projects. The file is just a basic text file with some useful info about the website like what it is about, when was it published, the author, etc etc.

It can be great if the website somehow ends up in a training dataset (who knows), and it can be super helpful for AI website crawlers, instead of using thousands of tokens to know what your website is about, they can do it with just a few hundred.


NOTE: showing posts with links only show all posts
◧◩◪
3. jruoho+z2[view] [source] [discussion] 2023-05-10 12:37:11
>>a28002+Z1
I am not sure about that but I think IANA is quite open to recognizing new well-known URIs:

https://www.iana.org/assignments/well-known-uris/well-known-...

Basically, assuming that you have a spec, I think it amounts to filing a PR or discussing it on a mailing list.

15. aww_da+D9[view] [source] 2023-05-10 13:15:06
>>Jeanne+(OP)
Most of what you listed is already covered by existing meta tags and structured data.

https://developer.mozilla.org/en-US/docs/Learn/HTML/Introduc...

https://schema.org/author

https://developers.google.com/search/blog/2013/08/relauthor-....

17. nstj+ma[view] [source] 2023-05-10 13:19:12
>>Jeanne+(OP)
“Google Search works hard to understand the content of a page. You can help us by providing explicit clues about the meaning of a page to Google by including structured data on the page.”[0]

[0]: https://developers.google.com/search/docs/appearance/structu...

◧◩
22. thefox+xc[view] [source] [discussion] 2023-05-10 13:29:20
>>dingle+2b
Good point.

Also Killer Robots are Robots: https://www.youtube.com/watch?v=4K6XJuH6P_w

◧◩
34. majews+Oe[view] [source] [discussion] 2023-05-10 13:39:50
>>samwil+H5
> All a robots.txt is is a polite request to please follow the rules in it

At least in my country (Germany), respecting robots.txt is a legal requirement for data mining. See German Copyright Code, section 44b: https://www.gesetze-im-internet.de/urhg/__44b.html

(IANAL)

◧◩
48. revico+Zk[view] [source] [discussion] 2023-05-10 14:08:09
>>matsem+ae
Feels like an enhancement to a sitemap.xml could be a better way to go here.

https://developers.google.com/search/docs/crawling-indexing/...

◧◩◪◨
49. mnot+Il[view] [source] [discussion] 2023-05-10 14:11:21
>>jruoho+z2
You can open an issue: https://github.com/protocol-registries/well-known-uris
50. qbasic+Ol[view] [source] 2023-05-10 14:11:34
>>Jeanne+(OP)
Your HTML already has semantic meta elements like author and description you should be populating with info like that: https://developer.mozilla.org/en-US/docs/Learn/HTML/Introduc...
53. h1fra+Xl[view] [source] 2023-05-10 14:12:00
>>Jeanne+(OP)
Something like JSON+LD ? It should cover most of your needs and can also be used for actual search engine

e.g: https://developers.google.com/search/docs/appearance/structu...

70. westur+ds[view] [source] 2023-05-10 14:38:20
>>Jeanne+(OP)
security.txt https://github.com/securitytxt/security-txt :

> security.txt provides a way for websites to define security policies. The security.txt file sets clear guidelines for security researchers on how to report security issues. security.txt is the equivalent of robots.txt, but for security issues.

Carbon.txt: https://github.com/thegreenwebfoundation/carbon.txt :

> A proposed convention for website owners and digital service providers to demonstrate that their digital infrastructure runs on green electricity.

"Work out how to make it discoverable - well-known, TXT records or root domains" https://github.com/thegreenwebfoundation/carbon.txt/issues/3... re: JSON-LD instead of txt, signed records with W3C Verifiable Credentials (and blockcerts/cert-verifier-js)

SPDX is a standard for specifying software licenses (and now SBOMs Software Bill of Materials, too) https://en.wikipedia.org/wiki/Software_Package_Data_Exchange

It would be transparent to disclose the SBOM in AI.txt or elsewhere.

How many parsers should be necessary for https://schema.org/CreativeWork https://schema.org/license metadata for resources with (Linked Data) URIs?

◧◩◪◨
74. Spivak+xu[view] [source] [discussion] 2023-05-10 14:47:39
>>casey2+1p
The argument goes that copyright has allowed massive corporations to buy up and exert near total control over all of our shared stories. And when you own the cultural touchstones of whole generations that gives you power that no one else can ever wield.

There is a massive amount of amazing stories based on ancient myths because it's one of the few large corpora that isn't copywritten. Once you see it in media you can't unsee it. The only space where that kind of creativity can thrive anymore is fan-fiction which lives in weird limbo where it's illegal but the copyright owners don't care. And when you want to bring any of it to the mainstream you have to hide it, all of Ali Hazelwoods books are reworked fanfics because she can't use the actual characters that inspired her -- her most famous book "The Love Hypothesis" is a Reylo fic.

Go check out https://archiveofourown.org/media and see how many works are owned by a few large corporations.

◧◩
75. techaq+7v[view] [source] [discussion] 2023-05-10 14:50:26
>>qbasic+Ol
and also opengraph meta tags https://ogp.me/
78. javier+Gv[view] [source] 2023-05-10 14:53:10
>>Jeanne+(OP)
Related, there is also https://datatxt.org
◧◩◪
89. doodle+Hy[view] [source] [discussion] 2023-05-10 15:06:26
>>techaq+7v
And also schema.org: https://schema.org/
◧◩
96. mtmail+qA[view] [source] [discussion] 2023-05-10 15:13:18
>>javier+Gv
Which claims 'under active development'. Four years ago the author took the robots.txt RFC and changed a couple of paragraphs https://github.com/datatxtorg/datatxt-spec/commit/36028e2280... Meanwhile the robots.txt was updated in 2022 https://www.rfc-editor.org/rfc/rfc9309.html
108. sph+AD[view] [source] 2023-05-10 15:24:43
>>Jeanne+(OP)

    # cat > /var/www/.well-known/ai.txt
    Disallow: *
    ^D
    # systemctl restart apache2
Until then, I'm seriously considering prompt injection in my websites to disrupt the current generation of AI. Not sure if it would work.

Please share with me ideas, links and further reading about adversarial anti-AI countermeasures.

EDIT: I've made an Ask HN for this: https://news.ycombinator.com/item?id=35888849

◧◩◪◨
114. placat+qF[view] [source] [discussion] 2023-05-10 15:32:26
>>lances+8y
Both things can, and I think are, true. I see it as reduced competition in both cases, corporate consolidation making companies huge and large copyright timelines.

The long timelines stifle new creative works by keeping other, especially smaller, outfits having to make sure they don't accidentally run afoul of another copyright from decades ago. This needs capital to either be proactive in searching or to defend a suit that's brought.

Here's a recent article about the battle between the copyright holders of Let's Get It On and Ed Sheeran for Thinking Out Loud. Those two songs are separated by around 40 years. https://www.theguardian.com/music/2023/may/07/ed-sheeran-cop...

119. mxurib+3H[view] [source] 2023-05-10 15:39:57
>>Jeanne+(OP)
@Jeannen I really like the thinking here...But instead of ai.txt - since the intent is not to block, but rather, to inform AI models (or any other presumably automaton) - my reflex is to suggest something more general like readme.txt. But, then i thought, well, since its really more about metadata, as others have stated, there might already be existing standards...Or, at least, common behaviors that could become standardized. For example, someone noted about security.txt, and i know there's the humans.txt approach (see https://humanstxt.org/), and of course there are web manifest files (see https://developer.mozilla.org/en-US/docs/Web/Manifest), etc. I wonder if you might want to consider reviewing existing approaches, and maybe augment them or see if any of thjose makese sense (or not)...?
◧◩◪◨⬒⬓
141. gavinh+PN[view] [source] [discussion] 2023-05-10 16:06:51
>>rhn_mk+4t
"AI" changes things by making it even harder for individuals to defend against.

Right now, we have FOSS organizations that will help you in lawsuits against companies that don't follow licenses. With "AI" in the picture, companies can launder your code with "plausible" deniability. [1]

[1]: https://matthewbutterick.com/chron/will-ai-obliterate-the-ru...

142. fredri+wO[view] [source] 2023-05-10 16:10:23
>>Jeanne+(OP)
If anyone want to use my blog posts, they can contact me. I want to know my customer.

If you want to know about copyright that applies to my work: https://www.riksdagen.se/sv/dokument-lagar/dokument/svensk-f...

Beeing in the US does not shield you from my country's laws. You are not allowed to copy my work without my permission, you are not allowed to transform it.

◧◩
153. bachme+aV[view] [source] [discussion] 2023-05-10 16:40:30
>>samwil+H5
> All a robots.txt is is a polite request to please follow the rules in it, there is no "legal" agreement to follow those rules, only a moral imperative.

I don't know that this is true for the US. As far back as I can remember, there have been questions about whether a robots.txt file means you don't have permission to engage in those activities. The CFAA is one law that has repeatedly come up. See for example https://www.natlawreview.com/article/doj-revises-policy-cfaa...

It might be the case that there is nothing there legally, but I don't think I'd describe the actions of search engines as being driven by a moral imperative.

◧◩◪◨
155. CWuest+HV[view] [source] [discussion] 2023-05-10 16:42:42
>>dclowd+lB
Yesterday's conversation here about the Ed Sheeran lawsuit should explain much of this: https://news.ycombinator.com/item?id=35868421

Here's one key bit from the OP: - - - - -

But the lawsuits have been where he’s really highlighted the absurdity of modern copyright law. After winning one of the lawsuits a year ago, he put out a heartfelt statement on how ridiculous the whole thing was. A key part:

There’s only so many notes and very few chords used in pop music. Coincidence is bound to happen if 60,000 songs are being released every day on Spotify—that’s 22 million songs a year—and there’s only 12 notes that are available.

In the aftermath of this, Sheeran has said that he’s now filming all of his recent songwriting sessions, just in case he needs to provide evidence that he and his songwriting partners came up with a song on their own, which is depressing in its own right.

◧◩◪◨
159. rzzzt+AW[view] [source] [discussion] 2023-05-10 16:46:55
>>Karell+FA
Hmm, "robot" in its spelled out form sounds weird to me for this use ("bot" is more frequent). Wikipedia redirects people looking for software agents to a separate page from the article about the beep-boop ones: https://en.wikipedia.org/wiki/Robot
◧◩
166. omoika+l21[view] [source] [discussion] 2023-05-10 17:10:29
>>samwil+H5
There is no legal agreement to follow robots.txt, but it appears to have came up a few times (from the first search result for "court cases involving robots.txt"):

https://www.robotstxt.org/faq/legal.html

If an "ai.txt" were to exist, I hope it's a signal for opt-in rather than opt-out. Whereas "robots.txt" being an explicit signal for opt-out might be useful because people who build public websites generally want their websites to be discovered, it seemed unlikely that training unknown AI would be a use case that content creators had in mind, considering that most existing content predates current AI systems.

◧◩
179. rglove+3a1[view] [source] [discussion] 2023-05-10 17:46:25
>>matsem+ae
I've been (slowly) writing a new type of OSS license around this exact concept so it's easier to (legally) stop LLMs hoovering up IP [1] (under "derivative works not permitted").

[1] https://github.com/cheatcode/joystick/blob/development/LICEN...

◧◩◪
180. spc476+pa1[view] [source] [discussion] 2023-05-10 17:47:19
>>shaneb+P6
It might be a better idea to serve up a 418 ("I'm a tea pot") with a line line text file saying "I'm not an HTTP server". That solved a problem I had with bots making HTTP requests to my gopher server [1]. Serving up a 503 informs the bot that there's a server issue and it may try again later. A 418 informs the bot that it made an erroneous request and such an odd error code might get someone to look into it and stop.

[1] https://boston.conman.org/2019/09/30.2

◧◩
183. User23+Oe1[view] [source] [discussion] 2023-05-10 18:06:39
>>samwil+H5
> there is no "legal" agreement to follow those rules

Yes there certainly is[1]. The robots.txt clearly specifies authorized use and violating it exceeds that authorization. Now granted good luck getting the FBI to doorkick their friends at Google and other politically connected tech companies, but as the law is written crawlers need to honor the site owner's robots.txt.

[1] https://en.wikipedia.org/wiki/Computer_Fraud_and_Abuse_Act

188. moimik+zh1[view] [source] 2023-05-10 18:17:30
>>Jeanne+(OP)
what differentiates this from https://humanstxt.org/?
◧◩◪
201. westur+bo1[view] [source] [discussion] 2023-05-10 18:45:42
>>mtmail+3B
JSON-LD or RDFa (RDF in HTML attributes) in at least the /index.html the HTML footer should be sufficient to indicate that there is structured linked data metadata for crawlers that then don't need an HTTP request to a .well-known URL /.well-known/ai_security_reproducibility_carbon.txt.jsonld.json

OSV is a new format for reporting security vulnerabilities like CVEs and an HTTP API for looking up CVEs from software component name and version. https://github.com/ossf/osv-schema

A number of tools integrate with OSV-schema data hosted by osv.dev: https://github.com/google/osv.dev#third-party-tools-and-inte... :

> We provide a Go based tool that will scan your dependencies, and check them against the OSV database for known vulnerabilities via the OSV API.

> Currently it is able to scan various lockfiles [ repo2docker REES config files like and requirements.txt, Pipfile lock, environment.yml, or a custom Dockerfile, ], debian docker containers, SPDX and CycloneDB SBOMs, and git repositories.

◧◩◪◨
203. westur+4r1[view] [source] [discussion] 2023-05-10 18:57:32
>>doodle+Hy
Thing > CreativeWork > WebSite https://schema.org/WebSite ... scroll down to "Examples" and click the "JSON-LD" and/or "RDFa" tabs. (And if there isn't an example then go to the schema.org/ URL of a superClassOf (rdfs:subClassOf) of the rdfs:Class or rdfs:Property; there are many markup examples for CreativeWork and subtypes).

httpS://schema.org/license

Also: https://news.ycombinator.com/item?id=35891631

extruct is one way to parse linked data from HTML pages: https://github.com/scrapinghub/extruct

◧◩◪◨⬒
205. mathqu+cr1[view] [source] [discussion] 2023-05-10 18:57:59
>>safety+BN
> If let's say Star Wars falls out of copyright tomorrow, economically that has two effects. One, Disney loses a ton of future revenue. Two, countless Disney other people create derivatives of Star Wars, and they make money from those. Competition is increased.

Three, the derivatives are made and Disney starts marketing "Disney's Star Wars" which continue to be the high-demand (and high-value) versions. The situation is unchanged.

For example, you can currently buy The Little Mermaid in non-Disney form[1], but Disney's version is what most people want.

[1] - https://www.amazon.com/s?k=little+mermaid+Hans+Christian+And...

◧◩◪◨⬒⬓⬔⧯
237. asdkjl+Xg2[view] [source] [discussion] 2023-05-10 23:29:58
>>majorm+Ch1
> Why should land be owned?

it shouldn't. Or, well, it should, but it should be the one and only thing taxed: https://en.wikipedia.org/wiki/Georgism

◧◩◪◨
241. 8note+Nj2[view] [source] [discussion] 2023-05-10 23:48:46
>>lances+8y
> People are free to make the Little Mermaid, Beauty and the Beast, Hunchback of Notre Dame, Aladdin, etc. and there's nothing out there that stops them.

IP law reasonably does. See: https://trademarks.justia.com/852/28/the-little-mermaid-8522...

◧◩◪◨⬒⬓⬔⧯▣▦
250. musica+SJ2[view] [source] [discussion] 2023-05-11 03:02:47
>>wwwest+Ou1
I think the concept that PP may be trying to get across is scarcity:

"goods are scarce because there are not enough resources to produce all the goods that people want to consume".(quoted at [1])

Physical books are intrinsically scarce because they require physical resources to make and distribute copies. Libraries are often limited by physical shelf space.

Ebooks are not intrinsically scarce because there are enough resources to enable anyone on the internet to download any one of millions of ebooks at close to zero marginal cost, with minimal physical space requirements per book. Archive.org and Z-Library are examples of this.

Consider also free goods:

"Examples of free goods are ideas and works that are reproducible at zero cost, or almost zero cost. For example, if someone invents a new device, many people could copy this invention, with no danger of this "resource" running out."[2]

[1] https://en.wikipedia.org/wiki/Scarcity

[2] https://en.wikipedia.org/wiki/Free_good

◧◩◪◨⬒⬓⬔⧯▣▦▧▨
254. gavinh+LP2[view] [source] [discussion] 2023-05-11 03:52:25
>>alphan+4H2
https://en.wikipedia.org/wiki/Loss_leader

Then once smaller competitors are out of business, raise prices.

Of course, force can go into it, such as when a big company sues a smaller company with a frivolous lawsuit that the smaller company can't afford to fight. Then the smaller company goes out of business, and the big company can use their ideas free.

257. anaclu+ba3[view] [source] 2023-05-11 06:28:15
>>Jeanne+(OP)
Some interesting studies on this I've done: https://cho.sh/r/F9F706

Project AIs.txt is a mental model of a machine learning permission system. Intuitively, question this: what if we could make a human-readable file that declines machine learning (a.k.a. Copilot use)? It's like robots.txt, but for Copilot.

User-agent: OpenAI Disallow: /some-proprietary-codebase/

User-agent: Facebook Disallow: /no-way-mark/

User-agent: Copilot Disallow: /expensive-code/

Sitemap: /public/sitemap.xml Sourcemap: /src/source.js.map License: MIT

# SOME LONG LEGAL STATEMENTS HERE

Key Issues Would it be legally binding? For now, no. It would be a polite way to mark my preference to opt-out of such data mining. It's closer to the Ask BigTechs Not to Track option rather than a legal license. Technically, Apple's App Tracking Transparency does not ban all tracking activity; it never can.

254AFC.png

Why not LICENSE or COPYING.txt? Both are mainly written in human language and cannot provide granular scraping permissions depending on the collector. Also, GitHub Copilot ignores LICENSE or COPYING.txt, claiming we consented to Copilot using our codes for machine learning by signing up and pushing code to GitHub, We may expand the LICENSE system to include the terms for machine learning use, but that would even more edge case and chaotic licensing systems.

Does machine learning purposes of copyrighted works require a license? This question is still under debate. Opt-out should be the default if it requires a license, making such a license system meaningless. If it doesn't require a license, then which company would respect the license system, given that it is not legally binding?

Is robots.txt legally binding? No. Even if you scrape the web prohibited under robots.txt, it is not against the law. See HIQ LABS, INC., Plaintiff-Appellee, v. LINKEDIN CORPORATION, Defendant-Appellant.. robots.txt cannot make fair use illegal.

Any industry trends? W3 has been working on robots.txt for machine learning, aligning with EU Copyright Directives.

The goal of this Community Group is to facilitate TDM in Europe and elsewhere by specifying a simple and practical machine-readable solution capable of expressing the reservation of TDM rights. w3c/tdm-reservation-protocol: Repository of the Text and Data Mining Reservation Protocol Community Group

Can we even draw the line? No. One could reasonably argue that AI is doing the same as humans, much better and more efficiently. However, that claim goes against the fundamentals of intellectual property. If any IP is legally protected, machine-generated code must also have the same level of awareness system to respect it and prevent any plagiarism. Otherwise, they must bear legal duties.

Maybe it can benefit AI companies too ... by excluding all hacky codes and only opting for best-practice codes. If implemented correctly, it can work as an effective data sanitation system.

259. juliel+Sp3[view] [source] 2023-05-11 08:33:26
>>Jeanne+(OP)
It is fair to give more information about the information exposed on a website, especially when it comes to partnering with AI systems. There is an international effort which includes such information. It is done under the auspices of the W3C. See https://www.w3.org/community/tdmrep/. It has been developed to implement the Text & Data Mining + AI "opt-out" that is legal in Europe. It does not use robots.txt because this one is about indexing a website and should stay focus on it. The information about website managers is contained in the /.well-known directory, in a JSON-LD file, which is much more well structured than robots.txt. Why not adhere to an international effort rather than creating N fragmented initiatives?
◧◩
281. menro2+Hxp[view] [source] [discussion] 2023-05-18 05:44:54
>>theand+56
I've been thinking about ai.txt more as rss - just beginning to vet the ideas and process" https://github.com/menro/ai.txt
[go to top]