zlacker

[parent] [thread] 0 comments
1. winddu+(OP)[view] [source] 2023-05-10 15:43:15
> It can be great if the website somehow ends up in a training dataset (who knows), and it can be super helpful for AI website crawlers, instead of using thousands of tokens to know what your website is about, they can do it with just a few hundred.

How do you differentiate an AI crawler from a normal crawler? Almost all of the LLMs are trained on commoncrawl, which the concept of LLMs didn't even exist when CC started. What about a crawler that creates a search database, but's context is fed into a LLM as context? Or a middleware that fetches data in real time?

Honestly that's a terrible idea. and robots.txt can cover the use cases. But is still pretty ineffective, because it's more just a set of suggestions than rules that must be followed.

[go to top]