Robot.txt isn't about copyrights, its about preventing bots. Its effectively a EULA. Copyright law only goes into effect when you distribute the content you scrape. If you scraped New York times for your own LLM that you used internally and didn't distribute the results, there would be no copyright infringement.
>>adrr+(OP)
> If you scraped New York times for your own LLM that you used internally and didn't distribute the results, there would be no copyright infringement.
Why?
As far as I understand, the copyright owner has control of all copying, regardless of whether it is done internally or externally. Distributing it externally would be a more serious vilation, though.