And I say this as someone that is extremely bothered by how easily mass amounts of open content can just be vacuumed up into a training set with reckless abandon and there isn’t much you can do other than put everything you create behind some kind of authentication wall but even then it’s only a matter of time until it leaks anyway.
Pandora’s box is really open, we need to figure out how to live in a world with these systems because it’s an un winnable arms race where only bad actors will benefit from everyone else being neutered by regulation. Especially with the massive pace of open source innovation in this space.
We’re in a “mutually assured destruction” situation now, but instead of bombs the weapon is information.
Foreign companies can be barred from selling infringing products in the United States.
Russian and Chinese consumers are less interested in English-language articles.
I can’t really get behind the argument that we need to let LLM companies use any material they want because other countries (with other languages, no less) might not have the same restrictions.
If you want some examples of LLMs held back by regulations, look into some of the examinations of how Chinese LLMs are clearly trained to avoid answering certain topics that their government deems sensitive.
Isn't it just one additional step to automatically translate them?