And I say this as someone that is extremely bothered by how easily mass amounts of open content can just be vacuumed up into a training set with reckless abandon and there isn’t much you can do other than put everything you create behind some kind of authentication wall but even then it’s only a matter of time until it leaks anyway.
Pandora’s box is really open, we need to figure out how to live in a world with these systems because it’s an un winnable arms race where only bad actors will benefit from everyone else being neutered by regulation. Especially with the massive pace of open source innovation in this space.
We’re in a “mutually assured destruction” situation now, but instead of bombs the weapon is information.
Banning a synthetic brain from studying copyrighted content just because it could later recite some of that content is as stupid as banning a biological person from studying copyrighted content because it could later quote from it verbatim.
We will not have "AIs as capable as humans" in a couple decades. AIs will keep being tools used by humans. If you use copyrighted texts as input to a digital transformation, that's vopyright infringement. It's essentially the same situation as sampling in music, and imo the same solutions can be applied here: e.g. licenses with royalties.