It is kinda funny if you consider these companies might consider their user data to be useful, especially with recent advances in LLM models. I've been thinking if you just exclude Reddit posts from training youll probably achieve much lower bullshit scores, as that seems to be what most posts on there seem to represent. I think data curation (by sources) could achieve quite a bit.