Its a hive of misinformation, disinformation and toxicity. Its succinct I guess, but nothing is eloquent or descriptive because of the character limit. And its full of repetitive "filler" information.
Who wants that in a foundational LLM dataset?
Maybe its OK for finding labeled images... But that still seems kidna iffy.
Twitter is great for examples of that, and the toxicity and disinformation doesn't get in the way.
Conversely, a training set doesn't need to be up to date to be useful for that.
I don't know if anyone really was trying to scrape it (examples of Musk disagreeing with his own engineers come to mind), but I assume it's possible, and given the quality of code ChatGPT spits out I can easily believe a really bad scraper has been produced by someone who thought they could do without hiring a software developer. If so, they might think they can get hot stock tips or forewarning of a pandemic from which emoji people post or something — not really what an LLM is for, but loads of people (even here!) conflate all the different kinds of AI into one thing.