At this point, all the good content has been sucked into LLM training sets. Other than a need to keep up with current events, there's no point in crawling more of the web to get training data.
There's a downside to dumping vast amounts of crap content into an LLM training set. The training method has no notion of data quality.