What this misses is all the regulatory capture that he’s been campaigning for. All the platforms have now closed their gardens. Authors and artists are much more vigilant about copyright etc. So it’s now a totally different game compared to 3 years ago because the data is not just there up for grabs anymore.
https://academictorrents.com/details/89d24ff9d5fbc1efcdaf9d7...
I assume this must be only the text portion, and heavily compressed?
The entire English language Wikipedia is only around 60GB in a format that can be readily searched and randomly accessed (ZIM), for example: https://kiwix.org/