So there is probably a big pile of Reddit comments, twitter messages, and libgen and arxiv PDFs I imagine
So there is some shit, but also painstakingly encoded knowledge (ie writing), and yeah it is miraculous that LLMs are right as often as they are
The most recent numbers from libgen itself are 2.4 million non-fiction books and 80 million science journal articles. The Atlantic's database published in 2025 has 7.5 million books.[0] The publishing industry estimates that many books are published each year. As of 2010, Google counted over 129 million books[1]. At best an LLM like Llama will have have 20% of all books in its training set.
0. https://www.theatlantic.com/technology/archive/2025/03/libge...
1. https://booksearch.blogspot.com/2010/08/books-of-world-stand...