I do suspect that well-curated and hand-tuned corpora, including possibly Cyc's, are of significant use to LLM AI. And will likely be more so as the feedback / autophagy problem exacerbates.
Natural-language content-based classification as by Google and Web text-based search relies effectively on documents self-descriptions (that is, their content itself) to classify and search works, though a ranking scheme (e.g., PageRank) is typically layered on top of that. What distinguished early Google from prior full-text search was that the latter had no ranking criteria, leading to keyword stuffing. An alternative approach was Yahoo, originally Yet Another Hierarchical Officious Oracle, which was a curated and ontological classification of websites. This was already proving infeasible by 1997/98 as a whole, though as training data for machine classification might prove useful.