Reddit and HN are among the highest quality sources of training text and are probably weighted very heavily as "probably human" in the mainstream models.
Any source of text with huge amounts of automated and community moderation will be better quality than, say, Twitter.