Leaked OpenAI documents reveal aggressive tactics toward former employees

>>apengw+(OP)
Looking forward for a document leak about openai using YouTube data for training their models. When asked if they use it, Murali (CTO) told she doesn't know which makes you believe that for 99% they are using it.

>>znitur+dH
I would say 100%, simply because there is no other reasonable source of video data

>>Dr_Bir+NY
I use multiple websites that have hundreds of thousands of free stock videos that are much easier to label than YouTube videos.

>>iLoveO+Y31
Number of videos are less relevant than the total duration of high-quality videos (quality can be approximated on YouTube with metrics such as view and subscriber count). Also, while YouTube videos are not labelled directly, you can extract signal from the title, the captions, and perhaps even the comments. Lastly, many sources online use YouTube to host videos and embed them on their pages, which probably contains more text data that can be used as labels.

zlacker