zlacker

Reasonably speaking, there is no way they can know how they plan to invest $500 billion dollars. The current generation of large language models basically use all human text thats ever been created for the parameters... not really sure where you go after than using the same tech.

replies(5): >>Philpa+l1 >>jazzyj+O2 >>cavisn+m7 >>riku_i+N7 >>rapjr9+v21

>>jppope+(OP)
That's not really true - the current generation, as in "of the last three months", uses reinforcement learning to synthesize new training data for themselves: https://huggingface.co/deepseek-ai/DeepSeek-R1-Zero

replies(2): >>XorNot+C5 >>bandra+x6

>>jppope+(OP)
It seems to me you could generate a lot of fresh information from running every youtube video, every hour of TV on archive.org, every movie on the pirate bay -- do scene by scene image captioning + high quality whisper transcriptions (not whatever junk auto-transcription YouTube has applied), and use that to produce screenplays of everything anyone has ever seen.

I'm not sure why I've never heard of this being done, it would be a good use of GPUs in between training runs.

replies(4): >>airstr+y4 >>milton+H5 >>jensvd+c6 >>ilaksh+eb

>>jazzyj+O2
Don't forget every hour of news broadcasting, of which we likely won't run out any time soon. Plus high quality radio

>>Philpa+l1
Right but that's kind of the point: there's no way forward which could benefit from "moar data". In fact it's weird we need so much data now - i.e. my son in learning to talk hardly needs to have read the complete works of Shakespeare.

If it's possible to produce intelligence from just ingesting text, then current tech companies have all the data they need from their initial scrapes of the internet. They don't need more. That's different to keeping models up to date on current affairs.

replies(2): >>throwa+y9 >>YetAno+zO

>>jazzyj+O2
> a lot of fresh information from running every youtube video

EVERY youtube video?? Even the 9/11 truther videos? Sandy Hook conspiracy videos? Flat earth? Even the blatantly racist? This would be some bad training data without some pruning.

replies(1): >>lansti+Om

>>jazzyj+O2
The fact that OpenAI can just scrape all of Youtube and Google isn't even taking legal action or attempting to stop it is wild to me. Is Google just asleep?

replies(1): >>bdangu+Gc

>>Philpa+l1
It worked well for the Habsburg family; what could go wrong?

>>jppope+(OP)
The new scaling vector is “test time compute” ie spending more compute in inference.

>>jppope+(OP)
I think there is huge amount of corporate knowledge.

>>XorNot+C5
That's essentially what R1 Zero is showing:

> Notably, it is the first open research to validate that reasoning capabilities of LLMs can be incentivized purely through RL, without the need for SFT.

>>jazzyj+O2
I think that this is the obvious path to more robust models -- grounding language on video.

>>jensvd+c6
what are they going to use to sue - DMCA? OpenAI (and others) are scraping everything imaginable (MS is scraping private Github repos…) - don’t think anyone in the current government will be regulating any of this anytime soon

replies(1): >>lansti+Aj

>>bdangu+Gc
Such a biased source of data-that gets them all the LaTeX source for my homeworks, but not my professor's grading of the homework, and not the invaluable words I get from my professor at office hours. No wonder the LLMs have bizarre blindnesses in different directions.

replies(1): >>bdangu+mH1

>>milton+H5
The best videos would be those where you accidentally start recording and you get 2 hours of naturalistic conversation between real people in reality. Not sure how often they are uploaded to YouTube.

Part of the reason that kids need less material is that the aren't just listening, they are also able to do experiments to see what works and what doesn't.

>>XorNot+C5
O3 high compute requires 1000s of dollars to solve one medium complexity problem like ARC.

replies(1): >>artifi+B43

>>jppope+(OP)
The latest hype is around "agents", everyone will have agents to do things for them. The agents will incidentally collect real-time data on everything everyone uses them for. Presto! Tons of new training data. You are the product.

>>lansti+Aj
Such a biased source of data-that gets them all the LaTeX source for my homeworks

but also myriad of hardcore private repositories of many high-tech US enterprises hacking amazing shit (mine included) :)

replies(1): >>lansti+YI3

>>YetAno+zO
Light bulbs used to be expensive too, nails as well.

>>bdangu+mH1
One reason LLMs are better at commercial coding than maths.