What this misses is all the regulatory capture that he’s been campaigning for. All the platforms have now closed their gardens. Authors and artists are much more vigilant about copyright etc. So it’s now a totally different game compared to 3 years ago because the data is not just there up for grabs anymore.
Here's to hoping there's still some poetic irony left to dish out in the world.
People speculated it was the funding, or attracting talent or having "access". Turns out it was none of them (obviously they all have a part, but having all three doesn't mean you can best OpenAI which gives you the fundemental reason why it is so hard to compete with them).
If the app does use certificate pinning, then you can use an Android phone and a modified app that removes the logic that enforces certificate pinning. This is more involved but also not impossible.
they "steal" access to data because the LLM launders it on the other end
Easier said than done
https://academictorrents.com/details/89d24ff9d5fbc1efcdaf9d7...
I assume this must be only the text portion, and heavily compressed?
The data has to come from somewhere, and all of the outlets that were used to train ChatGPT, stable diffusion, etc. have since been locked down. Any new company that Sam Altman makes in the AI space won't be competing just on merits of talent and product, they will also need to pay for and negotiate access to data.
I'd actually expect this to get far worse going forward, now that other organizations have an idea of how valuable their data is. It's also trivial to justify locking it down under the guise of protecting people, privacy, etc.
Llms know the contents of books because they are analyzed, reviewed and spoken about everywhere. Pick some obscure book that doesn't show up on any social media and ask about it's contents. GPT won't have a clue
What's your evidence contrary to this? Sounds like your common sense rather than inside knowledge
The entire English language Wikipedia is only around 60GB in a format that can be readily searched and randomly accessed (ZIM), for example: https://kiwix.org/
For the mobile app I used one of the smaller Wikipedia subsets, since I didn't want to take up too much space on my phone. The full offline Wikipedia download is saved to my laptop.
I'm building a magazine encyclopedia and I would estimate that 99.9% of all magazines ever published are not available electronically. And that the content in magazines probably exceeds the content in books by an order of magnitude.
Many people have betrayed their country to foreign governments in exchange for mere thousands of dollars. It is never safe to rule out the willingness of employees to engage in corporate espionage, even in exchange for truly pitiful rewards. It would be a stupid idea, but that doesn't mean it won't happen.
It is harder to prove to a "should have known" standard compared to say buying stolen speakers from the back of a truck for 20% of the list price.