(You'd make the billion small requests, if you are doing this on the sly.)
But that still makes the original commenters argument moot:
> Enterprise deals between these user generated content platforms and LLM platforms may well involve many billions of API requests, and the pricing is likely an order of magnitude less expensive per call due to the volume. The result is a cost-per-call that is cost-prohibitive at smaller scales, [...]
That speculation is not how things have been or were.
Pushshift has been shutdown by reddit earlier this year, so probably they are getting hammered by LLM folks trying to get the data now since they killed pushshift without understanding how it fit into the universe.
Reddit is completely stupid if they think people are going to pay for "enterprise API" access... pushshift existed because the API was trash and the only real option is to dump the entire dataset into something usable. The reason reddit's data was used so much is because there was an SQL API via pushshift and you could also download archives of the entire dataset at one go.
Oh is this why all the comment undelete websites broke?
Then rotate between accounts and put a random time between requests. Restrict certain accounts to browse within certain hours/timezones. Load pages as usual and just scrape data from the page rather than via api.
However, I believe in a company's right to charge whatever they want for their services. But I also believe in the right for people to choose not to use that service and for freer alternatives to spring up.
Just like Tumblr, Reddit seem intent on killing themselves, although these days I'm not so sure. When Elon took over Twitter everyone was saying that all the users would leave and it would die. This is not the case, human nature means that people seek familiarity and will cling on, hmm.
A) there is demand for functionality that depends on semi-real-time data, e.g. a prompt like “explain {recent_trending_topic} to me and describe its evolution” where the return could be useful in various contexts;
B) the degradation of search experience and the explosion of chat interfaces seem to indicate “the future of search is chat” and the number of Google searches prefixed or suffixed with “Reddit” make it obvious that LLM-powered chat models with search functionality will want to query Reddit extensively, and in the example prompt above, the tree of queries generated to fulfill a single prompt could be sizeable;
C) improvements to fine-tuning pipelines make it more and more feasible to use real-time data in the context of LLMs, such as a “trending summary” function that could cache many potentially related queries from Reddit, Twitter, etc and use them to fine-tune a model which would serve a response to my example prompt
Wouldn't network effects be the obvious null hypothesis, before we start speculating about human nature?