Had a call with Reddit to discuss pricing

>>robbie+(OP)
I think it's very clear that the recent LLM boom is directly responsible for Twitter, Reddit, and others quickly moving to restricted APIs with exorbitant pricing structures. I don't think these orgs really care much about third-party clients other than a nuisance consuming some fraction of their userbase.

Enterprise deals between these user generated content platforms and LLM platforms may well involve many billions of API requests, and the pricing is likely an order of magnitude less expensive per call due to the volume. The result is a cost-per-call that is cost-prohibitive at smaller scales, and undoubtedly the UGC platform operators are aware that they're pricing out third-party applications like Apollo and Pushshift. These operators need high baseline pricing so they can discount in negotiation with LLM clients.

Or, perhaps, it's the opposite: for instance, Reddit could be developing its own first-party language model, and any other model with access to semi-realtime data is a potentially existential competitor. The best strategic route is to make it economically infeasible for some hypothetical competitor to arise, while still generating revenue from clients willing to pay these much higher rates.

Ultimately, this seems to be playing out as the endgame of the open internet v. corporate consolidation, and while it's unclear who's winning, I think it's pretty obvious that most of us are losing.

>>58x14+Je
If you want training data for an LLM and are actively talking to some data providers, you'd just ask for a dump, instead of making a billion small requests.

(You'd make the billion small requests, if you are doing this on the sly.)

>>eru+Zt1
Right that'd be the case now but previously you could just make a billion small requests for free.

>>sahila+RN1
Or at least you could try.

But that still makes the original commenters argument moot:

> Enterprise deals between these user generated content platforms and LLM platforms may well involve many billions of API requests, and the pricing is likely an order of magnitude less expensive per call due to the volume. The result is a cost-per-call that is cost-prohibitive at smaller scales, [...]

That speculation is not how things have been or were.

>>eru+R02
I think most people who wanted large datasets got their data via pushshift. Pushshift was basically a guy who started out doing small things got so frustrated with the API that he eventually grew to maintaining large mirrors of Reddit content on Google cloud that people could access and query. I don't know why anyone doing research would have used reddit's API instead of using pushshift.

Pushshift has been shutdown by reddit earlier this year, so probably they are getting hammered by LLM folks trying to get the data now since they killed pushshift without understanding how it fit into the universe.

Reddit is completely stupid if they think people are going to pay for "enterprise API" access... pushshift existed because the API was trash and the only real option is to dump the entire dataset into something usable. The reason reddit's data was used so much is because there was an SQL API via pushshift and you could also download archives of the entire dataset at one go.

>>fluidc+od2
> Pushshift has been shutdown by reddit earlier this year

Oh is this why all the comment undelete websites broke?

>>doglea+wB2
Yep this is exactly why

zlacker