Had a call with Reddit to discuss pricing

>>robbie+(OP)
I think it's very clear that the recent LLM boom is directly responsible for Twitter, Reddit, and others quickly moving to restricted APIs with exorbitant pricing structures. I don't think these orgs really care much about third-party clients other than a nuisance consuming some fraction of their userbase.

Enterprise deals between these user generated content platforms and LLM platforms may well involve many billions of API requests, and the pricing is likely an order of magnitude less expensive per call due to the volume. The result is a cost-per-call that is cost-prohibitive at smaller scales, and undoubtedly the UGC platform operators are aware that they're pricing out third-party applications like Apollo and Pushshift. These operators need high baseline pricing so they can discount in negotiation with LLM clients.

Or, perhaps, it's the opposite: for instance, Reddit could be developing its own first-party language model, and any other model with access to semi-realtime data is a potentially existential competitor. The best strategic route is to make it economically infeasible for some hypothetical competitor to arise, while still generating revenue from clients willing to pay these much higher rates.

Ultimately, this seems to be playing out as the endgame of the open internet v. corporate consolidation, and while it's unclear who's winning, I think it's pretty obvious that most of us are losing.

>>58x14+Je
If you want training data for an LLM and are actively talking to some data providers, you'd just ask for a dump, instead of making a billion small requests.

(You'd make the billion small requests, if you are doing this on the sly.)

>>eru+Zt1
You’re right, but I think it’s also pretty clear that

A) there is demand for functionality that depends on semi-real-time data, e.g. a prompt like “explain {recent_trending_topic} to me and describe its evolution” where the return could be useful in various contexts;

B) the degradation of search experience and the explosion of chat interfaces seem to indicate “the future of search is chat” and the number of Google searches prefixed or suffixed with “Reddit” make it obvious that LLM-powered chat models with search functionality will want to query Reddit extensively, and in the example prompt above, the tree of queries generated to fulfill a single prompt could be sizeable;

C) improvements to fine-tuning pipelines make it more and more feasible to use real-time data in the context of LLMs, such as a “trending summary” function that could cache many potentially related queries from Reddit, Twitter, etc and use them to fine-tune a model which would serve a response to my example prompt

zlacker