zlacker

I think it's very clear that the recent LLM boom is directly responsible for Twitter, Reddit, and others quickly moving to restricted APIs with exorbitant pricing structures. I don't think these orgs really care much about third-party clients other than a nuisance consuming some fraction of their userbase.

Enterprise deals between these user generated content platforms and LLM platforms may well involve many billions of API requests, and the pricing is likely an order of magnitude less expensive per call due to the volume. The result is a cost-per-call that is cost-prohibitive at smaller scales, and undoubtedly the UGC platform operators are aware that they're pricing out third-party applications like Apollo and Pushshift. These operators need high baseline pricing so they can discount in negotiation with LLM clients.

Or, perhaps, it's the opposite: for instance, Reddit could be developing its own first-party language model, and any other model with access to semi-realtime data is a potentially existential competitor. The best strategic route is to make it economically infeasible for some hypothetical competitor to arise, while still generating revenue from clients willing to pay these much higher rates.

Ultimately, this seems to be playing out as the endgame of the open internet v. corporate consolidation, and while it's unclear who's winning, I think it's pretty obvious that most of us are losing.

replies(7): >>ameliu+9L >>Nextgr+CT >>quartz+BU >>eru+gf1 >>dagesh+dA1 >>hacker+0W1 >>Macha+NM5

>>58x14+(OP)
Can't they pull the data from archive.org?

replies(3): >>notaco+jM >>KuiN+AR >>SllX+KU

>>ameliu+9L
That would be worse.

>>ameliu+9L
Archive.org was knocked offline the other day due to some AI startup scraping it to death. It’s not a good thing.

replies(1): >>moneyw+Oi1

>>58x14+(OP)
LLMs have nothing to do with it. Someone skilled enough and rich enough to develop and train an LLM is absolutely capable of reverse-engineering your private API or scraping your web UI and defeating whatever protections you have.

replies(5): >>throw_+O11 >>moneyw+zi1 >>numpad+Pi1 >>applea+dm1 >>lost_t+Ls1

>>58x14+(OP)
Yes it's this. This has nothing to do with 3rd party app operation and everything to do with generally closing the gate to the data garden.

The value of reddit's content to non-reddit entities is rapidly increasing as its monetizable use shifts from a set of signals on which to build first-party ad targeting (which they never really figured out) to generally useful llm training data.

>>ameliu+9L
Archive.org is a non-profit without the capacity to serve that many requests. An excellent resource for people to use carefully, but not a treasure trove for bots to scrape down to the last bit.

replies(1): >>notpus+B32

>>Nextgr+CT
And open yourself to potential lawsuits. You can fork any public repo in github too, don't need any fancy resverse-engineering or web scraping. But if you use the content illegally then what's the point.

replies(1): >>dvngnt+j31

>>throw_+O11
i think you can only scrape public information, so if everything is behind a login screen then that might cause issues

>>58x14+(OP)
If you want training data for an LLM and are actively talking to some data providers, you'd just ask for a dump, instead of making a billion small requests.

(You'd make the billion small requests, if you are doing this on the sly.)

replies(3): >>sahila+8z1 >>fennec+gI2 >>58x14+tT7

>>Nextgr+CT
Managing all those LTE proxies is far from cheap

>>KuiN+AR
Source, they don’t rate limit

replies(3): >>Kon-Pe+pk1 >>pipers+Iq1 >>edgyqu+ew1

>>Nextgr+CT
I heard researchers on public funding can’t violate ToS without invalidating their current and future employment, and therefore cannot engage in social media researches without free API…

>>moneyw+Oi1
https://news.ycombinator.com/item?id=36110527

>>Nextgr+CT
dont tell this to youtube

>>moneyw+Oi1
True - and their lack of rate limiting ended up letting someone overwhelm their servers, knocking them offline.

>>Nextgr+CT
lol you can't get in trouble for datascraping or figuring out ways around their anti scraping measures. Good luck enforcing any user agreements the bot has to click through. If they don't want it scraped then they have to not put it on a public facing webpage.

>>moneyw+Oi1
They put out a blog asking people not to scrape afterwards. A simple google will be much fast than asking for sources.

>>eru+gf1
Right that'd be the case now but previously you could just make a billion small requests for free.

replies(1): >>eru+8M1

>>58x14+(OP)
This is very obviously what's going on.

The web is in the process of rapidly filling up with AI regurgitated garbage, eventually there's going to be a handful of sites with real usable content on them left, reddit being one of the biggest.

replies(1): >>xtract+qX4

>>sahila+8z1
Or at least you could try.

But that still makes the original commenters argument moot:

> Enterprise deals between these user generated content platforms and LLM platforms may well involve many billions of API requests, and the pricing is likely an order of magnitude less expensive per call due to the volume. The result is a cost-per-call that is cost-prohibitive at smaller scales, [...]

That speculation is not how things have been or were.

replies(1): >>fluidc+FY1

>>58x14+(OP)
ah that explains why Twitter led the pack making their APIs insanely expensive. there is value in the data and the LLM companies will be willing to fork it. a whole new business model and monetization of mass data not predicated on ads or user privacy. what could go wrong?

replies(1): >>pas+Qx2

>>eru+8M1
I think most people who wanted large datasets got their data via pushshift. Pushshift was basically a guy who started out doing small things got so frustrated with the API that he eventually grew to maintaining large mirrors of Reddit content on Google cloud that people could access and query. I don't know why anyone doing research would have used reddit's API instead of using pushshift.

Pushshift has been shutdown by reddit earlier this year, so probably they are getting hammered by LLM folks trying to get the data now since they killed pushshift without understanding how it fit into the universe.

Reddit is completely stupid if they think people are going to pay for "enterprise API" access... pushshift existed because the API was trash and the only real option is to dump the entire dataset into something usable. The reason reddit's data was used so much is because there was an SQL API via pushshift and you could also download archives of the entire dataset at one go.

replies(1): >>doglea+Nm2

>>SllX+KU
Would be cool if they introduce some reasonably priced access for mass scrapers. Should make some nice income in addition to donations, and a valuable service to community.

>>fluidc+FY1
> Pushshift has been shutdown by reddit earlier this year

Oh is this why all the comment undelete websites broke?

replies(1): >>redeye+1K2

>>hacker+0W1
That's unlikely. Elon jacked up the price because he wanted Twitter to make revenue, so get people to use their app and site, and basically to just get rid of anything he doesn't control.

>>eru+gf1
Eh can still just automate creating a bunch of accounts and do it manually. Use one of the many captcha completion services where you pay for people to complete captchas for you. ML models can already pretty much do them anyway.

Then rotate between accounts and put a random time between requests. Restrict certain accounts to browse within certain hours/timezones. Load pages as usual and just scrape data from the page rather than via api.

However, I believe in a company's right to charge whatever they want for their services. But I also believe in the right for people to choose not to use that service and for freer alternatives to spring up.

Just like Tumblr, Reddit seem intent on killing themselves, although these days I'm not so sure. When Elon took over Twitter everyone was saying that all the users would leave and it would die. This is not the case, human nature means that people seek familiarity and will cling on, hmm.

replies(1): >>eru+Etd

>>doglea+Nm2
Yep this is exactly why

>>dagesh+dA1
>The web is in the process of rapidly filling up with AI regurgitated garbage

This is already the case. See the oceans of crap SEO optimized "food recipe sites". It's unbearable.

So sad that, BBC back in 199ps and 2000s, there were so many random sites to visit with interesting things. Search engines were of actual use.

Now, it's basically facebook, reddit, pinterest, instagram, stackoverflow , and a couple of counted others, depending on what you like. And EVERYTHING is monetized.

The WWW of today is terrible.

Now

>>58x14+(OP)
I don't think the LLM boom caused Twitter's first API lockdown in 2012, nor do I really think it's anything to do with the more recent final nail which seems much more in tune with Elon's twitter trying to increase ARPU/engagement while also dealing with a 90% reduction in headcount.

>>eru+gf1
You’re right, but I think it’s also pretty clear that

A) there is demand for functionality that depends on semi-real-time data, e.g. a prompt like “explain {recent_trending_topic} to me and describe its evolution” where the return could be useful in various contexts;

B) the degradation of search experience and the explosion of chat interfaces seem to indicate “the future of search is chat” and the number of Google searches prefixed or suffixed with “Reddit” make it obvious that LLM-powered chat models with search functionality will want to query Reddit extensively, and in the example prompt above, the tree of queries generated to fulfill a single prompt could be sizeable;

C) improvements to fine-tuning pipelines make it more and more feasible to use real-time data in the context of LLMs, such as a “trending summary” function that could cache many potentially related queries from Reddit, Twitter, etc and use them to fine-tune a model which would serve a response to my example prompt

>>fennec+gI2
> Just like Tumblr, Reddit seem intent on killing themselves, although these days I'm not so sure. When Elon took over Twitter everyone was saying that all the users would leave and it would die. This is not the case, human nature means that people seek familiarity and will cling on, hmm.

Wouldn't network effects be the obvious null hypothesis, before we start speculating about human nature?