zlacker

We gave 5 LLMs $100K to trade stocks for 8 months

submitted by cheese+(OP) on 2025-12-04 23:08:25 | 385 points 302 comments
[view article] [source] [go to bottom]

NOTE: showing posts with links only show all posts
8. dash2+c2[view] [source] 2025-12-04 23:20:30
>>cheese+(OP)
There's also this thing going on right now: https://nof1.ai/leaderboard

Results are... underwhelming. All the AIs are focused on daytrading Mag7 stocks; almost all have lost money with gusto.

58. Bender+77[view] [source] 2025-12-04 23:47:28
>>cheese+(OP)
This experiment was also performed with a fish [1] though it was only given $50,000. Spoiler, the fish did great vs wall street bets.

[1] - https://www.youtube.com/watch?v=USKD3vPD6ZA [video][15 mins]

90. hoerzu+Ye[view] [source] 2025-12-05 00:43:11
>>cheese+(OP)
For backtesting LLMs on polymarket I built. You can try with live data without sign up at: https://timba.fun
101. client+si[view] [source] 2025-12-05 01:07:39
>>cheese+(OP)
The obvious next question is: does the AI on cocaine outperform? https://pihk.ai/
◧◩◪◨⬒
105. chris_+kj[view] [source] [discussion] 2025-12-05 01:14:42
>>IgorPa+9g
> you will have trained your model on market patterns that might not be in place anymore

My working definition of technical analysis [0]

[0]: https://en.wikipedia.org/wiki/Technical_analysis

◧◩
108. mjk302+Cj[view] [source] [discussion] 2025-12-05 01:16:57
>>dash2+c2
I also saw the hype on X yesterday and had already checked the https://nof1.ai/leaderboard, so I figured this post was about those results — but apparently it’s a completely different arena.

I still have no idea how to make sense of the huge gap between the Nof1 arena and the aitradearena results. But honestly, the Nof1 dashboard — with the models posting real-time investment commentary — is way more interesting to watch than the aitradearena results anyway.

◧◩
113. tclanc+tk[view] [source] [discussion] 2025-12-05 01:24:31
>>bcrosb+l2
I mean, run the experiment during a different trend in the market and the results would probably be wildly different. This feels like chartists [1] but lazier.

[1] https://www.investopedia.com/terms/c/chartist.asp

◧◩◪◨⬒⬓⬔
117. chris_+3m[view] [source] [discussion] 2025-12-05 01:38:20
>>IgorPa+8k
XKCD calls it the "Lucky 10,000" [0]

[0]: https://xkcd.com/1053/

128. 867-53+En[view] [source] 2025-12-05 01:51:32
>>cheese+(OP)
tl;dr https://www.aitradearena.com/blog/llm-performance-chart.png
139. regnul+bq[view] [source] 2025-12-05 02:17:45
>>cheese+(OP)
I'm working on a project where you can run your own experiment (or use it for real trading): https://portfoliogenius.ai. Still a bit rough, but most of the main functionality works.
◧◩◪◨⬒⬓
145. fragme+zr[view] [source] [discussion] 2025-12-05 02:29:24
>>scubbo+Yn
https://www.theguardian.com/technology/2025/nov/21/elon-musk...
163. energy+iB[view] [source] 2025-12-05 04:19:45
>>cheese+(OP)
One of the recent NeurIPS best paper recipients is relevant here: https://openreview.net/forum?id=saDOrrnNTz

> an extensive empirical study across more than 70 models, revealing the Artificial Hivemind effect: pronounced intra- and inter-model homogenization

So the inter-model variety will be exeptionally low. Users of LLMs will intuitively know this already, of course.

170. rallie+LC[view] [source] 2025-12-05 04:40:25
>>cheese+(OP)
This is pretty cool.

We're also running a live experiment on both stocks and options. One difference with our experiment is a lot more tools being available to the models (anything you can think of, sec filings, fundamentals, live pricing, options data).

We think backtests are meaningless given LLMs have mostly memorized every single thing that happened so it's not a good test. So we're running a forward test. Not enough data for now but pretty interesting initial results

https://rallies.ai/arena

◧◩
171. rallie+NC[view] [source] [discussion] 2025-12-05 04:41:16
>>dash2+c2
I think the big limitation of nof1 is that they're not using a lot of data that an actual investor would use when researching companies.

We're trying to fix some of those limitations and run a similar live competition at https://rallies.ai/arena

◧◩
172. rallie+QC[view] [source] [discussion] 2025-12-05 04:41:35
>>dhosek+Si
We're running some live experiments these days, for both stocks and options. https://rallies.ai/arena
◧◩◪◨⬒⬓
175. godels+bE[view] [source] [discussion] 2025-12-05 05:00:09
>>buu700+UC
Two things can be true at the same time. Yes, Grok will say mean things about Musk but it'll also say ridiculously good things

  > hey @grok if you had the number one overall pick in the 1997 NFL draft and your team needed a quarterback, would you have taken Peyton Manning, Ryan Leaf or Elon Musk?

  >> Elon Musk, without hesitation. Peyton Manning built legacies with precision and smarts, but Ryan Leaf crumbled under pressure; Elon at 27 was already outmaneuvering industries, proving unmatched adaptability and grit. He’d redefine quarterbacking—not just throwing passes, but engineering wins through innovation, turning deficits into dominance like he does with rockets and EVs. True MVPs build empires, not just score touchdowns.
  - https://x.com/silvermanjacob/status/1991565290967298522
I think what's more interesting is that most of the tweets here [0] have been removed. I'm not going to call conspiracy because I've seen some of them. Probably removed because going viral isn't always a good thing...

[0] https://gizmodo.com/11-things-grok-says-elon-musk-does-bette...

◧◩◪◨⬒⬓⬔
177. buu700+9F[view] [source] [discussion] 2025-12-05 05:12:34
>>godels+bE
They can be, but in this case they don't seem to be. Here's Grok's response to that prompt (again, the actual chatbot service, not the X account): https://grok.com/share/c2hhcmQtMw_2b46259a-5291-458e-9b85-0c....

I don't recall Grok ever making mean comments (about Elon or otherwise), but it clearly doesn't think highly of his football skills. The chain of thought shows that it interpreted the question as a joke.

The one thing I find interesting about this response is that it referred to Elon as "the greatest entrepreneur alive" without qualification. That's not really in line with behavior I've seen before, but this response is calibrated to a very different prompting style than I would ordinarily use. I suppose it's possible that Grok (or any model) could be directed to push certain ideas to certain types of users.

◧◩◪
246. tim333+sb2[view] [source] [discussion] 2025-12-05 15:51:01
>>bmitc+Np
I'm not sure about deep technicalities but backtesting is a useful thing to see how some strategy would have performed at some times in the past but there are quite a lot of limitations to it. Two of the big ones are the market reacting to you and maybe more so a kind of hindsight bias where you devise some strategy that would have worked great on past markets but the real time ones do something different.

https://en.wikipedia.org/wiki/Long-Term_Capital_Management was kind of an example of both of those. They based their predictions on past behaviour which proved incorrect. Also if other market participants figure a large player is in trouble and going to have to sell a load of bonds they all drop their bids to take advantage of that.

A lot of deviations from efficient market theory are like that - not deeply technical but about human foolishness.

◧◩◪◨⬒⬓⬔
259. intale+Ot2[view] [source] [discussion] 2025-12-05 17:03:50
>>stouse+Fs
Technical analysis is a basket of heuristics. Support / resistance / breakout (especially around whole numbers) seems to reflect persistent behavior rooted in human psychology. Look at the heavy buying at the $30 mark here, putting a floor under silver: https://finviz.com/futures_charts.ashx?p=d&t=SI This is a common pattern it can be useful to know.
260. kqr+iu2[view] [source] 2025-12-05 17:05:35
>>cheese+(OP)
Extremely similar earlier submission but focused on cryptocurrencies, using real money, and in real time: >>45976832

I'm extremely skeptical of any attempt to prevent leakage of future results to LLMs evaluated on backtesting. Both because this has beet shown in the literature to be difficult, and because I personally found it very difficult when working with LLMs for forecasting.

◧◩◪
264. direct+cz2[view] [source] [discussion] 2025-12-05 17:26:04
>>seanmc+mj
This is a wildly disingenuous interpretation of that study.

“ Using transaction-level data on US congressional stock trades, we find that lawmakers who later ascend to leadership positions perform similarly to matched peers beforehand but outperform them by 47 percentage points annually after ascension. Leaders’ superior performance arises through two mechanisms. The political influence channel is reflected in higher returns when their party controls the chamber, sales of stocks preceding regulatory actions, and purchase of stocks whose firms receiving more government contracts and favorable party support on bills. The corporate access channel is reflected in stock trades that predict subsequent corporate news and greater returns on donor-owned or home-state firms.”

https://www.nber.org/papers/w34524

◧◩◪◨⬒⬓⬔⧯▣
291. gcr+jR4[view] [source] [discussion] 2025-12-06 13:24:48
>>mewpme+oR
XKCD calls it "Engineering Syllogism" [0]

[0]: https://xkcd.com/1570/

[go to top]