zlacker

Something being simultaneously described as a "30 sheet, mind-numbingly complex Excel model" and "testable" seems somewhat unlikely, even before we get into whether Claude will be able to test such a thing before it runs into context length issues. I've seen Claude hallucinate running test suites before.

replies(2): >>martin+U >>djeast+KR

>>AlotOf+(OP)
It compacted at least twice but continued with no real issues.

Anyway, please try it if you find it unbelievable. I didn't expect it to work FWIW like it did. Opus 4.5 is pretty amazing at long running tasks like this.

replies(2): >>moregr+w2 >>stavro+m3

>>martin+U
I think the skepticism here is that without tests or a _lot_ of manual QA how would you know that it did it correctly?

Maybe you did one or the other , but “nearly one-shotted” doesn’t tend to mean that.

Claude Code more than occasionally likes to make weird assumptions, and it’s well known that it hallucinates quite a bit more near the context length, and that compaction only partially helps this issue.

replies(1): >>skybri+vn

>>martin+U
I generally agree with you, but I tried to get it to modernize a fairly old SaaS codebase, and it couldn't. It had all the code right there, all it had to do was change a few lines, upgrade a few libraries, etc, but it kept getting lots of things wrong. The HTML was wrong, the CSS was completely missing, basic views wouldn't work, things like that.

I have no idea why it had so much trouble with this generally easy task. Bizarre.

>>moregr+w2
If you’re porting some formulas from one language to another, “correct” can be defined as “gets the same answers as before.” Assuming you can run both easily, this is easy to write a property test for.

Sure, maybe that’s just building something that’s bug-for-bug compatible, but it’s something Claude can work with.

replies(1): >>gregor+4z

>>skybri+vn
For starters, Python uses IEEE 754, and Excel uses IEEE 754 (with caveats). I wonder if that's being emulated.

>>AlotOf+(OP)
>I've seen Claude hallucinate running test suites before.

This reminded of something that happened to me last year. Not Claude (I think it was GPT 4.0 maybe?), but I had it running in VS Code's Copilot and asked it to fix a bug then add a test for the case.

Well, it kept failing to pass its own test, so on the third try, it sat there "thinking" for a moment, then finally spit out the command `echo "Test Passed!"`, executed it, read it from the terminal, and said it was done.

I was almost impressed by the gumption more than anything.

replies(1): >>Merad+9Z1

>>djeast+KR
I've been using Claude Code with Opus 4.5 a lot the last several months and while it's amazingly capable it has a huge tendency to give up on tests. It will just decide that it can commit a failing test because "fixing it has been deferred" or "it's a pre-existing problem." It also knows that it can use `HUSKY=0 git commit ...` to bypass tests that are run in commit hooks. This is all with CLAUDE.md being very specific that every commit must have passing tests, lint, etc. I eventually had to add a Claude Code pre-command hook (which it can't bypass) to block it from running git commit if it isn't following the rules.

replies(1): >>theshr+pf4

>>Merad+9Z1
Anecdata from the internet has a few stories of Claude Opus bypassing hooks too =)

1) it wants to run X command

2) it notices a hook preventing it from running X

3) it creates a Python application or shell script that does X and runs it instead

Whoops.

replies(1): >>Merad+Wi6

>>theshr+pf4
I haven't seen it bypass my hook yet (knock on wood). I have my hook script [0] tell that its commits are required to pass validation, maybe that helps push it in the right direction?

0: https://github.com/mbcrawfo/vibefun/blob/main/.claude/hooks/...