zlacker

From this commit: https://github.com/cloudflare/workers-oauth-provider/commit/...

===

"Fix Claude's bug manually. Claude had a bug in the previous commit. I prompted it multiple times to fix the bug but it kept doing the wrong thing.

So this change is manually written by a human.

I also extended the README to discuss the OAuth 2.1 spec problem."

===

This is super relatable to my experience trying to use these AI tools. They can get halfway there and then struggle immensely.

replies(6): >>nisega+K >>diggan+G1 >>myster+u2 >>krooj+O9 >>nicce+Nb >>arendt+HF

>>infini+(OP)
Same. But I personally find it a lot easier to do those bits at the end than to begin from a blank file/function, so it's a good match for me.

replies(1): >>SkyPun+b6

>>infini+(OP)
> They can get halfway there and then struggle immensely.

Restart the conversation from scratch. As soon as you get something incorrect, begin from the beginning.

It seems to me like any mistake in a messages chain/conversation instantly poisons the output afterwards, even if you try to "correct" it.

So if something was wrong at one point, you need to go back to the initial message, and adjust it to clarify the prompt enough so it doesn't make that same mistake again, and regenerate the conversation from there on.

replies(4): >>eikenb+p8 >>dingnu+lk >>int_19+ep >>viktor+EN

>>infini+(OP)
This to me is why I think these tools don't have actual understanding, and are instead producing emergent output from pooling an incomprehensibly large set of pattern-recognized data.

replies(1): >>diggan+33

>>myster+u2
> these tools don't have actual understanding, and are instead producing emergent output from pooling an incomprehensibly large set of pattern-recognized data

I mean, bypassing the fact that "actual understanding" doesn't have any consensus about what it is, does it matter if it's "actual understanding" or "kind of understanding", or even "barely understanding", as long as it produces the results you expect?

replies(2): >>scepti+m4 >>myster+YC

>>diggan+33
> as long as it produces the results you expect?

But it's more the case of "until it doesn't produce the results you expect" and then what do you do?

replies(3): >>diggan+j5 >>seunos+Jw >>JW_000+Q92

>>scepti+m4
> "until it doesn't produce the results you expect" and then what do you do?

I'm not sure I understand what you mean. You're asking it to do something, and it doesn't do that?

replies(1): >>dingnu+Ek

>>nisega+K
Same here. Sometimes you just need time to stew in the problem/solution space.

LLMs let me be ultraproductive upfront then come in at the end to clean up when I have a full understanding.

>>diggan+G1
I thought Claude still has a problem generating the same output for the same input? That you can't just rewind and rerun and get to the same point again.

replies(2): >>diggan+ea >>throwa+rs

>>infini+(OP)
The comment in lines 163 - 172 make some claims that are outright false and/or highly A/S dependent, to the point where I question the validity of this post entirely. While it's possible that an A/S can be pseudo-generated based on lots of training data, each implementation makes very specific design choices: i.e.: Auth0's A/S allows for a notion of "leeway" within the scope of refresh token grant flows to account for network conditions, but other A/S implementations may be far more strict in this regard.

My point being: assuming you have RFCs (which leave A LOT to the imagination) and some OSS implementations to train on, each implementation usually has too many highly specific choices made to safely assume an LLM would be able to cobble something together without an amount of oversight effort approaching simply writing the damned thing yourself.

>>eikenb+p8
> I thought Claude still has a problem generating the same output for the same input?

I haven't used Anthropic's models/software in a long time (months, basically forever in AI ecosystem), so don't know exactly how it works now.

But last time I used Claude, you could edit the first message, and then re-generate the assistants next message based on your edit. Most of the LLM interfaces has one or another way of doing this, I can't imagine they got rid of that feature.

What I'm suggesting isn't to use the exact same input (the first message), but rather change it so you remove the chances of something incorrect happening later after that.

>>infini+(OP)
I am waiting for studies whether we have just an illusion of production or these actually save man hours in the long term in creation of production-level systems.

>>diggan+G1
Can you imagine if Excel worked like this? the formula put out the wrong result, so try again! It's like that scene from The Office where Michael has an accountant "run it again." It's farcical. They have created computers that are bad at math and I will never forgive them.

Also, each try costs money! You're pulling the lever on a god damned slot machine!

I will TRY AGAIN with the same prompt when I start getting a refund for my wasted money and time when the model outputs bullshit, otherwise this is all confirmation and sunk cost bias talking, I'm sure if it.

replies(1): >>diggan+Ko

>>diggan+j5
if you give an LLM a spec with a new language and no examples, it can't write the new language.

until someone does that, I think we've demonstrated that they do not have understanding or abstract thought. they NEED examples in a way humans do not.

replies(1): >>Powder+Gl

>>dingnu+Ek
https://openreview.net/pdf?id=GTHD2UnDIb

replies(1): >>myster+CD

>>dingnu+lk
> Can you imagine if Excel worked like this?

I mean, why would I imagine that? Who would want that? It's like the argument against legal marijuana, and someone replies "But would you like your pilot to be high when flying?!". Right tool for the right job, clearly when you want 100% certainty then LLMs aren't the tool for that. Just because they're useful for some things don't mean we have to replace everything with them.

> Also, each try costs money!

I guess you're using some paid API? Try a different way then. I mostly use the web UI from OpenAI, or Codex lately, or ran locally with my own agent using local weights, neither is "each try costs money" more than writing data to my SSD is costing me money.

It's not a holy grail some people paint it, and not sure we're across the "productivity threshold" (>>44160664 ) yet, but it's worth trying it out probably before jumping to conclusions. But no one is forcing you either, YMMV and all that.

>>diggan+G1
Chatbot UIs really need better support for conversation branching all around. It's very handy to be able to just right-click on any random message in the conversation in LM Studio and say, "branch from here".

replies(3): >>diggan+Np >>carlos+Vt >>impure+Hx2

>>int_19+ep
Maybe it's contrarian, maybe it's not, but I don't think Chat UIs are well suited for software engineering/programming at all, we need something completely different. Being able to branch conversations and such would be useful, but probably not for the way I do software. Besides, I'm rarely beyond 3 messages (1 system, 1 user, 1 assistant) in any usage of the chat UIs. Maybe it's more useful to people with different workflows.

replies(1): >>int_19+Q23

>>eikenb+p8
> can't just rewind and rerun and get to the same point again

Why would you want to? The whole point of a retry is that your previous conversation attempt went poorly.

replies(1): >>eikenb+HT

>>int_19+ep
AI Studio has this, I usually ask it to plan and I do some rounds of refining until the plan covers all my requirements, then I branch this conversation, a branch for each feature, none of the branches get polluted this way.

>>scepti+m4
Then you teach it. Even humans don't always produce the results we expect.

replies(1): >>scepti+i85

>>diggan+33
No, I was not making a critique on its effectiveness at generating usable results. I was responding to what I've seen in several other articles here arguing towards anthropomorphism.

>>Powder+Gl
Interesting paper, thanks for sharing. I assume the effectiveness depends greatly on the syntax of the language to be learned (c-like, etc).

>>infini+(OP)
One way to mitigate the issue is to use tests or specifications and let the AI find a solution to the spec.

A few months ago, solving such a spec riddle could take a while, and most of the time, the solutions that were produced by long run times were worse than the quick solutions. However, recently the models have become significantly better at solving such riddles, making it fun (depending on how well your use case can be put into specs).

In my experience, sonnet 3.7 represented a significant step forward compared to sonnet 3.5 in this discipline, and Gemini 2.5 Pro was even more impressive. Sonnet 4 makes even fewer mistakes, but it is still necessary to guide the AI through sound software engineering practices (obtaining requirements, discovering technical solutions, designing architecture, writing user stories and specifications, and writing code) to achieve good results.

Edit: And there is another trick: Provide good examples to the AI. Recently, I wanted to create an app with the OpenAI Realtime API and at first it failed miserably, but then I added the most important two pages of the documentation and one of the demo projects into my workspace and just like that it worked (even though für my use-case the API calls had to be use quite differently).

replies(1): >>fxnn+oT

>>diggan+G1
It can be done, but for my environment the sum of all prompts that I end up typing to get the right result ends up being longer than the actual code.

So now I'm using LLMs as crapshoot machines for generating ideas which I then implement manually

>>arendt+HF
That's one thing where I love Golang. I just tell Aider to `/run go doc github.com/some/package`, and it includes the full signatures in the chat history.

It's true: often enough AI struggles to use libraries, and doesn't remember the usage correctly. Simply adding the go doc fixed that often.

>>throwa+rs
Good engineering? You want automated steps to be repeatable so you know your tweak to the previous conversation have the effect you desire. Though using an AI for coding is probably closer in spirit the the art of writing code than the engineering of writing code and art is pretty much unrepeatable by definition.

replies(1): >>throwa+741

>>eikenb+HT
Fair enough. Use the respective API or Google Gemini which will let you set temperature to zero resulting in deterministic output barring FP errors accumulating when paired with non-standard GPU/TPU configurations. Likely not to differ by much in the vast majority of cases though.

>>scepti+m4
Then you do that part yourself. You let AI automate the 20/50/80% (*) of work it can, and you now only need to do the remainder manually.

(*) which one of these it is depends on your case. If you're writing a run-of-the-mill Next.js app, AI will automate 80%; if you're doing something highly specific, it'll be closer to 20%.

>>int_19+ep
Certainly in my version of LM Studio (0.3.15) it has a branch button at the end of every message [0]

[0] https://i.imgur.com/xZ2Fkn7.png

replies(1): >>int_19+u23

>>impure+Hx2
It does indeed. What I'm saying is that, for some mysterious reason, none of the first-party chatbot apps do that - ChatGPT, Claude, Gemini all lack this feature.

>>diggan+Np
I don't see how you'd avoid using chat if you need the bot to work on some bug end-to-end. I usually have many rounds in a chat session, first asking it to identify the overall approach, reviewing and approving that, then one or more rounds for coding, and several more to request edits as needed.

If you only ever ask it for trivial changes that don't require past context to make sense, then chat is indeed overkill. But we already have different UX approaches for that - e.g. some IDEs watch for specially formatted comments to trigger code generation, so you literally just type what you want right there in the editor, exactly where you want the code to go.

replies(1): >>diggan+ZA3

>>int_19+Q23
Yeah, I'd agree you want to iterate, but I'm not sure the UX of "Log of messages, where some of yours, some are tool calls, others are the assistant" and the workflow of "Add more messages into the log of messages"/"Change existing messages" is the right broad UX for this type of work.

I'm sorry I can't substantiate it more than that, as my own head is still trying to wrap itself around what I think is needed instead. Still, sounds very "fluffy" even when I read it back myself.

>>seunos+Jw
Have you tried that? It generally doesn't go so well.

In this example there are several commits where you can see they needed to fix the code because they couldn't get (teach) the LLM to generate the required code.

And there's no memory there, you open a new prompt and it's forgotten everything you said previously.