Look at this one:
> Ask Claude to remove the "backup" encryption key. Clearly it is still important to security-review Claude's code!
> prompt: I noticed you are storing a "backup" of the encryption key as `encryptionKeyJwk`. Doesn't this backup defeat the end-to-end encryption, because the key is available in the grant record without needing any token to unwrap it?
I don’t think a non-expert would even know what this means, let alone spot the issue and direct the model to fix it.
If you look at the README it is completely revealed... so i would argue there is nothing to "reveal" in the first place.
> I started this project on a lark, fully expecting the AI to produce terrible code for me to laugh at. And then, uh... the code actually looked pretty good. Not perfect, but I just told the AI to fix things, and it did. I was shocked.
> To emphasize, this is not "vibe coded". Every line was thoroughly reviewed and cross-referenced with relevant RFCs, by security experts with previous experience with those RFCs.
It's more like the genious coworker that has an overassertive ego and sometimes shows up drunk, but if you know how to work with them and see past their flaws, can be a real asset.
Thats the biggest issue I see. In most cases I don't use llm because DIYing it takes less time than prompting/waiting/checking every line.
Yes:
> It took me a few days to build the library with AI.
> I estimate it would have taken a few weeks, maybe months to write by hand.
> or just tried to prove a point that if you actually already know all details of impl you can guide llm to do it?
No:
> I was an AI skeptic. I thoughts LLMs were glorified Markov chain generators that didn't actually understand code and couldn't produce anything novel. I started this project on a lark, fully expecting the AI to produce terrible code for me to laugh at. And then, uh... the code actually looked pretty good. Not perfect, but I just told the AI to fix things, and it did. I was shocked.
— https://github.com/cloudflare/workers-oauth-provider/?tab=re...
Using an LLM to write code you already know how to write is just like using intellisense or any other smart autocomplete, but at a larger scale.
No it doesn't. Typing speed is never the bottleneck for an expert.
As an offline database of Google-tier knowledge, LLM's are useful. Though current LLM tech is half-baked, we need:
a) Cheap commodity hardware for running your own models locally. (And by "locally" I mean separate dedicated devices, not something that fights over your desktop's or laptop's resources.)
b) Standard bulletproof ways to fine-tune models on your own data. (Inference is already there mostly with things like llama.cpp, finetuning isn't.)
An expert reasons, plans ahead, thinks and reasons a little bit more before even thinking about writing code.
If you are measuring productivity by lines of code per hour then you don't understand what being a dev is.
They didn't suggest that at all, they merely suggested that the component of the expert's work that would otherwise be spent typing can be saved, while the rest of their utility comes from intense scrutiny, problem solving, decision making about what to build and why, and everything else that comes from experience and domain understanding.
How could that possibly be true!? Seems like it'd be the same as suggesting being constrained to analog writing utensils wouldn't bottleneck the process of publishing a book or research paper. At the very least such a statement implies that people with ADHD can't be experts.
In my experience, it takes longer to debug/instruct the LLM than to write it from scratch.
removing expert humans from the loop is the deeply stupid thing the Tech Elite Who Want To Crush Their Own Workforces / former-NFT fanboys keep pushing, just letting an LLM generate code for a human to review then send out for more review is really pretty boring and already very effective for simple to medium-hard things.
how ? the prompts have still to be typed right ? and then the output examined in earnest.
It's a lie. The marketing one, to be specific, which makes it even worse.
I mean I try to use them for svelte or vue and it still recommends react snippets sometimes.
(I'll assume you're not joking, because your post is ridiculous enough to look like sarcasm.)
The answer is because programmers read code 10 times more (and think about code 100 times more) than they write it.
In less than 5 minutes Claude created code that: - encapsulated the api call - modeled the api response using Typescript - created a re-usable and responsive ui component for the card (including a load state) - included it in the right part of the page
Even if I typed at 200wpm I couldn't produce that much code from such a simple prompt.
I also had similar experiences/gains refactoring back-end code.
This being said, there are cases in which writing the code yourself is faster than writing a detailed enough prompt, BUT those cases are becoming exception with new LLM iteration. I noticed that after the jump from Claude 3.7 to Claude 4 my prompts can be way less technical.
How many times have you read a story card and by the time you finished reading it you thought "It's an easy task, should take me 1 hour of work to write the code and tests"?
In my experience, in most of those cases the AI can do the same amount of code writing in under 10 minutes, leaving me the other 50 minutes to review the code, make/ask for any necessary adjustments, and move on to another task.
So what you're saying makes sense. And I'm definitely on the other side of that fence.
I hate to say, though, but I have reviewed a lot of human code in my time, and I've definitely caught many humans making similar-magnitude mistakes. :/
The main value I've gotten out of AI writing software comes from the two extremes; not from the middle-ground you present. Vibe coding can be great and seriously productive; but if I have to check it or manually maintain it in nearly any capacity more complicated than changing one string, productivity plummets. Conversely; delegating highly complex, isolated function writing to an AI can also be super productive, because it can (sometimes) showcase intelligence beyond mine and arrive at solutions which would take me 10x longer; but definitionally I am not the right person to check its code output; outside of maybe writing some unit tests for it (a third thing AI tends to be quite good at)
I remember hearing somewhere that humans have a limited capacity in terms of number of decisions made in a day, and it seems to fit here: If I'm writing the code myself, I have to make several decisions on every line of code, and that's mentally tiring, so I tend to stop and procrastinate frequently.
If an LLM is handling a lot of the details, then I'm just making higher-level decisions, allowing me to make more progress.
Of course this is totally speculation and theories like this tend to be wrong, but it is at least consistent with how I feel.
This is less a matter of "mindset", but more a general problem of information.
But you're correct, when you use more exotic and/or quite new libraries, the outputs can be of mixed quality. For my current stack (Typescript, Node, Express, React 19, React Router 7, Drizzle and Tailwind 4) both Grok 3 (the paid one with 100k+ context) and Gemini 2.5 are pretty damn good. But I use them for prototyping, i.e. quickly putting together new stuff, for types, refactorings... I would never trust their output verbatim. (YET.) "Build an app that ..." would be a nightmare, but React-like UI code at sufficiently granular level is pretty much the best case scenario for LLMs as your components should be relatively isolated from the rest of the app and not too big anyways.
"code base must do X with Y conditions"
The reviewer is at no disadvantage, other than the ability to walk the problem without coding.
Oddly enough security critical flows are likely to be one of the few exceptions because catching subtle reasoning errors that won't trip any unit tests when reviewing code that you didn't write is extremely difficult.
An expert in oauth, perhaps. Not your typical expert dev who doesn't specialize in auth but rather in whatever he's using the auth for. Navigating those sorts of standards is extremely time consuming.
The worst case is an intern or LLM having generated some code where the intent is not obvious and them not being able to explain the intent behind it. "How is that even related to the ticket"-style code.
I am curious, did you find the work of reviewing Claude's output more mentally tiring/draining than writing it yourself? Like some other folks mentioned, I generally find reviewing code more mentally tiring than writing it, but I get a lot of personal satisfaction by mentoring junior developers and collaborating with my (human) colleagues (most of them anyway...) Since I don't get that feeling when reviewing AI code, I find it more draining. I'm curious how you felt reviewing this code.
I thought this experience was so helpful as it gave an objective, evidence-based sample on both the pros and cons of AI-assisted coding, where so many of the loudest voices on this topic are so one-sided ("AI is useless" or "developers will be obsolete in a year"). You say "removing expert humans from the loop is the deeply stupid thing the Tech Elite Who Want To Crush Their Own Workforces / former-NFT fanboys keep pushing", but the fact is many people with the power to push AI onto their workers are going to be more receptive to actual data and evidence than developers just complaining that AI is stupid.
It's like with some auto-driving systems - I say it like having a slightly inebriated teenager at the wheel. I can't just relax and read a book, because then I'd die. But so I have to be more mentally alert than just driving myself because everything could be going smoothly and relaxed, but at any moment the driving system could decide to drive into a tree.
This was a surprise to me! Until I tried it, I dreaded the idea.
I think it is because of the shorter feedback loop. I look at what the AI writes as it is writing it, and can ask for changes which it applies immediately. Reviewing human code typically has hours or days of round-trip time.
Also with the AI code I can just take over if it's not doing the right thing. Humans don't like it when I start pushing commits directly to their PR.
There's also the fact that the AI I'm prompting is, obviously, working on my priorities, whereas humans are often working on other priorities, but I can't just decline to review someone's code because it's not what I'm personally interested in at that moment.
When things go well, reviewing the AI's work is less draining than writing it myself, because it's basically doing the busy work while I'm still in control of high-level direction and architecture. I like that. But things don't always go well. Sometimes the AI goes in totally the wrong direction, and I have to prompt it too many times to do what I want, in which case it's not saving me time. But again, I can always just cancel the session and start doing it myself... humans don't like it when I tell them to drop a PR and let me do it.
Personally, I don't generally get excited about mentoring and collaborating. I wish I did, and I recognize it's an important part of my job which I have to do either way, but I just don't. I get excited primarily about ideas and architecture and not so much about people.
Probably very language specific. I use a lot of Ruby, typing things takes no time it's so terse. Instead I get to spend 95% of my time pondering my problems (or prompting the LLM)...
It lets me still have my own style preferences with the benefit of AI code generation. Bridged the barrier I had with code coming from Claude/ChatGPT/etc where its style preferences were based on the wider internets standards. This is probably a preference on the level of tabs vs spaces, but ¯\_(ツ)_/¯
While it's often a luxury, I'd much rather work on code I wrote than code somebody else wrote.
How novel is a OAuth provider library for cloudflare workers? I wouldn't be surprised if it'd been trained on multiple examples.
Edit: this comment was more a result of me being in a terrible mood than a true claim. Sorry.
My ability to review and understand intent behind code isn't a primarily bottleneck to me actually efficiently reviewing code when it's requested of me, the primary bottleneck is being notified at the right time that I have a waiting request to review code.
If compilers were never a bottleneck, why would we ever try to make them faster? If build tools were never a bottleneck, why would we ever optimize those? These are all just some of the things that can stand between the identification of a problem and producing a solution for it.
Again, I think your posting of this is probably the best actual, real world evidence that shows both the pros and cons of AI-assisted coding, dispassionately. Awesome work!
The trade off here would be that you must create the spec file (and customize the template files where needed) which drives the codegen, in exchange for explicit control over deterministic output. So there’s more typing but potentially less cognitive overhead with reviewing a bunch of LLM output.
For this use case I find the explicit codegen UX preferable to inspecting what the LLM decided to do with my human-language prompt, if attempting to have the LLM directly code the library/executable source (as opposed to asking it to create the generator, template or API spec).
Are you being serious here?
Let's do the math.
1200 lines in a hour would be one line every three seconds, with no breaks.
And your figure of 1200 lines is apparently omitting whitespace and comments. The actual code is 2626 lines. Let's say we ignore blank lines, then it's 2251 lines. So one line per ~1.6 seconds.
The best typists type like 2 words per second, so unless the average line of code has 3 words on it, a human literally couldn't type that fast -- even if they knew exactly what to type.
Of course, people writing code don't just type non-stop. We spend most of our time thinking. Also time testing and debugging. (The test is 2195 lines BTW, not included in above figures.) Literal typing of code is a tiny fraction of a developer's time.
I'd say your estimate is wrong by at least one, but realistically more likely two orders of magnitude.
On my very most productive days of my entire career I've managed to produce ~1000 lines of code. This library is ~5000 (including comments, tests, and documentation, which you omitted for some reason). I managed to prompt it out of the AI over the course of about five days. But they were five days when I also had a lot of other things going on -- meetings, chats, code reviews, etc. Not my most productive.
So I estimate it would have taken me 2x-5x longer to write this library by hand.
this is completely expected behavior by them. departments with well paid experts will be one of the first they’ll want to cut. in every field. experts cost money.
we’re a long, long, long way off from a bot that can go into random houses and fix under the sink plumbing, or diagnose and then fix an electrical socket. however, those who do most of their work on a computer, they’re pretty close to a point where they can cut these departments.
in every industry in every field, those will be jobs cut first. move fast and break things.
I go line-by-line through the code that I wrote (in my git client) before I stage+commit it.
Afterwards I make sure the LLM passes all the tests before I spend my time to review the code.
I find this process keeps the iterations count low for review -> prompt -> review.
I personally love writing code with an LLM. I’m a sloppy typist but love programming. I find it’s a great burnout prevention.
For context: node.js development/React (a very LLM friendly stack.)
Some basic things it fails at:
* Upgrading the React code-base from Material-UI V4 → V5
* Implementing a simple header navigation dropdown in HTML/CSS that looks decent and is usable (it kept having bugs with hovering, wrong sizes, padding, responsiveness, duplicated code etc.)
* Changing anything. About half of the time, it keeps saying "I made those changes", but no changes were made (it happens with all of them, Windsurf, Copilot, etc.).It did it in no time and no complaints.
Defeats the purpose in this case though.
Would you have been able to review it for its faults, had you not experienced the pain of committing some of them yourself, or been warned by your peers of the same?
Would you have become the senior expert that you are if that had been your learning process?
This is EXTREMELY false. When you write the code you [remember] it, it's fresh in your head, you [know] what it is doing and exactly what it's supposed to do. This is why debugging a codebase you didn't wrote is harder than one you wrote, if a bug happens you know exactly the spots it could be happening at and you can easily go and check them.
You are doing something wrong. I go line-by-line through my code like 7x faster than I would do it for someone's else code, because I know what I wrote, my own intentions, my flow of coding and all of those details. I can just look at it en passant, while with AI code I need to carefully review every single detail and the connection between them to approve it.