I see this as a work in progress.. I am almost certain the humans in the loop on these PRs are well aware of what's going on and have their expectations in check, and this isn't just "business as usual" like any other PR or work assignment.
This is a test. You can't improve a system without testing it on real world conditions.
How do we know they're not tweaking the Copilot system prompts and settings behind the scenes while they're doing this work?
Can no one see the possibility that what is happening in those PRs is exactly what all the people involved expected to have happen, and they're just going through the process of seeing what happens when you try to refine and coach the system to either success or failure?
When we adopted AI coding assist tools internally over a year ago we did almost exactly this (not directly in GitHub though).
We asked a bunch of senior engineers to see how far they could get by coaching the AI to write code rather than writing it themselves. We wanted to calibrate our expectations and better understand the limits, strengths and weaknesses of these new tools we wanted to adopt.
In most of those early cases we ended up with worse code than if it had been written by humans, but we learned a ton. We can also clearly see how much better things have gotten over time, since we have that benchmark to look back on.
It's going to look stupid... until the point it doesn't. And my money's on, "This will eventually be a solved problem."
Good decision making would weigh the odds of 1 vs 8 vs 16 years. This isn’t good decision making.
Why is doing a public test of an emerging technology not good decision making?
> Good decision making would weigh the odds of 1 vs 8 vs 16 years.
What makes you think this isn't being done?
I'm not so sure they'll get there. If the solved problem is defined as a sub-standard but low cost, then I wouldn't bet against that. A solution better than that though, I don't think I'd put my money on that.
>> This is a test. You can't improve a system without testing it on real world conditions.
Software developers know to fix build problems before asking for a review. The AIs are submitting PRs in bad faith because they don't know any better. Compilers and other build tools produce errors when they fail, and the AI is ignoring this first line of feedback.
It is not a maintainers job to review code for syntax errors, or use of APIs that don't actually exist, or other silly mistakes. That's the compilers job and it does it well. The AI needs to take that feedback and fix the issues before escalating to humans.
AI can remain stupid longer than you can remain solvent.
So the typical expectations or norms of how code reviews and PRs work between humans don't really apply here.
That's my guess at least. I have no more insider information than you.
I have met people who believe that automobile engineering peaked in the 1960's, and they will argue that until you are blue in the face.
Otherwise it would check the tests are passing.
EVERY single prompt should have the opportunity to get copied off into a permanent log where the end user triggers it : log all input, all output, human writes a summary of what he wanted to happen but did not, what he thinks might have went wrong, what he thinks should have happened (domain specific experts giving feedback about how things are fucking up) And then its still only useful with long term tracking like how someone actually made a training change to fix this exact failure scenario.
None of that exists, so just like "full self driving" was a pie in the sky bullshit dream that proved machine learning has an 80/20 never gonna fully work problem, same thing here
My variation was:
"Leadership can stay irrational longer than you can stay employed"
What if the goalpost is shifted backwards, to the 90% mark (instead of demanding that AI get to 100%)?
* Big corps could redefine "good enough" as "what the SotA AI can do" and call it good.
* They could then layoff even more employees, since the AI would be, by definition, Good Enough.
(This isn't too far-fetched, IMO, seeing how we're seeing calls for copyright violation to be classified as legal-when-we-do-it)
Unfortunately, just about every thread on this genre is like that now.