zlacker

[return to "Memory and new controls for ChatGPT"]
1. anothe+Pf[view] [source] 2024-02-13 19:29:04
>>Josely+(OP)
This is a bit off topic to the actual article, but I see a lot of top ranking comments complaining that ChatGPT has become lazy at coding. I wanted to make two observations:

1. Yes, GPT-4 Turbo is quantitatively getting lazier at coding. I benchmarked the last 2 updates to GPT-4 Turbo, and it got lazier each time.

2. For coding, asking GPT-4 Turbo to emit code changes as unified diffs causes a 3X reduction in lazy coding.

Here are some articles that discuss these topics in much more detail.

https://aider.chat/docs/unified-diffs.html

https://aider.chat/docs/benchmarks-0125.html

◧◩
2. omalle+6q[view] [source] 2024-02-13 20:21:14
>>anothe+Pf
Can you say in one or two sentences what you mean by “lazy at coding” in this context?
◧◩◪
3. anothe+xu[view] [source] 2024-02-13 20:47:50
>>omalle+6q
Short answer: Rather than fully writing code, GPT-4 Turbo often inserts comments like "... finish implementing function here ...". I made a benchmark based on asking it to refactor code that provokes and quantifies that behavior.

Longer answer:

I found that I could provoke lazy coding by giving GPT-4 Turbo refactoring tasks, where I ask it to refactor a large method out of a large class. I analyzed 9 popular open source python repos and found 89 such methods that were conceptually easy to refactor, and built them into a benchmark [0].

GPT succeeds on this task if it can remove the method from its original class and add it to the top level of the file with appropriate changes to the size of the abstract syntax tree. By checking that the size of the AST hasn't changed much, we can infer that GPT didn't replace a bunch of code with a comment like "... insert original method here...". The benchmark also gathers other laziness metrics like counting the number of new comments that contain "...". These metrics correlate well with the AST size tests.

[0] https://github.com/paul-gauthier/refactor-benchmark

◧◩◪◨
4. Shamel+391[view] [source] 2024-02-14 01:13:09
>>anothe+xu
I use gpt4-turbo through the api many times a day for coding. I have encountered this behavior maybe once or twice period. It was never an issue that didn’t make sense as essentially the model summarizing and/or assuming some shared knowledge (that was indeed known to me).

This, and people generally saying that chatGPT has been intentionally degraded, are just super strange for me. I believe it’s happening but it’s making me question my sanity. What am I doing to get decent outputs? Am I simply not as picky? I treat every conversion as though it needs to be vetted because it does regardless of how good the model is. I only trust output from the model that I am a subject matter expert on or in a closely adjacent field. Otherwise I treat it much like an internet comment - useful for surfacing curiosities but requires vetting.

◧◩◪◨⬒
5. xionpl+MM7[view] [source] 2024-02-15 23:17:45
>>Shamel+391
IMO it is because there is a huge stochastic element to all this.

If we were all flipping coins there would be people claiming that coins only come up tails. There would be nothing they were doing though to make the coin come up tails. That is just the streak they had.

Some days I get lucky with chatGPT4 and some days I don't.

It is also ridiculous how we talk about this as if all subjects and context presented to chatGPT4 are going to be uniform in output. One word difference in your own prompt might change things completely while trying to accomplish exactly the same thing. Now scale that to all the people talking about chatGPT with everyone using it for something different.

[go to top]