zlacker

> GPT-4 is still king, but not by that much any more

Idk, I just tried Gemini Ultra and it's so much worse than GPT4 that I am actually quite shocked. Trying to ask it any kind of coding question ends up being this frustrating and honestly bizarre waste of time as it hallucinates a whole new language syntax every time and then asks if you want to continue with non-working, in fact non-existing, option A or the equally non-existent option B until you realise that you've spent an hour trying to make it at least output something that is even in the requested language and finally that it is completely useless.

I'm actually pretty astonished at how far Google is behind and that they released such a bunch of worthless junk at all. And have the chutzpah to ask people to pay for it!

Of course I'm looking forward to gpt-5 but even if it's only a minor step up, they're still way ahead.

replies(5): >>pb7+q >>mad_to+Yd >>dieort+Pj >>TeMPOr+7m >>Keyfra+qE

>>sho+(OP)
Do you have example links?

replies(1): >>sho+K

>>pb7+q
here was one of them https://gemini.google.com/share/fde31202b221?hl=en

edit: as pointed out, this was indeed a pretty esoteric example. But the rest of my attempts were hardly better, if they had a response at all.

replies(1): >>peddli+C1

>>sho+K
That’s an awfully specific and esoteric question. Would you expect gpt4 to be significantly better at that level of depth? That’s not been my experience.

replies(1): >>sho+t2

>>peddli+C1
OK, i have to admit that one was a little odd, I was beginning to give up and trying new angles. I can't really share my other sessions. But I was trying to get a handle on the language and thought it would be an easily-understood situation (multiple-token auth). I would have at least expected the response to be slightly valid.

The language in question was only open sourced after GPT4's training date, so i couldn't compare. That's actually why I tried it in the first place. And yes, I do expect it to be better - GPT4 isn't perfect but I don't really it ever hallucinating quite that hard. In fact, its answer was basically that it didn't know.

And when I asked it questions with other, much less esoteric code like "how would you refactor this to be more idiomatic?" I'd get either "I couldn't complete your request. Rephrase your prompt and try again." or "Sorry, I can't help with that because there's too much data. Try again with less data." GPT-4 was helpful in both cases.

replies(1): >>peddli+p8

>>sho+t2
My experience has been that gpt4 will happily hallucinate the details when I go too deep. Like you mentioned, it will invent new syntax and function calls.

It's magic, until it isn't.

>>sho+(OP)
That's interesting, because I have had exactly the opposite experience testing GPT vs Bard with coding questions. Bard/Gemini far outperformed GPT on coding, especially with newer languages or libraries. Whereas GPT was better with more general questions.

>>sho+(OP)
I’ve had the opposite experience with Gemini, which was surprising. I feel like it lies less to me among other things

>>sho+(OP)
They seem to be steadily dumbing down GPT-4; eventually, improving performance of open source models and decreasing performance of GPT-4 will meet in the middle.

replies(2): >>bamboo+9o >>fennec+fo

>>TeMPOr+7m
I'm almost certain this is because you're getting use to chat bots. How would they honestly be getting worse?

Initially it felt like the singularity was at hand. You've played with it, got to know it, the computer was taking to you, it was your friend, it was exciting then you got bored with your new friend and it wasn't as great as you remember it.

Dating is often like this. You meet someone, have some amazing intimacy, then you get really get to know someone, you work out it wasn't for you and it's time to move on.

replies(5): >>clbrmb+zq >>detour+2r >>DJHenk+ks >>TeMPOr+Fs >>whywhy+C21

>>TeMPOr+7m
Yeah, I agree, GPT's attention seems much less focussed now. If you tell it to respond in a certain way it now has trouble figuring out what you want.

If it's a conversation with "format this loose data into XML" repeated several times and then a "now format it to JSON" I find often it has trouble determing that what you just asked for is the most important; I think the attention model gets confused by all the preceding text.

>>bamboo+9o
1. Cost & resource optimization

2. More and more RLHF

replies(1): >>bamboo+0J

>>bamboo+9o
Google search got worse.

replies(2): >>polsha+jw >>whywhy+831

>>bamboo+9o
> I'm almost certain this is because you're getting use to chat bots. How would they honestly be getting worse?

People say that, but I don't get this line of reasoning. There was something new, I learned to work with it. At one point I knew what question to ask to get the answer I want and have been using that form ever since.

Nowadays I don't get the answer I want for the same input. How is that not a result of declining quality?

replies(2): >>omega3+Iz >>jsjohn+tD

>>bamboo+9o
The author of `aider` - an OSS GPT-powered coding assistant - is on HN, and says[0] he has benchmarks showing gradual decline in quality of GPT-4-Turbo, especially wrt. "lazy coding" - i.e. actually completing a coding request, vs. peppering it with " ... write this yourself ... " comments.

That on top of my own experiences, and heaps of anecdotes over the last year.

> How would they honestly be getting worse?

The models behind GPT-4 (which is rumored to be a mixture model)? Tuning, RLHF (which has long been demonstrated to dumb the model down). The GPT-4, as in the thing that produces responses you get through API? Caching, load-balancing, whatever other tricks they do to keep the costs down and availability up, to cope with the growth of the number of requests.

--

[0] - >>39361705

>>detour+2r
And Amazon search, youtube search. There do seem to be somewhat different incentives involved though, those examples are primarily about increasingly pushing lower quality content (ads, more profitable items, more engaging items) because it makes more money.

replies(1): >>detour+fB

>>DJHenk+ks
Could you share your findings re what questions to ask?

>>polsha+jw
The incentive mismatch that I seem to be observing is that Wall Street is in constant need of new technical disruption. This means that any product that shows promise will be optimized to meet a business plan rather than a human need.

>>DJHenk+ks
For the record, I agree with you about declining quality of answers, but…

> Nowadays I don't get the answer I want for the same input. How is that not a result of declining quality?

Is it really the same input? An argument could easily be made that as you’ve gotten accustomed to ChatGPT, you ask harder questions, use less descriptive of language, etc.

replies(2): >>DJHenk+RT >>avion2+E31

>>sho+(OP)
I kind of gave up completely on coding questions. Whether it's GPT4, Anthropic, or Gemini - there's always this big issue of laziness I'm facing. Never do I get a full code, there are always stubs or TODOs (on important stuff) and when asked to correct for that.. I just get more of it (laziness). Has anyone else faced this and is there a solution? It's almost as annoying, if not more, as was incomplete output in the early days.

replies(2): >>buggle+iI >>Curiou+vV

>>Keyfra+qE
The solution, at least for GPT-4, is to ask it to first draft a software spec for whatever you want it to implement and then write the code based on the spec. There are a bunch of examples here:

https://github.com/mckaywrigley/prompts

>>clbrmb+zq
So we should expected GPT-5 to be worse than GPT-4?

replies(1): >>pixl97+tM

>>bamboo+0J
GPT-5: "I'm sorry I cannot answer that question because it may make GPT-4 feel bad about it's mental capabilities, instead we've presented GPT-4 with a participation trophy and told it's a good model"

Talking to corporate HR is subjectively worse for most people, and objectively worse in many cases.

>>jsjohn+tD
> Is it really the same input? An argument could easily be made that as you’ve gotten accustomed to ChatGPT, you ask harder questions, use less descriptive of language, etc.

I don't have logs detailed enough to be able to look it up, so I can't prove it. But for me learning to work with AI tools like ChatGPT consists specifically developing an intuition of what kind of answer to expect.

Maybe my intuition skewed a little over the months. It did not do that for open source models though. As a software developer understanding and knowing what to expect from a complex system is basically my profession. Not just the systems I build, maintain and integrate, but also the systems I use to get information, like search engines. Prompt engineering is just a new iteration of google-fu.

Since this intuition has not failed me in all those other areas and since OpenAI has an incentive to change the workings under the hood (cutting costs, adding barriers to keep it politically correct) and it is a closed source system that no-one from the outside can inspect, my bet is that it is them and not me.

replies(1): >>jsjohn+4O3

>>Keyfra+qE
If you can't get GPT4 to do coding questions you're prompting it wrong or not loading your context correctly. It struggles a bit with presentational stuff like getting correct HTML/CSS from prompts or trying to generate/update large functions/classes, but it is stellar at producing short functions, creating scaffolding (tests/stories) and boilerplate and it can do some refactors that are outside the capabilities of analytical tools, such as converting from inline styles to tailwind, for example.

replies(1): >>Keyfra+fd1

>>bamboo+9o
> How would they honestly be getting worse

To me it feels like it detects if the answer could be answered cheaper by code interpreter model or 4 Turbo and then it offloads them to that and they just kinda suck compared to OG 4.

I’ve watched it fumble and fail to solve a problem with CI, took it 3 attempts over 5 minutes real time and just gave up in the end, a problem that OG 4 can do one shot no preamble.

>>detour+2r
Yandex image search is now better than Googles just by being the exact product Googles was 10+ years ago.

Watching tools decline is frustrating.

>>jsjohn+tD
Not OP, but I copy & pasted the same code and asked it to improve. With no-fingers-tip-hack it does something, but much worse results.

replies(1): >>jsjohn+cN3

>>Curiou+vV
so, mundane trivial things and/like web programming? I got it eventually to answer what I needed but it always liked to skip part of the code, inserting // TODO: important stuff in the middle, hence 'laziness' attribute. Maybe it is just lazy, who knows. I know I am since I'm prompting it for stuff.

replies(2): >>Curiou+Dj1 >>antonv+05g

>>Keyfra+fd1
I wouldn't say mundane/trivial as much as well trodden. I get good code for basic shaders, various compsci algorithms, common straightforward sql queries, etc. If you're asking for it to edit 500 line functions and handle memory management in a language that isn't in the top20 of the TIOBE index you're going to have a bad time.

The todo comments can be prompted against, just tell it to always include complete runnable code as its output will executed in a sandbox without prior verification.

>>avion2+E31
Yep, hence why I said up front “I agree with you about declining quality of answers” because they definitively have based on personal experience with examples similar to yours.

>>DJHenk+RT
> As a software developer understanding and knowing what to expect from a complex system is basically my profession. Not just the systems I build, maintain and integrate, but also the systems I use to get information, like search engines.

Ok, I’m going to call b/s here unless your expectations of Google have not gone way down over the years. Google was night and day different results twenty years ago vs ten years ago vs today. If 2004 Google search was a “10 out of 10”, then 2014 it was an “8 out of 10”, and today barely breaks a “5” in quality of results in comparison and don’t even bother with the advanced query syntax you could’ve used in the 00’s, they flat ignore it now.

(Also, side note, reread what you said in this post again. Just a friendly note that the overall tone comes across a certain way you might not have intended)

>>Keyfra+fd1
Fyi, I've never encountered what you're describing, whether with GPT 3.5 or 4.

It may be that you're expecting it to do too much at once. Try giving smaller requests.