Idk, I just tried Gemini Ultra and it's so much worse than GPT4 that I am actually quite shocked. Trying to ask it any kind of coding question ends up being this frustrating and honestly bizarre waste of time as it hallucinates a whole new language syntax every time and then asks if you want to continue with non-working, in fact non-existing, option A or the equally non-existent option B until you realise that you've spent an hour trying to make it at least output something that is even in the requested language and finally that it is completely useless.
I'm actually pretty astonished at how far Google is behind and that they released such a bunch of worthless junk at all. And have the chutzpah to ask people to pay for it!
Of course I'm looking forward to gpt-5 but even if it's only a minor step up, they're still way ahead.
edit: as pointed out, this was indeed a pretty esoteric example. But the rest of my attempts were hardly better, if they had a response at all.
The language in question was only open sourced after GPT4's training date, so i couldn't compare. That's actually why I tried it in the first place. And yes, I do expect it to be better - GPT4 isn't perfect but I don't really it ever hallucinating quite that hard. In fact, its answer was basically that it didn't know.
And when I asked it questions with other, much less esoteric code like "how would you refactor this to be more idiomatic?" I'd get either "I couldn't complete your request. Rephrase your prompt and try again." or "Sorry, I can't help with that because there's too much data. Try again with less data." GPT-4 was helpful in both cases.
It's magic, until it isn't.
Initially it felt like the singularity was at hand. You've played with it, got to know it, the computer was taking to you, it was your friend, it was exciting then you got bored with your new friend and it wasn't as great as you remember it.
Dating is often like this. You meet someone, have some amazing intimacy, then you get really get to know someone, you work out it wasn't for you and it's time to move on.
If it's a conversation with "format this loose data into XML" repeated several times and then a "now format it to JSON" I find often it has trouble determing that what you just asked for is the most important; I think the attention model gets confused by all the preceding text.
People say that, but I don't get this line of reasoning. There was something new, I learned to work with it. At one point I knew what question to ask to get the answer I want and have been using that form ever since.
Nowadays I don't get the answer I want for the same input. How is that not a result of declining quality?
That on top of my own experiences, and heaps of anecdotes over the last year.
> How would they honestly be getting worse?
The models behind GPT-4 (which is rumored to be a mixture model)? Tuning, RLHF (which has long been demonstrated to dumb the model down). The GPT-4, as in the thing that produces responses you get through API? Caching, load-balancing, whatever other tricks they do to keep the costs down and availability up, to cope with the growth of the number of requests.
--
[0] - >>39361705
> Nowadays I don't get the answer I want for the same input. How is that not a result of declining quality?
Is it really the same input? An argument could easily be made that as you’ve gotten accustomed to ChatGPT, you ask harder questions, use less descriptive of language, etc.
Talking to corporate HR is subjectively worse for most people, and objectively worse in many cases.
I don't have logs detailed enough to be able to look it up, so I can't prove it. But for me learning to work with AI tools like ChatGPT consists specifically developing an intuition of what kind of answer to expect.
Maybe my intuition skewed a little over the months. It did not do that for open source models though. As a software developer understanding and knowing what to expect from a complex system is basically my profession. Not just the systems I build, maintain and integrate, but also the systems I use to get information, like search engines. Prompt engineering is just a new iteration of google-fu.
Since this intuition has not failed me in all those other areas and since OpenAI has an incentive to change the workings under the hood (cutting costs, adding barriers to keep it politically correct) and it is a closed source system that no-one from the outside can inspect, my bet is that it is them and not me.
To me it feels like it detects if the answer could be answered cheaper by code interpreter model or 4 Turbo and then it offloads them to that and they just kinda suck compared to OG 4.
I’ve watched it fumble and fail to solve a problem with CI, took it 3 attempts over 5 minutes real time and just gave up in the end, a problem that OG 4 can do one shot no preamble.
Watching tools decline is frustrating.
The todo comments can be prompted against, just tell it to always include complete runnable code as its output will executed in a sandbox without prior verification.
Ok, I’m going to call b/s here unless your expectations of Google have not gone way down over the years. Google was night and day different results twenty years ago vs ten years ago vs today. If 2004 Google search was a “10 out of 10”, then 2014 it was an “8 out of 10”, and today barely breaks a “5” in quality of results in comparison and don’t even bother with the advanced query syntax you could’ve used in the 00’s, they flat ignore it now.
(Also, side note, reread what you said in this post again. Just a friendly note that the overall tone comes across a certain way you might not have intended)
It may be that you're expecting it to do too much at once. Try giving smaller requests.