There are still significant limitations, no amount of prompting will get current models to approach abstraction and architecture the way a person does. But I'm finding that these Gemini models are finally able to replace searches and stackoverflow for a lot of my day-to-day programming.
Are we sure they know these things as opposed to being able to consistently guess correctly? With LLMs I'm not sure we even have a clear definition of what it means for it to "know" something.
But also things where guessing was desirable. For example with a riddle it would tell you it did not know or there wasn't enough information. After pressuring it to answer anyway it would correctly solve the riddle.
The official llama 2 finetune was pretty bad with this stuff.
And if you bully it enough on something nonsensical it'll give you a wrong answer.
You press it, and it takes a guess even though you told it not to, and gets it right, then you go "see it knew!". There's no database hanging out in ChatGPT/Claude/Gemini's weights with a list of cities and the tallest buildings. There's a whole bunch of opaque stats derived from the content it's been trained on that means that most of the time it'll come up with the same guess. But there's no difference in process between that highly consistent response to you asking the tallest building in New York and the one where it hallucinates a Python method that doesn't exist, or suggests glue to keep the cheese on your pizza. It's all the same process to the LLM.