That is a lot of the hard issues with driving are preemptive knowledge issues. I see a ball rolling towards the road from the left. I as a human know that, one the ball will likely roll out in front of me, and two, a kid/person may be following that. Now if you see a blowing trash bag, you probably aren't going to take any risky corrective action to avoid it.
The problem just a vision knowledge system is a ball and blowing trashbag are just objects that have the same priority. You have no categorization system of the relative meaning and dangers behind each action.
But things start getting weird when you couple LLMs with vision knowledge. Really, it's much too slow currently, but in multi-modal systems objects get depth of meaning. That trash bag can be identified, and a low risk can be assigned to it. While the ball can also be identified and a high risk assigned to it. Along with a bunch of other generalization that humans typically do.