I would like to propose that reasoning needs an intermediate representation for it to be effective. Consider the scene graph representation in computer graphics. This scene graph is the intermediate representation. The algorithm is not reasoning about individual pixels of two objects interacting in the scene graph. It uses IR. Now for some that IR takes the form of language/words. For some it takes the form of visuals. For some, these are just abstract feelings.