Convolutional filters lend themselves to rich combinatorics of compositions[1]: think of them as of context-dependent texture-atoms, repulsing and attracting over the variations of the local multi-dimensional context in the image. The composition is literally a convolutional transformation of local channels encoding related principal components of context.
Astronomical amounts of computations spent via training allow the network to form a lego-set of these texture-atoms in a general distribution of contexts.
At least this is my intuition for the nature of the convnets.
1. https://microscope.openai.com/models/contrastive_16x/image_b...