SIMT means multiple processors all executing the exact same instruction. One instruction decoder, but like 64 execution pipelines.
The difference is that because the software "kernels" (i.e. software threads) may be mapped by the compiler either on hardware threads or on hardware SIMD lanes, and this mapping is not controlled by the programmer, the divergent instructions will cause inefficient (serial) execution when they happen to be executed on SIMD lanes of the same core, so they must be avoided.
This however is only an optimization problem, if the speed would be irrelevant all the kernels could execute divergent instructions.
The reason for the existence of CUDA is to mask for the programmer the existence of the SIMD lanes and to allow programming as if the software threads would map only to hardware threads. Nevertheless, for optimum performance the programmer must be aware that this abstraction is not really true, so the programs should be written with awareness of the limitations introduced by SIMD.
On conventional SIMT implementations (pre-Volta), the programmer also has to be aware of it to not cause deadlocks in the atomics across different lanes in the same warp.
On NV Volta onwards, each SIMT lane has its own instruction pointer with opportunistic reconvergence when possible.