zlacker

Subtle difference. A parallel_for can have asynchronous threads. They can all divergen and run independent instructions (due to if statements etc)

SIMT means multiple processors all executing the exact same instruction. One instruction decoder, but like 64 execution pipelines.

replies(1): >>adrian+Yb

>>hasman+(OP)
The same is true for CUDA/OpenCL kernels. They can include conditionals and they can execute independent instructions.

The difference is that because the software "kernels" (i.e. software threads) may be mapped by the compiler either on hardware threads or on hardware SIMD lanes, and this mapping is not controlled by the programmer, the divergent instructions will cause inefficient (serial) execution when they happen to be executed on SIMD lanes of the same core, so they must be avoided.

This however is only an optimization problem, if the speed would be irrelevant all the kernels could execute divergent instructions.

The reason for the existence of CUDA is to mask for the programmer the existence of the SIMD lanes and to allow programming as if the software threads would map only to hardware threads. Nevertheless, for optimum performance the programmer must be aware that this abstraction is not really true, so the programs should be written with awareness of the limitations introduced by SIMD.

replies(1): >>my123+2h

>>adrian+Yb
> Nevertheless, for optimum performance

On conventional SIMT implementations (pre-Volta), the programmer also has to be aware of it to not cause deadlocks in the atomics across different lanes in the same warp.

On NV Volta onwards, each SIMT lane has its own instruction pointer with opportunistic reconvergence when possible.