Implementing a GPU's programming model on a CPU

>>luu+(OP)
> This is in contrast to SIMD, or "single instruction multiple data," where the programmer explicitly uses vector types and operations in their program. The SIMD approach is suited for when you have a single program that has to process a lot of data, whereas SIMT is suited for when you have many programs and each one operates on its own data

This statement is comparing the SIMT model to SIMD. Can anyone explain the last part about SIMT being better for many programs operating on its own data? Are they just saying you can have individual “threads” executing independently (via predication/masks and such)?

>>Techni+Eb2
SIMT let's a scheduler get clever about memory accesses, SIMD can practically only access memory linearly (scatter gather can do better but it's still usually quite linear) whereas SIMT can be much smarter in terms of having lots of similar bits of work going on in ways that use the bandwidth maximally and don't overlap.

>>mhh__+2j2
https://developer.nvidia.com/blog/how-access-global-memory-e...

SIMT still expects coalesced memory access that's close together otherwise performance falls off a cliff

>>kllrno+Gj2
Yes, but not all thread in the block need to. As long as you fill a cache line you’re good.

zlacker