zlacker

>>pca006+(OP)
I agree. I work on numerical simulation software that involves very sparse, very irregular matrices. We hardly use SIMD because of the challenges of maintaining predicates, bitmasks, mapping values into registers, and so on. It's not bad if we can work on dense blocks, but dealing with sparsity is a huge headache. Now that we are implementing these methods with CUDA using the SIMT paradigm, that complexity is largely taken care of. We still need to consider how to design algorithms to have parallelism, but we don't have to manage all the minutiae of mapping things into and out of registers.

Modern GPGPUs also have more hardware dedicated to this beyond the SIMD/SIMT models. In NVIDIAs CUDA programming model, besides the group of threads that represents a vector operation (a warp), you also have groups of warps (thread blocks) that are assigned the same processor and can explicitly address a fast, shared memory. Each processor has many registers that are automatically mapped to each thread so that each thread has its own dedicated registers. Scheduling is done in hardware at an instruction level so that you can effectively single cycle context switches between warps. Starting with Volta, it will even assemble vectors from threads in any warps in the same thread block, so lanes that are predicated off in a warp don't have to go to waste - they can take lanes from other warps.

There are many other hardware additions that make this programming model very efficient. Similar to how C and x86 each provide abstractions over the actual micro ops being executed that hides complexity like pipelining, out of order execution, and speculative execution, CUDA and the PTX ISA provide abstractions over complex hardware implementations that specifically benefit this kind of SIMT paradigm.