zlacker

SIMT and SIMD are different things. It's fortunate that they have different names.

A GPU is a single instruction multiple data machine. That's what the predicated vector operations are. 32 floats at a time, each with a disable bit.

Cuda is a single instruction multiple thread language. You write code in terms of one float and branching on booleans, as if it was a CPU, with some awkward intrinsics for accessing the vector units in the GPU.

That is, the programming model of a GPU ISA and that of Cuda are not the same. The GPU gives you vector instructions. Cuda gives you (mostly) scalar instructions and a compiler that deals with this mismatch, lowering branches to changes in exec mask and so forth.

With my numerical library hat on, I hate this. Programming a simd machine through a simt language means trying to get the compiler to transform the control flow into the thing you could easily write using vector instructions.

With my compiler implementer hat on, I hate this. It gives you two control flow graphs intertwined and a really bad time in register allocation.

It's not totally clear to me why simt won out over writing the vector operations. I'm certainly in the minority opinion here.

replies(4): >>pca006+W9 >>shihab+sb >>alphab+jI1 >>konsta+ws2

>>JonChe+(OP)
I guess that a lot of people are uncomfortable thinking about vector instructions, and dealing with masks manually? And for vector instructions you need to align things properly, pad the arrays such that they are of the right size, that people are not used to I guess.

replies(1): >>RevEng+KI2

>>JonChe+(OP)
I think it boils down to which GPGPU camp you are in, AI or HPC.

AI has relatively simple workflow, less thread divergence, so the SIMT abstractions add very little value. HPC workflow on the other hand is lot more complex. Writing a good simulation program for example, is going to get inhumanly complex with just SIMD.

>>JonChe+(OP)
> It's not totally clear to me why simt won out over writing the vector operations.

From the user side, it is probably simpler to write an algorithm once without vectors, and have a compiler translate it to every vector ISA it supports, rather than to deal with each ISA by hand.

Besides, in many situations, having the algorithm executed sequentially or in parallel is irrelevant to the algorithm itself, so why introduce that concern?

> I'm certainly in the minority opinion here.

There are definitely more userland programmers than compiler/numerical library ones.

replies(1): >>JonChe+kC2

>>JonChe+(OP)
> SIMT and SIMD are different things.

was anything said against that?

the comment said SIMT is same as SPMD

>>alphab+jI1
I think it's easier to write the kernel in terms of the scalar types if you can avoid the warp level intrinsics. If you need to mix that with the cross lane operations it gets very confusing. In the general case volta requires you to pass the current lane mask into intrinsics, so you get to try to calculate what the compiler will turn your branches into as a bitmap.

So if you need/want to reason partly in terms of warps, I think the complexity is lower to reason wholly in terms of warps. You have to use vector types and that's not wonderful, but in exchange you get predictable control flow out of the machine code.

Argument is a bit moot though, since right now you can't program either vendor hardware using vectors, so you also need to jump the barrier to assembly. None of the GPUs are very easy to program in assembly.

>>pca006+W9
I agree. I work on numerical simulation software that involves very sparse, very irregular matrices. We hardly use SIMD because of the challenges of maintaining predicates, bitmasks, mapping values into registers, and so on. It's not bad if we can work on dense blocks, but dealing with sparsity is a huge headache. Now that we are implementing these methods with CUDA using the SIMT paradigm, that complexity is largely taken care of. We still need to consider how to design algorithms to have parallelism, but we don't have to manage all the minutiae of mapping things into and out of registers.

Modern GPGPUs also have more hardware dedicated to this beyond the SIMD/SIMT models. In NVIDIAs CUDA programming model, besides the group of threads that represents a vector operation (a warp), you also have groups of warps (thread blocks) that are assigned the same processor and can explicitly address a fast, shared memory. Each processor has many registers that are automatically mapped to each thread so that each thread has its own dedicated registers. Scheduling is done in hardware at an instruction level so that you can effectively single cycle context switches between warps. Starting with Volta, it will even assemble vectors from threads in any warps in the same thread block, so lanes that are predicated off in a warp don't have to go to waste - they can take lanes from other warps.

There are many other hardware additions that make this programming model very efficient. Similar to how C and x86 each provide abstractions over the actual micro ops being executed that hides complexity like pipelining, out of order execution, and speculative execution, CUDA and the PTX ISA provide abstractions over complex hardware implementations that specifically benefit this kind of SIMT paradigm.