As another poster has already mentioned, there exists a compiler for CPUs which has been inspired by CUDA and which has been available for many years: ISPC (Implicit SPMD Program Compiler), at https://github.com/ispc/ispc .
NVIDIA has the very annoying habit of using a lot of terms that are different from those that have been previously used in computer science for decades. The worst is that NVIDIA has not invented new words, but they have frequently reused words that have been widely used with other meanings.
SIMT (Single-Instruction Multiple Thread) is not the worst term coined by NVIDIA, but there was no need for yet another acronym. For instance they could have used SPMD (Single Program, Multiple Data Stream), which dates from 1988, two decades before CUDA.
Moreover, SIMT is the same thing that was called "array of processes" by C.A.R. Hoare in August 1978 (in "Communicating Sequential Processes"), or "replicated parallel" by Occam in 1985 or "PARALLEL DO" by "OpenMP Fortran" in 1997-10 or "parallel for" by "OpenMP C and C++" in 1998-10.
Each so-called CUDA kernel is just the body of a "parallel for" (which is multi-dimensional, like in Fortran).
The only (but extremely important) innovation brought by CUDA is that the compiler is smart enough so that the programmer does not need to know the structure of the processor, i.e. how many cores it has and how many SIMD lanes each core has. The CUDA compiler distributes automatically the work over the available SIMD lanes and available cores and in most cases the programmer does not care whether two executions of the function that must be executed for each data item are done on two different cores or on two different SIMD lanes of the same core.
This distribution of the work over SIMD lanes and over cores is simple when the SIMD operations are maskable, like in GPUs or in AVX-512 a.k.a. AVX10 or in ARM SVE. When masking is not available, like in AVX2 or Armv8-A, the implementation of conditional statements and expressions is more complicated.
CUDA didn't show up until 2007.
There is some discussion in the ispc paper: https://pharr.org/matt/assets/ispc.pdf
A GPU is a single instruction multiple data machine. That's what the predicated vector operations are. 32 floats at a time, each with a disable bit.
Cuda is a single instruction multiple thread language. You write code in terms of one float and branching on booleans, as if it was a CPU, with some awkward intrinsics for accessing the vector units in the GPU.
That is, the programming model of a GPU ISA and that of Cuda are not the same. The GPU gives you vector instructions. Cuda gives you (mostly) scalar instructions and a compiler that deals with this mismatch, lowering branches to changes in exec mask and so forth.
With my numerical library hat on, I hate this. Programming a simd machine through a simt language means trying to get the compiler to transform the control flow into the thing you could easily write using vector instructions.
With my compiler implementer hat on, I hate this. It gives you two control flow graphs intertwined and a really bad time in register allocation.
It's not totally clear to me why simt won out over writing the vector operations. I'm certainly in the minority opinion here.
SIMT means multiple processors all executing the exact same instruction. One instruction decoder, but like 64 execution pipelines.
AI has relatively simple workflow, less thread divergence, so the SIMT abstractions add very little value. HPC workflow on the other hand is lot more complex. Writing a good simulation program for example, is going to get inhumanly complex with just SIMD.
The difference is that because the software "kernels" (i.e. software threads) may be mapped by the compiler either on hardware threads or on hardware SIMD lanes, and this mapping is not controlled by the programmer, the divergent instructions will cause inefficient (serial) execution when they happen to be executed on SIMD lanes of the same core, so they must be avoided.
This however is only an optimization problem, if the speed would be irrelevant all the kernels could execute divergent instructions.
The reason for the existence of CUDA is to mask for the programmer the existence of the SIMD lanes and to allow programming as if the software threads would map only to hardware threads. Nevertheless, for optimum performance the programmer must be aware that this abstraction is not really true, so the programs should be written with awareness of the limitations introduced by SIMD.
On conventional SIMT implementations (pre-Volta), the programmer also has to be aware of it to not cause deadlocks in the atomics across different lanes in the same warp.
On NV Volta onwards, each SIMT lane has its own instruction pointer with opportunistic reconvergence when possible.
Two major examples,
https://en.m.wikipedia.org/wiki/TMS34010
https://en.m.wikipedia.org/wiki/RenderMan_Shading_Language
Also in DirectX 8, it wasn't as we know them today, because Assembly was the only programming language.
Nowadays CUDA does C, C++, Fortran, Python JIT in the box, and has partner collaborations for Haskell, Java, Julia, C#.
From the user side, it is probably simpler to write an algorithm once without vectors, and have a compiler translate it to every vector ISA it supports, rather than to deal with each ISA by hand.
Besides, in many situations, having the algorithm executed sequentially or in parallel is irrelevant to the algorithm itself, so why introduce that concern?
> I'm certainly in the minority opinion here.
There are definitely more userland programmers than compiler/numerical library ones.
was anything said against that?
the comment said SIMT is same as SPMD
So if you need/want to reason partly in terms of warps, I think the complexity is lower to reason wholly in terms of warps. You have to use vector types and that's not wonderful, but in exchange you get predictable control flow out of the machine code.
Argument is a bit moot though, since right now you can't program either vendor hardware using vectors, so you also need to jump the barrier to assembly. None of the GPUs are very easy to program in assembly.
Modern GPGPUs also have more hardware dedicated to this beyond the SIMD/SIMT models. In NVIDIAs CUDA programming model, besides the group of threads that represents a vector operation (a warp), you also have groups of warps (thread blocks) that are assigned the same processor and can explicitly address a fast, shared memory. Each processor has many registers that are automatically mapped to each thread so that each thread has its own dedicated registers. Scheduling is done in hardware at an instruction level so that you can effectively single cycle context switches between warps. Starting with Volta, it will even assemble vectors from threads in any warps in the same thread block, so lanes that are predicated off in a warp don't have to go to waste - they can take lanes from other warps.
There are many other hardware additions that make this programming model very efficient. Similar to how C and x86 each provide abstractions over the actual micro ops being executed that hides complexity like pipelining, out of order execution, and speculative execution, CUDA and the PTX ISA provide abstractions over complex hardware implementations that specifically benefit this kind of SIMT paradigm.