[ACCEPTED]-help me understand cuda-parallel-processing

Accepted answer
Score: 69

You should check out the webinars on the NVIDIA 52 website, you can join a live session or 51 view the pre-recorded sessions. Below is 50 a quick overview, but I strongly recommend 49 you watch the webinars, they will really 48 help as you can see the diagrams and have 47 it explained at the same time.

When you execute 46 a function (a kernel) on a GPU it is executes 45 as a grid of blocks of threads.

  • A thread is the finest granularity, each thread has a unique identifier within the block (threadIdx) which is used to select which data to operate on. The thread can have a relatively large number of registers and also has a private area of memory known as local memory which is used for register file spilling and any large automatic variables.
  • A block is a group of threads which execute together in a batch. The main reason for this level of granularity is that threads within a block can cooperate by communicating using the fast shared memory. Each block has a unique identifier (blockIdx) which, in conjunction with the threadIdx, is used to select data.
  • A grid is a set of blocks which together execute the GPU operation.

That's the logical hierarchy. You 44 really only need to understand the logical 43 hierarchy to implement a function on the 42 GPU, however to get performance you need 41 to understand the hardware too which is 40 SMs and SPs.

A GPU is composed of SMs, and 39 each SM contains a number of SPs. Currently 38 there are 8 SPs per SM and between 1 and 37 30 SMs per GPU, but really the actual number 36 is not a major concern until you're getting 35 really advanced.

The first point to consider 34 for performance is that of warps. A warp is a 33 set of 32 threads (if you have 128 threads 32 in a block (for example) then threads 0-31 31 will be in one warp, 32-63 in the next and 30 so on. Warps are very important for a few 29 reasons, the most important being:

  • Threads within a warp are bound together, if one thread within a warp goes down the 'if' side of a if-else block and the others go down the 'else', then actually all 32 threads will go down both sides. Functionally there is no problem, those threads which should not have taken the branch are disabled so you will always get the correct result, but if both sides are long then the performance penalty is important.
  • Threads within a warp (actually a half-warp, but if you get it right for warps then you're safe on the next generation too) fetch data from the memory together, so if you can ensure that all threads fetch data within the same 'segment' then you will only pay one memory transaction and if they all fetch from random addresses then you will pay 32 memory transactions. See the Advanced CUDA C presentation for details on this, but only when you are ready!
  • Threads within a warp (again half-warp on current GPUs) access shared memory together and if you're not careful you will have 'bank conflicts' where the threads have to queue up behind each other to access the memories.

So having 28 understood what a warp is, the final point 27 is how the blocks and grid are mapped onto 26 the GPU.

Each block will start on one SM 25 and will remain there until it has completed. As 24 soon as it has completed it will retire 23 and another block can be launched on the 22 SM. It's this dynamic scheduling that gives 21 the GPUs the scalability - if you have one 20 SM then all blocks run on the same SM on 19 one big queue, if you have 30 SMs then the 18 blocks will be scheduled across the SMs 17 dynamically. So you should ensure that when 16 you launch a GPU function your grid is composed 15 of a large number of blocks (at least hundreds) to 14 ensure it scales across any GPU.

The final 13 point to make is that an SM can execute 12 more than one block at any given time. This 11 explains why a SM can handle 768 threads 10 (or more in some GPUs) while a block is 9 only up to 512 threads (currently). Essentially, if 8 the SM has the resources available (registers 7 and shared memory) then it will take on 6 additional blocks (up to 8). The Occupancy 5 Calculator spreadsheet (included with the 4 SDK) will help you determine how many blocks 3 can execute at any moment.

Sorry for the 2 brain dump, watch the webinars - it'll be 1 easier!

Score: 3

It's a little confusing at first, but it 10 helps to know that each SP does something 9 like 4 way SMT - it cycles through 4 threads, issuing 8 one instruction per clock, with a 4 cycle 7 latency on each instruction. So that's how 6 you get 32 threads per warp running on 8 5 SPs.

Rather than go through all the rest 4 of the stuff with warps, blocks, threads, etc, I'll 3 refer you to the nVidia CUDA Forums, where this kind of question 2 crops up regularly and there are already 1 some good explanations.

More Related questions