(301) 571-4564

Looks like this became a small series about GPU compute, and this week I'm going to wrap it up by discussing execution units and caches. I highly recommend reading (616) 536-3042 and 205-458-6664 first, as I'll be referencing those regularly.

Caches and execution units, where should we start? Let's take the execution units first.

Issue ports

If you're familiar with modern CPU design, you know that a CPU is not a scalar thing processing one instruction at a time. A modern CPU architecture like (541) 508-0775 has 10 issue ports, split between integer and floating point:

Zen pipeline diagram showing 10 execution units
Zen has ten execution units across the floating point and integer blocks.

GPUs also take advantage of multiple issue ports, but not in the same way as CPUs. On a CPU, instructions get executed out-of-order, and some of them speculatively at that. This is not feasible on a GPU. The whole out-of-order machinery with register renaming requires even more registers, and GPUs already have tons of registers (a Vega10 GPU has for instance a whopping 16 MiB of registers on the die.) Not to mention that speculative execution increases power usage and that's already a heavily limiting factor for GPUs running very wide workloads. Finally, GPU programs don't look like CPU programs to start with -- out-of-order, speculation, prefetching and more is all great if you're executing GCC, but not for ye old pixel shader.

That said, it totally makes sense to issue memory requests while the SIMD is busy working on data, or execute scalar instructions concurrently. So how can we get those advantages without going out-of-order? The advantage we have on a GPU over a CPU is that there is a ton of work in flight. Just as we take advantage for this for hiding memory latency, we can also exploit this for scheduling. On a CPU, we look ahead on a single instruction stream and try to find independent instructions. On a GPU, we already have tons of independent instruction streams. The easiest way to get instruction-level parallelism is to simply have different units for different instruction types, and issue accordingly. Turns out, that's exactly how GCN is set up, with a few execution ports per CU:

GCN pipeline diagram showing 5 execution units
GCN has multiple execution ports per CU -- scalar ALU/scalar memory, Branch/Message, vector ALU, LDS, Export/GDS and vector memory.

In total, there are six distinct execution ports, and the dispatcher can send one instruction to up to five of them per cycle. There are some special-case instructions which are handled in the dispatcher directly (like no-ops -- there's no use in sending them to a unit.) At each clock cycle, the dispatcher looks at the active waves, and the next instruction that is ready. It will then send it to the next available unit. For instance, let's assume we have code like this executing:

v_add_f32 r0, r1, r2
s_cmp_eq_i32 s1, s2

If there are two waves ready, the dispatcher will issue the first v_add to the first SIMD. In the next cycle, it will issue the s_cmp from the first wave, and the v_add from the second wave. This way the scalar instruction overlaps with the execution of the vector instruction, and we get instruction level parallelism without any lookahead or expensive out-of-order machinery.

Let's look at a more complete example, with multiple wavefronts executing a decent mix of scalar, vector, and memory instructions:

GCN compute unit processing a variety of instructions.
The top part shows how four wavefronts are scheduled (left to right are clock cycles.) In the first cycle, three independent instructions get issued to three units. Keep in mind that VALU runs for four cycles. The bottom part shows how much utilization the units see. As VALU instructions run for four cycles, all four SIMD units get used rather quickly; a good instruction mix ensures that all units are kept busy.

One last thing before we wrap this up is handling of loads and stores. On a CPU, it's all transparent, you can write a sequence like this:

mov rcx,QWORD PTR [rsp+0x8]
add rdx, rcx

This will just work, because the CPU "knows" that the load needs to finish before the operation can start by tracking this information. On a GPU, tracking which register is written by a load would require a lot of extra hardware. The solution the GPU folks came up with is moving this problem "one level up", into the shader compiler. The compiler has the required knowledge, and inserts the waits for loads manually. In GCN ISA, a special instruction -- s_waitcnt -- is used to wait until a certain number of loads has finished. It's not just waiting for everything, as this allows piping in multiple loads simultaneously and then consuming them one-by-one. The corresponding GCN ISA would look somewhat like this:

s_buffer_load_dword s0, s[12:15], 0x0   ; load a single dword
s_waitcnt     lgkmcnt(0)                ; wait for the previous
                                        ; load to finish
v_add r0, s0, r1                        ; consume the loaded data

I think a good idea is to think of a (GCN) GPU as a CPU, running four threads per core (compute unit), and each thread can call scalar, vector and other instructions. It's in-order, and the designers made a trade-off between hardware and software complexity. Instead of requiring expensive hardware, a GPU requires massively parallel software -- not just to hide latency, but also to take advantage of all execution units. Instead of "automatic" tracking, it requires the compiler to insert extra operations, and requires the application to provide enough parallelism to fully utilize it, but at the same time, it provides massive throughput and tons of execution units. It's a nice example how the properties of the code you're executing can influence the design of the hardware.


Speaking of software influencing hardware, we should also talk about caches. GPU caches are yet another -- and in this series, the final -- example of how a GPU was built for massively parallel workloads, and what trade offs the designer have taken compared to a CPU. We'll also realize that CPUs are actually following the same path as GPUs along the way!


Let's start by looking at how a modern CPU looks like, for example, a 32-core server CPU built on the Zen architecture:

4 Zen dies, containing 2 CCX comprised of one L3 cache, 4 L1/L2/Instruction caches, and four cores each.
A modern server CPU built on top of the Zen architecture. The package contains four dice, each containing two core complexes. Each core complex has 8 MiB of L3, 4× L2 caches of 512 KiB each, 4× L1 caches of 32 KiB each, and 4× instruction caches of 64 KiB each. That's a lot of cache -- the whole package packs 64 MiB of L3 alone!

That's a huge CPU right there, and interestingly, the topology does matter for high performance code. Two cores sharing the same L3 can obviously exchange data right there, while going into another L3 already requires some travelling -- not to mention moving across dice. That's simply the nature of the beast, as larger and larger chips are also becoming larger in terms of die area, and travelling far distances becomes increasingly expensive in terms of power usage and latency. There's no silver bullet around that -- it's physics at work -- and except for making every core pay the worst case latency for all others, there will be always some in closer proximity.

What does this mean for the application developer? It means that the application must try to keep work "close together" -- usually, the OS scheduler will take care of this. By default, all caches are coherent with each other, which means that if the core in the top-left corner writes something to memory, the core in the bottom right can see this. Various protocols have been designed to handle this; the gist of it is that any core can write some memory, and any other core will see the new data by default. No extra work required by the application -- but you can imagine already that sharing data between cores will force a lot of cross-core communication.


Now, GPUs are much more parallel than CPUs. A Vega10 GPU is practically speaking a 64-core CPU, with a similar cache hierarchy. Let's take a look:

A Vega10 GPU, consisting of 4 shader engines, each containing 16 compute units. Four compute units share a constant and instruction cache. Each compute unit has a L1 cache. All shader engines share a single L2 cache.
A Vega10 GPU, containing 4 shader engines. Each shader engine has 16 compute units. A compute unit has 16 KiB of L1 cache, and four compute units share a 32 KiB instruction cache and a 16 KiB of scalar (constant) cache. All shader engines are attached to a 4 MiB L2 cache.

The sizes are completely off, but it's still vaguely similar. If you squint enough, you could think it's a 64-core CPU on a single die. Obviously, the devil is hiding in the details again, because where CPUs have coherency by default, GPUs are running in a very different mode. The programming model is designed for many independent tasks, so let's think for a moment how this would impact our cache design. Given every work item is independent, we can assume that each core works on its own data, and there is little to no sharing across cores. How could we optimize our hardware for this? First of all, we'll get rid of coherency by default. If core A writes something, and wants core B to see it, it's now the developer's responsibility to do. The assumption is that this is rarely necessary, as it requires two syncs -- one is memory (we need to flush the data out of the cache somehow), and the second one, it requires that the second core is actually processing the same data. As we learned, execution order is something GPUs are not usually happy to provide, and that directly impacts the cache handling. Given a developer has to synchronize already, they can do the memory barriers at the same time.

The other part is that the caches don't manage themselves. On a CPU, things tend to just work within a single process, but on a GPU, flushing and invalidating caches is a very explicit operation. If you finish one compute shader and you want to start the next one, the GPU will typically insert a drain to make sure that all work finished, and also flush and invalidate all compute-unit local caches to make sure the next dispatch sees the data. This makes it critical to keep the data in L2 -- flushing the L1 caches happens a lot, but because they're small, it's cheap. Compare this with a CPU where the L2 of a single core is already half the size of all L1 caches of a GPU combined!

The other interesting bit are the shared caches. In the CPU case, the only shared data is in L3. In the GPU case, where we expect a single program to execute for many compute units, the instruction cache is shared among compute units. This implies that ideally, we want to send the same program to groups of four compute units and not totally random across the GPU. Similarly, we assume that the same constants get loaded across all waves executing the same program, which results in separate constant/scalar caches. These are practically read-only (except for atomics instructions), which means that they don't need to get flushed (no data changed), but they still require invalidation between dispatches.

You might wonder, with this default setup, how do I get coherency across caches? Surely there must be a way, as GLSL for instance has the coherent modifier. Glad you've asked -- and the solution to this is rather simple. All compute units share the same L2, so if we want to ensure coherence, we can just bypass L1. If you look at the odorometer, there's a GLC bit which says: "Force bypass of L1 cache". By writing through to L2, and always reading from L2, we can get the impression of coherent caches, without any coherency protocol. All at the expense of basically ignoring the (tiny) L1 -- again, a trade-off which makes sense for GPUs.

Finally, let's talk once more about the sizes. Compared to a CPU, the caches are tiny, so why are they even there? On a CPU, caches are all about reuse, and as you can keep next to no data around in registers, you need large caches. The other side of the story is that CPU code tends to read memory all over the place, but typically it doesn't read large chunks of nearby data. Think databases -- the chance that you're going to read a few entries which are next to each other is rather low.

GPUs on the other hand have to solve a different problem. Tons of threads in flight, all of them wanting to either stream through data, or access data in a spatially coherent fashion (think textures). For that use case, what you really want is a cache which helps you combine reads/writes and keeps data just around long enough that you can move it into registers. For example, you're loading a 4-component vector component by component. Ideally, you want to have the four components "cached" until your load finishes. A tiny cache is perfect for that -- it keeps the cache line around until it's consumed, and the chance you're going to hit it again is super small anyway as your threads are processing tons of (independent) data in general. For this series, this is the last example of how the applications and the expected usage have shaped the GPU to be very different from a CPU.


That's it, folks! I hope you enjoyed the series -- and noticed that CPUs and GPUs are both multi-core processors, but specifically designed for different use cases and tuned for those. The other interesting bit is how the programming model influenced the hardware design and vice versa -- and how we're on some path to convergence. Modern GPU code tends to run fine on modern CPUs; it's already well designed to take advantage of many cores, can handle non-uniform memory access, and can easily cope with little cache coherency guarantees. Where are we heading? I don't know, but for sure, knowledge about compute shaders and GPU execution models will help prepare you for whatever future is ahead of us!

More compute shaders

Last week I've covered compute shaders, and I've been asked to go a bit deeper on the hardware side to cover subgroups and more. But before we get there, let's recap briefly what a compute unit looks like and what is happening in it.

In the last post, I explained that the hardware is optimized for many items executing the same program. This resulted in the usage of very wide SIMD units (in the case of AMD's GCN, 16-wide), hiding memory latency through task switching, and a branching model which relies on masking. I didn't get into too much detail about those though, which we'll do now, and after that, we'll find out what to do with our new-found knowledge!

SIMD execution

When it comes to executing your code, a GCN compute unit has two main building blocks you care about: Several SIMD units and a scalar unit. The SIMD units are each 16 wide, meaning they process 16 elements at a time. However, they don't have a latency of one -- they don't finish executing an instruction in a single clock cycle. Instead, it takes four cycles to process an instruction from start to end (some take longer, but let's pretend all take four cycles for now.) Four cycles is about the speed you'd expect for something like a fused-multiply-add, which needs to fetch three operands from the register file, do a multiply-and-add, and write the result back (highly optimized CPU designs also take four cycles, as can be seen in 5203954081.)

Latency doesn't equal throughput, though. An instruction with a latency of four can still have a throughput of one, if it's properly pipelined. Let's look at an example:

Comparison of pipelined and non-pipelined execution.
A non-pipelined unit compared to a pipelined execution unit. Latency is four cycles in both cases, but once filled, the pipelined unit has a throughput of one instruction per cycle.

Which means that if we have sufficient work to do for a single SIMD, we can get 16 FMA instructions executed per clock. If we could issue a different instructions in every cycle, we'd have to deal with another problem though -- some of the results may not be ready. Imagine our architecture would have a four-cycle latency on all instructions, and a single-cycle dispatch (meaning every cycle we can throw a new instruction into the pipeline.) Now, we want to execute this imaginary code:

v_add_f32 r0, r1, r2
v_mul_f32 r3, r0, r2

We have a dependency between the first and the second instruction -- the second instruction cannot start before the first has finished, as it needs to wait for r0 to be ready. This means we'd have to stall for three cycles before we issue another instruction. The GCN architects solved this by issuing one instruction to a SIMD every four cycles. Additionally, instead of executing one operation on 16 elements and then switching to the next instruction, GCN runs the same instruction four times on a total of 64 elements. The only real change this requires to the hardware is to make the registers wider than the SIMD unit. Now, you never have to wait, as by the time the v_mul_f32 starts on the first 16 elements, the v_add_f32 just finished them:

Instruction issue for a single SIMD on GCN.
Instruction issue on a single GCN SIMD. One instruction is issued every four clock cycles, and instructions can start immediately one after the other.

You'll immediately notice those wait cycles, and clearly that's no good to have a unit spend most of its time waiting. To fill those up, the GCN designers use four SIMD units, so the real picture on the hardware is as following:

Instruction issue for a single CU on GCN.
By duplicating the SIMD four times, one instruction can be issued every clock cycle, bringing the total throughput to 64 elements/clock (4 SIMDs à 16 elements/SIMD/clock).

This 64-wide construct is called a "wavefront" or "wave", and it's the smallest unit of execution. A wave can get scheduled onto a SIMD to execute, and each thread group consists of at least one wave.

Scalar code

Phew, that was quite a bit, and we're unfortunately still not done. So far, we pretended everything is executed on the SIMD, but remember when I wrote there are two blocks related to execution? High time we get to this other ... thing.

If you program anything remotely complex, you'll notice that there are really two kinds of variables: Uniform values, which are constant across all elements, and non-uniform values, which differ per lane.

A non-uniform variable would be for instance the laneId. We'll pretend there's a special register we can read -- g_laneId, and then we want to execute the following code:

if (laneId & 1) {
    result += 2;
} else {
    result *= 2;

In this example, we're not going to talk about conditional moves, so this needs to get compiled to a branch. How would this look like on a GPU? As we learned, there is something called the execution mask which controls what lanes are active (also known as divergent control flow.) With that knowledge, this code would probably compile to something like this:

v_and_u32 r0, g_laneId, 1   ; r0 = laneId & 1
v_cmp_eq_u32 exec, r0, 1    ; exec[lane] = r0[lane] == 1
v_add_f32 r1, r1, 2         ; r1 += 2
v_invert exec               ; exec[i] = !exec[i]
v_mul_f32 r1, r1, 2         ; r1 *= 2
v_reset exec                ; exec[i] = 1

The exact instructions don't matter, what matters is that all the values we're looking at that are not literals are per-lane values. I.e. g_laneId has a different value, as has r1, for every single lane. This is a "non-uniform" value, and the default case, as each lane has its own slot in a vector register.

Now, if the control flow was looking like this, with cb coming from a constant buffer:

if (cb == 1) {
    result += 2
} else {
    result *= 2;

Turning this into this straightforward code:

v_cmp_eq_u32 exec, cb, 1    ; exec[i] = cb == 1
v_add_f32 r1, r1, 2         ; r1 += 2
v_invert exec               ; exec[i] = !exec[i]
v_mul_f32 r1, r1, 2         ; r1 *= 2
v_reset exec                ; exec[i] = 1

This has suddenly a problem the previous code didn't have -- cb is constant for all lanes, yet we pretend it's not. As cb is an uniform value, the comparison against 1 could be computed once instead of per lane. That's how you'd do it on a CPU, where vector instructions are a new addition. You'd probably do a normal conditional jump (again, ignore conditional moves for now), and call a vector instruction in each branch. Turns out, GCN has the same concept of a "non-vector" execution, which is aptly named "scalar" as it operates on a single scalar instead of a vector. In GCN assembly, the code could compile to:

s_cmp_eq_u32 cb, 1          ; scc = (cb == 1)
s_cbranch_scc0 else         ; jump to else if scc == 0
v_add_f32 r1, r1, 2         ; r1 += 2
s_branch end                ; jump to end
v_mul_f32 r1, r1, 2         ; r1 *= 2

What does this buy us? The big advantage is that those scalar units and register are super cheap compared to vector units. Whereas a vector register is 64x32 bit in size, a scalar register is just 32 bit, so we can throw many more scalar registers on the chip than vector registers (some hardware has special predicate registers for the same reason, one bit per lane is much less storage than a full-blown vector register.) We can also add exotic bit-manipulation instructions to the scalar unit as we don't have to instantiate it 64 times per CU. Finally, we use less power as the scalar unit has less data to move and work on.

Putting it all together

Now, being hardware experts, let's see how we can put our knowledge to good use finally. We'll start with wavefront wide instructions, which are a hot topic for GPU programmers. Wavefront wide means we do something for all lanes, instead of doing it per lane -- what could that possibly be?

Scalar optimizations

The first thing we might want to try is to play around with that execution mask. Every hardware has this present in some form, be it explicit or as a predication mask. With that, we can do some neat optimization. Let's assume we have the following code:

if (distanceToCamera < 10) {
    return sampleAllTerrainLayers ();
} else {
    return samplePreblendedTerrain ();

Looks innocent enough, but both function calls sample memory and are thus rather expensive. As we learned, if we have divergent control flow, the GPU will execute both paths. Even worse, the compiler will likely compile this to following pseudo-code:

VectorRegisters<0..32> allLayerCache = loadAllTerrainLayers();
VectorRegisters<32..40> simpleCache = loadPreblendedTerrainLayers();
if (distanceToCamera < 10) {
    blend (allLayerCache);
} else {
    blend (sampleCache);

Basically, it will try to front-load the memory access as much as possible, so by the time we reach the else path, there's a decent chance the loads has finished. However, we -- as developers -- know that the all-layer variant is higher quality, so how about this approach: If any lane goes down the high-quality path, we send all lanes to the high-quality path. We'll have slightly higher quality overall, and on top of that, we get two optimizations in return:

  • The compiler can use fewer registers by not front-loading both
  • The compiler can use scalar branches

There's a bunch of functions for this, all of which operate on the execution mask (or predicate registers, from here on I'll pretend it's the execution mask.) The three functions you'll hear of are:

  • ballot () -- returns the exec mask
  • any () -- returns exec != 0
  • all() -- returns ~exec == 0

Changing our code to take advantage of this is trivial:

if (any (distanceToCamera < 10)) {
    return sampleAllTerrainLayers ();
} else {
    return samplePreblendedTerrain ();

Another common optimization applies to atomics. If we want to increment a global atomic by one per lane, we could do it like this:

atomic<int> globalAtomic;
if (someDynamicCondition) {

This will require up to 64 atomic increments on GCN (as GCN does not coalesce them across a full wavefront.) That's quite expensive, we can do much better by translating this into:

atomic<int> globalAtomic;
var ballotResult = ballot (someDynamicCondition);
if (laneId == 0) {
    globalAtomic += popcount (ballotResult);

Where popcount counts the number of set bits. This cuts down the number of atomics by 64. In reality, you probably still want to have a per-lane value if you're doing a compaction, and it turns out that case is so common that GCN has a separate opcode for it (v_mbcnt) which is used automatically by the compiler when doing atomics.

Finally, one more on the scalar unit. Let's assume we have a vertex shader which passes through some drawId, and the pixel shader gets it as a normal interpolant. In this case (barring cross-stage optimization and vertex-input-layout optimization), code like this will cause problems:

var materialProperties = materials [drawId];

As the compiler does not know that drawId is uniform, it will assume it can be non-uniform, and thus perform a vector load into vector registers. If we do know it's uniform -- dynamically uniform is the specific term here -- we can tell the compiler about this. GCN has a special instruction for this which somewhat become the "standard" way to express it -- v_readfirstlane. Read-first-lane reads the first active lane and broadcasts its value to all other lanes. In an architecture with separate scalar registers, this means that the value can be loaded into a scalar register. The optimal code would be thus:

var materialProperties = materials [readFirstLane (drawId)];

Now, the materialProperties are stored in scalar registers. This reduces the vector register pressure, and will also turn branches that reference the properties into scalar branches.

Vector fun

So much for the scalar unit, let's turn to the vector unit, because things get really funny here. Turns out, pixel shaders have a huge impact on compute, because they force the hardware to do something really funky -- make lanes talk to each other. Everything we learned so far about GPUs said you can't talk across lanes except through LDS, or by broadcasting something through scalar registers (or read a single lane.) Turns out, pixel shaders have a very unique requirement -- they require derivatives. GPUs implement derivatives by using quads, i.e. 2×2 pixels, and exchange data between them in a dynamically varying way. Mind blown?

Pixel shaders access neighboring lanes during ddx, ddy instructions.
Pixel shaders can access neighboring lanes through ddx(), ddy() instructions. Each lane processes one pixel, and within 4 lanes, a lot of exchange is required to make derivatives work. On the right side, we can see the initial packing, and how the derivatives swap data between lanes within four lanes.

This is commonly referred to as quad swizzle, and you'll be hard pressed to find a GPU which doesn't do this. However, most GPUs go much further, and provide more than just simple swizzling across four lanes. GCN takes it really far since the introduction of DPP -- data parallel primitives. DPP goes beyond swizzling and provides cross-lane operand sourcing. Instead of just permuting lanes within a lane, it actually allows one lane to use another lane as the input for an instruction, so you can express something like this:

v_add_f32 r1, r0, r0 row_shr:1

What does it do? It takes the current value of r0 on this lane and the one to the right on the same SIMD (row-shift-right being set to one), adds them together, and stores it in the current lane. This is some really serious functionality which introduces new wait states, and also has various limitations as to which lanes you can broadcast to. All of this requires intimate knowledge of the implementation, and as vendors differ in how they can exchange data between lanes, the high level languages expose general wave-wide reductions like min etc. which will either swizzle or use things like DPP to get to one value. With those, you can reduce values across a wavefront in very few steps, without access to memory -- it's faster and still easy to use; what's not to like here!


I hope I could shed some light on how things really work this time around. There's really not much left that hasn't been covered so far, for GCN, I can only think of separate execution ports and how waits are handled, so maybe we'll cover that in a later post? Other than that, thanks for reading, and if you have questions, don't hesitate to get in touch with me!


A couple of months ago I went to the Munich GameCamp -- a bar camp where anyone can propose a talk, and then a quick vote is cast which talks get accepted. I've been asking around for some ideas, and one developer mentioned I might want to cover compute shaders. So I went in, without much hope to attract many people, but I ended up with an overcrowded room with roughly one quarter of all the game camp participants, rambling about compute shaders for roughly an hour. Afterwards, the main question I got was: "Where can I read about this?", and I couldn't quite nail a good introduction to compute shaders (there's 782-435-2547, but that's already quite past an introduction.)

The hardware

To understand how compute shaders happened, we have to take a look at the hardware evolution. Back in the old days, before shaders, we had geometry processing and texturing separated. For instance, a Voodoo² card had one rasterizer and two texturing units on the card, splitting the workload between them. This theme continued for a long time, even after shaders were introduced. Up until the GeForce 7 and the Radeon X1950, GPUs had separate vertex and pixel shader units. Those units had usually similar capabilities in terms of what they could compute (after all, additions and multiplications are the bulk of the work on a GPU), but memory access differed a lot. For instance, accessing textures was something vertex shaders couldn't do for a long time. At that time, the split made sense as scenes consisted of few polygons, covering many pixels, so having less vertex shading power typically didn't result in a bottleneck. By removing functionality from the vertex shaders, they could be better optimized and thus run faster.

A basic vertex & pixel pipeline.
A basic GPU pipeline with separate vertex and pixel units. Vertex units write shaded vertices into an attribute buffer, the pixel units consume it and also access memory through texture units.

However, a fixed distribution also makes it impossible to load-balance resources. As games evolved, sometimes more vertex power was required -- for instance, when rendering dense geometry like trees -- while other games were exploiting the new shader programming models to write more complex pixel shaders. This was also the time at which GPGPU programming starting, that is, programmers were trying to solve numerical programs on GPUs. Eventually, it became clear that the fixed split was not good enough.

In late 2006 -- early 2007, the age of "unified shaders" began with the release of the GeForce 8800 GTX and the Radeon HD 2800 (technically speaking, the XBox 360 was first with (581) 339-8276.) Gone were the days of separate units; instead, the shader core could process any kind of workload.

A basic, unified vertex & pixel pipeline.
An unified pixel & vertex pipeline. The units can all talk to memory through the texture units, and pass data between them through an attribute buffer.

So what happened back then? The ALUs -- the units executing the math instructions -- were already similar. What changed was that the distinction between the units was removed completely. Vertex shaders were now executing on the same units as pixel shaders, making it possible to balance the workload between vertex and pixel shader work. One thing you'll notice here is that vertex and pixel shaders need to communicate somehow -- a decent chunk of memory is needed to store attributes for interpolation, like vertex color. A vertex shader will compute those per vertex and move on to the next vertex, but a pixel shader might have to hang on this information for a while until all pixels are processed. We'll come to this memory later, let's remember if for now.

A basic compute pipeline.
The unified pipeline, shown as a compute pipeline. The attribute buffer becomes the local or shared memory, the texture units become the gateway to global memory, and the pixel and vertex units are now general compute units.

This model -- some ALUs bundled together, with some extra memory to allow communicating between sets of them -- was exposed at the same time as a "compute shader". In reality there are a few more details to this, for instance, a compute unit usually does not process a single element, but multiple of those, and there are quite a few more caches involved to make all of this efficient. Below is the diagram for a single compute unit in the AMD GCN architecture -- note that what I previously called "compute unit" is now a "SIMD", more on this in a moment.

An advanced compute pipeline.
An AMD GCN class compute unit, consisting of four 16-wide SIMD units, local memory, and various caches. Blocks with a dashed outline are shared between more than one compute unit.

Note that other GPUs look very similar, I'm simply using GCN here as I'm most familiar with it. On NVIDIA hardware, a GCN compute unit maps to a "streaming multiprocessor", or SM. The SIMDs have different width, and the caches will look a bit different, but the basic compute model is still exactly the same.

Back to the the SIMD unit -- GPUs have been always optimized to process many things concurrently. Where you have one pixel, you have many, and all of them run the same shader, so this is how the hardware was designed. In the case of GCN, the designers built four, 16-wide SIMDs into each compute unit. A SIMD (SIMD is short for single instruction, multiple data) unit can execute one operation across 16 elements at once. Not in one clock cycle -- there's some latency -- but for GCN, that latency is four cycles, so by having 4 SIMD, and pretending a SIMD is not 16, but 64 wide, the machine behaves as-if it executes 64 instructions per cycle. Confused? Don't worry, this is just an implementation detail, the point here is that every GPU executes some set of instructions together, be it 64 in the GCN case or 32 on current NVIDIA architectures.

What we ended up with is a GPU consisting of many compute units, and each compute unit containing:

  • A bunch of SIMD processors, which execute the instructions.
  • Some local memory inside each compute unit which can be used to communicate between shader stages.
  • Each SIMD unit executes one instruction across many items.

With those prerequisites, let's have a look at how we might expose them when not doing pixel and vertex shader work. Let's get going!

The kernel

The goal is to write a programming system which allows us to take advantage of the hardware at hand. Obviously, the first thing we notice is that cross-compute-unit communication is something we want to avoid. All of the compute units are clients of the L2 cache, but clearly the L1 caches can go out of sync, so if we distribute work, we should pretend we can't talk across compute units. This means that we somehow have to split our work into smaller units. We could go just pretend there are no compute units and dispatch individual items, but then we're missing out on the local memory, so it seems that we want one level below "here's all work".

We could go directly at the SIMD level, but that's not optimal as there is a capability we haven't discussed yet within a compute unit. As we have some kind of local memory, it's natural to assume we can use that to synchronize the SIMD units (given we could just write something to local memory and wait for it to show up). This also means we can only synchronize within a single compute unit, so let's use that as the next grouping level -- we'll call the work that gets put onto a single compute unit a work group. We thus start with a work domain, which is then split into work groups. Each work group runs independently of the others. Given the developer might not know how many groups can run concurrently, we enforce that work groups can't depend on the fact that:

  • Another work group from the same domain is running
  • Work groups inside a domain execute in any particular order

That allows us to run one work group after the other on a single unit, or all of them at once. Now, inside a work group, we still need many independent items of work, so we can fill all of our SIMD units. Let's call the individual item of work a work item, and our programming model will launch work groups containing many work items -- at least enough to fill up one compute unit (we probably want more than just enough to fill it, but more on this later.)

Let's put all of this together: Our domain consists of work groups, each work group gets mapped onto one compute unit, and the SIMD units process the work items.

A basic programming model
Here is how our programming model maps onto the hardware so far. A domain consists of work groups, those get assigned to compute units. Each workgroup is comprised of multiple work items.

One thing we forgot to mention so far is the fact that there's apparently one more level which we didn't cover. I mentioned a SIMD unit executes multiple work items together, yet we haven't exposed this in our programming model, we only have "independent" work items there. Let's give the work items executing together a name -- we'll call them a subgroup (on AMD, they're typically referred to as a "wavefront" or "wave", while NVIDIA calls them "warp".) Now, we can expose another set of instructions, which perform operations across all items in a SIMD. For instance, we could compute the minimum of all values without having to go through local memory.

I mentioned before that there are only very few constraints on ordering. For instance, a GPU may want to execute a whole domain on a single compute unit. One of the reasons for doing this is memory latency. Those wide SIMD units are great when it comes to number crunching, but it also means that memory access is relatively speaking super slow. In fact, it's horribly slow, because a GCN CU can process 64 float multiply-add instructions per cycle, which is 64×3×4 bytes of input data, and 64×4 bytes of out. Across a large chip like a Vega10, that's 48 KiB worth of data read in a single cycle -- at 1.5 GHz, that's 67 TiB of data you'd have to read in. GPU memory systems have been thus optimized for throughput, at the expense of latency, and that impacts how we program them.

Before we head into the effects of this on the programming model, let's summarize how compute shaders look like:

  • Work is defined through a 3-level hierarchy:
    • A domain specifies all work
    • The domain is subdivided into work groups, which are executed independently, but allow for communication inside a group
    • Work items are the individual elements of work
  • Some APIs also expose an intermediate level -- the subgroup, which allows some optimizations below the work group, but above the work item level.
  • Work groups can synchronize internally and exchange data through local memory.

This programming model is universal across all GPU compute shader APIs. Differences occur in terms of what guarantees are provided, for instance, an API might limit the number of work groups in flight to allow you to synchronize between all of them; or some hardware might guarantee the execution order so you can pass information from one workgroup to the next.

Latency hiding & how to write code

Now that we understand how the hardware and the programming model looks like, how do we actually program this efficiently? Until now, it sounds like we're writing normal code, and all that happens is that we process a work-item in it, potentially communicating between our neighbors on the same SIMD, others through local memory, and finally read and write through global memory. The problem is that "normal" code is not good, as we need to hide latency.

The way a GPU hides latency is by having more work in flight. A lot more work -- taking GCN again as the example, every compute unit can have up to 40 subgroups in flight. Every time a subgroup accesses memory, another one gets scheduled in. Given only four can execute at the same time, this means we can switch up to 10 times before we return to the subgroup which started the original request.

Latency hiding
The first subgroup hits a memory instruction and stalls. The compute unit immediately schedules the next subgroup, which also stalls. After the fourth subgroup has issued an instruction, the memory subsystem has returned, and the first subgroup can resume processing. This way, the compute unit was executing work at every cycle.

There's a problem here though -- subgroups switching needs to be instantaneous to make this possible. Which means you can't write the program state to memory and read it back. Instead, all state of all programs is kept permanently resident in registers. This requires a huge amount of registers -- a single compute unit on GCN has 256 KiB of registers. With that, we can use up to (256 KiB / 40 / 4 / 64B) = 24 registers for a single item before we need to reduce occupancy. For our programming style, this means we should try to minimize the amount of state we need to keep around as much as possible, at least as long as we're accessing memory. If we don't access memory, a single wave can keep a SIMD 100% occupied. We should also make sure to use that local memory and L1 cache as much as we can, as they have more bandwidth and less latency than the external memory.

Execution on SIMD
One SIMD can process multiple work items in a single clock cycle, in this figure, it's 8-wide. If a branch is short, we don't waste much computation. Long branches, which disable most of the SIMD lanes however can drastically reduce our throughput.

The other problem we're going to face is how branches are handled. SIMDs execute everything together, so they can't perform jumps. What happens instead is that the individual lanes get masked off, i.e. they no longer participate. It also implies that unless all work items take the same branch, a GPU will typically execute both sides of the branch.

Which means that for a SIMD of length N, we can get a worst-case usage of 1/N. On CPUs N is usually 1..8, so it's not catastrophic, but on a GPU N can be as high as 64 at which point this does really matter. How can we make sure this doesn't happen? First, we can take advantage of the subgroup execution; if we have a branch to select between a cheap and expensive version, and some work items take the expensive version, we might as well all take it. This reduces the cost of a branch from expensive + cheap to just expensive. The other part is to simply avoid excessive branching. As CPUs get wider, this is also becoming more important, and techniques like sorting all data first, and the processing it become more interesting than heavy branching on individual work items. For instance, if you're writing a particle simulation engine, it's much faster to sort the particles by type, and the run a specialized program for each, compared to running all possible simulation programs for each particle.

What have we learned? We need:

  • A large problem domain -- the more independent work items, the better.
  • Memory intensive code should minimize the state, to allow for high occupancy.
  • We should avoid branches which disable most of the subgroup.

Those guidelines will apply universally to all GPUs. After all, all of them were designed to solve massively parallel graphics problems, and those have little branching, not too much state in flight, and tons and tons of independent work items.


I hope this blog post gives you a good idea how we ended up with those compute shaders. The really interesting bit is that the concepts of minimal communication, local memory access, and divergence costs are universally applicable. Modern multi-core CPUs are also introducing new levels of communications costs, NUMA has been long used to model costs for memory access, and so on. Understanding that not all memory is equal, and that your code is executed in some particular way on the underlying hardware will enable you to squeeze out more performance everywhere!

(312) 380-9770

As many other developers, I wrote (778) 221-4571 over the years, and because I do believe in open source, I released most of them in public. As I focused on scratching my own itch, nearly all of my projects are not very popular, and some of them are mostly dead these days. However, if you went to my project overview, you wouldn't be able to tell what state each project is in.

This weekend, I sat down to solve this once and for all, by creating an inventory of my projects, categorizing the status, and getting this all on my website in an easy-to-digest fashion. Sounds straightforward, but there are a few details which are hopefully of interest: Maintenance work, inventory management, and figuring out what states I want to actually report. Let's dive in!


Going through my various projects I noticed that my most successful projects in terms of users and community contribution were in a rather sad state. No easy-to-find readme, no continuous integration, and even worse, no changelog. Those are the three main things I look for in other projects, and I failed my own minimum bar here.

Solving the changelog is rather simple, but the continuous integration and the readme required a bit more work. You might think how can a readme require extra work? Before we get to that, a bit of context first: The most popular projects of mine are pure Python projects. This means there's exactly one way to publish them, which is (504) 534-6931. PyPI uses setuptools for distribution, and there have been quite a few changes since I originally uploaded my projects.

These days, uploading to PyPI requires twine. Additionally, as PyPI moved to warehouse, update paths and more need to get adjusted. Fortunately, the 8478364581 was also significantly improved. Some of this may sound like busywork, but it's actually worth spending the time. For me, it meant I could include some nicely formatted readme as the long project description, for instance -- which required some extra markup in setup.py.

I also used the opportunity to get some continuous integration in place. I'm not a big fan of the git monoculture, so I use Mercurial as my primary version control management, and hence Bitbucket. Unfortunately, popular services like 3604915949 don't support Bitbucket, so continuous integration was usually a pain point. Turns out, Bitbucket solved the problem themselves by introducing BitBucket pipelines, which is a very simple way to run a project inside a Docker container. With some updates to my 770-587-4881 scripts, I got this running for my projects in no time.

With this out of the way, it was time to figure out what projects I had floating around besides the few projects I updated: Time for some archeology!


The first step is getting a proper inventory of all projects. As usual, I have a lot of choice as to what is my "source of truth". Is it Bitbucket, where nearly all of my projects live? Is it PyPI? Do I store it inside each project or do I use some other place?

I was tinkering around with just using Bitbucket. Turns out, nearly all my projects are in fact on Bitbucket, so stuffing a meta.json in there seemed like a good choice. I even have a mirror script for Bitbucket which I use to back up my repositories locally, so that seemed like the natural choice … except for my closed source projects. To get that to work, I would have to index my local repository store which contains both my public and private repositories. Given I have roughly a dozen projects I care about, this started to smell like some serious overkill, but I'd probably go for this if I had more projects floating around (or if I had more time to automate some things.)

Turns out, I do have a decent index already -- the project descriptions I'm using to generate my website in the first place. Liara -- the backend for the website -- uses a YAML block at the beginning of each page to store arbitrary metadata, and that's now the new canonical index. Problem solved, but now comes the next problem: What do we want to store there?


The key bit of information is what status a project can be in. There are some obvious candidates like "active" and "abandoned", but there are a few more states which I think are worth discerning. Looking through all my projects, these are the states I settled with:

  • Active: I'm working on this project, which means I'm doing some work at least once a quarter. Plus if someone reports an issue or requests a feature, there's a decent changes it will actually happen.
  • Maintenance: The project is good enough, all features I wanted to write are in there, and there's really no reason to keep working on it. However, if someone would ask for a new feature or a bug fix, I'll most likely get to it in a timely manner.
  • Archived: Similar to maintenance, but I'm not going to fix bugs or add new features even if requested. Those are projects which are online for reference purposes, hopefully useful but not popular enough to warrant spending time on them.
  • Stalled: That's a tricky one, as it's kind of similar to the next state -- abandoned. The main difference here is that a stalled project solves some problem, but it's probably not generally useful. An archived project is completely done, I achieved everything I wanted there. For a stalled project, I have some things in mind for how to continue, but no good use case to make it happen or justify additional effort.
  • Abandoned: That's the case everyone is familar with. I started hacking but due to lack of motivation or users, I stopped before it was useful. That's the main difference to stalled -- stalled does work for some limited use cases, abandoned probably never got beyond proof-of-concept.
  • Deprecated: This is a special state for projects which are no longer relevant. I only have few projects in this state, where the upstream project moved along and hence there's no reason any more to use my library. It's similar to archived, but due to factors outside of my control, it's no longer useful.

On top of that, I also added a quality tag for production-ready projects. Some libraries like SJSON are extensively tested (more than 90% code coverage), have been used by other developers, and are also well-documented. Sticking the maintenance tag onto those doesn't seem to do them justice, so if you happen to look at the details of one of those projects, the status will mention it's production ready.


Besides a much friendlier overview page, I also got something I was missing a lot -- peace of mind. Previously, any project which was in "limbo" was a constant mental weight, as I couldn't put it to rest properly. Now I have a simple way to clearly mark it as done and I can finally stop thinking about it without having a bad conscience about it. Time to move on to new projects 😊!

(226) 206-9854

Today, something completely different than the usual programming stuff I've been recently blogging about. It's time for some gaming again! Or more specifically, time to discuss one of the major problems I have with a bunch of games -- the save game handling!

What's the deal?

I'm a highly irregular gamer these days. The times when I can sit down for a couple hours in a row to practice a game are mostly gone. Instead, I often have half an hour or an hour I can spend on a game and I want to make some progress there. Which is usually fine, as many games these days allow you to consume the content in small chunks with a decent save game system. With decent, it means the save game system implements (hopefully all) of the following features:

  • Allow me to save the game at any time
  • Allow me to continue from whatever the last save was with one click
  • Supports more than one save game per session

This is really the base feature set I expect. The ability to save at any time means I can just turn off the machine, instead of having to continue playing until the next checkpoint/and or replay parts of the game next time I get to play again. I get really upset when my OS doesn't allow them to shutdown my machine because it's "installing updates". I don't need games to mimic this behavior. Continuing from the last save point with one click is a similar convenience feature. Both together allow me to enjoy the game and don't force me to replay parts of it.

More than one save game per session is an interesting one. I tend to save games once in a while, like every time I finish a session. It's very uncommon I return to them, but I still like keeping save games at interesting places around. The primary use case is probably to go back and capture some screenshot or try another approach I learn about later on (for instance, in "The Witcher 2", the second chapter is drastically different depending on a choice. I do want to finish the story on one path, then just play the second chapter with the different choice to see the difference. Another great example of how to do this right is "Papers, please", which automatically saves at every single day, and shows you the branches in the save game dialog).

It's also helpful when playing a new game, as sometimes you can dig yourself into a hole you can't get out of, and then returning to a savegame an hour ago may be the best option instead of starting the whole session from scratch. Not to mention trying out something -- what happens when I press this red button? Oh, I see, that wasn't good, glad I don't have to replay the last two hours!

Hard mode!

The problem I've been facing recently in a couple of games are "one save only". The usual reasoning is "hard mode", or something along those lines. While I get the intention, I hope this post helps to understand the position of people like me who will not buy as a result. The non-existence of a good save game system reduces the appeal of a game for me -- and there's no upside for me. If you (designing the game) really like the "hard mode", or it's a rogue like game, or whatever, just add an option to play it the "proper way", but don't force it upon everyone. I can't believe that's going to impact sales, as everyone can still play with just one save game if they want for maximum immersion (see X-Com with the notorious Iron Man mode).

I might really enjoy your game, but I simply don't have the time to replay two hours just to get to the point where I failed. It's not like a game can really prevent me from consuming it my way: I can usually hack around those save game systems with some file monitoring and by exiting/restarting the game. Why make it so difficult?

Here's my plea to all game developers out there: Don't remove standard save game options while designing your game. For customers like me, this is a buy-no-buy decision, because I don't have the time to replay your game over and over just because the save game system is designed for "hard mode" or rogue like. I'm perfectly fine spending my money to see the carefully crafted content you prepared for me just once, and it's my decision to make if there's enough content to warrant the price. Making me practice the game or forcing me to replay parts over and over is not helping my buy decision. Thanks!