[ACCEPTED]-What techniques to avoid conditional branching do you know?-assembly

Accepted answer
Score: 17

Using Matt Joiner's example:

if (b > a) b = a;

You could also 2 do the following, without having to dig 1 into assembly code:

bool if_else = b > a;
b = a * if_else + b * !if_else;
Score: 12

I believe the most common way to avoid branching 24 is to leverage bit parallelism in reducing 23 the total jumps present in your code. The 22 longer the basic blocks, the less often 21 the pipeline is flushed.

As someone else 20 has mentioned, if you want to do more than 19 unrolling loops, and providing branch hints, you're 18 going to want to drop into assembly. Of 17 course this should be done with utmost caution: your 16 typical compiler can write better assembly 15 in most cases than a human. Your best hope 14 is to shave off rough edges, and make assumptions 13 that the compiler cannot deduce.

Here's an 12 example of the following C code:

if (b > a) b = a;

In assembly 11 without any jumps, by using bit-manipulation 10 (and extreme commenting):

sub eax, ebx ; = a - b
sbb edx, edx ; = (b > a) ? 0xFFFFFFFF : 0
and edx, eax ; = (b > a) ? a - b : 0
add ebx, edx ; b = (b > a) ? b + (a - b) : b + 0

Note that while 9 conditional moves are immediately jumped 8 on by assembly enthusiasts, that's only 7 because they're easily understood and provide 6 a higher level language concept in a convenient 5 single instruction. They are not necessarily 4 faster, not available on older processors, and 3 by mapping your C code into corresponding 2 conditional move instructions you're just 1 doing the work of the compiler.

Score: 8

The generalization of the example you give 12 is "replace conditional evaluation with 11 math"; conditional-branch avoidance largely 10 boils down to that.

What's going on with 9 replacing && with & is that, since && is short-circuit, it 8 constitutes conditional evaluation in and 7 of itself. & gets you the same logical results 6 if both sides are either 0 or 1, and isn't 5 short-circuit. Same applies to || and | except 4 you don't need to make sure the sides are 3 constrained to 0 or 1 (again, for logic 2 purposes only, i.e. you're using the result 1 only Booleanly).

Score: 6

At this level things are very hardware-dependent 32 and compiler-dependent. Is the compiler 31 you're using smart enough to compile < without 30 control flow? gcc on x86 is smart enough; lcc is 29 not. On older or embedded instruction sets 28 it may not be possible to compute < without 27 control flow.

Beyond this Cassandra-like 26 warning, it's hard to make any helpful general 25 statements. So here are some general statements 24 that may be unhelpful:

  • Modern branch-prediction 23 hardware is terrifyingly good. If you could 22 find a real program where bad branch prediction 21 costs more than 1%-2% slowdown, I'd be very 20 surprised.

  • Performance counters or other 19 tools that tell you where to find branch 18 mispredictions are indispensible.

  • If you 17 actually need to improve such code, I'd 16 look into trace scheduling and loop unrolling:

    • Loop 15 unrolling replicates loop bodies and gives 14 your optimizer more control flow to work 13 with.

    • Trace scheduling identifies which paths 12 are most likely to be taken, and among other 11 tricks, it can tweak the branch directions 10 so that the branch-prediction hardware works 9 better on the most common paths. With unrolled 8 loops, there are more and longer paths, so 7 the trace scheduler has more to work with

  • I'd 6 be leery of trying to code this myself in 5 assembly. When the next chip comes out 4 with new branch-prediction hardware, chances 3 are excellent that all your hard work goes 2 down the drain. Instead I'd look for a 1 feedback-directed optimizing compiler.

Score: 5

An extension of the technique demonstrated 14 in the original question applies when you 13 have to do several nested tests to get an 12 answer. You can build a small bitmask from 11 the results of all the tests, and the "look 10 up" the answer in a table.

if (a) {
  if (b) {
    result = q;
  } else {
    result = r;
} else {
  if (b) {
    result = s;
  } else {
    result = t;

If a and b are 9 nearly random (e.g., from arbitrary data), and 8 this is in a tight loop, then branch prediction 7 failures can really slow this down. Can 6 be written as:

// assuming a and b are bools and thus exactly 0 or 1 ...
static const table[] = { t, s, r, q };
unsigned index = (a << 1) | b;
result = table[index];

You can generalize this to 5 several conditionals. I've seen it done 4 for 4. If the nesting gets that deep, though, you 3 want to make sure that testing all of them 2 is really faster than doing just the minimal 1 tests suggested by short-circuit evaluation.

Score: 4

GCC is already smart enough to replace conditionals 10 with simpler instructions. For example newer 9 Intel processors provide cmov (conditional move). If 8 you can use it, SSE2 provides some instructions 7 to compare 4 integers (or 8 shorts, or 16 chars) at a time.

Additionaly 6 to compute minimum you can use (see these 5 magic tricks):

min(x, y) = x+(((y-x)>>(WORDBITS-1))&(y-x))

However, pay attention to things like:

c[i][j] = min(c[i][j], c[i][k] + c[j][k]);   // from Floyd-Warshal algorithm

even 4 no jumps are implied is much slower than

int tmp = c[i][k] + c[j][k];
if (tmp < c[i][j])
    c[i][j] = tmp;

My 3 best guess is that in the first snippet 2 you pollute the cache more often, while 1 in the second you don't.

Score: 2

In my opinion if you're reaching down to 10 this level of optimization, it's probably 9 time to drop right into assembly language.

Essentially 8 you're counting on the compiler generating 7 a specific pattern of assembly to take advantage 6 of this optimization in C anyway. It's 5 difficult to guess exactly what code a compiler 4 is going to generate, so you'd have to look 3 at it anytime a small change is made - why 2 not just do it in assembly and be done with 1 it?

Score: 1

Most processors provide branch prediction 6 that is better than 50%. In fact, if you 5 get a 1% improvement in branch prediction 4 then you can probably publish a paper. There 3 are a mountain of papers on this topic if 2 you are interested.

You're better off worrying 1 about cache hits and misses.

Score: 0

This level of optimization is unlikely to 5 make a worthwhile difference in all but 4 the hottest of hotspots. Assuming it does 3 (without proving it in a specific case) is 2 a form of guessing, and the first rule of optimization 1 is don't act on guesses.

More Related questions