Practical usage of branchmasking technique from CryENGINE3?

Started by
3 comments, last by Matias Goldberg 10 years, 8 months ago

I am referring to these functions: https://github.com/gamerankur/alecmercer-origins/blob/master/Code/CryEngine/CryCommon/branchmask.h I found 1 or 2 threads on them, but they only covered setting uints.Can branchmasking also be used to check pointers or call functions depending on a condition?I couldn't find any practical uses for them in my engine :(

>removed<

Advertisement

Can branchmasking also be used to check pointers or call functions depending on a condition?I couldn't find any practical uses for them in my engine sad.png

In general no. That would require checking the result of the mask and doing something based on it's value hence you would branch. These functions are really only useful for refactoring a blend of math and logic into just math. Often times the end result is more instructions but the execution path doesn't diverge so it avoids instruction cache misses and can be better pipelined by the hardware and/or mixed with a concurrently executing thread on hyper threaded cores.

This kind of thing is reserved for gritty optimization passes and should be used with care. Modern CPUs are complex and if you don't profile both before and after an optimization you run the risk of the optimization decreasing readability for no performance gain and maybe even reducing performance due to some synergistic quirk between the hardware architecture and your code.

http://stackoverflow.com/questions/13894302/how-does-branch-masking-work-in-cryengine-3

...applicability starts and ends with the example given in answer to that post.

Yeah. And unless you're working on PS3/X360 (not PS4/XB1!), this is likely to be a waste of time. Those platforms have in-order CPUs that pay a huge cost for mis-predicted branches. Modern PC CPUs (including the ones found in PS4/XB1) can handle a typical amount of branching without paying a significant overhead. Any optimization effort is almost certainly better spent elsewhere...

Note however that tricks like these are pretty much standard practice when programming for GPUs.

GPUs tend to have a large number of in-order cores, and branching is very expensive, so it makes a lot of sense there.

Tristam MacDonald. Ex-BigTech Software Engineer. Future farmer. [https://trist.am]

These are part of what is known as "data oriented design".

There are three kinds of "if" or branch statements:

  1. Those one that are inherent to the logic flow. Example: if playerInFov then attack else patrol.
  2. To verify validity/bookkeeping. Example: if pointer != null then pointer->doSomething
  3. For math operations: if length > 0 then vector /= length; if inheritOrientation then orientation = parentOrientation * orientation

Out of the 3, the first one is almost always unavoidable and are better treated with branches anyway.

The second one you should try to design to avoid having them in the first place. For example create a rule that no pointer can be null, and assign a shared dummy pointer. An alternative is to create a list of pointers that will be called doSomething, and this list will always have the valid, needed pointers. Of course this needs some thinking and coding, so it's only worth if you're checking validity every frame too often or profiling shows cache misses in the near area of that code.

The third one is the one "branchmask" aims to solve. Math intensive operations are better when pipelined. I'm not familiar with CryEngine's functions, but an example would be:

orientation = (parentOrientation * checkIfSet( inheritOrientation ) ) * orientation;

parentOrientation will be nullified if checkIfSet returns zero, and kept intact if true. This works better with architectures that support bitwise logic directly in floating point registers, for example SSE, where you can do parentOrientation & checkIfSet( inheritOrientation ) instead (much faster).

The thing is, a branch will cause cache misses and pipelines stalls, and this takes longer than the computation being skipped.

You can read more info about it. Note that that blog talks about the PPC architecture in particular. I see many posters here telling you that in PC it's not needed because x86 have branch predictors and OoOE (Out of Order Execution).

But it still does make a difference; core applications for this kind of optimizations are: node transformations update, frustum culling, and AABB transform updates. The only exception of course, is when the amount of operations to be skipped is huge enough to be bigger than the pipeline stalls.

Chances are if you're in game logic, you won't require that kind of math throughput and won't notice a single millisecond of difference (if it doesn't run slower) and you only made your code harder to read.

But I wanted to note that there are applications in the x86 architecture where this is useful, despite branch predictors and OoOE.

I hope that clears it.

This topic is closed to new replies.

Advertisement