Sign in to follow this  
jezham

Manually unrolling a massive loop (bitmap array)?

Recommended Posts

Hi.

I'm a little confused about unrolling loops. The compiler will apparently unroll 'this and that', but some info I read suggests it might also roll things back up at the end of the day!

For the unroll which required - I generate a texture's new data with a very small function (every frame), but rather than manually unroll the y/x loop (producing a ridiculously massive source file), I would like to have a function do it at runtime by allocating a chunk of memory and unrolling the loop to there. Of course I can semi-unroll within a smaller loop but I presume this would still be slower to execute than unrolling the lot?

The only part which I can't find out how to do is the filling of the allocated memory with the required instructions (have not touched ASM for almost 20 years!). I'm guessing this is not the best idea in the world as I've not found much info...or my search string is incorrect perhaps?

Is a total unroll the way forward here, or perhaps just dealing with 4 pixels per loop is the way. If the former, how can I do what the compiler does and generate the code?

Thanks.

Share this post


Link to post
Share on other sites
Every instruction needs to be decoded, and since that is a fairly expensive process, CPUs have so called instruction cache.

Unrolling too deep will trash the cache and result in low throughput without an obvious culprit.

Typically the unrolling is done over 4-8 elements.

Remember that instructions are read from memory same way as data. And if each loop entry is 3-6 bytes in opcodes, then CPU will spend more time loading instructions than 1 byte of pixel data.

As a rule of thumb, optimizing for space (smaller executable and code footprint) will frequently result in faster running code.


As for pixel processing, it's either SIMD or GPU. Manual unrolling of for loops is simply something compilers will do on their own.

The only exception would be unrolling for better utilization of SSE instructions, but even then it would likely be just 4 elements. Trivial operations will likely be memory bound as well, especially if data is larger than L2 cache.

Share this post


Link to post
Share on other sites
You won't gain much benefit from unrolling a loop that size... the speed benefit from loop unrolling comes from the theory that, if a loop only has one or two instructions inside it, the cost of a branch will be dramatically larger then the cost of the loop body itself. For example, imagine the cost of a branch execution is 100 clock cycles, but the loop is small enough that full execution of the body is only 25 clock cycles.

Without loop unrolling, execution of the loop 8 times requires (100+25)*8 = 1250 clock cycles.
With 4x loop unrolling, execution of the loop requires (100+4*25)*2 = 400 clock cycles.

That said, however, you can easily see that if as you increase the multiplier for unrolling, the amount of benefit you gain decreases very, very quickly. If you unroll this same loop 100x, then the amount of time you are "wasting" is only 100 clocks out of every 2600 clocks spent in the loop. Unrolling the loop 200x on top of that makes your "wasted time" percentage decrease from 1/26 to 1/52, which is almost NO real-world time saved at all.

In addition, even if you REALLLY really need that .5% speed boost unrolling 200x versus 100x, then this doesn't work anyway, because in the real world you have an instruction cache, and unrolling the loop too much will cost you severely when your application has to deal with the cost of an icache miss every 100 loop iterations, vs the previous cost of a pipelined and predicted branch to a previous instruction already in memory.

If you are really trying to get that speed boost, try unrolling the loop by hand a factor of 8x, benchmark it, then run it again with a factor of 16x. My guess is that the speed difference will be so negligible that you will leave your hand-unrolled 16x implementation as it stands and give up on your plan.

Share this post


Link to post
Share on other sites
Very informative replies, thank you. I was too distracted by the plan to see the wood!

Clearly, keeping the function inside of the cache is the way to go. I shall profile 4x, 8x, and 16x. My aim is to complete the loop in around the same time it takes GL to update the texture from the other buffer, but that ratio will be platform specific. Thanks again.


btw: I've already tried running the function on GPU, however it run's on GLES2 which was very limited and didn't take much to slow down (iOS).

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this