Sign in to follow this  
Sik_the_hedgehog

Problem: turning a bit shift into a division

Recommended Posts

OK, reading [url="http://www.gamasutra.com/blogs/AndreyKarpov/20120416/168596/Wade_not_in_unknown_C_waters_Part_three.php"]this[/url] made me remember how trying to do any operation that isn't arithmetic or a comparison with signed integers in C is pretty much considered undefined behavior. While it's true that the chances of me touching anything that doesn't use two's complement are nil, I thought I may as well try to remove bit shifts in my code that are meant to be multiplications/divisions since after all the compiler will optimize them anyways.

Then I stumble upon this piece of code...
[CODE]int32_t speed_to_int(int32_t value) {
return (value + comma_offset) >> 8;
}[/CODE]

This function takes a 24.8 fixed comma value and returns an integer. This is used when calculating how much an object should move. The variable [font=courier new,courier,monospace]comma_offset[/font] is an offset calculated every frame used to fake sub-pixel accuracy (I know I could have just used fixed comma for everything, but that's another topic).

What's the issue here? That arithmetic (signed) bit shift doesn't behave exactly like division when it comes to negative integers: instead of rounding towards zero, it [i]rounds away from zero[/i], and the code relies on this behavior. I tried some quick tricks to get it to work with division but I couldn't get it working. Applying an offset doesn't work because positive integers still have to round towards zero.

So I was wondering: does anybody here know what'd be the way to implement the same behavior using pure arithmetic, and not abusing [font=courier new,courier,monospace]if[/font]? Not like I really need it (since like I said, the chances of this ever running on a processor not using two's complement is pretty much non-existent), but I'm curious as to what would be the approach - especially if it's something the compiler can understand and optimize into a bit shift where possible. Just going to stick to what I have for now because it works as intended and I have better things to do.

In case somebody wants to know how [font=courier new,courier,monospace]comma_offset[/font] is calculated, here we go, where [font=courier new,courier,monospace]game_anim[/font] is just a counter that goes up by 1 every frame:
[CODE]comma_offset =
(game_anim & 0x80) >> 7 |
(game_anim & 0x40) >> 5 |
(game_anim & 0x20) >> 3 |
(game_anim & 0x10) >> 1 |
(game_anim & 0x08) << 1 |
(game_anim & 0x04) << 3 |
(game_anim & 0x02) << 5 |
(game_anim & 0x01) << 7;[/CODE]

PS: I had to reformat the code above because the editor decides to be idiotic with indentation while pasting.

Share this post


Link to post
Share on other sites
Did not quite get what you mean :/

Something like this? : int32( int64u( int64( signed_32_bit_value ) ) / 16) // compiler replaces the division with bit-shift.

PS. for maximum compatibility, it would be wise to not use ">>" bit-shifts at all - any compiler is able to do the correct substitution if you just use multiplication / division.

Share this post


Link to post
Share on other sites
Many of those bit hacks were great back in the day when machines had no pipeline and very little cache. Those optimizations were good 25 years ago, even back in the PS1 days a few of them were legitimately good on the hardware.

But hardware is different today. It is deeply pipelined, where the result of any operation may not be known until 20 or 50 cycles later. Modify-and-query instructions can be horrible on performance.

Compilers are great at optimizing for today's hardware. They will reorder instructions so the results won't be needed until the pipeline is clear. Strength reduction, such as replacing a division with a shift, is among the most trivial of the optimizations.

Today that wonderful little comparison could cause 16 pipeline stalls.

Another classic was to swap in place: a xor b, b xor a, a xor b. Three decades ago it was great. Today you can get over 50x the performance with temp=a; a=b; b=temp;

Write it clearly and simply. Compilers are really quite good at optimizing code.

Share this post


Link to post
Share on other sites
Somebody should write a book about all these "optimizations of the olden days" with a detailed explanation why they are a really bad idea. Why? Because not so long ago I had a coworker stubbornly insisting that a leap-year function was THE reason for bad performance and applied all his 80ies C-hacker knowledge to optimize it. Making it about 10 times slower. On the other hand, he probably wouldn't read that book or just write it off as "nonsense".

Share this post


Link to post
Share on other sites
OK, but what would be the way to replace that piece of code with pure arithmetic that has exactly the same behavior? Because that's what I'm asking =P


[quote name='tanzanite7' timestamp='1335133445' post='4933893']PS. for maximum compatibility, it would be wise to not use ">>" bit-shifts at all - any compiler is able to do the correct substitution if you just use multiplication / division.[/quote]
For maximum compatibility don't use signed integers for anything that isn't the basic arithmetic operations and comparisons, and don't do comparisons between signed and unsigned integers (I even get compiler warnings when I try that), and don't do overflow and underflow (the standard marks those two situations as undefined for signed integers).


For two's complement hardware (practically all standard processors) the behavior of other operations on signed integers is pretty much a given. The problem is that some DSPs still use sign+magnitude (throwing away all bit hackery to hell), and that they like to clamp the result on overflow/underflow (instead of making it wrap), when not outright throwing out an error. The standard still has to accommodate to this, which is why doing anything but what I've described is undefined behavior on signed integers.

[quote name='frob' timestamp='1335143238' post='4933918']Compilers are great at optimizing for today's hardware. They will reorder instructions so the results won't be needed until the pipeline is clear. Strength reduction, such as replacing a division with a shift, is among the most trivial of the optimizations.[/quote]
Also that since the shift barrel was removed, it's possible that multiplication and division can actually be [i]faster[/i] in some situations. Only the compiler would know this (although honestly, it seems like every iteration of processors throw away many optimizations from the previous one and bring in new ones, so probably it isn't worth worrying all that much outside pipelining and cache misses).

[quote name='frob' timestamp='1335143238' post='4933918']Another classic was to swap in place: a xor b, b xor a, a xor b. Three decades ago it was great. Today you can get over 50x the performance with temp=a; a=b; b=temp;[/quote]
It doesn't help that the XOR trick is flawed (try swapping a variable with itself and you'll zero it out instead), making it impossible for compilers to do the proper replacement.

Share this post


Link to post
Share on other sites
To restate the obvious in case one finds a bottleneck and wants to make it faster (in strict order):
* find better algorithms (also, keep in mind cache-trashing and badly-predictable branches [might be worth to combine branches also] in choosing an algorithm).
* express more cleanly your intent to the compiler (ie, avoid bit manipulation and other hard-to-see-through meddling. Use __restrict / __assume etc - but do not forget to add relevant asserts! Use [url=http://msdn.microsoft.com/en-us/library/26td21ds%28v=vs.80%29.aspx]intrinsics[/url] where possible).
* time to do some magic (ie. try some alternatives and see what works best with your compiler / hardware configuration etc).

Obviously, it is assumed the last option is avoided if at all possible - i assume OP is at that point, so ... back to finding an alternative to division:
------------

Got curious and tested how int32/256 will be done with VC, as expected - it does not use division:
[code]
00000001400255D9 mov eax,dword ptr [test1 (1400FA310h)]
00000001400255DF cdq
00000001400255E0 movzx edx,dl
00000001400255E3 add eax,edx
00000001400255E5 sar eax,8
00000001400255E8 mov dword ptr [test1 (1400FA310h)],eax
[/code]
Looking at it - i do not really see any other option. There just are no better suited instructions it could use => so, how about just using the division, knowing that the best alternative to division is used anyway.

PS. the "solution" in my first reply is wrong :insert-rolleyes:

Share this post


Link to post
Share on other sites
The problem is that [font=courier new,courier,monospace](value + comma_offset) / 0x100[/font] doesn't do what you'd think it does. It's what I said about rounding in the first post - division always rounds towards zero, bit shifting always rounds towards negative infinity. It's the first thing I tried and it completely broke up the physics (everything was moving way slower than it should when moving left or up).

Share this post


Link to post
Share on other sites
Hm... seems my grasp of English is biting me :/. Finding this, somewhy, confusing...

[quote name='Sik_the_hedgehog' timestamp='1335115479' post='4933819']
What's the issue here? That arithmetic (signed) bit shift doesn't behave exactly like division when it comes to negative integers: instead of rounding towards zero, it [i]rounds away from zero[/i], and the code relies on this behavior. I tried some quick tricks to get it to work with division but I couldn't get it working. Applying an offset doesn't work because positive integers still have to round towards zero.
[/quote]
Am i to understand that this is the intended result: -257/256 = -2, 257/256 = 1 ?

*useless stuff cut*

Nope, still confused. Could you give an example of intended behavior using -257/256 and 257/256?

Share this post


Link to post
Share on other sites
[quote name='tanzanite7' timestamp='1335167870' post='4933995']Am i to understand that this is the intended result: -257/256 = -2, 257/256 = 1 ?[/quote]
Exactly. That's how arithmetic bit shifting behaves.

[quote name='tanzanite7' timestamp='1335167870' post='4933995']In that case, you might try: (val & ~255) / 256.[/quote]
The whole point of this is to get rid of bit trickery though.

[quote name='tanzanite7' timestamp='1335167870' post='4933995']hopefully there is an intrinsic for "sar" to get rid of the nonsense[/quote]
No need to because that's exactly what bit shifting on a signed integer does. Remember, two's complement processors explicitly have an "arithmetic" bit shift for doing divisions on signed integers, and this is what compilers use with them.

Share this post


Link to post
Share on other sites
[quote name='Sik_the_hedgehog' timestamp='1335168263' post='4934000']
[quote name='tanzanite7' timestamp='1335167870' post='4933995']Am i to understand that this is the intended result: -257/256 = -2, 257/256 = 1 ?[/quote]
Exactly. That's how arithmetic bit shifting behaves.

The whole point of this is to get rid of bit trickery though.
[/quote]
(you replied before i realized i was talking crap)

At some point bit trickery must be used if a div instruction is to be avoided (either by you or the compiler) - [b]there is no way around that[/b]. I assumed you had a problem with the unspecified behavior of ">>" (arithmetic vs not, although i have never encountered a compiler that used non-arithmetic for signed, but afaik it is allowed).

Confused what the actual goal here is :/ (ie. what goal is there for the replacement of the shift?)

edit:
This are the options (a far as i can see):
* val < 0 ? ((val - 255) / 256) : (val / 256) // performance is not a concern => no bit manipulations in source
* (val & ~255) / 256 // sacrificing a bit of performance to avoid ">>" ambiguity.
* val >> 8 // simple and fast, but relies on ">>" being arithmetic Edited by tanzanite7

Share this post


Link to post
Share on other sites
[quote name='tanzanite7' timestamp='1335169545' post='4934005']At some point bit trickery must be used if a div instruction is to be avoided (either by you or the compiler) - [b]there is no way around that[/b]. I assumed you had a problem with the unspecified behavior of ">>" (arithmetic vs not, although i have never encountered a compiler that used non-arithmetic for signed, but afaik it is allowed).[/quote]
The problem strikes in that according to the C standard, bit shifting on signed integers is undefined behavior, so the idea here is to replace that piece of code with something that has defined behavior for the standard.

If the compiler outputs a bit shift it's OK. In fact, it's the ideal case =P (EDIT: in other words, we don't care what the compiler does, only we care about the source code)

[quote name='tanzanite7' timestamp='1335169545' post='4934005']Confused what the actual goal here is :/ (ie. what goal is there for the replacement of the shift?)[/quote]
Must be defined behavior according to the C standard. This pretty much means we're stuck with pure arithmetic (+, -, *, /, %).

Share this post


Link to post
Share on other sites
Then the second option (you probably missed the edit) in my previous reply is the only way (fully specified behavior) ... except.

Best choice would be: use a helper function for shift right. So you can use ">>" on compilers that use arithmetic (VC specifies that it uses arithmetic - no ambiguity there) and whatever else where not (some compiler specific intrinsic functions or "(val & ~255) / 256" if unknown compiler).

edit: if inlining / performance in non-optimized debug build is a problem then a macro function would be prudent as long as the value is used only once in the macro (this criteria is met here).

edit: Hm, actually. I think i am wrong. I doubt C specifies how signed values are represented. Bummer.

Share this post


Link to post
Share on other sites
[quote name='tanzanite7' timestamp='1335169545' post='4934005']edit:This are the options (a far as i can see):[/quote]It would be interesting to compare the actual assembly that each of these produces, and see how clever modern optimisers are!
[quote name='Sik_the_hedgehog' timestamp='1335168263' post='4934000']
The whole point of this is to get rid of bit trickery though
[/quote][code]float speed_to_float(float value) {
return value;
}[/code]... Sorry [img]http://public.gamedev.net//public/style_emoticons/default/tongue.png[/img] [img]http://public.gamedev.net//public/style_emoticons/default/wink.png[/img]

Share this post


Link to post
Share on other sites
[quote name='Hodgman' timestamp='1335171438' post='4934014'][code]float speed_to_float(float value) {
return value;
}[/code]... Sorry [img]http://public.gamedev.net//public/style_emoticons/default/tongue.png[/img] [img]http://public.gamedev.net//public/style_emoticons/default/wink.png[/img][/quote]
And that breaks log-based replays across different machines! *hides*

Jeez, I'm amazed at how many things can go wrong so easily. And now is when I wish the C standard would just specify some defined behavior for two's complement machines, since all of them work the same. Only on DSPs and the like you may find other encodings these days.

Share this post


Link to post
Share on other sites
[quote name='Hodgman' timestamp='1335171438' post='4934014']
It would be interesting to compare the actual assembly that each of these produces, and see how clever modern optimisers are!
[/quote]
A bit OT, but got curious and ...

[code]
00000001400255A9 mov eax,dword ptr [test1 (1400FA310h)]
00000001400255AF test eax,eax
00000001400255B1 jns tester+28h (1400255B8h)
00000001400255B3 add eax,0FFFFFF01h
00000001400255B8 cdq
00000001400255B9 movzx edx,dl
00000001400255BC add eax,edx
00000001400255BE sar eax,8
00000001400255C1 mov dword ptr [test1 (1400FA310h)],eax
[/code]
Very good imo (branchless instructions are an option also, but they usually perform much worse than branching code due to branch prediction and worse behavior with register-renaming hardware [or something i forgot {fucks up the dependency chain or something?}]). Expecting compiler to see through this code to just replace with a "sar" - would be too much to ask.

[code]
00000001400255C7 mov eax,dword ptr [test2 (1400FA314h)]
00000001400255CD and eax,0FFFFFF00h // if it would be clever then it would realize that this is actually unnecessary (compiler fails again because it assumes the next 3 instructions actually do something [it otherwise does keep track of this kind of bit manipulations and would remove the "and" - did test that])
00000001400255D2 cdq // this does absolutely nothing useful (it would be relatively easy for a compiler to deduce it based on previous instruction, but - evidently it does not)
00000001400255D3 movzx edx,dl // --||--
00000001400255D6 add eax,edx // --||--
00000001400255D8 sar eax,8 // this is actually all it should do
00000001400255DB mov dword ptr [test2 (1400FA314h)],eax
[/code]
[b]Absolutely terrible[/b]. Like i said - one should avoid bit manipulations as compilers just can not see through it most of the time.

[code]
00000001400255E1 sar dword ptr [test3 (1400FA318h)],8
[/code]
Perfect. Not a fair comparision with 2. option tho as it rolled "load" and "save" also into the instruction - load / save should not be part of the test as in real case one could not roll them into one. So, the code should be seen as (for comparision purposes):
[code]
mov eax,dword ptr [test3]
sar eax,8
mov dword ptr [test3],eax
[/code]

edit: compiled with full optimizations with _ReadWriteBarrier separating the tests to ensure compiler does not mix the tests.

Share this post


Link to post
Share on other sites
[quote name='Sik_the_hedgehog' timestamp='1335170327' post='4934007']The problem strikes in that according to the C standard, bit shifting on signed integers is undefined behavior[/quote]Despite being ill-defined, every project is going to have non-standard code in it somewhere. If you just want to document the ([i]unlikely[/i]) portability issue and keep working, I usually just use static assertions nearby the bit-magic code[code]//Non-standard compiler extensions assumed
STATIC_ASSERT( s8(0x80) >> 7 == s8(0xFF) );//signed shift is arithmetic / sign-extending
STATIC_ASSERT( u8(0x80) >> 7 == u8(0x01) );//unsigned shift is logical[/code]This assumption is probably true on every platform you care about, and in the case where it's not true, the above code will produce an error you when you compile.
I personally make the above assumptions in my code, and don't expect them to fail me any time soon.

Share this post


Link to post
Share on other sites
[quote]the code relies on this behavior[/quote]

Algorithm you're dealing with here is black box. There is no division that would make it more portable or better.

Because it gets worse. If you rely on exactly identical behavior, then you not only need to properly emulate operation, but also overflow behavior and CPU flags. And perhaps something else, who knows, there might be an interrupt hooked on overflow.

Code like this wasn't written for longevity, it's defined by the code itself.


As for >> and signed types - it wasn't until recently that developers finally realized that it doesn't work quite correctly for negative numbers. Before, >> was the equivalent to division. Best part - nobody noticed it wasn't correct.

Then there's matter of larger values - does it perform identically across 16/32/64 register sizes?

If trying for emulation, then the rabbit hole goes really deep.

Share this post


Link to post
Share on other sites
Am I the only one who didn't understand the point of that last post at all?

[quote name='Antheus' timestamp='1335187312' post='4934088']Because it gets worse. If you rely on exactly identical behavior, then you not only need to properly emulate operation, but also overflow behavior and CPU flags. And perhaps something else, who knows, there might be an interrupt hooked on overflow.[/quote]
First of all, this is C, I don't care about what the flags are or anything (the compiler takes care about that). And this code assumes overflow shouldn't happen (if it does then it's considered a bug in the game).

Second, the only behavior that matters here is the result the function returns, not what the compiler decides to do inside (though ideally the compiler should not do stupid things to achieve it!).

[quote name='Antheus' timestamp='1335187312' post='4934088']As for >> and signed types - it wasn't until recently that developers finally realized that it doesn't work quite correctly for negative numbers. Before, >> was the equivalent to division. Best part - nobody noticed it wasn't correct.[/quote]
Pretty sure this issue was known for as long as bit shifting was used for this purpose. It would have appeared pretty quickly during debugging. (EDIT: assuming here you talk about the rounding issue, if you mean it not working on non-two's-complement hardware, nevermind)

Share this post


Link to post
Share on other sites
[quote]First of all, this is C, I don't care about what the flags are or anything (the compiler takes care about that). And this code assumes overflow shouldn't happen (if it does then it's considered a bug in the game).

Second, the only behavior that matters here is the result the function returns, not what the compiler decides to do inside (though ideally the compiler should not do stupid things to achieve it!).[/quote]
No, the compiler is, by spec, required to do stupid things. Result of bitshift on a signed number is compiler-defined.

So whatever you're trying to emulate "correctly" has nothing to compare to.

The following function is correct:[code]
int32_t speed_to_int(int32_t value) {
if (value + comma_offset >= 0) return (value+comma_offset) / 256;
return rand();
}[/code]For a given compiler that chooses such behavior. C language does not forbid it, compiler that generates code like above is perfectly C standard compliant.

Since that is probably not what you intended, it is up to you what the supposed result will be and how it's implemented.

If log replayes across machines (as mentioned above) are important, then any behavior will do, as long as its consistent.

But there is no way to provide 1:1 correct version, since original one is undefined, the only thing one can do is compare against a specific result of a specific compiler.

Share this post


Link to post
Share on other sites
[quote name='Antheus' timestamp='1335206406' post='4934192']No, the compiler is, by spec, required to do stupid things. Result of bitshift on a signed number is compiler-defined.[/quote]
By "stupid" I meant "don't try to pull off 20 lines when it could have done with 5" and stuff like that.

[quote name='Antheus' timestamp='1335206406' post='4934192']So whatever you're trying to emulate "correctly" has nothing to compare to.[/quote]
Except I've already described what I'm expecting, and the whole point here is to write code that achieves what I'm expecting without relying on undefined behavior (ideally without making the compiler output needlessly complex code).

Share this post


Link to post
Share on other sites
Well, the [url="http://stackoverflow.com/questions/3041946/how-to-integer-divide-round-negative-numbers-down"]marked solution here[/url] is correct across all values.

[quote]* val < 0 ? ((val - 255) / 256) : (val / 256) // performance is not a concern => no bit manipulations in source[/quote]

This one is simplest, but incorrect for values where (val - 255) overflows.

Just about all other answers in SO's thread are also incorrect in similar ways. What's worse is that they may work if int is 64 bits.

Share this post


Link to post
Share on other sites
The problem is that such a solution requires comparisons, which is something I'd definitely like to avoid, especially since I don't trust the compiler to actually understand the idiom even when optimization is set to maximum (I suppose that in the worst case I could just do an #ifdef, but whatever).

And I said I don't care about overflow. If the value passed is large enough that overflow happens, then something is seriously wrong with the program and I shouldn't expect it to behave properly at all. So, I don't care about what happens if overflow kicks in.

Also in case this helps: this function is used to turn speeds into pixels. It takes a 24.8 fixed comma value and returns a normal integer. The integer is rounded in such a way as to fake subpixel precision (e.g. if it's 320 (1¼), it'd return 2, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, etc. as frames pass by). I still can afford to change the algorithm yet, but I'd like the behavior I've described to remain.

Share this post


Link to post
Share on other sites
Have you tried looking at what assembly the various methods generate?

A compiler might be able to optimize something like this in to a very short instruction sequence:
[code]
uint32_t signed_rshift(uint32_t number, uint32_t shift)
{
if (number & 0x80000000) return (~(((~number) + uint32_t(1)) >> shift)) + uint32_t(1); // check this ;)
else return number >> shift;
}
[/code]

I know for example that GCC will optimize circular bit-rotations implemented with shifts and bitwise-or in to ROL/ROR on x86, so it's probably worth a go...

Share this post


Link to post
Share on other sites
[quote name='edd²' timestamp='1335222912' post='4934276']I know for example that GCC will optimize circular bit-rotations implemented with shifts and bitwise-or in to ROL/ROR on x86, so it's probably worth a go...[/quote]
That's probably the worst example out there, since it's a very common idiom GCC was explicitly programmed to handle. In fact, if you do it in any way other than you've described (two shifts and an OR), GCC will [i]not[/i] generate a rotate instruction, even if the final result is guaranteed to be 100% identical.

I *know* there's a way to turn a signed division into a signed bit shift properly that involves adding some number to the value before dividing it, but I can't recall the details right now. I suppose the solution here would be to do the same, but subtracting the value instead of adding it.

EDIT: OK, I can't find out how that was done, argh. But in the meanwhile I tested some combinations... GCC, -O3, x86-64 (also being GCC means AT&T syntax, watch out).

Bit shift:
[CODE]void test1(int value) {
result = value >> 8;
}[/CODE][CODE]sarl $8, %edi[/CODE]

Naive comparison:
[CODE]
void test2(int value) {
result = value >= 0 ? value / 0x100 : (value + 0xFF) / 0x100;
}[/CODE][CODE]testl %edi, %edi
jns .L5
leal 510(%rdi), %eax
addl $255, %edi
cmovs %eax, %edi
.L5:
sarl $8, %edi[/CODE]
Same as above, trying to fool it to go into the asm optimizer:
[CODE]void test3(int value) {
if (value < 0) value += 0xFF;
result = value / 0x100;
}[/CODE][CODE]leal 255(%rdi), %eax
testl %edi, %edi
cmovs %eax, %edi
leal 255(%rdi), %eax
testl %edi, %edi
cmovs %eax, %edi
sarl $8, %edi[/CODE]

PS: somebody make the editor stop stripping indentation when copying! =/

Share this post


Link to post
Share on other sites
I really don't understand the goal here.

It should duplicate original behavior, yet it doesn't matter if algorithm isn't correct.
It should be portable, yet should be a perfect match for generated assembly opcode.
It should be compiler independent and standard-compliant yet perfectly optimized for any compiler, even those not written and those running on exotic architectures.
Performance is secondary to portability yet performance comes first.

Bitshift is perfectly fine solution if performance matters. Otherwise, there's a so-called reference implementation available on stack overflow.

Implement that, then profile. Is it good enough? Then it's fine. Otherwise, use an ifdef and a bitshift.

Especially since for this type of optimization keyhole optimizations won't matter. 5 lines of assembly is frequently faster than a single instruction.

And if performance is crucial, there's SIMD, for which either approach looks good.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this