• 10
• 10
• 12
• 12
• 14

# Problem: turning a bit shift into a division

This topic is 2193 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

OK, reading this made me remember how trying to do any operation that isn't arithmetic or a comparison with signed integers in C is pretty much considered undefined behavior. While it's true that the chances of me touching anything that doesn't use two's complement are nil, I thought I may as well try to remove bit shifts in my code that are meant to be multiplications/divisions since after all the compiler will optimize them anyways.

Then I stumble upon this piece of code...
int32_t speed_to_int(int32_t value) { return (value + comma_offset) >> 8; }

This function takes a 24.8 fixed comma value and returns an integer. This is used when calculating how much an object should move. The variable [font=courier new,courier,monospace]comma_offset[/font] is an offset calculated every frame used to fake sub-pixel accuracy (I know I could have just used fixed comma for everything, but that's another topic).

What's the issue here? That arithmetic (signed) bit shift doesn't behave exactly like division when it comes to negative integers: instead of rounding towards zero, it rounds away from zero, and the code relies on this behavior. I tried some quick tricks to get it to work with division but I couldn't get it working. Applying an offset doesn't work because positive integers still have to round towards zero.

So I was wondering: does anybody here know what'd be the way to implement the same behavior using pure arithmetic, and not abusing [font=courier new,courier,monospace]if[/font]? Not like I really need it (since like I said, the chances of this ever running on a processor not using two's complement is pretty much non-existent), but I'm curious as to what would be the approach - especially if it's something the compiler can understand and optimize into a bit shift where possible. Just going to stick to what I have for now because it works as intended and I have better things to do.

In case somebody wants to know how [font=courier new,courier,monospace]comma_offset[/font] is calculated, here we go, where [font=courier new,courier,monospace]game_anim[/font] is just a counter that goes up by 1 every frame:
comma_offset = (game_anim & 0x80) >> 7 | (game_anim & 0x40) >> 5 | (game_anim & 0x20) >> 3 | (game_anim & 0x10) >> 1 | (game_anim & 0x08) << 1 | (game_anim & 0x04) << 3 | (game_anim & 0x02) << 5 | (game_anim & 0x01) << 7;

PS: I had to reformat the code above because the editor decides to be idiotic with indentation while pasting.

##### Share on other sites
Did not quite get what you mean :/

Something like this? : int32( int64u( int64( signed_32_bit_value ) ) / 16) // compiler replaces the division with bit-shift.

PS. for maximum compatibility, it would be wise to not use ">>" bit-shifts at all - any compiler is able to do the correct substitution if you just use multiplication / division.

##### Share on other sites
Many of those bit hacks were great back in the day when machines had no pipeline and very little cache. Those optimizations were good 25 years ago, even back in the PS1 days a few of them were legitimately good on the hardware.

But hardware is different today. It is deeply pipelined, where the result of any operation may not be known until 20 or 50 cycles later. Modify-and-query instructions can be horrible on performance.

Compilers are great at optimizing for today's hardware. They will reorder instructions so the results won't be needed until the pipeline is clear. Strength reduction, such as replacing a division with a shift, is among the most trivial of the optimizations.

Today that wonderful little comparison could cause 16 pipeline stalls.

Another classic was to swap in place: a xor b, b xor a, a xor b. Three decades ago it was great. Today you can get over 50x the performance with temp=a; a=b; b=temp;

Write it clearly and simply. Compilers are really quite good at optimizing code.

##### Share on other sites
Somebody should write a book about all these "optimizations of the olden days" with a detailed explanation why they are a really bad idea. Why? Because not so long ago I had a coworker stubbornly insisting that a leap-year function was THE reason for bad performance and applied all his 80ies C-hacker knowledge to optimize it. Making it about 10 times slower. On the other hand, he probably wouldn't read that book or just write it off as "nonsense".

##### Share on other sites
OK, but what would be the way to replace that piece of code with pure arithmetic that has exactly the same behavior? Because that's what I'm asking =P

PS. for maximum compatibility, it would be wise to not use ">>" bit-shifts at all - any compiler is able to do the correct substitution if you just use multiplication / division.

For maximum compatibility don't use signed integers for anything that isn't the basic arithmetic operations and comparisons, and don't do comparisons between signed and unsigned integers (I even get compiler warnings when I try that), and don't do overflow and underflow (the standard marks those two situations as undefined for signed integers).

For two's complement hardware (practically all standard processors) the behavior of other operations on signed integers is pretty much a given. The problem is that some DSPs still use sign+magnitude (throwing away all bit hackery to hell), and that they like to clamp the result on overflow/underflow (instead of making it wrap), when not outright throwing out an error. The standard still has to accommodate to this, which is why doing anything but what I've described is undefined behavior on signed integers.

Compilers are great at optimizing for today's hardware. They will reorder instructions so the results won't be needed until the pipeline is clear. Strength reduction, such as replacing a division with a shift, is among the most trivial of the optimizations.

Also that since the shift barrel was removed, it's possible that multiplication and division can actually be faster in some situations. Only the compiler would know this (although honestly, it seems like every iteration of processors throw away many optimizations from the previous one and bring in new ones, so probably it isn't worth worrying all that much outside pipelining and cache misses).

Another classic was to swap in place: a xor b, b xor a, a xor b. Three decades ago it was great. Today you can get over 50x the performance with temp=a; a=b; b=temp;

It doesn't help that the XOR trick is flawed (try swapping a variable with itself and you'll zero it out instead), making it impossible for compilers to do the proper replacement.

##### Share on other sites
To restate the obvious in case one finds a bottleneck and wants to make it faster (in strict order):
* find better algorithms (also, keep in mind cache-trashing and badly-predictable branches [might be worth to combine branches also] in choosing an algorithm).
* express more cleanly your intent to the compiler (ie, avoid bit manipulation and other hard-to-see-through meddling. Use __restrict / __assume etc - but do not forget to add relevant asserts! Use intrinsics where possible).
* time to do some magic (ie. try some alternatives and see what works best with your compiler / hardware configuration etc).

Obviously, it is assumed the last option is avoided if at all possible - i assume OP is at that point, so ... back to finding an alternative to division:
------------

Got curious and tested how int32/256 will be done with VC, as expected - it does not use division:
 00000001400255D9 mov eax,dword ptr [test1 (1400FA310h)] 00000001400255DF cdq 00000001400255E0 movzx edx,dl 00000001400255E3 add eax,edx 00000001400255E5 sar eax,8 00000001400255E8 mov dword ptr [test1 (1400FA310h)],eax 
Looking at it - i do not really see any other option. There just are no better suited instructions it could use => so, how about just using the division, knowing that the best alternative to division is used anyway.

PS. the "solution" in my first reply is wrong :insert-rolleyes:

##### Share on other sites
The problem is that [font=courier new,courier,monospace](value + comma_offset) / 0x100[/font] doesn't do what you'd think it does. It's what I said about rounding in the first post - division always rounds towards zero, bit shifting always rounds towards negative infinity. It's the first thing I tried and it completely broke up the physics (everything was moving way slower than it should when moving left or up).

##### Share on other sites
Hm... seems my grasp of English is biting me :/. Finding this, somewhy, confusing...

What's the issue here? That arithmetic (signed) bit shift doesn't behave exactly like division when it comes to negative integers: instead of rounding towards zero, it rounds away from zero, and the code relies on this behavior. I tried some quick tricks to get it to work with division but I couldn't get it working. Applying an offset doesn't work because positive integers still have to round towards zero.

Am i to understand that this is the intended result: -257/256 = -2, 257/256 = 1 ?

*useless stuff cut*

Nope, still confused. Could you give an example of intended behavior using -257/256 and 257/256?

##### Share on other sites
Am i to understand that this is the intended result: -257/256 = -2, 257/256 = 1 ?

Exactly. That's how arithmetic bit shifting behaves.

In that case, you might try: (val & ~255) / 256.

The whole point of this is to get rid of bit trickery though.

hopefully there is an intrinsic for "sar" to get rid of the nonsense

No need to because that's exactly what bit shifting on a signed integer does. Remember, two's complement processors explicitly have an "arithmetic" bit shift for doing divisions on signed integers, and this is what compilers use with them.

##### Share on other sites

[quote name='tanzanite7' timestamp='1335167870' post='4933995']Am i to understand that this is the intended result: -257/256 = -2, 257/256 = 1 ?

Exactly. That's how arithmetic bit shifting behaves.

The whole point of this is to get rid of bit trickery though.
[/quote]
(you replied before i realized i was talking crap)

At some point bit trickery must be used if a div instruction is to be avoided (either by you or the compiler) - there is no way around that. I assumed you had a problem with the unspecified behavior of ">>" (arithmetic vs not, although i have never encountered a compiler that used non-arithmetic for signed, but afaik it is allowed).

Confused what the actual goal here is :/ (ie. what goal is there for the replacement of the shift?)

edit:
This are the options (a far as i can see):
* val < 0 ? ((val - 255) / 256) : (val / 256) // performance is not a concern => no bit manipulations in source
* (val & ~255) / 256 // sacrificing a bit of performance to avoid ">>" ambiguity.
* val >> 8 // simple and fast, but relies on ">>" being arithmetic Edited by tanzanite7