Jump to content

  • Log In with Google      Sign In   
  • Create Account

Awesome job so far everyone! Please give us your feedback on how our article efforts are going. We still need more finished articles for our May contest theme: Remake the Classics

Problem: turning a bit shift into a division


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
25 replies to this topic

#1 Sik_the_hedgehog   Members   -  Reputation: 958

Like
0Likes
Like

Posted 22 April 2012 - 11:24 AM

OK, reading this made me remember how trying to do any operation that isn't arithmetic or a comparison with signed integers in C is pretty much considered undefined behavior. While it's true that the chances of me touching anything that doesn't use two's complement are nil, I thought I may as well try to remove bit shifts in my code that are meant to be multiplications/divisions since after all the compiler will optimize them anyways.

Then I stumble upon this piece of code...
int32_t speed_to_int(int32_t value) {
   return (value + comma_offset) >> 8;
}

This function takes a 24.8 fixed comma value and returns an integer. This is used when calculating how much an object should move. The variable comma_offset is an offset calculated every frame used to fake sub-pixel accuracy (I know I could have just used fixed comma for everything, but that's another topic).

What's the issue here? That arithmetic (signed) bit shift doesn't behave exactly like division when it comes to negative integers: instead of rounding towards zero, it rounds away from zero, and the code relies on this behavior. I tried some quick tricks to get it to work with division but I couldn't get it working. Applying an offset doesn't work because positive integers still have to round towards zero.

So I was wondering: does anybody here know what'd be the way to implement the same behavior using pure arithmetic, and not abusing if? Not like I really need it (since like I said, the chances of this ever running on a processor not using two's complement is pretty much non-existent), but I'm curious as to what would be the approach - especially if it's something the compiler can understand and optimize into a bit shift where possible. Just going to stick to what I have for now because it works as intended and I have better things to do.

In case somebody wants to know how comma_offset is calculated, here we go, where game_anim is just a counter that goes up by 1 every frame:
comma_offset =
   (game_anim & 0x80) >> 7 |
   (game_anim & 0x40) >> 5 |
   (game_anim & 0x20) >> 3 |
   (game_anim & 0x10) >> 1 |
   (game_anim & 0x08) << 1 |
   (game_anim & 0x04) << 3 |
   (game_anim & 0x02) << 5 |
   (game_anim & 0x01) << 7;

PS: I had to reformat the code above because the editor decides to be idiotic with indentation while pasting.
Don't pay much attention to "the hedgehog" in my nick, it's just because "Sik" was already taken =/ By the way, Sik is pronounced like seek, not like sick.

Sponsor:

#2 tanzanite7   Members   -  Reputation: 595

Like
0Likes
Like

Posted 22 April 2012 - 04:24 PM

Did not quite get what you mean :/

Something like this? : int32( int64u( int64( signed_32_bit_value ) ) / 16) // compiler replaces the division with bit-shift.

PS. for maximum compatibility, it would be wise to not use ">>" bit-shifts at all - any compiler is able to do the correct substitution if you just use multiplication / division.

#3 frob   Moderators   -  Reputation: 7828

Like
3Likes
Like

Posted 22 April 2012 - 07:07 PM

Many of those bit hacks were great back in the day when machines had no pipeline and very little cache. Those optimizations were good 25 years ago, even back in the PS1 days a few of them were legitimately good on the hardware.

But hardware is different today. It is deeply pipelined, where the result of any operation may not be known until 20 or 50 cycles later. Modify-and-query instructions can be horrible on performance.

Compilers are great at optimizing for today's hardware. They will reorder instructions so the results won't be needed until the pipeline is clear. Strength reduction, such as replacing a division with a shift, is among the most trivial of the optimizations.

Today that wonderful little comparison could cause 16 pipeline stalls.

Another classic was to swap in place: a xor b, b xor a, a xor b. Three decades ago it was great. Today you can get over 50x the performance with temp=a; a=b; b=temp;

Write it clearly and simply. Compilers are really quite good at optimizing code.

#4 Trienco   Members   -  Reputation: 1307

Like
0Likes
Like

Posted 22 April 2012 - 10:39 PM

Somebody should write a book about all these "optimizations of the olden days" with a detailed explanation why they are a really bad idea. Why? Because not so long ago I had a coworker stubbornly insisting that a leap-year function was THE reason for bad performance and applied all his 80ies C-hacker knowledge to optimize it. Making it about 10 times slower. On the other hand, he probably wouldn't read that book or just write it off as "nonsense".
f@dzhttp://festini.device-zero.de

#5 Sik_the_hedgehog   Members   -  Reputation: 958

Like
1Likes
Like

Posted 22 April 2012 - 11:03 PM

OK, but what would be the way to replace that piece of code with pure arithmetic that has exactly the same behavior? Because that's what I'm asking =P


PS. for maximum compatibility, it would be wise to not use ">>" bit-shifts at all - any compiler is able to do the correct substitution if you just use multiplication / division.

For maximum compatibility don't use signed integers for anything that isn't the basic arithmetic operations and comparisons, and don't do comparisons between signed and unsigned integers (I even get compiler warnings when I try that), and don't do overflow and underflow (the standard marks those two situations as undefined for signed integers).


For two's complement hardware (practically all standard processors) the behavior of other operations on signed integers is pretty much a given. The problem is that some DSPs still use sign+magnitude (throwing away all bit hackery to hell), and that they like to clamp the result on overflow/underflow (instead of making it wrap), when not outright throwing out an error. The standard still has to accommodate to this, which is why doing anything but what I've described is undefined behavior on signed integers.

Compilers are great at optimizing for today's hardware. They will reorder instructions so the results won't be needed until the pipeline is clear. Strength reduction, such as replacing a division with a shift, is among the most trivial of the optimizations.

Also that since the shift barrel was removed, it's possible that multiplication and division can actually be faster in some situations. Only the compiler would know this (although honestly, it seems like every iteration of processors throw away many optimizations from the previous one and bring in new ones, so probably it isn't worth worrying all that much outside pipelining and cache misses).

Another classic was to swap in place: a xor b, b xor a, a xor b. Three decades ago it was great. Today you can get over 50x the performance with temp=a; a=b; b=temp;

It doesn't help that the XOR trick is flawed (try swapping a variable with itself and you'll zero it out instead), making it impossible for compilers to do the proper replacement.
Don't pay much attention to "the hedgehog" in my nick, it's just because "Sik" was already taken =/ By the way, Sik is pronounced like seek, not like sick.

#6 tanzanite7   Members   -  Reputation: 595

Like
0Likes
Like

Posted 23 April 2012 - 01:07 AM

To restate the obvious in case one finds a bottleneck and wants to make it faster (in strict order):
* find better algorithms (also, keep in mind cache-trashing and badly-predictable branches [might be worth to combine branches also] in choosing an algorithm).
* express more cleanly your intent to the compiler (ie, avoid bit manipulation and other hard-to-see-through meddling. Use __restrict / __assume etc - but do not forget to add relevant asserts! Use intrinsics where possible).
* time to do some magic (ie. try some alternatives and see what works best with your compiler / hardware configuration etc).

Obviously, it is assumed the last option is avoided if at all possible - i assume OP is at that point, so ... back to finding an alternative to division:
------------

Got curious and tested how int32/256 will be done with VC, as expected - it does not use division:
00000001400255D9  mov         eax,dword ptr [test1 (1400FA310h)]
00000001400255DF  cdq
00000001400255E0  movzx       edx,dl
00000001400255E3  add         eax,edx
00000001400255E5  sar         eax,8
00000001400255E8  mov         dword ptr [test1 (1400FA310h)],eax
Looking at it - i do not really see any other option. There just are no better suited instructions it could use => so, how about just using the division, knowing that the best alternative to division is used anyway.

PS. the "solution" in my first reply is wrong :insert-rolleyes:

#7 Sik_the_hedgehog   Members   -  Reputation: 958

Like
0Likes
Like

Posted 23 April 2012 - 01:16 AM

The problem is that (value + comma_offset) / 0x100 doesn't do what you'd think it does. It's what I said about rounding in the first post - division always rounds towards zero, bit shifting always rounds towards negative infinity. It's the first thing I tried and it completely broke up the physics (everything was moving way slower than it should when moving left or up).
Don't pay much attention to "the hedgehog" in my nick, it's just because "Sik" was already taken =/ By the way, Sik is pronounced like seek, not like sick.

#8 tanzanite7   Members   -  Reputation: 595

Like
0Likes
Like

Posted 23 April 2012 - 01:57 AM

Hm... seems my grasp of English is biting me :/. Finding this, somewhy, confusing...

What's the issue here? That arithmetic (signed) bit shift doesn't behave exactly like division when it comes to negative integers: instead of rounding towards zero, it rounds away from zero, and the code relies on this behavior. I tried some quick tricks to get it to work with division but I couldn't get it working. Applying an offset doesn't work because positive integers still have to round towards zero.

Am i to understand that this is the intended result: -257/256 = -2, 257/256 = 1 ?

*useless stuff cut*

Nope, still confused. Could you give an example of intended behavior using -257/256 and 257/256?

#9 Sik_the_hedgehog   Members   -  Reputation: 958

Like
0Likes
Like

Posted 23 April 2012 - 02:04 AM

Am i to understand that this is the intended result: -257/256 = -2, 257/256 = 1 ?

Exactly. That's how arithmetic bit shifting behaves.

In that case, you might try: (val & ~255) / 256.

The whole point of this is to get rid of bit trickery though.

hopefully there is an intrinsic for "sar" to get rid of the nonsense

No need to because that's exactly what bit shifting on a signed integer does. Remember, two's complement processors explicitly have an "arithmetic" bit shift for doing divisions on signed integers, and this is what compilers use with them.
Don't pay much attention to "the hedgehog" in my nick, it's just because "Sik" was already taken =/ By the way, Sik is pronounced like seek, not like sick.

#10 tanzanite7   Members   -  Reputation: 595

Like
1Likes
Like

Posted 23 April 2012 - 02:25 AM

Am i to understand that this is the intended result: -257/256 = -2, 257/256 = 1 ?

Exactly. That's how arithmetic bit shifting behaves.

The whole point of this is to get rid of bit trickery though.

(you replied before i realized i was talking crap)

At some point bit trickery must be used if a div instruction is to be avoided (either by you or the compiler) - there is no way around that. I assumed you had a problem with the unspecified behavior of ">>" (arithmetic vs not, although i have never encountered a compiler that used non-arithmetic for signed, but afaik it is allowed).

Confused what the actual goal here is :/ (ie. what goal is there for the replacement of the shift?)

edit:
This are the options (a far as i can see):
* val < 0 ? ((val - 255) / 256) : (val / 256) // performance is not a concern => no bit manipulations in source
* (val & ~255) / 256 // sacrificing a bit of performance to avoid ">>" ambiguity.
* val >> 8 // simple and fast, but relies on ">>" being arithmetic

Edited by tanzanite7, 23 April 2012 - 03:05 AM.


#11 Sik_the_hedgehog   Members   -  Reputation: 958

Like
0Likes
Like

Posted 23 April 2012 - 02:38 AM

At some point bit trickery must be used if a div instruction is to be avoided (either by you or the compiler) - there is no way around that. I assumed you had a problem with the unspecified behavior of ">>" (arithmetic vs not, although i have never encountered a compiler that used non-arithmetic for signed, but afaik it is allowed).

The problem strikes in that according to the C standard, bit shifting on signed integers is undefined behavior, so the idea here is to replace that piece of code with something that has defined behavior for the standard.

If the compiler outputs a bit shift it's OK. In fact, it's the ideal case =P (EDIT: in other words, we don't care what the compiler does, only we care about the source code)

Confused what the actual goal here is :/ (ie. what goal is there for the replacement of the shift?)

Must be defined behavior according to the C standard. This pretty much means we're stuck with pure arithmetic (+, -, *, /, %).
Don't pay much attention to "the hedgehog" in my nick, it's just because "Sik" was already taken =/ By the way, Sik is pronounced like seek, not like sick.

#12 tanzanite7   Members   -  Reputation: 595

Like
0Likes
Like

Posted 23 April 2012 - 02:54 AM

Then the second option (you probably missed the edit) in my previous reply is the only way (fully specified behavior) ... except.

Best choice would be: use a helper function for shift right. So you can use ">>" on compilers that use arithmetic (VC specifies that it uses arithmetic - no ambiguity there) and whatever else where not (some compiler specific intrinsic functions or "(val & ~255) / 256" if unknown compiler).

edit: if inlining / performance in non-optimized debug build is a problem then a macro function would be prudent as long as the value is used only once in the macro (this criteria is met here).

edit: Hm, actually. I think i am wrong. I doubt C specifies how signed values are represented. Bummer.

#13 Hodgman   Moderators   -  Reputation: 13602

Like
0Likes
Like

Posted 23 April 2012 - 02:57 AM

edit:This are the options (a far as i can see):

It would be interesting to compare the actual assembly that each of these produces, and see how clever modern optimisers are!

The whole point of this is to get rid of bit trickery though

float speed_to_float(float value) {
   return value;
}
... Sorry Posted Image Posted Image

#14 Sik_the_hedgehog   Members   -  Reputation: 958

Like
0Likes
Like

Posted 23 April 2012 - 03:07 AM

float speed_to_float(float value) {
   return value;
}
... Sorry Posted Image Posted Image

And that breaks log-based replays across different machines! *hides*

Jeez, I'm amazed at how many things can go wrong so easily. And now is when I wish the C standard would just specify some defined behavior for two's complement machines, since all of them work the same. Only on DSPs and the like you may find other encodings these days.
Don't pay much attention to "the hedgehog" in my nick, it's just because "Sik" was already taken =/ By the way, Sik is pronounced like seek, not like sick.

#15 tanzanite7   Members   -  Reputation: 595

Like
1Likes
Like

Posted 23 April 2012 - 03:24 AM

It would be interesting to compare the actual assembly that each of these produces, and see how clever modern optimisers are!

A bit OT, but got curious and ...

00000001400255A9  mov         eax,dword ptr [test1 (1400FA310h)]  
00000001400255AF  test        eax,eax  
00000001400255B1  jns         tester+28h (1400255B8h)  
00000001400255B3  add         eax,0FFFFFF01h  
00000001400255B8  cdq  
00000001400255B9  movzx       edx,dl  
00000001400255BC  add         eax,edx  
00000001400255BE  sar         eax,8  
00000001400255C1  mov         dword ptr [test1 (1400FA310h)],eax  
Very good imo (branchless instructions are an option also, but they usually perform much worse than branching code due to branch prediction and worse behavior with register-renaming hardware [or something i forgot {fucks up the dependency chain or something?}]). Expecting compiler to see through this code to just replace with a "sar" - would be too much to ask.

00000001400255C7  mov         eax,dword ptr [test2 (1400FA314h)]  
00000001400255CD  and         eax,0FFFFFF00h // if it would be clever then it would realize that this is actually unnecessary (compiler fails again because it assumes the next 3 instructions actually do something [it otherwise does keep track of this kind of bit manipulations and would remove the "and" - did test that])
00000001400255D2  cdq                 // this does absolutely nothing useful (it would be relatively easy for a compiler to deduce it based on previous instruction, but - evidently it does not)
00000001400255D3  movzx       edx,dl  //   --||--
00000001400255D6  add         eax,edx //   --||--
00000001400255D8  sar         eax,8   // this is actually all it should do
00000001400255DB  mov         dword ptr [test2 (1400FA314h)],eax  
Absolutely terrible. Like i said - one should avoid bit manipulations as compilers just can not see through it most of the time.

00000001400255E1  sar         dword ptr [test3 (1400FA318h)],8
Perfect. Not a fair comparision with 2. option tho as it rolled "load" and "save" also into the instruction - load / save should not be part of the test as in real case one could not roll them into one. So, the code should be seen as (for comparision purposes):
mov         eax,dword ptr [test3]  
sar         eax,8
mov         dword ptr [test3],eax  

edit: compiled with full optimizations with _ReadWriteBarrier separating the tests to ensure compiler does not mix the tests.

#16 Hodgman   Moderators   -  Reputation: 13602

Like
0Likes
Like

Posted 23 April 2012 - 04:57 AM

The problem strikes in that according to the C standard, bit shifting on signed integers is undefined behavior

Despite being ill-defined, every project is going to have non-standard code in it somewhere. If you just want to document the (unlikely) portability issue and keep working, I usually just use static assertions nearby the bit-magic code
//Non-standard compiler extensions assumed
STATIC_ASSERT( s8(0x80) >> 7 == s8(0xFF) );//signed shift is arithmetic / sign-extending
STATIC_ASSERT( u8(0x80) >> 7 == u8(0x01) );//unsigned shift is logical
This assumption is probably true on every platform you care about, and in the case where it's not true, the above code will produce an error you when you compile.
I personally make the above assumptions in my code, and don't expect them to fail me any time soon.

#17 Antheus   Members   -  Reputation: 2369

Like
-2Likes
Like

Posted 23 April 2012 - 07:21 AM

the code relies on this behavior


Algorithm you're dealing with here is black box. There is no division that would make it more portable or better.

Because it gets worse. If you rely on exactly identical behavior, then you not only need to properly emulate operation, but also overflow behavior and CPU flags. And perhaps something else, who knows, there might be an interrupt hooked on overflow.

Code like this wasn't written for longevity, it's defined by the code itself.


As for >> and signed types - it wasn't until recently that developers finally realized that it doesn't work quite correctly for negative numbers. Before, >> was the equivalent to division. Best part - nobody noticed it wasn't correct.

Then there's matter of larger values - does it perform identically across 16/32/64 register sizes?

If trying for emulation, then the rabbit hole goes really deep.

#18 Sik_the_hedgehog   Members   -  Reputation: 958

Like
0Likes
Like

Posted 23 April 2012 - 12:18 PM

Am I the only one who didn't understand the point of that last post at all?

Because it gets worse. If you rely on exactly identical behavior, then you not only need to properly emulate operation, but also overflow behavior and CPU flags. And perhaps something else, who knows, there might be an interrupt hooked on overflow.

First of all, this is C, I don't care about what the flags are or anything (the compiler takes care about that). And this code assumes overflow shouldn't happen (if it does then it's considered a bug in the game).

Second, the only behavior that matters here is the result the function returns, not what the compiler decides to do inside (though ideally the compiler should not do stupid things to achieve it!).

As for >> and signed types - it wasn't until recently that developers finally realized that it doesn't work quite correctly for negative numbers. Before, >> was the equivalent to division. Best part - nobody noticed it wasn't correct.

Pretty sure this issue was known for as long as bit shifting was used for this purpose. It would have appeared pretty quickly during debugging. (EDIT: assuming here you talk about the rounding issue, if you mean it not working on non-two's-complement hardware, nevermind)
Don't pay much attention to "the hedgehog" in my nick, it's just because "Sik" was already taken =/ By the way, Sik is pronounced like seek, not like sick.

#19 Antheus   Members   -  Reputation: 2369

Like
-1Likes
Like

Posted 23 April 2012 - 12:40 PM

First of all, this is C, I don't care about what the flags are or anything (the compiler takes care about that). And this code assumes overflow shouldn't happen (if it does then it's considered a bug in the game).

Second, the only behavior that matters here is the result the function returns, not what the compiler decides to do inside (though ideally the compiler should not do stupid things to achieve it!).

No, the compiler is, by spec, required to do stupid things. Result of bitshift on a signed number is compiler-defined.

So whatever you're trying to emulate "correctly" has nothing to compare to.

The following function is correct:
int32_t speed_to_int(int32_t value) {
   if (value + comma_offset >= 0) return (value+comma_offset) / 256;
   return rand();
}
For a given compiler that chooses such behavior. C language does not forbid it, compiler that generates code like above is perfectly C standard compliant.

Since that is probably not what you intended, it is up to you what the supposed result will be and how it's implemented.

If log replayes across machines (as mentioned above) are important, then any behavior will do, as long as its consistent.

But there is no way to provide 1:1 correct version, since original one is undefined, the only thing one can do is compare against a specific result of a specific compiler.

#20 Sik_the_hedgehog   Members   -  Reputation: 958

Like
0Likes
Like

Posted 23 April 2012 - 12:54 PM

No, the compiler is, by spec, required to do stupid things. Result of bitshift on a signed number is compiler-defined.

By "stupid" I meant "don't try to pull off 20 lines when it could have done with 5" and stuff like that.

So whatever you're trying to emulate "correctly" has nothing to compare to.

Except I've already described what I'm expecting, and the whole point here is to write code that achieves what I'm expecting without relying on undefined behavior (ideally without making the compiler output needlessly complex code).
Don't pay much attention to "the hedgehog" in my nick, it's just because "Sik" was already taken =/ By the way, Sik is pronounced like seek, not like sick.




Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS