# Quick average of 2 numbers

This topic is 4504 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Does anyone know of any slick tricks to calculate the average of 2 numbers w/o using addition? Its okay if the number is off by 1. Just looking for fast technique. I thought of using look-up tables, but couldn't come up with a way to index into it quickly. I'm thinking theres a bit trick. Thanks

##### Share on other sites
Umm, why do want to speed up addition? Adding two numbers together and dividing by two should be extremely quick (you could use a bit shift for the division if you really want to).

There's probably a good reason; I'm just a little confused... [looksaround]

##### Share on other sites
You could use a look-up table, but that would probably be slower than adding them and dividing by 2.

##### Share on other sites
I don't know of any speed tricks, but one useful formula is (a & b) + (a ^ b) / 2 to avoid integral overflow from addition.

##### Share on other sites
I guess I should've posted this in general.

I'm trying to speed up my image blending operation and am willing to sacrifice (some quality) to do it. The addition, shift trick is quite fast enough for the amount of blending I'm doing.

##### Share on other sites
Consider that shifting isn't necessarily faster than multiplication or division on modern processors (division is still slow, however). Your compiler may even decide to replace your shift operator with anything it considers better.

See the topic discussed here:
Is shift still worth it?

The bottom line is, let the compiler do the optimization and write your code in such a way it can do it well.

##### Share on other sites
Hey,

This sounds like a time I'd consider using SSE instructions. Either that or do it on the GPU or something. The more parallel processing you can get, the better IMO here, and you're unlikely to get better than the GPU.

--CJM

##### Share on other sites
I guess you could speed it up a little by replacing the division with a multiplication... x * 0.5f

*shrugs*

##### Share on other sites
Quote:
 Original post by WavarianI guess you could speed it up a little by replacing the division with a multiplication... x * 0.5f*shrugs*

I doubt this would end up being faster than the bit-shift option suggested since it would require a conversion to float, then back to int for the RGB value.

##### Share on other sites
Quote:
 Original post by WavarianI guess you could speed it up a little by replacing the division with a multiplication... x * 0.5f

This ranks right up there with the shift "optimization." It is something that the compiler can do if such a change is worth it.

*edit: And I doubt anything will be worth it. Go back a step, and see if there's some way to blend multiple pixels at once. Chances are, that's your best bet, short of using somebody else's library.

CM

##### Share on other sites

So, the relatively slow division operation doesn't bother you at all, but you're concerned about the addition ? Addition is the absolute fastest instruction most processors provide, in most (all?) cases. A P4 can accomplish 4 add instructions in a single cycle (2 way exec plus double pumped ALUs). It's basically free.

##### Share on other sites
Quote:
 Original post by PromitWithout...addition?So, the relatively slow division operation doesn't bother you at all, but you're concerned about the addition ? Addition is the absolute fastest instruction most processors provide, in most (all?) cases. A P4 can accomplish 4 add instructions in a single cycle (2 way exec plus double pumped ALUs). It's basically free.
Absolutely!

Addition practically is the 'bit trick' you were looking for. Profile first to see where your bottleneck is. Then if you find it is actually there, then simply take shortcuts by doing it less often, or doing many in parallel. Seriously though, you're basically barking up the wrong tree.

##### Share on other sites
Sorry, guess I didn't provide enough info.

I'm working on an arm processor. No SSE, no graphics hardware. Cpu speed is 100Mhz.

Also, I'm working with integers and my ARGB values are represented by a short. 0xffff is full white. The reason I wanted to avoid addition is that, it seems like performing several bit-wise ops on the components doesn't bother performance too much, but when I throw in that addition, things slow down quite a bit.

So I figured, dealing with components whose values range from 0-15 seem like it lends well itself to look-up tables, provided that cache access is fast enough (maybe it isn't though). The whole table outta be able to fit in cache, if I could come up with a technique. I've seen look-up tables used in a similar fashion, but the components were added first, and the result was an index into the array. If possible, I'd like to get rid of the add step.

It also seems like there could be a bit-op way to get a value close to the actual average (a little error is acceptable in my case), so I've been trying stuff out, but to no avail.

##### Share on other sites
You can alway process 2 rgb values at a time assuming you've got 4 channels of 4 bits each:
typedef unsigned short u16; // Must be atleast 16 bits wide!!u16average(u16 a, u16 b){  u16 a2 = a & 0x0f0f;  u16 b2 = b & 0x0f0f;  a &= 0xf0f0;  b &= 0xf0f0;  a2 += b2;  a2 >>= 1;  a2 &= 0x0f0f;  a >>= 1;  b >>= 1;  a += b;  a &= 0xf0f0;  return a | a2;}

This boils down to 14 instructions on a P4 that could be pieplined quite well.
Still got two additions, but you maybe had four before?
Disclaimer: This code is untested, but the theory still holds.

##### Share on other sites
Hey that looks pretty good, I did a couple of quick tests and the results are correct. And there are fewer total ops, so hopefully I'll get some speed-ups.

Thanks!

##### Share on other sites
See, you really should've mentioned that it's an embedded system beforehand. There's a slight difference between a P4 3.06 and a ARM 100MHz [lol] Also, the fact that you're using crazy colors would have been relevant earlier...

##### Share on other sites
Yep, you're right. Again I apologize for not providing enough info up front.

I actually didn't realize there was a way to combine the whole colors at once.

##### Share on other sites
Quote:
 Original post by Unfadableit seems like performing several bit-wise ops on the components doesn't bother performance too much, but when I throw in that addition, things slow down quite a bit.

I highly doubt that. Of course, it's theoretically possible, but addition is usually just about the fastest operation you can perform (And certainly doesn't take more than a single cycle)

Quote:
 So I figured, dealing with components whose values range from 0-15 seem like it lends well itself to look-up tables, provided that cache access is fast enough (maybe it isn't though).

Addition is *definitely* faster than lookup tables, even if they fit in cache. I suspect that the version using addition has some other bottleneck holding it back.

##### Share on other sites
Quote:
Original post by Spoonbender
Quote:
 So I figured, dealing with components whose values range from 0-15 seem like it lends well itself to look-up tables, provided that cache access is fast enough (maybe it isn't though).

Addition is *definitely* faster than lookup tables, even if they fit in cache. I suspect that the version using addition has some other bottleneck holding it back.

And besides, a lookup tables would require an addition to offset from the base pointer of your table anyway.

If you're processor is 32-bits, here's a variation on eq's code that may be slightly faster
u16 average(u16 a, u16 b){  u32 a2 = a<<17;  u32 b2 = b<<17;  a2 |= a;  b2 |= b;  a2 &= 0x1e1ef0f0;  b2 &= 0x1e1ef0f0;  a2 += b2;  a2 >>= 1;  a2 &= 0x1e1ef0f0;  return (a2>>17) | a2;}

Compared to eq's code:
1 extra shift
2 extra ORs
3 less ANDs

BTW, make sure your compiler is inlining the code, otherwise the overhead of the function call will far outweigh these optimisations

[Edited by - joanusdmentia on October 21, 2005 6:43:17 PM]

##### Share on other sites
Quote:
 Original post by PrototypeConsider that shifting isn't necessarily faster than multiplication or division on modern processors (division is still slow, however). Your compiler may even decide to replace your shift operator with anything it considers better. See the topic discussed here:Is shift still worth it?The bottom line is, let the compiler do the optimization and write your code in such a way it can do it well.
Very true, but misleading in this case.

You must often let the compiler know that you're willing to sacrifice rounding towards zero by using right shifts and ANDs instead of signed divisions and modulo. In other words you can't just assume that the compiler will translate "(a + b) / 2" into a right shift..

Here's a quick example of what kind of code VC6 generates for the following operations (with maximum optimizations enabled of course):
int divide(int input) {	return input / 2;	// cdq	// sub eax,edx	// sar eax,1}int shift(int input) {	return input >> 1;	// sar eax,1}int modulo(int input) {	return input % 2;	// and eax,80000001h	// jns continue	// dec eax	// or eax,fffffffeh	// inc eax	// continue:}int and(int input) {	return input & 1;	// and eax,1}

##### Share on other sites
Wouldn't look-up tables require addition (for the pointer)?

##### Share on other sites
Hi guys, thanks for the replies.

So about the addition thing, I was initally looking at the component-wise blend like this:
red = (((pixel1 & 0x0f00) + (pixel2 & 0x0f00)) >> 1) & 0x0f00

From what I understand, ARM processors can perform a barrel shift on one operand during an instruction for free. With the equation above, both inner &'s could take the shift(do shift on pixel, & it with 0x0780), the equal could take the shift, the last & could get the shift, but I couldn't see a way for one of the addition operands to get the free shift. So my conclusion was that I can't piggy-back that shift onto an add instruction, which is why I hoped to remove it.

Looking at joanusdmentia's method, I see a couple of good opportunities for the free shift, but I think when it comes to the addition, no luck. I'll pen and paper joanus and eq's methods tomorrow to see if this is indeed the case (I'm a little tired at the moment). btw, thanks for those tips guys.

Quote:
 Wouldn't look-up tables require addition (for the pointer)?

Yep, but I wanted to get rid of an extra addition of the index.eg.lookup_table[a+b]. The barrel shift can be used on load/store instructions, so I could come away with just the one add, if I could get unique indicies with a slick shift trick.

Btw, I read in a white paper at arm.com that SoC ARM L1 cache accesses are usually 1 cycle. I need to verify this, but if true then investigating look-up tables seems worthwhile.

[Edited by - Unfadable on October 22, 2005 1:24:33 AM]

##### Share on other sites
Quote:
 Its okay if the number is off by 1

Didn't see this at first, this is a fast way of doing the averaging with some loss of precision:
u16average(u16 a, u16 b){  a >>= 1;  b >>= 1;  a &= 0x7777;  b &= 0x7777;  return a + b;

A faster (than my previous) method with full precision:

u16average(u16 a, u16 b){  u16 c = a & b;    // If the lower bit of each element in the two numbers is odd, we need to add one to the result  c &= 0x1111;      // Only interested in the lower bit  a &= 0xeeee;      // Remove the lowest bit of each element for a  b &= 0xeeee;      // Remove the lowest bit of each element for b  a >>= 1;          // Shift down to avoid overflow  b >>= 1;          // Shift down to avoid overflow  return a + b + c; // Sum it all up}

Edit: Cleaned up the code so it's easier to understand whats going on.

##### Share on other sites
Nice dude, thanks!