Sign in to follow this  
Unfadable

Quick average of 2 numbers

Recommended Posts

Does anyone know of any slick tricks to calculate the average of 2 numbers w/o using addition? Its okay if the number is off by 1. Just looking for fast technique. I thought of using look-up tables, but couldn't come up with a way to index into it quickly. I'm thinking theres a bit trick. Thanks

Share this post


Link to post
Share on other sites
Umm, why do want to speed up addition? Adding two numbers together and dividing by two should be extremely quick (you could use a bit shift for the division if you really want to).

There's probably a good reason; I'm just a little confused... [looksaround]

Share this post


Link to post
Share on other sites
I guess I should've posted this in general.

I'm trying to speed up my image blending operation and am willing to sacrifice (some quality) to do it. The addition, shift trick is quite fast enough for the amount of blending I'm doing.

Share this post


Link to post
Share on other sites
Consider that shifting isn't necessarily faster than multiplication or division on modern processors (division is still slow, however). Your compiler may even decide to replace your shift operator with anything it considers better.

See the topic discussed here:
Is shift still worth it?

The bottom line is, let the compiler do the optimization and write your code in such a way it can do it well.

Share this post


Link to post
Share on other sites
Hey,

This sounds like a time I'd consider using SSE instructions. Either that or do it on the GPU or something. The more parallel processing you can get, the better IMO here, and you're unlikely to get better than the GPU.

--CJM

Share this post


Link to post
Share on other sites
Quote:
Original post by Wavarian
I guess you could speed it up a little by replacing the division with a multiplication... x * 0.5f

*shrugs*

I doubt this would end up being faster than the bit-shift option suggested since it would require a conversion to float, then back to int for the RGB value.

Share this post


Link to post
Share on other sites
Quote:
Original post by Wavarian
I guess you could speed it up a little by replacing the division with a multiplication... x * 0.5f

This ranks right up there with the shift "optimization." It is something that the compiler can do if such a change is worth it.

*edit: And I doubt anything will be worth it. Go back a step, and see if there's some way to blend multiple pixels at once. Chances are, that's your best bet, short of using somebody else's library.

CM

Share this post


Link to post
Share on other sites
Without...addition?

So, the relatively slow division operation doesn't bother you at all, but you're concerned about the addition ? Addition is the absolute fastest instruction most processors provide, in most (all?) cases. A P4 can accomplish 4 add instructions in a single cycle (2 way exec plus double pumped ALUs). It's basically free.

Share this post


Link to post
Share on other sites
Quote:
Original post by Promit
Without...addition?

So, the relatively slow division operation doesn't bother you at all, but you're concerned about the addition ? Addition is the absolute fastest instruction most processors provide, in most (all?) cases. A P4 can accomplish 4 add instructions in a single cycle (2 way exec plus double pumped ALUs). It's basically free.
Absolutely!

Addition practically is the 'bit trick' you were looking for. Profile first to see where your bottleneck is. Then if you find it is actually there, then simply take shortcuts by doing it less often, or doing many in parallel. Seriously though, you're basically barking up the wrong tree.

Share this post


Link to post
Share on other sites
Sorry, guess I didn't provide enough info.

I'm working on an arm processor. No SSE, no graphics hardware. Cpu speed is 100Mhz.

Also, I'm working with integers and my ARGB values are represented by a short. 0xffff is full white. The reason I wanted to avoid addition is that, it seems like performing several bit-wise ops on the components doesn't bother performance too much, but when I throw in that addition, things slow down quite a bit.

So I figured, dealing with components whose values range from 0-15 seem like it lends well itself to look-up tables, provided that cache access is fast enough (maybe it isn't though). The whole table outta be able to fit in cache, if I could come up with a technique. I've seen look-up tables used in a similar fashion, but the components were added first, and the result was an index into the array. If possible, I'd like to get rid of the add step.

It also seems like there could be a bit-op way to get a value close to the actual average (a little error is acceptable in my case), so I've been trying stuff out, but to no avail.

Share this post


Link to post
Share on other sites
You can alway process 2 rgb values at a time assuming you've got 4 channels of 4 bits each:


typedef unsigned short u16; // Must be atleast 16 bits wide!!

u16
average(u16 a, u16 b)
{
u16 a2 = a & 0x0f0f;
u16 b2 = b & 0x0f0f;
a &= 0xf0f0;
b &= 0xf0f0;
a2 += b2;
a2 >>= 1;
a2 &= 0x0f0f;
a >>= 1;
b >>= 1;
a += b;
a &= 0xf0f0;
return a | a2;
}






This boils down to 14 instructions on a P4 that could be pieplined quite well.
Still got two additions, but you maybe had four before?
Disclaimer: This code is untested, but the theory still holds.

Share this post


Link to post
Share on other sites
See, you really should've mentioned that it's an embedded system beforehand. There's a slight difference between a P4 3.06 and a ARM 100MHz [lol] Also, the fact that you're using crazy colors would have been relevant earlier...

Share this post


Link to post
Share on other sites
Quote:
Original post by Unfadable
it seems like performing several bit-wise ops on the components doesn't bother performance too much, but when I throw in that addition, things slow down quite a bit.

I highly doubt that. Of course, it's theoretically possible, but addition is usually just about the fastest operation you can perform (And certainly doesn't take more than a single cycle)

Quote:

So I figured, dealing with components whose values range from 0-15 seem like it lends well itself to look-up tables, provided that cache access is fast enough (maybe it isn't though).

Addition is *definitely* faster than lookup tables, even if they fit in cache. I suspect that the version using addition has some other bottleneck holding it back.

Share this post


Link to post
Share on other sites
Quote:
Original post by Spoonbender
Quote:

So I figured, dealing with components whose values range from 0-15 seem like it lends well itself to look-up tables, provided that cache access is fast enough (maybe it isn't though).

Addition is *definitely* faster than lookup tables, even if they fit in cache. I suspect that the version using addition has some other bottleneck holding it back.

And besides, a lookup tables would require an addition to offset from the base pointer of your table anyway.

If you're processor is 32-bits, here's a variation on eq's code that may be slightly faster

u16 average(u16 a, u16 b)
{
u32 a2 = a<<17;
u32 b2 = b<<17;
a2 |= a;
b2 |= b;
a2 &= 0x1e1ef0f0;
b2 &= 0x1e1ef0f0;
a2 += b2;
a2 >>= 1;
a2 &= 0x1e1ef0f0;
return (a2>>17) | a2;
}



Compared to eq's code:
1 extra shift
2 extra ORs
1 less addition
3 less ANDs

BTW, make sure your compiler is inlining the code, otherwise the overhead of the function call will far outweigh these optimisations

[Edited by - joanusdmentia on October 21, 2005 6:43:17 PM]

Share this post


Link to post
Share on other sites
Quote:
Original post by Prototype
Consider that shifting isn't necessarily faster than multiplication or division on modern processors (division is still slow, however). Your compiler may even decide to replace your shift operator with anything it considers better.

See the topic discussed here:
Is shift still worth it?

The bottom line is, let the compiler do the optimization and write your code in such a way it can do it well.
Very true, but misleading in this case.

You must often let the compiler know that you're willing to sacrifice rounding towards zero by using right shifts and ANDs instead of signed divisions and modulo. In other words you can't just assume that the compiler will translate "(a + b) / 2" into a right shift..

Here's a quick example of what kind of code VC6 generates for the following operations (with maximum optimizations enabled of course):
int divide(int input) {
return input / 2;
// cdq
// sub eax,edx
// sar eax,1
}

int shift(int input) {
return input >> 1;
// sar eax,1
}

int modulo(int input) {
return input % 2;
// and eax,80000001h
// jns continue
// dec eax
// or eax,fffffffeh
// inc eax
// continue:
}

int and(int input) {
return input & 1;
// and eax,1
}

Share this post


Link to post
Share on other sites
Hi guys, thanks for the replies.

So about the addition thing, I was initally looking at the component-wise blend like this:
red = (((pixel1 & 0x0f00) + (pixel2 & 0x0f00)) >> 1) & 0x0f00

From what I understand, ARM processors can perform a barrel shift on one operand during an instruction for free. With the equation above, both inner &'s could take the shift(do shift on pixel, & it with 0x0780), the equal could take the shift, the last & could get the shift, but I couldn't see a way for one of the addition operands to get the free shift. So my conclusion was that I can't piggy-back that shift onto an add instruction, which is why I hoped to remove it.

Looking at joanusdmentia's method, I see a couple of good opportunities for the free shift, but I think when it comes to the addition, no luck. I'll pen and paper joanus and eq's methods tomorrow to see if this is indeed the case (I'm a little tired at the moment). btw, thanks for those tips guys.

Quote:

Wouldn't look-up tables require addition (for the pointer)?

Yep, but I wanted to get rid of an extra addition of the index.eg.lookup_table[a+b]. The barrel shift can be used on load/store instructions, so I could come away with just the one add, if I could get unique indicies with a slick shift trick.

Btw, I read in a white paper at arm.com that SoC ARM L1 cache accesses are usually 1 cycle. I need to verify this, but if true then investigating look-up tables seems worthwhile.

[Edited by - Unfadable on October 22, 2005 1:24:33 AM]

Share this post


Link to post
Share on other sites
Quote:
Its okay if the number is off by 1

Didn't see this at first, this is a fast way of doing the averaging with some loss of precision:

u16
average(u16 a, u16 b)
{
a >>= 1;
b >>= 1;
a &= 0x7777;
b &= 0x7777;
return a + b;








A faster (than my previous) method with full precision:


u16
average(u16 a, u16 b)
{
u16 c = a & b; // If the lower bit of each element in the two numbers is odd, we need to add one to the result
c &= 0x1111; // Only interested in the lower bit
a &= 0xeeee; // Remove the lowest bit of each element for a
b &= 0xeeee; // Remove the lowest bit of each element for b
a >>= 1; // Shift down to avoid overflow
b >>= 1; // Shift down to avoid overflow
return a + b + c; // Sum it all up
}




Edit: Cleaned up the code so it's easier to understand whats going on.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this