Sign in to follow this  

Silly Question On Performance

This topic is 4297 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I'm trying to improve the performance of some C++ code. The logic is simple and I don't see a way to improve it. double test = 0.0; double array_1[MAX]; double array_2[MAX]; double array_result[MAX]; for(xy = 0; xy < MAX; xy++) { test += array_1[xy]; test += array_2[xy]; array_result[xy] = test * 0.50; //I want the average; I hear multiply is quicker than divide } I don't see a way to improve it by anything meaningful. Sorry if this post is kind of silly. My problem is I have fairly simple calculations to do, but I have to do them a repeated number of times.

Share this post


Link to post
Share on other sites
There's only two ways I can think of to improve the performance of that - use a more optimizing compiler or buy a faster computer.

Share this post


Link to post
Share on other sites
Fast math routines, that's about all I can think of.

The only other thing is not doing it in series since you have a finite set of data it shouldn't be that hard.
But this is C and the overhead of creating threads would probably make all potential performance nil for all but large data sets.

So if you have a large set of data, thread it.

Share this post


Link to post
Share on other sites
Quote:
Original post by anonuser
So if you have a large set of data, thread it.


Not going to help unless you have multiple CPUs. Even then, the loop is memory-bound. The only way I can think to make that might make that code faster would be to interleave the arrays in memory ( [A,B,C,A,B,C,...] rather than [A,A,A...] [B,B,B...] [C,C,C...] ) to better exploit locality of reference (and CPU cache).

Share this post


Link to post
Share on other sites
Quote:
Original post by Fruny
Quote:
Original post by anonuser
So if you have a large set of data, thread it.


Not going to help unless you have multiple CPUs. Even then, the loop is memory-bound. The only way I can think to make that might make that code faster would be to interleave the arrays in memory ( [A,B,C,A,B,C,...] rather than [A,A,A...] [B,B,B...] [C,C,C...] ) to better exploit locality of reference (and CPU cache).


I did forget to mention the CPU part, thanks for the correction.


Share this post


Link to post
Share on other sites
You could possibly do some parallel processing with a single CPU that supports sse2:

http://www.gamedev.net/reference/articles/article1987.asp

Also, if you don't need the precision, I would use floats instead of doubles. Doing so can allow you to do 4 floating point operations at once. It's also just less data to move around not to mention 32bit cpus generally like 32bit vars.

Share this post


Link to post
Share on other sites
Quote:
Original post by fathom88
I'm trying to improve the performance of some C++ code. The logic is simple and I don't see a way to improve it.


double test = 0.0;
double array_1[MAX];
double array_2[MAX];
double array_result[MAX];
for(xy = 0; xy < MAX; xy++)
{
test += array_1[xy];
test += array_2[xy];

array_result[xy] = test * 0.50;


You say you want the average when you are dividing test by 2, i think you need to set test back to zero.
Yes, multiplying is faster, but a good compiler will optimize that for you.
xy++, typically you want to do ++xy. Although it does not matter unless xy is some sort of object.
You could thread if you had multiple CPUs, but most likely you (and whoever is going to run this) do not.
The only optimization I see that would make a difference is to unfurl your loop. "xy < MAX" takes the CPU about 2 clock cycles to calculate each iteration. Also, the conditional jump will interfere with your processing pipeline.. yadda yadda, but most CPU's are so damned smart that you will only loose a couple clock cycles in total due to the conditional part of the for loop. So if you know what MAX is before hand, you can get rid of the for loop and write as many lines as need be to get the job done. It will save you about 2 clock cycles * MAX = (probably less than a microsecond). :(
The loop unfurl technique works well with memory copies that are used an insane amount of times... works well => saves milliseconds. Again, :(

You shouldn't worry about optimizing small amounts of code like this, because the compiler will optimize it for you.

Share this post


Link to post
Share on other sites
Quote:
Original post by ProphecyEye
You could possibly do some parallel processing with a single CPU that supports sse2:

http://www.gamedev.net/reference/articles/article1987.asp

Also, if you don't need the precision, I would use floats instead of doubles. Doing so can allow you to do 4 floating point operations at once. It's also just less data to move around not to mention 32bit cpus generally like 32bit vars.


It would be less data to move around, but 32 bit cpus don't execute anything with floats. They let the FPU handle that. And the FPU converts both 32bit and 64bit floats to 80bit floats before calculating anything. And what are you talking about (floats) would allow 4 floating point operations at once? I don't think the FPU knows how to do that.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
That is the perfect code to optimize with SSE/AMD 3DNow. That's a classic example of a vector computation.

Share this post


Link to post
Share on other sites
Quote:
Original post by blaze02
Quote:
Original post by ProphecyEye
You could possibly do some parallel processing with a single CPU that supports sse2:

http://www.gamedev.net/reference/articles/article1987.asp

Also, if you don't need the precision, I would use floats instead of doubles. Doing so can allow you to do 4 floating point operations at once. It's also just less data to move around not to mention 32bit cpus generally like 32bit vars.


It would be less data to move around, but 32 bit cpus don't execute anything with floats. They let the FPU handle that. And the FPU converts both 32bit and 64bit floats to 80bit floats before calculating anything. And what are you talking about (floats) would allow 4 floating point operations at once? I don't think the FPU knows how to do that.


I heard this a year ago and thought that the performance difference between floats and doubles would be almost non-existant because of this After seeing some benchmarks between floats and doubles my friend did with one of his projects though, I was shocked to find a huge difference. I don't know my x86 cpus well enough to know why there is such a big difference, I only know that there is one and it's well worth taking advantage of if possible.

If I were to guess though I would imagine it partly has to do with moving half the data around to/from memory and partly to do with compilers getting smart enough to generate SSE instructions for float calculations.

Share this post


Link to post
Share on other sites
Don't discount "less data to move around". Especially with big datasets. Getting something from memory is really slow compared to doing pretty much anything once you have it at the processor. And for really big datasets, paging is really really really slow compared to processors. You can chew a lot of floats in the time it takes to take care of one page fault.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
How much is MAX exactly? I certainly hope it's not > 2); xy = xy + 4)
{
array_result[xy + 0] = 0.5 * (array_1[xy + 0] + array_2[xy + 0]);
array_result[xy + 1] = 0.5 * (array_1[xy + 1] + array_2[xy + 1]);
array_result[xy + 2] = 0.5 * (array_1[xy + 2] + array_2[xy + 2]);
array_result[xy + 3] = 0.5 * (array_1[xy + 3] + array_2[xy + 3]);
}
// remainder, might not be needed
for (xy = 0; xy <

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
How much is MAX exactly? I certainly hope it's not under 32.

As for the loop itself, there's few things you can do.

First is looking into L1 data cache organization on the processor you're using. It may get you quite some performance increase, if you're using a single input, single output array, or even a single array for all the data.

Second, loop unroling. Some compilers may do this by themself, others might not. The for loop contains an if statement on every iteration. That can mess with pipeline, so this is almost always faster. Actual size of unrolled loop depends on available cache space.


for(xy = 0; xy < (4 * MAX); xy = xy + 4)
{
array_result[xy + 0] = 0.5 * (array_1[xy + 0] + array_2[xy + 0]);
array_result[xy + 1] = 0.5 * (array_1[xy + 1] + array_2[xy + 1]);
array_result[xy + 2] = 0.5 * (array_1[xy + 2] + array_2[xy + 2]);
array_result[xy + 3] = 0.5 * (array_1[xy + 3] + array_2[xy + 3]);
}
// remainder, might not be needed
for (xy = 0; xy < (MAX - (MAX && 0x3)); xy++) {
array_result[xy] = 0.5 * (array_1[xy] + array_2[xy]);
}



Third, is converting the above either partially or fully into assembly, using one of the vector extensions.

Last, although perhaps it should be first, is what are you trying to acomplish in the first place. Do you really need to calculate the data like this, or could this all be processed while populating the arrays in the first place? How often is the code called? How many elements? Is it really a bottleneck?

Unless MAX is well over 1M, and you need to process this several times per second, performance of this loop shouldn't be the issue, and if it is, you'll need to aproach the problem differently.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
[quote]Original post by Anonymous Poster

for(xy = 0; xy < (4 * MAX); xy = xy + 4)
for (xy = 0; xy < (MAX - (MAX && 0x3)); xy++) {
}

[/quote]
Both of your loops iterate wrongly. The first one goes 4 times too high, the second one starts from 0, uses logical AND etc..

Share this post


Link to post
Share on other sites
Another thing you could try would be to always perform the multiplication on the data from the previous iteration. I'm not sure how much (if anything) you'd gain from it, but it would remove some dependencies from the loop body (the multiplication would no longer depend on the addition performed in the same iteration)

Share this post


Link to post
Share on other sites

This topic is 4297 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this