Sign in to follow this  
trs79

performance hit when accessing array in large loop

Recommended Posts

Hi all,

I have 4 nested loops in a C++ function such that there are 125,000 total iterations. In the innermost loop I do a simple write to a float array[5], i.e.

[code]
float array[5];

for (uint a = 0; a < 1000; a++)
{
for (uint b = 0; b < 5; ++b)
{
for (uint c = 0; c < 5; ++b)
{
for (uint d = 0; d < 5; ++b)
{
unsigned int ctr = 0;
float v = 23 * a * b;
array[ctr] = v;
}
}
}
}
[/code]

This code is part of a larger 3D test. If I comment out the "array[ctr] = v;" line, I get 600fps, if not I lose over 200fps and drop down to ~350 fps. Could this be due to CPU cache issues? I'm really stumped on this one, thanks for any help.

Share this post


Link to post
Share on other sites
[url=http://www.mvps.org/directx/articles/fps_versus_frame_time.htm]Don't measure in FPS, measure in milliseconds[/url].

[code]
60 fps: 16.67 ms/frame (target)
600 fps: 1.67 ms/frame (10% of target)
350 fps: 2.85 ms/frame (17% of target)
[/code]

You've still got 13.82 ms/frame in your budget before you'll notice any performance hits.

I wouldn't worry about it until performance becomes a problem. When this happens, it may very well be likely that the bottleneck will be elsewhere.

Share this post


Link to post
Share on other sites
Thanks for the link, I see your point but the drop from ~600fps to ~360 fps was on my fast development machine. I'm trying to target low-end laptops, and when I tested the code on one of those I went from 170 fps to ~95 fps which seems like a greater percentage based performance drop (and getting closer to the 60fps target). I'm concerned because I still have more code to add and I'm afraid it will drop down even further. Anyone have any clues as to why a simple memory write takes that much time? Thanks.

Share this post


Link to post
Share on other sites
[quote name='trs79' timestamp='1318535519' post='4872280']
Thanks for the link, I see your point but the drop from ~600fps to ~360 fps was on my fast development machine. I'm trying to target low-end laptops, and when I tested the code on one of those I went from 170 fps to ~95 fps which seems like a greater percentage based performance drop (and getting closer to the 60fps target). I'm concerned because I still have more code to add and I'm afraid it will drop down even further. Anyone have any clues as to why a simple memory write takes that much time? Thanks.
[/quote]

It doesn't seem like you read the article. Don't mislead yourself into believing a drop from 170 to 95 FPS is close to a 50% performance drop.

For your code... What is it that you are computing? How often are you computing it? Your code looks like it will always compute the same value.

Share this post


Link to post
Share on other sites
[quote name='colinhect' timestamp='1318536274' post='4872286']
[quote name='trs79' timestamp='1318535519' post='4872280']
Thanks for the link, I see your point but the drop from ~600fps to ~360 fps was on my fast development machine. I'm trying to target low-end laptops, and when I tested the code on one of those I went from 170 fps to ~95 fps which seems like a greater percentage based performance drop (and getting closer to the 60fps target). I'm concerned because I still have more code to add and I'm afraid it will drop down even further. Anyone have any clues as to why a simple memory write takes that much time? Thanks.
[/quote]

It doesn't seem like you read the article. Don't mislead yourself into believing a drop from 170 to 95 FPS is close to a 50% performance drop.

For your code... What is it that you are computing? How often are you computing it? Your code looks like it will always compute the same value.
[/quote]

I did read the article, the takeway I got was that the lower the fps you have when it drops means a greater relative performance loss, i.e. the drop from 60 to 55 was worse then the drop from 900 to 400, hence my drop from 170 to 90 was worse on my target machine then the drop from 600 to 350 on my developer machine. (I'm looking at the "[font="Trebuchet MS, Arial, Helvetica"][font="Times New Roman"][size="3"]If you've been paying attention, you should suspect that the 5FPS drop is a sign of a greater performance cost than the 450FPS drop seen with the first method! In fact, that 5FPS drop represents 36.4% more execution time than the 450FPS drop[/size]![/font]" [font="Arial"]part towards the end of the article specifically.[/font]

[font="Arial"]The computation I posted is simplified for clarity, the actual code deals with creating an implicit surface for Smoothed particle hydrodynamics[font="Trebuchet MS, Arial, Helvetica"]. Like I mentioned earlier it's computed 125,000 times during the loops executions, and that function is called as fast as possible.[/font][/font][/font] So it seems whatever latency is caused by the array access is being accumulated to the point where it's noticeable. The rendered scene is actually very simple at this point.

Share this post


Link to post
Share on other sites
Your code doesn't make sense, you have ++b in the two innermost loops where I guess you want ++c and ++d. I assume that's not the code you use?
Always show your actual real code when asking a question.
Also, compile in Release mode. It is often 10x faster than Debug mode.

[quote]It doesn't seem like you read the article. Don't mislead yourself into believing a drop from 170 to 95 FPS is close to a 50% performance drop.[/quote]

That doesn't make sense either though, the execution time is increased with 79% from about 6ms to 11ms.

[quote]The computation I posted is simplified for clarity, the actual code deals with creating an implicit surface for Smoothed particle hydrodynamics. Like I mentioned earlier it's computed 125,000 times during the loops executions, and that function is called as fast as possible. So it seems whatever latency is caused by the array access is being accumulated to the point where it's noticeable. The rendered scene is actually very simple at this point. [/quote]

That's an entirely different matter. Post your actual calculation function, as your problem probably has exactly nothing to do with changing a float value 125k times.
Again, make sure to compile in Release mode.

Share this post


Link to post
Share on other sites
[quote name='trs79' timestamp='1318534024' post='4872269']
[...] If I comment out the "array[ctr] = v;" line, I get 600fps, if not I lose over 200fps and drop down to ~350 fps.
[/quote]

If you comment out the "array[ctr] = v;" line, your compiler may decide that the whole piece of code doesn't have any effect and just optimize it into nothing. It's really not a good indication that the array access is slow. Array accesses are usually not slow at all. Also, the code you posted only ever writes to array[0], so I don't see the point.

Share this post


Link to post
Share on other sites
[quote name='Erik Rufelt' timestamp='1318536993' post='4872294']
Your code doesn't make sense, you have ++b in the two innermost loops where I guess you want ++c and ++d. I assume that's not the code you use?
Always show your actual real code when asking a question.
Also, compile in Release mode. It is often 10x faster than Debug mode.

[quote]It doesn't seem like you read the article. Don't mislead yourself into believing a drop from 170 to 95 FPS is close to a 50% performance drop.[/quote]

That doesn't make sense either though, the execution time is increased with 79% from about 6ms to 11ms.

[quote]The computation I posted is simplified for clarity, the actual code deals with creating an implicit surface for Smoothed particle hydrodynamics. Like I mentioned earlier it's computed 125,000 times during the loops executions, and that function is called as fast as possible. So it seems whatever latency is caused by the array access is being accumulated to the point where it's noticeable. The rendered scene is actually very simple at this point. [/quote]

That's an entirely different matter. Post your actual calculation function, as your problem probably has exactly nothing to do with changing a float value 125k times.
Again, make sure to compile in Release mode.
[/quote]

My apologizes for the typo, I need to get in the habit of posting the real code, here it is (actually this is taken from [url="http://www.ss.iij4u.or.jp/%7Eamada/fluid/"]http://www.ss.iij4u....p/~amada/fluid/[/url], I'm in the process of modifying it and integrating it into my engine).

[code]
void Ren::sph_render_create_implicit(sph_render* r, float k, const vector3* pos, int n_particles)
{

float inv_stride;
float stride;


int x;
int y;
int z;

float fmin_x;
float fmax_x;
float fmin_y;
float fmax_y;
float fmin_z;
float fmax_z;

int min_x;
int max_x;
int min_y;
int max_y;
int min_z;
int max_z;

int vol_width;
int vol_height;
int vol_depth;


vector3 piv; /* position in volume */


stride = r->mc_stride;
inv_stride = 1.0f/stride;

fmin_x = fmin_y = fmin_z = .10;
fmax_x = fmax_y = fmax_z = -.10;


vec3_set(&r->volpos, fmin_x - r->mc_range*stride, fmin_y - r->mc_range*stride, fmin_z - r->mc_range*stride);

min_x = 0;
min_y = 0;
min_z = 0;
inv_stride = 1.0f/stride;
max_x = (int)((fmax_x - fmin_x)*inv_stride);
max_y = (int)((fmax_y - fmin_y)*inv_stride);
max_z = (int)((fmax_z - fmin_z)*inv_stride);

if ((min_x < 0) || (min_y < 0) || (min_z < 0))
printf("min_x=%d, min_y=%d, min_z=%d\n", min_x, min_y, min_z);


vol_width = 32;
vol_height = 32;
vol_depth = 32;



memset(r->volume, 0, vol_width*vol_height*vol_depth*sizeof(float));

r->im_vol.density = r->volume;
r->im_vol.width = vol_width;
r->im_vol.height = vol_height;
r->im_vol.depth = vol_depth;
r->im_vol.voxelsize = stride;

float rArray[5];
float v;

for (int i = 0; i < n_particles; i++) //n_particles is set to 1000
{

float bias = 1.0f;

vec3_sub(&piv, &pos[i], &r->volpos);
vec3_scale(&piv, 1.0f/stride, &piv);
for (z = -r->mc_range; z <= r->mc_range; z++) // mc_range is set to 2, so from -2 to +2 is 5 iterations
{
for (y = -r->mc_range; y <= r->mc_range; y++)
{


int vindex = 0;
unsigned int ctr = 0;

for (x = -r->mc_range; x <= r->mc_range; x++)
{



float h2_r2;

float dx;
float dy;
float dz;


dx = ((int)piv.x)*stride + r->volpos.x - pos[i].x + stride*x;
dy = ((int)piv.y)*stride + r->volpos.y - pos[i].y + stride*y;
dz = ((int)piv.z)*stride + r->volpos.z - pos[i].z + stride*z;

h2_r2 = max(r->iso_radius*r->iso_radius - (dx*dx + dy*dy + dz*dz), 0.0f);

v = k*h2_r2*h2_r2*h2_r2;




if ((vindex < 0) || (vindex > vol_width*vol_height*vol_depth))
continue;


rArray[ctr] = v; // this is the culprit line
++ctr;
}

}
}

}

r->vol_width = vol_width;
r->vol_height = vol_height;
r->vol_depth = vol_depth;
}

[/code]

The fps results are actually from a Release mode build, in Debug mode it's far lower. Thanks for any help. Yeah I didn't think changing a float would be such an issue, but I can't figure out what else it could be.

Share this post


Link to post
Share on other sites
Like alvaro said, if you comment out that "culprit line", everything else will be removed also by the optimizer, so the whole inner loop is the actual culprit.

There you have 375k 'floor' ops by converting to int and then multiplying that int by a float. That is not a good thing. Try replacing it with 'floor' and see if you get any difference. Floor isn't free in general though.
Try enabling SSE2 if you are compiling for 32-bit, as faster int/float conversion ops are used then. I'm not sure how many of those the compiler is allowed to optimize away. Try to not do arithmetic with both integers and floats together.

In addition you have 375k int * float operations from 'x/y/z * stride'.

Then you have like a million or two adds and subtracts, same for multiplies, and 250k branches.

Share this post


Link to post
Share on other sites
[quote name='alvaro' timestamp='1318537330' post='4872295']
If you comment out the "array[ctr] = v;" line, your compiler may decide that the whole piece of code doesn't have any effect and just optimize it into nothing. It's really not a good indication that the array access is slow. Array accesses are usually not slow at all. Also, the code you posted only ever writes to array[0], so I don't see the point.
[/quote]

Sorry, there should have been a ++ctr line in there so more then just array[0] would have been written too. (I see why posting full code is best, I always seem to think simplifying it will help but then it causes more problems....)

I think you all found the solution, if I compile in debug mode, I get ~211 fps without the array write, with the array write it's ~195 fps (on my fast machine) so very close performance, probably because in debug mode the loops never get optimized out. But in release mode without the array write the loops do get optmized out, so in essence the large fps drop is probably because in one instance there are 125,000 loop iterations and in another case there aren't any loop iterations. That would explain why on the slower laptop the performance hit is even a greater percentage because the processor is slower.

Does that seem like a reasonable conclusion?

Share this post


Link to post
Share on other sites
[quote name='Erik Rufelt' timestamp='1318538955' post='4872302']
Like alvaro said, if you comment out that "culprit line", everything else will be removed also by the optimizer, so the whole inner loop is the actual culprit.

There you have 375k 'floor' ops by converting to int and then multiplying that int by a float. That is not a good thing. Try replacing it with 'floor' and see if you get any difference. Floor isn't free in general though.
Try enabling SSE2 if you are compiling for 32-bit, as faster int/float conversion ops are used then. I'm not sure how many of those the compiler is allowed to optimize away. Try to not do arithmetic with both integers and floats together.

In addition you have 375k int * float operations from 'x/y/z * stride'.

Then you have like a million or two adds and subtracts, same for multiplies, and 250k branches.
[/quote]

Interesting, I hadn't realized that was such an expensive operation but that makes sense, I'll try changing it so ints and floats aren't used together. Yeah this whole functions seems very expensive, any other ideas on how to remove the O(N^3) complexity? I guess maybe that's just the nature of SPH fluid surface generation. I've tried to think of ways around that but am stumped. Currently it looks like the compiler is set to use SSE2 (I'm using MSVC with a /arch:sse2 compiler switch) dissassembly shows use of the xmm0 register.

Share this post


Link to post
Share on other sites
Couple of points:

[list][*]Timing in FPS is pointless. It's counterintuitive because you're dealing with the [i]reciprocal[/i] of the actual speed value, which means it follows a weird curve instead of a nice easy-to-understand linear pattern. Always talk about timings and performance in terms of how many milliseconds it takes to do some operation, not in terms of how many FPS you have. That was the point of the linked article from the beginning of the thread.[/list]

[list][*]You should pick up a profiler (I like Very Sleepy and Luke Stackwalker, personally; both are free) and see what it says. Guessing about performance bottlenecks is risky business. Even the best assembly hackers can't always look at a program and tell you where its performance issues will be. So the conventional advice is, if you're talking about performance, you should have some profiler numbers to prove out your assumptions about what's actually slow. In this case, the float/int/float conversion paths are probably hurting you the most, but you might discover other interesting things about the code if you throw a good profiler at it.[/list]

Good luck!

Share this post


Link to post
Share on other sites
Thanks for the info and profiler links, I'll give those a shot. I can't believe it didn't don on me that the compiler was optimizing away the loops, thanks everyone for showing me that!

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this