performance hit when accessing array in large loop

Started by
11 comments, last by trs79 12 years, 6 months ago
Hi all,

I have 4 nested loops in a C++ function such that there are 125,000 total iterations. In the innermost loop I do a simple write to a float array[5], i.e.


float array[5];

for (uint a = 0; a < 1000; a++)
{
for (uint b = 0; b < 5; ++b)
{
for (uint c = 0; c < 5; ++b)
{
for (uint d = 0; d < 5; ++b)
{
unsigned int ctr = 0;
float v = 23 * a * b;
array[ctr] = v;
}
}
}
}


This code is part of a larger 3D test. If I comment out the "array[ctr] = v;" line, I get 600fps, if not I lose over 200fps and drop down to ~350 fps. Could this be due to CPU cache issues? I'm really stumped on this one, thanks for any help.
Advertisement
Don't measure in FPS, measure in milliseconds.


60 fps: 16.67 ms/frame (target)
600 fps: 1.67 ms/frame (10% of target)
350 fps: 2.85 ms/frame (17% of target)


You've still got 13.82 ms/frame in your budget before you'll notice any performance hits.

I wouldn't worry about it until performance becomes a problem. When this happens, it may very well be likely that the bottleneck will be elsewhere.
Thanks for the link, I see your point but the drop from ~600fps to ~360 fps was on my fast development machine. I'm trying to target low-end laptops, and when I tested the code on one of those I went from 170 fps to ~95 fps which seems like a greater percentage based performance drop (and getting closer to the 60fps target). I'm concerned because I still have more code to add and I'm afraid it will drop down even further. Anyone have any clues as to why a simple memory write takes that much time? Thanks.

Thanks for the link, I see your point but the drop from ~600fps to ~360 fps was on my fast development machine. I'm trying to target low-end laptops, and when I tested the code on one of those I went from 170 fps to ~95 fps which seems like a greater percentage based performance drop (and getting closer to the 60fps target). I'm concerned because I still have more code to add and I'm afraid it will drop down even further. Anyone have any clues as to why a simple memory write takes that much time? Thanks.


It doesn't seem like you read the article. Don't mislead yourself into believing a drop from 170 to 95 FPS is close to a 50% performance drop.

For your code... What is it that you are computing? How often are you computing it? Your code looks like it will always compute the same value.

[quote name='trs79' timestamp='1318535519' post='4872280']
Thanks for the link, I see your point but the drop from ~600fps to ~360 fps was on my fast development machine. I'm trying to target low-end laptops, and when I tested the code on one of those I went from 170 fps to ~95 fps which seems like a greater percentage based performance drop (and getting closer to the 60fps target). I'm concerned because I still have more code to add and I'm afraid it will drop down even further. Anyone have any clues as to why a simple memory write takes that much time? Thanks.


It doesn't seem like you read the article. Don't mislead yourself into believing a drop from 170 to 95 FPS is close to a 50% performance drop.

For your code... What is it that you are computing? How often are you computing it? Your code looks like it will always compute the same value.
[/quote]

I did read the article, the takeway I got was that the lower the fps you have when it drops means a greater relative performance loss, i.e. the drop from 60 to 55 was worse then the drop from 900 to 400, hence my drop from 170 to 90 was worse on my target machine then the drop from 600 to 350 on my developer machine. (I'm looking at the "[font="Trebuchet MS, Arial, Helvetica"][font="Times New Roman"]If you've been paying attention, you should suspect that the 5FPS drop is a sign of a greater performance cost than the 450FPS drop seen with the first method! In fact, that 5FPS drop represents 36.4% more execution time than the 450FPS drop![/font]" [font="Arial"]part towards the end of the article specifically.[/font]

[font="Arial"]The computation I posted is simplified for clarity, the actual code deals with creating an implicit surface for Smoothed particle hydrodynamics[font="Trebuchet MS, Arial, Helvetica"]. Like I mentioned earlier it's computed 125,000 times during the loops executions, and that function is called as fast as possible.[/font][/font][/font] So it seems whatever latency is caused by the array access is being accumulated to the point where it's noticeable. The rendered scene is actually very simple at this point.
Your code doesn't make sense, you have ++b in the two innermost loops where I guess you want ++c and ++d. I assume that's not the code you use?
Always show your actual real code when asking a question.
Also, compile in Release mode. It is often 10x faster than Debug mode.

It doesn't seem like you read the article. Don't mislead yourself into believing a drop from 170 to 95 FPS is close to a 50% performance drop.[/quote]

That doesn't make sense either though, the execution time is increased with 79% from about 6ms to 11ms.

The computation I posted is simplified for clarity, the actual code deals with creating an implicit surface for Smoothed particle hydrodynamics. Like I mentioned earlier it's computed 125,000 times during the loops executions, and that function is called as fast as possible. So it seems whatever latency is caused by the array access is being accumulated to the point where it's noticeable. The rendered scene is actually very simple at this point. [/quote]

That's an entirely different matter. Post your actual calculation function, as your problem probably has exactly nothing to do with changing a float value 125k times.
Again, make sure to compile in Release mode.

[...] If I comment out the "array[ctr] = v;" line, I get 600fps, if not I lose over 200fps and drop down to ~350 fps.


If you comment out the "array[ctr] = v;" line, your compiler may decide that the whole piece of code doesn't have any effect and just optimize it into nothing. It's really not a good indication that the array access is slow. Array accesses are usually not slow at all. Also, the code you posted only ever writes to array[0], so I don't see the point.

Your code doesn't make sense, you have ++b in the two innermost loops where I guess you want ++c and ++d. I assume that's not the code you use?
Always show your actual real code when asking a question.
Also, compile in Release mode. It is often 10x faster than Debug mode.

It doesn't seem like you read the article. Don't mislead yourself into believing a drop from 170 to 95 FPS is close to a 50% performance drop.


That doesn't make sense either though, the execution time is increased with 79% from about 6ms to 11ms.

The computation I posted is simplified for clarity, the actual code deals with creating an implicit surface for Smoothed particle hydrodynamics. Like I mentioned earlier it's computed 125,000 times during the loops executions, and that function is called as fast as possible. So it seems whatever latency is caused by the array access is being accumulated to the point where it's noticeable. The rendered scene is actually very simple at this point. [/quote]

That's an entirely different matter. Post your actual calculation function, as your problem probably has exactly nothing to do with changing a float value 125k times.
Again, make sure to compile in Release mode.
[/quote]

My apologizes for the typo, I need to get in the habit of posting the real code, here it is (actually this is taken from http://www.ss.iij4u....p/~amada/fluid/, I'm in the process of modifying it and integrating it into my engine).


void Ren::sph_render_create_implicit(sph_render* r, float k, const vector3* pos, int n_particles)
{

float inv_stride;
float stride;


int x;
int y;
int z;

float fmin_x;
float fmax_x;
float fmin_y;
float fmax_y;
float fmin_z;
float fmax_z;

int min_x;
int max_x;
int min_y;
int max_y;
int min_z;
int max_z;

int vol_width;
int vol_height;
int vol_depth;


vector3 piv; /* position in volume */


stride = r->mc_stride;
inv_stride = 1.0f/stride;

fmin_x = fmin_y = fmin_z = .10;
fmax_x = fmax_y = fmax_z = -.10;


vec3_set(&r->volpos, fmin_x - r->mc_range*stride, fmin_y - r->mc_range*stride, fmin_z - r->mc_range*stride);

min_x = 0;
min_y = 0;
min_z = 0;
inv_stride = 1.0f/stride;
max_x = (int)((fmax_x - fmin_x)*inv_stride);
max_y = (int)((fmax_y - fmin_y)*inv_stride);
max_z = (int)((fmax_z - fmin_z)*inv_stride);

if ((min_x < 0) || (min_y < 0) || (min_z < 0))
printf("min_x=%d, min_y=%d, min_z=%d\n", min_x, min_y, min_z);


vol_width = 32;
vol_height = 32;
vol_depth = 32;



memset(r->volume, 0, vol_width*vol_height*vol_depth*sizeof(float));

r->im_vol.density = r->volume;
r->im_vol.width = vol_width;
r->im_vol.height = vol_height;
r->im_vol.depth = vol_depth;
r->im_vol.voxelsize = stride;

float rArray[5];
float v;

for (int i = 0; i < n_particles; i++) //n_particles is set to 1000
{

float bias = 1.0f;

vec3_sub(&piv, &pos, &r->volpos);
vec3_scale(&piv, 1.0f/stride, &piv);
for (z = -r->mc_range; z <= r->mc_range; z++) // mc_range is set to 2, so from -2 to +2 is 5 iterations
{
for (y = -r->mc_range; y <= r->mc_range; y++)
{


int vindex = 0;
unsigned int ctr = 0;

for (x = -r->mc_range; x <= r->mc_range; x++)
{



float h2_r2;

float dx;
float dy;
float dz;


dx = ((int)piv.x)*stride + r->volpos.x - pos.x + stride*x;
dy = ((int)piv.y)*stride + r->volpos.y - pos.y + stride*y;
dz = ((int)piv.z)*stride + r->volpos.z - pos.z + stride*z;

h2_r2 = max(r->iso_radius*r->iso_radius - (dx*dx + dy*dy + dz*dz), 0.0f);

v = k*h2_r2*h2_r2*h2_r2;




if ((vindex < 0) || (vindex > vol_width*vol_height*vol_depth))
continue;


rArray[ctr] = v; // this is the culprit line
++ctr;
}

}
}

}

r->vol_width = vol_width;
r->vol_height = vol_height;
r->vol_depth = vol_depth;
}



The fps results are actually from a Release mode build, in Debug mode it's far lower. Thanks for any help. Yeah I didn't think changing a float would be such an issue, but I can't figure out what else it could be.
Like alvaro said, if you comment out that "culprit line", everything else will be removed also by the optimizer, so the whole inner loop is the actual culprit.

There you have 375k 'floor' ops by converting to int and then multiplying that int by a float. That is not a good thing. Try replacing it with 'floor' and see if you get any difference. Floor isn't free in general though.
Try enabling SSE2 if you are compiling for 32-bit, as faster int/float conversion ops are used then. I'm not sure how many of those the compiler is allowed to optimize away. Try to not do arithmetic with both integers and floats together.

In addition you have 375k int * float operations from 'x/y/z * stride'.

Then you have like a million or two adds and subtracts, same for multiplies, and 250k branches.

If you comment out the "array[ctr] = v;" line, your compiler may decide that the whole piece of code doesn't have any effect and just optimize it into nothing. It's really not a good indication that the array access is slow. Array accesses are usually not slow at all. Also, the code you posted only ever writes to array[0], so I don't see the point.


Sorry, there should have been a ++ctr line in there so more then just array[0] would have been written too. (I see why posting full code is best, I always seem to think simplifying it will help but then it causes more problems....)

I think you all found the solution, if I compile in debug mode, I get ~211 fps without the array write, with the array write it's ~195 fps (on my fast machine) so very close performance, probably because in debug mode the loops never get optimized out. But in release mode without the array write the loops do get optmized out, so in essence the large fps drop is probably because in one instance there are 125,000 loop iterations and in another case there aren't any loop iterations. That would explain why on the slower laptop the performance hit is even a greater percentage because the processor is slower.

Does that seem like a reasonable conclusion?

This topic is closed to new replies.

Advertisement