OpenMP slow unless I trace.

This topic is 3912 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

Recommended Posts

Hello! I'm giving OpenMP a bit of a try, seeing as I have a nice new Quad Core. The following code takes upwards of 35 seconds. If the two commented 'azTrace' calls are uncommented, the code takes ~ 2 seconds. Without OpenMP pragmas the time is ~ 2.8 seconds. It's a templated texture fill function for running 'functor' for each pixel. The functor in this case is 'Noise' at the bottom.
        // function preamble, locking texture
D3DCOLOR * const colorBasePtr = (D3DCOLOR *)lockedRect.pBits;

azTimeDuration loopDuration;

const int width = m_width, height = m_height;
int y,x;
#pragma omp parallel default(none) private( y, x ) shared( height, width, colorBasePtr, lockedRect, functor )
{
//		azTrace( "tn = %d\n", tn );
int seen = 0;

#pragma omp for schedule( static,height/4)
for( y = 0; y < height; ++y )
{
//			if( seen++ == 0 )
//				azTrace( "y = %d\n", y );

for( x = 0; x < width; ++x )
{
*(colorBasePtr + x + y * (lockedRect.Pitch >> 2) ) = functor( x, y, float(x) / float(width), float(y) / float(height) );
}
}
}

azTrace( "Fill loop duration was %gms\n", loopDuration.GetSeconds() * 1000.0f );
// unlocking texture, function end

// functor
static D3DCOLOR Noise( int x, int y, float u, float v )
{
unsigned int randomValue;
rand_s( &randomValue );
return -1 * ((randomValue & 0xff) >= 0x80);
}

There's a whole bunch of things I'm aware of: * The rand_s call probably updates globals, I'm not worried in this instance. * The memory filled is write combined, but supposedly there's 4 independent write combined buffers on CPUs. I'm interested to know if that's 4 per core or 4 per physical chip. I tried a cached memory buffer instead of the lockedRect.pBits and it was the same speed (with the tracing). The main questions are: * What's wrong with this that's fixed by sprintf and OutputDebugString (in azTrace)? * Why is the improvement only .8 seconds when it *does* 'work'? TIA, Slagh

Share on other sites
I can guess as to why the speedup is so small - the performance is probably primarily limited by memory bandwidth and not CPU speed. If the function did more computation then the speedup should be more noticable.

In addition under some compilers the rand() function will be wrapped in a critical section or similar so that it is thread safe when updating it's global data. That would obviously hit threaded performance.

The only side effect I can think that OutputDebugString() would have is that it might put the thread to sleep while it waited for the I/O to complete, maybe that change in timing would fix it. What you really need is a profiler to tell you what's going on.

As a side note if you want to make it go quicker I'd recommend adjusting it so your functor fills in a whole scanline instead of a single pixel to minimize the function call overhead, and allow for other optimizations.

Share on other sites
The compiler turns of certain aspects of the optimization afaik.
This result is very likey. Especially since you have many shared vars and much memory access.

Is correct to make width and height shared?
I am thinking of making them firstprivate.

- You might want to cast width and height to float one time, not in every iteration.

Share on other sites
Just to make sure:

{// if( seen++ == 0 )//	azTrace( "y = %d\n", y );   for( x = 0; x < width; ++x ){

Do you properly uncomment *both* lines, or just one? That would quickly explain the absurd difference in running time, due to for loop being executed once rather than every time.

Another things are the common optimizations of in-loop variables.

1) (lockedRect.Pitch >> 2)   // constant per function2) float(y) / float(height)                   // constant per row3) colorBasePtr + y * (lockedRect.Pitch >> 2) // constant per row

I'm pointing these out since they would drastically cut down the access to shared variables.

If single-core vs. 4-core version take 2 seconds either way, you have a congestion somewhere. 4 core should run 1/4 of time.

This:
rand_s( &randomValue );
should be a one-liner function. Make it non-shared, it's likely the reason for no performance increase.

Quote:
 * Why is the improvement only .8 seconds when it *does* 'work'?

Because your algorithm isn't parallelized, and all 4 cores are waiting on single shared resource, sitting idly most of the time.

Quote:
 * What's wrong with this that's fixed by sprintf and OutputDebugString (in azTrace)?

Excluding bugs, printf statement likely desynchronizes accesses to shared variable, causing less contention, preventing pipeline stalls, and other horrors.

Try to make as much data as possible local and non-shared. Often, greatest gains in scalability can be gained through duplicating the resources (unless that is not viable)

Share on other sites

Quote:
 I can guess as to why the speedup is so small - the performance is probably primarily limited by memory bandwidth and not CPU speed. If the function did more computation then the speedup should be more noticable.

The timing was filling a 2048x2048 ARGB8 texture, 2 seconds equates to around 8Mb a second, something I'd be more accustomed to seeing from or to a hard drive, but then it could be going over the PCIE bus.

Quote:
 In addition under some compilers the rand() function will be wrapped in a critical section or similar so that it is thread safe when updating it's global data. That would obviously hit threaded performance.

I ran a test with the Noise function simply returning 0, and the timing was trivially different.

Quote:
 The only side effect I can think that OutputDebugString() would have is that it might put the thread to sleep while it waited for the I/O to complete, maybe that change in timing would fix it. What you really need is a profiler to tell you what's going on.

I think you're right about the profiling. This is my first quick foray into OpenMP, I think I have a lot to learn!

Quote:
 As a side note if you want to make it go quicker I'd recommend adjusting it so your functor fills in a whole scanline instead of a single pixel to minimize the function call overhead, and allow for other optimizations.

This is a very good idea!

Share on other sites
Quote:
 Original post by hydrooThe compiler turns of certain aspects of the optimization afaik.This result is very likey. Especially since you have many shared vars and much memory access.Is correct to make width and height shared?I am thinking of making them firstprivate.- You might want to cast width and height to float one time, not in every iteration.

The examples I was trying to understand suggested that the loop comparison values should be shared. I'll have to have a look at firstprivate I think!

Quote:
 Original post by AntheusDo you properly uncomment *both* lines, or just one?

Absolutely both! :)

I know there's a number of invariants in the loop, and I think I've misunderstood the meaning of shared - my first interpretation has been 'import into the parallel region'.

Ultimately the rand_s function is being called 4M times and I'd be better off pulling that call out of the loop before bringing in the OpenMP hammer.

Share on other sites
I suggest to change the standard PRNG with something better and to generate all the random numbers before the loops. Some PRNGs have functions to generate a large number of random number in a single call.

1. 1
2. 2
Rutin
21
3. 3
4. 4
frob
17
5. 5

• 9
• 12
• 9
• 33
• 13
• Forum Statistics

• Total Topics
632590
• Total Posts
3007246

×

Important Information

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!