Sign in to follow this  
Slagh

OpenMP slow unless I trace.

Recommended Posts

Hello! I'm giving OpenMP a bit of a try, seeing as I have a nice new Quad Core. The following code takes upwards of 35 seconds. If the two commented 'azTrace' calls are uncommented, the code takes ~ 2 seconds. Without OpenMP pragmas the time is ~ 2.8 seconds. It's a templated texture fill function for running 'functor' for each pixel. The functor in this case is 'Noise' at the bottom.
        // function preamble, locking texture
	D3DCOLOR * const colorBasePtr = (D3DCOLOR *)lockedRect.pBits;

        azTimeDuration loopDuration;

	const int width = m_width, height = m_height;
	int y,x;
	#pragma omp parallel default(none) private( y, x ) shared( height, width, colorBasePtr, lockedRect, functor )
	{
		int tn = omp_get_thread_num();
//		azTrace( "tn = %d\n", tn );
		int seen = 0;

		#pragma omp for schedule( static,height/4)
		for( y = 0; y < height; ++y )
		{
//			if( seen++ == 0 )
//				azTrace( "y = %d\n", y );

			for( x = 0; x < width; ++x )
			{
				*(colorBasePtr + x + y * (lockedRect.Pitch >> 2) ) = functor( x, y, float(x) / float(width), float(y) / float(height) );
			}
		}
	}

	azTrace( "Fill loop duration was %gms\n", loopDuration.GetSeconds() * 1000.0f );
        // unlocking texture, function end

// functor
static D3DCOLOR Noise( int x, int y, float u, float v )
{
	unsigned int randomValue;
	rand_s( &randomValue );
	return -1 * ((randomValue & 0xff) >= 0x80);
}
There's a whole bunch of things I'm aware of: * The rand_s call probably updates globals, I'm not worried in this instance. * The memory filled is write combined, but supposedly there's 4 independent write combined buffers on CPUs. I'm interested to know if that's 4 per core or 4 per physical chip. I tried a cached memory buffer instead of the lockedRect.pBits and it was the same speed (with the tracing). The main questions are: * What's wrong with this that's fixed by sprintf and OutputDebugString (in azTrace)? * Why is the improvement only .8 seconds when it *does* 'work'? TIA, Slagh

Share this post


Link to post
Share on other sites
I can guess as to why the speedup is so small - the performance is probably primarily limited by memory bandwidth and not CPU speed. If the function did more computation then the speedup should be more noticable.

In addition under some compilers the rand() function will be wrapped in a critical section or similar so that it is thread safe when updating it's global data. That would obviously hit threaded performance.

The only side effect I can think that OutputDebugString() would have is that it might put the thread to sleep while it waited for the I/O to complete, maybe that change in timing would fix it. What you really need is a profiler to tell you what's going on.

As a side note if you want to make it go quicker I'd recommend adjusting it so your functor fills in a whole scanline instead of a single pixel to minimize the function call overhead, and allow for other optimizations.

Share this post


Link to post
Share on other sites
The compiler turns of certain aspects of the optimization afaik.
This result is very likey. Especially since you have many shared vars and much memory access.

Is correct to make width and height shared?
I am thinking of making them firstprivate.

- You might want to cast width and height to float one time, not in every iteration.

Share this post


Link to post
Share on other sites
Just to make sure:

{
// if( seen++ == 0 )
// azTrace( "y = %d\n", y );
for( x = 0; x < width; ++x ){




Do you properly uncomment *both* lines, or just one? That would quickly explain the absurd difference in running time, due to for loop being executed once rather than every time.

Another things are the common optimizations of in-loop variables.


1) (lockedRect.Pitch >> 2) // constant per function
2) float(y) / float(height) // constant per row
3) colorBasePtr + y * (lockedRect.Pitch >> 2) // constant per row


I'm pointing these out since they would drastically cut down the access to shared variables.

If single-core vs. 4-core version take 2 seconds either way, you have a congestion somewhere. 4 core should run 1/4 of time.

This:
rand_s( &randomValue );
should be a one-liner function. Make it non-shared, it's likely the reason for no performance increase.

Quote:
* Why is the improvement only .8 seconds when it *does* 'work'?


Because your algorithm isn't parallelized, and all 4 cores are waiting on single shared resource, sitting idly most of the time.

Quote:
* What's wrong with this that's fixed by sprintf and OutputDebugString (in azTrace)?


Excluding bugs, printf statement likely desynchronizes accesses to shared variable, causing less contention, preventing pipeline stalls, and other horrors.

Try to make as much data as possible local and non-shared. Often, greatest gains in scalability can be gained through duplicating the resources (unless that is not viable)

Share this post


Link to post
Share on other sites
Thanks for your replies!

Quote:

I can guess as to why the speedup is so small - the performance is probably primarily limited by memory bandwidth and not CPU speed. If the function did more computation then the speedup should be more noticable.


The timing was filling a 2048x2048 ARGB8 texture, 2 seconds equates to around 8Mb a second, something I'd be more accustomed to seeing from or to a hard drive, but then it could be going over the PCIE bus.

Quote:

In addition under some compilers the rand() function will be wrapped in a critical section or similar so that it is thread safe when updating it's global data. That would obviously hit threaded performance.


I ran a test with the Noise function simply returning 0, and the timing was trivially different.

Quote:

The only side effect I can think that OutputDebugString() would have is that it might put the thread to sleep while it waited for the I/O to complete, maybe that change in timing would fix it. What you really need is a profiler to tell you what's going on.


I think you're right about the profiling. This is my first quick foray into OpenMP, I think I have a lot to learn!

Quote:

As a side note if you want to make it go quicker I'd recommend adjusting it so your functor fills in a whole scanline instead of a single pixel to minimize the function call overhead, and allow for other optimizations.


This is a very good idea!

Share this post


Link to post
Share on other sites
Quote:
Original post by hydroo
The compiler turns of certain aspects of the optimization afaik.
This result is very likey. Especially since you have many shared vars and much memory access.

Is correct to make width and height shared?
I am thinking of making them firstprivate.

- You might want to cast width and height to float one time, not in every iteration.


The examples I was trying to understand suggested that the loop comparison values should be shared. I'll have to have a look at firstprivate I think!

Quote:
Original post by Antheus
Do you properly uncomment *both* lines, or just one?


Absolutely both! :)

I know there's a number of invariants in the loop, and I think I've misunderstood the meaning of shared - my first interpretation has been 'import into the parallel region'.

Ultimately the rand_s function is being called 4M times and I'd be better off pulling that call out of the loop before bringing in the OpenMP hammer.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this