Navier-Stokes Equations Solver - Compute Shader!

Published May 22, 2009
Advertisement

Navier-Stokes Equations Solver - Compute Shader!

After a couple of weeks of engine development and testing, it was really nice to spend a few days on an actual algorithm to implement in D3D11. When I originally heard about the Compute Shader, one of the first techniques that I thought of was the NSE water solver that was popularized a few years ago. Up until this point, I only had support for the water solver in my engine on the CPU. From today onward, I have support for it on the GPU - completely on the GPU! (I know, I know, there isn't any D3D11 GPUs out there yet - but it will run when they are available [rolleyes]).

So without further ado - here is my fully GPU water solver in still action:



I still have some optimizations to make (specifically to increase the parallelism by the way I store the current state of the simulation), but it is pretty cool to see it running. I also uploaded a youtube video as well, where I initialized the state of the water to very wavy at one end of a 32x32 grid of water. You can see the waves propagating over to the near side of the water [cool].

">You can view the video here.

I'll post the details of the algorithm once I have the final version of the shader ready, but here is the current state of the shader for you guys to try out:

//-----------------------------------------------------------------------------// NSE_Solver_CS.hlsl//// The NSE Solver takes as input a shader resource view representing the // current state of the system in a texture2D, and the current flow values in // another texture2D.  The calculated output is stored in a texture2D bound to // the compute shader with an unordered access view to allow all of the various// threads to write their output simultaneously.//// Copyright (C) 2003-2009 Jason Zink.  All rights reserved.//-----------------------------------------------------------------------------Texture2D<float> InputHeight : register( t0 ); Texture2D InputFlow : register( t1 ); RWTexture2D<float> OutputHeight : register( u0 );RWTexture2D OutputFlow : register( u1 );#define size_x 32#define size_y 32groupshared float loadedheights[size_x * size_y];groupshared float4 loadedflows[size_x * size_y];[numthreads(size_x, size_y, 1)]void CSMAIN( uint3 GroupID : SV_GroupID, uint3 DispatchThreadID : SV_DispatchThreadID, uint3 GroupThreadID : SV_GroupThreadID, uint GroupIndex : SV_GroupIndex ){	//----------------------------------------	// Perform all texture accesses here and load the group shared memory with last frame's height and flow		loadedheights[GroupIndex] = InputHeight.Load( uint3(DispatchThreadID.x, DispatchThreadID.y, 0 ) );	loadedflows[GroupIndex] = InputFlow.Load( uint3(DispatchThreadID.x, DispatchThreadID.y, 0 ) );	GroupMemoryBarrierWithGroupSync();	//----------------------------------------	// Update all flow calculations	//	// Programmer artwork:    o--o x	//                       /|\	//                      / | \	//                     o  o  o	//                     w  z  y	// Initialize the flow variable to the last frames flow values	float4 NewFlow = float4( 0.0f, 0.0f, 0.0f, 0.0f );//loadedflows[GroupIndex];	// Check for 'not' right edge	if ( DispatchThreadID.x < size_x - 1 )	{		NewFlow.x = ( loadedheights[GroupIndex+1] - loadedheights[GroupIndex] );		// Check for 'not' bottom edge		if ( DispatchThreadID.y < size_y - 1 )		{			NewFlow.y = ( loadedheights[(GroupIndex+1) + size_x] - loadedheights[GroupIndex] );		}	}	// Check for 'not' bottom edge	if ( DispatchThreadID.y < size_y - 1 )	{		NewFlow.z = ( loadedheights[GroupIndex+size_x] - loadedheights[GroupIndex] );		// Check for 'not' left edge		if ( DispatchThreadID.x > 0 )		{			NewFlow.w = ( loadedheights[GroupIndex + size_x - 1] - loadedheights[GroupIndex] );		}	}	const float TIME_STEP = 0.05f;	const float PIPE_AREA = 0.0001f;	const float GRAVITATION = 10.0f;	const float PIPE_LENGTH = 0.2f;	const float FLUID_DENSITY = 1.0f;	const float COLUMN_AREA = 0.05f;	const float DAMPING_FACTOR = 0.9998f;	const float fAccelFactor = ( TIME_STEP * PIPE_AREA * GRAVITATION ) / ( PIPE_LENGTH * COLUMN_AREA );	NewFlow = ( NewFlow * fAccelFactor + loadedflows[GroupIndex] ) * DAMPING_FACTOR;	GroupMemoryBarrierWithGroupSync();	//----------------------------------------	// Calculate the new height for each column, then store the height and modified flow values.	// The updated height values are now stored in the loadedheights shared memory.	// Out of the current column...	loadedheights[GroupIndex] = loadedheights[GroupIndex] + NewFlow.x + NewFlow.y + NewFlow.z + NewFlow.w;	GroupMemoryBarrierWithGroupSync();	// Into the right columns	if ( DispatchThreadID.x < size_x - 1 )		loadedheights[GroupIndex+1] = loadedheights[GroupIndex+1] - NewFlow.x;	GroupMemoryBarrierWithGroupSync();	// Into the bottom right columns	if ( DispatchThreadID.x < size_x - 1 )		if ( DispatchThreadID.y < size_y - 1 )			loadedheights[GroupIndex+size_x+1] = loadedheights[GroupIndex+size_x+1] - NewFlow.y;	GroupMemoryBarrierWithGroupSync();	// Into the bottom columns	if ( DispatchThreadID.y < size_y - 1 )		loadedheights[GroupIndex+size_x] = loadedheights[GroupIndex+size_x] - NewFlow.z;	GroupMemoryBarrierWithGroupSync();	// Into the bottom left columns	if ( DispatchThreadID.y < size_y - 1 )		if ( DispatchThreadID.x > 0 )			loadedheights[GroupIndex+size_x-1] = loadedheights[GroupIndex+size_x-1] - NewFlow.w;	GroupMemoryBarrierWithGroupSync();	OutputHeight[DispatchThreadID.xy] = loadedheights[GroupIndex];	OutputFlow[DispatchThreadID.xy] = NewFlow;}

There's plenty more D3D11 goodness to come - this compute shader functionality really opens up the doors for trying new algorithms...
0 likes 5 comments

Comments

Moe
That's pretty cool. Not knowing much of the difference between D3D10 and D3D11, how hard would it be to get that working on D3D10 hardware?
May 26, 2009 01:34 PM
Jason Z
This implementation specifically uses the compute shader, which allows for read/write access to a texture buffer. That makes the implementation easier to do in a single pass in D3D11. However, you could probably decompose the calculations to two passes in a pixel shader on D3D10 without too much trouble. If you do it, I'd be interested in seeing the results!

The other consideration is that the CS allows for shared memory between invocations, whereas the PS doesn't. This allows for reduced texture bandwidth, and perhaps some clever optimizations in some cases. This won't really help until the D3D11 GPUs come around, but it is still fun to play with!
May 27, 2009 11:31 AM
FReY
Hey this looks pretty good; have you got an implementation that propagates across multiple thread group/blocks ie. a water surface greater than 32x32?

Moe: to get this working on DX10 feature level Compute under DX11, you could actually do it in 1 pass if you changed this to:
1] max 768 threads in a group. Although for efficiency reasons you'd actually want less than 384 threads.
2] a single UAV - possible if one converts the Height+Flow into a single RWStructuredBuffer for output, and a single StructuredBuffer for input.
3] a single groupshared region - possible again if you make it structured.
4] no shared memory scattered writes; ie. convert the algorithm to do gather reads.
5] The output StructuredBuffer can be directly bound as an SRV to other shader stages (VS, GS, PS).
6] compile for cs_4_0 target rather than cs_5_0 target.

(more info here: http://www.slideshare.net/repii/your-game-needs-direct3d-11-so-get-started-now)
May 29, 2009 03:28 AM
Jason Z
Quote:Original post by FReY
Hey this looks pretty good; have you got an implementation that propagates across multiple thread group/blocks ie. a water surface greater than 32x32?
That is what I'm working on now. I have already changed the algorithm to perform the reads from neighboring pixel areas instead of the scattered writes that you mention below. I should have something up and running with multiple groups in the next day or so...
Quote:Original post by FReY
Moe: to get this working on DX10 feature level Compute under DX11, you could actually do it in 1 pass if you changed this to:
1] max 768 threads in a group. Although for efficiency reasons you'd actually want less than 384 threads.
2] a single UAV - possible if one converts the Height+Flow into a single RWStructuredBuffer for output, and a single StructuredBuffer for input.
3] a single groupshared region - possible again if you make it structured.
4] no shared memory scattered writes; ie. convert the algorithm to do gather reads.
5] The output StructuredBuffer can be directly bound as an SRV to other shader stages (VS, GS, PS).

Those are good suggestions, but I'm not sure if you can do a multi-group implementation with a single pass on a single buffer - you would be writing to the same resource that you are using to calculate the update, which would be troublesome... Maybe there is a way to do it, but I couldn't figure it out at first thought.
May 29, 2009 09:37 AM
FReY
Quote:
Those are good suggestions, but I'm not sure if you can do a multi-group implementation with a single pass on a single buffer - you would be writing to the same resource that you are using to calculate the update, which would be troublesome...


Actually, you can still keep using 2 buffers and double-buffer the update like you're doing now. Your input will be an SRV, and your output will just be a single UAV. Basically, pseudocode for your loop will still look like this:


ID3D11Buffer* buffer[2];
CreateStructuredBuffer(buffer[0]);
CreateStructuredBuffer(buffer[1]);
CreateStructuredBufferSRV(srv[0], buffer[0]);
CreateStructuredBufferSRV(srv[1], buffer[1]);
CreateStructuredBufferUAV(uav[0], buffer[0]);
CreateStructuredBufferUAV(uav[1], buffer[1]);
uint readBuff  = 0;
uint writeBuff = 1;
while (running)
{
   pDev->CsSetUnorderedAccessViews(uav[writeBuff], 1);
   pDev->CsSetShaderResourceViews(srv[readBuff], 1);
   pDev->Dispatch(numGroupsX, numGroupsY, 1);
   pDev->CsSetUnorderedAccessViews(NULL, 1);

   // swap read and write buffers
   writeBuff = (writeBuff) ? 0 : 1;
   readBuff  = (readBuff) ? 0 : 1;

   pDev->VSSetShaderResourceViews(srv[readBuff], 1);
   pDev->DrawIndexed(...)
}





Effectively you'll be doing the following:

  struct GridPoint
  {
      float4 Flow;
      float  Height;
  };
  RWStructuredBuffer<GridPoint> Output : register(u0);
  StructuredBuffer<GridPoint>   Input  : register(t0);

  etc..




instead of this:

  Texture2D<float> InputHeight : register( t0 ); 
  Texture2D<float4> InputFlow : register( t1 ); 

  RWTexture2D<float> OutputHeight : register( u0 );
  RWTexture2D<float4> OutputFlow : register( u1 );

  etc...




Funny enough I just found some info on creating structured buffer UAVs and SRVs right here on gamedev :)

http://www.gamedev.net/community/forums/topic.asp?topic_id=516043
May 29, 2009 09:56 AM
You must log in to join the conversation.
Don't have a GameDev.net account? Sign up!
Advertisement