• Announcements

    • khawk

      Download the Game Design and Indie Game Marketing Freebook   07/19/17

      GameDev.net and CRC Press have teamed up to bring a free ebook of content curated from top titles published by CRC Press. The freebook, Practices of Game Design & Indie Game Marketing, includes chapters from The Art of Game Design: A Book of Lenses, A Practical Guide to Indie Game Marketing, and An Architectural Approach to Level Design. The GameDev.net FreeBook is relevant to game designers, developers, and those interested in learning more about the challenges in game development. We know game development can be a tough discipline and business, so we picked several chapters from CRC Press titles that we thought would be of interest to you, the GameDev.net audience, in your journey to design, develop, and market your next game. The free ebook is available through CRC Press by clicking here. The Curated Books The Art of Game Design: A Book of Lenses, Second Edition, by Jesse Schell Presents 100+ sets of questions, or different lenses, for viewing a game’s design, encompassing diverse fields such as psychology, architecture, music, film, software engineering, theme park design, mathematics, anthropology, and more. Written by one of the world's top game designers, this book describes the deepest and most fundamental principles of game design, demonstrating how tactics used in board, card, and athletic games also work in video games. It provides practical instruction on creating world-class games that will be played again and again. View it here. A Practical Guide to Indie Game Marketing, by Joel Dreskin Marketing is an essential but too frequently overlooked or minimized component of the release plan for indie games. A Practical Guide to Indie Game Marketing provides you with the tools needed to build visibility and sell your indie games. With special focus on those developers with small budgets and limited staff and resources, this book is packed with tangible recommendations and techniques that you can put to use immediately. As a seasoned professional of the indie game arena, author Joel Dreskin gives you insight into practical, real-world experiences of marketing numerous successful games and also provides stories of the failures. View it here. An Architectural Approach to Level Design This is one of the first books to integrate architectural and spatial design theory with the field of level design. The book presents architectural techniques and theories for level designers to use in their own work. It connects architecture and level design in different ways that address the practical elements of how designers construct space and the experiential elements of how and why humans interact with this space. Throughout the text, readers learn skills for spatial layout, evoking emotion through gamespaces, and creating better levels through architectural theory. View it here. Learn more and download the ebook by clicking here. Did you know? GameDev.net and CRC Press also recently teamed up to bring GDNet+ Members up to a 20% discount on all CRC Press books. Learn more about this and other benefits here.
  • entries
    422
  • comments
    1540
  • views
    488280

About this blog

Ramblings of a DirectX MVP playing at the sharp end of 3D graphics

Entries in this blog

jollyjeffers

AWOL Update

afternoon all!
Writing this from my Nokia 5800 mobile internet so apologies in advance for any typos or bad formatting.
Mobile internet has certainly improved since I worked in the field (`06-`07) but i'm not yet convinced its quite there yet. I still get the occasional urge to throw said mobile out the window [smile]
Anyway, my PC is still broke and I don't really know why. I've tried a few combinations of OS's and HDD's and it still isn't happy. In particular the performance is very poor, VERY poor. I'm wondering if the mobo is at fault, maybe some sort of heat/power issue?? I am overclocking a Q6600 quite a long way after all...
Consequently I doubt i'll be active online for a while, factoring in a day job, social life and replacing hardware/rebuilding a PC [sad]
It's particularly irritating as i've just been on "holiday" for a week walking the 2nd section of the Pennine Way-90 miles in 6 days, 270 miles for the full route. There were so many occasions in the last week where I was miles from civilization and the only noise was wildlife and the wind. Reminded me of how much I enjoy getting away from big cities and technology every now and then!! The mind naturally wanders whilst walking and i thought up various crazy ideas and neat solutions to programming puzzles... To not have a PC to play with upon my return may see these forgotten...

once I've got a working PC i'll put my photos up for anyone interested in the more remote and picturesque parts of England look like.
jollyjeffers

Windows 7

Well, I sort of have a PC working again.

Saturday last week my machine was working absolutely fine. Sunday morning and the boot disk apparently failed and the machine wouldn't even boot. Great.

I'd previously tried dual-booting Windows 7 on this machine (it's Vista x64 normally) and it wasn't happy at all. Then a week or two later the whole machine broke, which says to me that either Win7 is very good at spotting an error early or that it itself caused the error.

So having lost everything (maybe, I still need to try disk recovery tricks) I try installing Windows 7 to just try and get any OS up and running - just happened to have the 7 disk closer to hand.

Anyway, that simply wouldn't go anywhere - the installer randomly crashes, or it won't allow me to pick a disk...

Gave up and stuck a Vista x64 disk in... worked first time.

In short, I now have zero confidence/trust in Win7 and I will most likely stick with Vista for the foreseeable future. Kinda seems backwards to the general consensus, but I never really had any objections to Vista so I don't much care [grin]


P.S. I'm off on holiday tomorrow. I'll rebuild my machine in July and get back to some D3D11/SlimDX dev work then...
jollyjeffers
Evening all,

Been quite busy lately hence the recent drought of journal updates. Not entirely sure where the time has been going, but I hope to get back on the case soon! Although, what with the RMT paralysing London for the next 3-4 days I doubt this week will be any more than a total write off (which I think is totally unreasonable on their part, but hey-ho I just have to deal with it...)

Anyway, Mike Popoloski recently asked me to have a look at the SlimDX API with their adaptation of the new Direct3D 11 API. I've been wanting to make time to check out their work for a while and this seems like a suitable opportunity.

Grabbed the latest version of TortoiseSVN and pointed it to their repository, built SlimDX.sln in VS'08 with the latest DXSDK successfully and I think I'm good to go.

In fact, the download speed of TortoiseSVN from SourceForge was the slowest part! I'm quite impressed that I could just pull down all the code and hit "build solution" and get only 27 seemingly unimportant warnings, no errors and a shiny new SlimDX.dll waiting to be used [cool]

Getting a little late to do any more now, but I intend to implement the Curved Point Normal Triangles (aka ATI TruForm from the D3D8 era) sometime this week using their API.
jollyjeffers

Windows 7

hmmm.

Cleared out a spare partition and installed the Windows 7 Release Candidate. Took 90 minutes before I saw the desktop which wasn't impressive given Vista takes all of 30 minutes to do the same.

But that's nothing compared to the fact that it's almost totally unresponsive once it gets to the desktop. Even trying to get task manager and My Computer loaded failed!

I hope for its sake that it was just busy doing some 1-time initialization or whatever. A fast quad core with plenty of RAM really shouldn't be struggling with viewing the start menu [lol]
jollyjeffers
Was just reading the private MVP newsgroups and ZMan posted a link to a Win7 developer blog: Windows 7 Managed Code APIs.

I must admit I've not looked into it in much detail, but Andy flagged up the "Support for Direct3D 11.0 and DXGI 1.0/1.1 APIs" comment near the top. I get the impression its not "MDX 3.0" or any sort of official successor to the now dead MDX API, but from a functional standpoint it sounds like it might fit in the same space...

Figured you guys might well find that interesting [smile]
jollyjeffers
Evening all,

Was ill most of last week so haven't really done all that much.

  • Downloading Windows 7 RC x64, probably dual boot that on my Vista 64 machine over the weekend if I'm not too hungover
  • Also downloading Visual Studio 2010 Beta 1. No particular reason, but figured I might as well pair it up with Win7 [smile]

    What follows is part of my ongoing article (ongoing...and going...and going...and going... 40+ A4 pages and counting). Took me f'in ages to upload these images (I <3 GDNet [razz]) so you'd better appreciate it [wink]




    Pre-processing the height map

    Despite being a pre-processing step in this context the approach that will be taken is very similar to the idea of post-processing which has been common in real-time graphics for several years.

    The input texture will be divided up into kernels; each kernel area will have its pixels read in and four values generated from the raw data which can then be stored in a 2D output texture. This 2D output texture will then be indexed by the Hull Shader to enable it to have the previously described context when making LOD computations.
    The key design decision is how to map the 16x16 height map samples down to a single per-patch output value.

    It is relatively straight forward to compute a variety of statistics from the source data, but really all that the Hull Shader cares about is having a measure of how much detail the patch requires. Is this piece of terrain flat? If yes, generate less detail. Alternatively, is this piece of terrain very bumpy and noisy? If yes, generate more detail.
    A good objective for this pre-pass is to find a statistical measure of coplanarity - to what extent do the 256 samples lie on the same plane in 3D space?

    Consider the following two diagrams:



    The left hand side diagram shows a relatively uniform slope, possibly the side of a hill or valley for example. However the right hand side diagram is much more erratic and noisy that originates from a more complex section of terrain. Ideally the Hull Shader would give the right-hand side example a much higher level of detail as it, quite simply, requires more triangles to represent it.



    The above diagram is the same two examples but with a plane inserted into the dataset. Whilst Direct3D isn't capable of rendering quads natively, the plane is the best possible surface if there were no tessellation involved and only a single primitive used to represent this piece of landscape. Notice that the plane in the left-hand diagram is a much closer match to the surface than in the right-hand diagram.



    This next diagram shows a side-on view of a terrain segment with plane and lines indicating how far each sample is from the plane. It is from this basis that we can measure coplanarity - the shorter the lines between the samples and the plane the more coplanar the data is.
    Picking the plane to base these calculations off requires a 'best fit' approach as it needs to be representative of the overall shape of the patch yet it is unlikely that any plane generated will be a perfect match to the real data.



    The above diagram demonstrates one computationally efficient method of getting an acceptable 'best fit' plane. On the left is the original patch geometry introduced earlier and on the right is the same geometry but with only the four corners joined together. Whilst this simplified primitive appears coplanar in this case there is no guarantee that this will always be the case.

    For each of the four corners the two adjacent neighbours are also known, and from here it is trivial to generate the pairs of vectors denoted in red. The cross-product of each pair of vectors results in a normal vector for that corner, denoted in blue. Combining and normalizing these four raw normal vectors will result in a single unit length normal vector for the patch, one that is generally representative of the underlying surface. By taking any of the four corner positions it is possible to derive a standard plane equation:

    Ax + By + Cz + D = 0
    A = Nx
    B = Ny
    C = Nz
    D = -(NoP)


    Where N is the normal vector, P is a corner point
    With this plane equation known, the compute shader can evaluate each height map sample for the distance between it and the plane.

    Implementing with a Compute Shader



    Notation and indexes in the compute shader are not immediately obvious; the above diagram introduces two of the key variables in the context of a terrain rendering pre-pass.
    The core HLSL shader has an entry point with the [numthreads(x,y,z)] attribute attached to it:

    [numthreads(16,16,1)]
    void csMain()
    {
    /* shader body here */
    }


    This attribute defines a group, aka a kernel, and in the above context it is defining a 16x16 array of threads per group. The body of the csMain method is for a single thread, but via system generated values it is able to identify which of these 256 (16x16) threads it actually is. With the ability to know which thread it is the code can be written to ensure each thread reads from and writes to the correct location.

    In the preceding diagram the Dispatch(x, y, z) call is also introduced. This is made by the application and is essentially a draw call as it begins execution of the compute shader. At this level the parameters indicate how many groups of 16x16 threads to create. For this particular algorithm the application simply divides the input height map texture dimensions by 16 and uses this as the number of kernels.

    For a 1024x1024 height map there will be 64x64 kernels, each kernel being 16x16 threads. Conceptually this would imply a very large number of threads, one per pixel in this case, but it is up to the implementation quite how these tasks are scheduled on the GPU and how many actually execute concurrently.
    A key detail omitted till now is how an invocation is able to identify itself relative to its group as well as the entire dispatch call. Direct3D defines four system generated values for this purpose:

    1. SV_GroupID
      This uint3 returns indexes into the parameters provided by ID3D11DeviceContext::Dispatch(). It allows this invocation to know which group it is relative to all others being executed. In particular, this value is useful for determining the output location in a many:one relationship. In this algorithm it is the index into the output texture where the results for the whole group are written.

    2. SV_GroupThreadID
      This uint3 returns indexes local to the current group - the parameters provided at compile-time as part of the [numthreads()] attribute. In this algorithm it is used to know which threads represent corner pixels for the current 16x16 area.

    3. SV_DispatchThreadID
      This uint3 is a combination of the previous two. Whereas they index relative to only one set of input parameters (::Dispatch() or [numthreads()]) this is a global index, essentially the two axis multiplied together. For a 64x64 dispatch of 16x16 threads this system value will vary between 0 and 1023 in both axis (64*16=1024) thus for this algorithm it provides the thread with the address of the source pixel to read from.

    4. SV_GroupIndex
      This uint gives the flattened index into the current group. For a 16x16 area this value will be between 0 and 255 and for the purpose of this algorithm it is essentially the thread ID, used only to coordinate work across the group.


    The final piece in the puzzle is the ability for threads to communicate with each other. This is done via a 4kb chunk of shared memory and synchronization intrinsics. Variables defined at the global scope with the 'groupshared' prefix can be both read from and written to by all threads in the current group:

    groupshared float groupResults[16 * 16];
    groupshared float4 plane;
    groupshared float3 rawNormals[2][2];
    groupshared float3 corners[2][2];


    Synchronization is done via a choice of six barrier functions. The code can be authored with either a *MemoryBarrier() or *MemoryBarrierWithGroupSync() call - the former blocks until memory operations have finished, but progress can continue before remaining ALU instructions complete; the latter blocks until all threads in the group have reached the specified point - both memory and arithmetic instructions must be complete. The barrier can either be 'All', 'Device' or 'Group' - with decreasing scope at each level. Thus an AllMemoryBarrierWithGroupSync() is the heaviest intrinsic to employ whereas a GroupMemoryBarrier() is more lightweight. In this algorithm only GroupMemoryBarrierWithGroupSync() is used.



    The first phase of the algorithm utilizes four threads, one for each corner of the 16x16 pixel group. Each of the four threads reads in a single sample and stores the height in groupResults[] and then a 3D position in corners[][]. All other threads are idle at this point. The code for this is as follows:

    if(
    ((GTid.x == 0) && (GTid.y == 0))
    ||
    ((GTid.x == 15) && (GTid.y == 0))
    ||
    ((GTid.x == 0) && (GTid.y == 15))
    ||
    ((GTid.x == 15) && (GTid.y == 15))
    )
    {
    // This is a corner thread, so we want it to load
    // its value first
    groupResults[GI] = texHeightMap.Load( uint3( DTid.xy, 0 ) ).r;

    corners[GTid.x / 15][GTid.y / 15] = float3(GTid.x / 15, groupResults[GI], GTid.y / 15);

    // The above will unfairly bias based on the height ranges
    corners[GTid.x / 15][GTid.y / 15].x /= 64.0f;
    corners[GTid.x / 15][GTid.y / 15].z /= 64.0f;
    }

    // Block until all threads have finished reading
    GroupMemoryBarrierWithGroupSync();




    The next phase sees the same four threads continuing to process the corner points. In this instance they need to know about their neighbouring corners so that they can generate the cross-product and hence a normal vector for each corner - entirely ALU work. Concurrently the other 252 threads can be reading in the remaining height map samples.

    if((GTid.x ==  0) && (GTid.y ==  0))
    {
    rawNormals[0][0] = normalize(cross
    (
    corners[0][1] - corners[0][0],
    corners[1][0] - corners[0][0]
    ));
    }
    else if((GTid.x == 15) && (GTid.y == 0))
    {
    rawNormals[1][0] = normalize(cross
    (
    corners[0][0] - corners[1][0],
    corners[1][1] - corners[1][0]
    ));
    }
    else if((GTid.x == 0) && (GTid.y == 15))
    {
    rawNormals[0][1] = normalize(cross
    (
    corners[1][1] - corners[0][1],
    corners[0][0] - corners[0][1]
    ));
    }
    else if((GTid.x == 15) && (GTid.y == 15))
    {
    rawNormals[1][1] = normalize(cross
    (
    corners[1][0] - corners[1][1],
    corners[0][1] - corners[1][1]
    ));
    }
    else
    {
    // This is just one of the other threads, so let it
    // load in its sample into shared memory
    groupResults[GI] = texHeightMap.Load( uint3( DTid.xy, 0 ) ).r;
    }

    // Block until all the data is ready
    GroupMemoryBarrierWithGroupSync();




    Phase four is where the next big chunk of work takes place, but prior to this the group must have a plane from which to measure offsets. This only requires a single thread and simply implements the plane-from-point-and-normal equations as shown below:

    if(GI == 0)
    {
    // Let the first thread only determine the plane coefficients

    // First, decide on the average normal vector
    float3 n = normalize
    (
    rawNormals[0][0]
    + rawNormals[0][1]
    + rawNormals[1][0]
    + rawNormals[1][1]
    );

    // Second, decide the lowest point on which to base it
    float3 p = float3(0.0f,1e9f,0.0f);
    for(int i = 0; i < 2; ++i)
    for(int j = 0; j < 2; ++j)
    if(corners[j].y < p.y)
    p = corners[j];

    // Third, derive the plane from point+normal
    plane = CreatePlaneFromPointAndNormal(n,p);
    }

    GroupMemoryBarrierWithGroupSync();




    With a plane available it is necessary to process each of the raw heights as originally loaded from the height map. Each thread takes a single height and computes the distance between this sample and the plane previously computed and replaces the original raw height value.

    // All threads now translate the raw height into the distance
    // from the base plane.
    groupResults[GI] = ComputeDistanceFromPlane(plane, float3((float)GTid.x / 15.0f, groupResults[GI], (float)GTid.y / 15.0f));

    GroupMemoryBarrierWithGroupSync();




    The final phase of the algorithm takes all of the height values and computes the standard deviation from the surface of the plane. This single value is a good metric of how coplanar the 256 individual height samples are - lower values imply a flatter surface and higher values a noisier and varying patch. This single value and the plane's normal vector is written out as a float4 in the output texture - 256 height map samples reduced down to four numbers.

    if(GI == 0)
    {
    float stddev = 0.0f;

    for(int i = 0; i < 16*16; ++i)
    stddev += pow(groupResults,2);

    stddev /= ((16.0f * 16.0f) - 1.0f);

    stddev = sqrt(stddev);

    // Write the normal vector and standard deviation
    // to the output buffer for use by the Domain and Hull Shaders
    bufferResults[uint2(Gid.x, Gid.y)] = float4(plane.xyz, stddev);
    }


    Two utility functions were referenced in the above fragments, for completeness they are as follows:

    float4 CreatePlaneFromPointAndNormal(float3 n, float3 p)
    {
    return float4(n,-dot(n,p));
    }

    float ComputeDistanceFromPlane(float4 plane, float3 position)
    {
    return dot(plane.xyz,position) - plane.w;
    }


    Integrating the Compute Shader

    The previous section details the actual Compute Shader implementing the algorithm, but it is still necessary for the application to coordinate this work.
    Firstly the output texture needs to be created. This will be bound as an output to the Compute Shader but later used as an input into the Hull Shader. The underlying type is a regular 2D texture with an important detail of having a D3D11_BIND_UNORDERED_ACCESS as one of its bind flags:

    D3D11_TEXTURE2D_DESC outputDesc;

    ZeroMemory( &outputDesc, sizeof( D3D11_TEXTURE2D_DESC ) );

    outputDesc.ArraySize = 1;
    outputDesc.BindFlags = D3D11_BIND_UNORDERED_ACCESS | D3D11_BIND_SHADER_RESOURCE;
    outputDesc.Usage = D3D11_USAGE_DEFAULT;
    outputDesc.Format = DXGI_FORMAT_R32G32B32A32_FLOAT;
    outputDesc.Width = TERRAIN_WIDTH;
    outputDesc.Height = TERRAIN_LENGTH;
    outputDesc.MipLevels = 1;
    outputDesc.SampleDesc.Count = 1;
    outputDesc.SampleDesc.Quality = 0;

    if( FAILED( hr = g_pd3dDevice->CreateTexture2D( &outputDesc, NULL, &g_pPrePassResults ) ) )
    {
    LOG( L"Failed to create 2D pre-pass results texture!" );
    return hr;
    }

    // Create a SRV on to the output buffer so the HS can read it
    if(FAILED( hr = g_pd3dDevice->CreateShaderResourceView( reinterpret_cast(g_pPrePassResults), NULL, &g_pPrePassResultsView ) ) )
    {
    LOG( L"Failed to create a SRV for the pre-pass results texture!" );
    return hr;
    }
    Next an unordered access view needs to be created so that the Compute Shader can read and write to the texture that has just been created:
    ID3D11UnorderedAccessView* pUAV = NULL;
    D3D11_UNORDERED_ACCESS_VIEW_DESC outputUAV;

    outputUAV.Format = DXGI_FORMAT_R32G32B32A32_FLOAT;
    outputUAV.ViewDimension = D3D11_UAV_DIMENSION_TEXTURE2D;
    outputUAV.Texture2D.MipSlice = 0;

    if( FAILED( hr = g_pd3dDevice->CreateUnorderedAccessView( g_pPrePassResults, &outputUAV, &pUAV ) ) )
    {
    LOG( L"Failed to create unordered access view for CS output!" );

    SAFE_RELEASE( pUAV );

    return hr;
    }


    At this point the necessary resources have been created so they simply need to be bound to the pipeline and the Compute Shader initiated:

    g_pContext->CSSetShaderResources( 0, 1, &g_pHeightMapView );

    ID3D11UnorderedAccessView* outputView[ 1 ] = { pUAV };
    g_pContext->CSSetUnorderedAccessViews( 0, 1, outputView, (UINT*)(&outputView) );

    g_pContext->CSSetShader( g_pPrePassComputeShader, NULL, 0 );

    g_pContext->Dispatch( xCount, yCount, 1 );

    SAFE_RELEASE( pUAV );
    ID3D11ShaderResourceView* nullEntry = NULL;
    g_pContext->CSSetShaderResources( 0, 1, &nullEntry );
    ID3D11UnorderedAccessView* nullView[ 1 ] = { NULL };
    g_pContext->CSSetUnorderedAccessViews( 0, 1, nullView, (UINT*)(&nullView) );



    The code after the Dispatch() call is particularly important. Without this being executed the UAV will still be bound to the pipeline referencing the 2D output texture; Direct3D will then stop it being bound as an input to the Hull Shader as it is illegal to have a resource set as both an input and output at the same time!

    At this point the work is done and the texture can be used by the Hull Shader. However, there is one additional piece of work that can greatly improve the quality of results - normalizing the standard deviations. Currently the values stored in the texture are raw as-is deviation values from the per-patch plane. The range of these values across the entire dataset can be very small, often between 0.0 and 0.4, which has the later effect of ensuring a close proximity between the flattest and the bumpiest terrain segments. By post-processing the Compute Shader results the values can be stretched out to the full 0.0 to 1.0 range and getting a much better spread of detail when the Hull Shader executes.

    outputDesc.BindFlags = 0;
    outputDesc.Usage = D3D11_USAGE_STAGING;
    outputDesc.CPUAccessFlags = D3D11_CPU_ACCESS_READ | D3D11_CPU_ACCESS_WRITE;

    ID3D11Texture2D *pStaging = NULL;
    if( FAILED( hr = g_pd3dDevice->CreateTexture2D( &outputDesc, NULL, &pStaging ) ) )
    {
    LOG( L"Failed to create staging resource to copy CS output data to!" );

    return hr;
    }

    g_pContext->CopyResource( pStaging, g_pPrePassResults );


    The above code creates a CPU accessible staging resource and copies the GPU results to it. This copy of the data can be normalized by the application code using the following construct:

    D3D11_MAPPED_SUBRESOURCE data;
    if( SUCCEEDED( g_pContext->Map( pStaging, 0, D3D11_MAP_READ_WRITE, 0, &data ) ) )
    {
    D3DXVECTOR4 *pResults = (D3DXVECTOR4*)data.pData;

    float minStdDev = 1e9f;
    float maxStdDev = -1e9f;

    for( int i = 0; i < outputDesc.Width * outputDesc.Height; ++i )
    {
    if(pResults.w > maxStdDev)
    maxStdDev = pResults.w;

    if(pResults.w < minStdDev)
    minStdDev = pResults.w;
    }

    float scalar = maxStdDev - minStdDev;
    if(scalar <= 1e-5f) // avoid divide-by-zero
    scalar = 1.0f;

    for( int i = 0; i < outputDesc.Width * outputDesc.Height; ++i )
    {
    pResults.w = (pResults.w - minStdDev) / scalar;
    }

    g_pContext->Unmap( pStaging, 0 );
    }


    Because the above operation has had to be performed on a CPU accessible staging resource it is necessary to copy the staging resource back over the GPU accessible texture. Failure to do this would mean that the GPU would simply use the results directly from the Compute Shader.

    g_pContext->CopyResource( g_pPrePassResults, pStaging );


    Results

    The following images, based on height map data for Puget Sound, Washington State USA ([Georgia Institute, 01]), demonstrate the difference that the new algorithm has:


    Naive Distance Based LOD
    208,068 Triangles Generated (60% rasterized)


    Compute Shader Based LOD
    200,452 Triangles Generated (64% rasterized)


    Whilst the top image may appear more aesthetically pleasing due to the smooth gradients, the more chaotically shaded bottom image is by far the better output from a geometric perspective. In both images the patch detail is translated into a colour - red for high detail, green for mid detail and blue for low detail.

    Consider the two areas marked by the box; this area represents a good example of the benefits of the deviation based heuristic. In the top image note that the majority of tiles being rendered have all been assigned the same LOD (notable by being the same shade of green) yet the region to the left of the box is very flat and the region to the right is very rough. Conversely in the bottom image the flat region to the left is predominantly blue and the rough are to the right is mostly green.

    The bottom image is demonstrating that the Hull Shader is using the pre-pass information to assign detail to patches that warrant it and reducing detail from those that do not require it.
  • jollyjeffers
    Been ill most of this week so haven't really been up to much development wise. That said, I still spent most of this afternoon playing around with my Direct3D 11 Terrain Renderer.

    During the week I came across a link for the Puget Sound dataset that I thought I'd make use of as it's definitely better than anything I can invent myself!

    For those who aren't familiar with Puget Sound, it's the bay area around Seattle. Compare the colour map with a google map (depressingly had to resort of using a Google service as none of the others have a 'terrain' mode [headshake]) and you should be able to identify it - look for the triangle of peaks in the lower-left portion of the image as being Mt Ranier (top), Mt St. Helens (bottom-left) and Mt Adams (bottom-right).

    Courtesy of y2kiah's comments in my previous journal entry I implemented a plane-based error metric and pushed ahead with a standard deviation based calculation.

    I now have a more complex Compute Shader pre-pass that generates a plane for each patch and takes the distance of each point to the plane as input into the standard deviation equation rather than just the raw height as I had previously done. Essentially I end up with a measurement of coplanarity (is that a word?) - low values indicate that all the points in the patch are roughly following the same flat surface and high values indicate a much rougher/uneven patch. Perfect!

    I also modified the way that the final LOD is chosen. I use the standard deviation as well as distance as inputs and have taken to simply adding them together with a given bias. Currently I'm using 35% distance and 65% deviation.

    From initial testing the results are pretty much exactly what I want. Flat and distant areas are low detail and rough pieces are high detail subject to their distance to the camera.

    jollyjeffers
    I've spent most of this afternoon playing around with more complex LOD selection algorithms.

    I tweaked and fixed the last few bugs in the Compute Shader pre-pass I previously discussed and now have it seeding my Hull Shader with additional per-patch data.

    However, I'm not really happy with the results. Today's experiments have mostly demonstrated to me that a "one size fits all" metric is very hard to find - some heightmaps suit different heuristics better than others, and also having a multi-variable LOD scheme is very hard to balance regardless of target data. It's proven far too easy to invalidate one variable in favour of another, or have multiple variables cancel each other out, or have variables working well in different parts of the image (my main problem)...



    The above is the naive approach - simply take the distance from the camera, clamped to a maximum distance.

    Two main problems exist - there is extra detail where its no needed (flat areas are the same/similar shade of green to the hilly areas) and there is no red as the geometry closest to the camera is clipped by the view/projection and near plane clipping.



    The above is a revision on the above in that it implements a near plane as well as a far plane. Notice that you can now see all three graduations of detail - red, green and blue.

    Still, the problem of detail where its not necessary exists.



    The above is a static LOD metric using the standard deviation of height values. The idea here being that 'noisy' patches have a high standard deviation whereas flatter areas will have a very low standard deviation. This should distribute detail to the patches that vary the most and thus deserve the most detail.

    It works pretty well, but there are a few cases where it can be thrown off quite badly - particularly where most of a patch is flat and only the edge is raised. Like the skirting tiles around the islands.



    The above is based on the spread of heights - basically maximum less minimum. This achieves a similar effect to the standard deviation but isn't so easily fooled by skirting tiles at the expense of generating a few patches with more detail than they probably need.



    The above modulates the standard deviation by the distance from camera, which should work well as a hybrid. However the typically very small (max of 0.305 in this image) standard deviation means it either weighs heavily on the distance and gives mostly blue or, if weighted differently, is drowned out by the distance metric.



    Above modulates distance with the spread of heights and seems to prove much more pleasing results with a better distribution of detail. At this time it's my preferred hybrid metric for LOD.




    I've posted
    ">another YouTube video of the latter algorithm, and in it (as well as the above images) you can spot some gaps between patches - the twinkling white pixels. This is really not good, and seems to be a discontinuity introduced by my "improved" distance-from-camera equation which is a shame. Something I need to look into tomorrow.

    I want to start capturing some of the amplification ratios and other statistics as part of the display. I've got them writing to the console, but I want those in the videos so you can see the actual geometric complexity differences.


    Thoughts?
    jollyjeffers
    Was going to get an early night tonight, but an hour past my target I'm still messing around with Direct3D 11 Compute Shaders.

    For less than 2hrs from first-keypress to a working algorithm I'm quite impressed, and the CS definitely seems to suit image processing algorithms nicely. Definitely a lot nicer than using pixel shaders!

    Texture2D texHeightMap : register( t0 ); 
    RWTexture1D bufferResults : register( u0 );

    groupshared float groupResults[16 * 16];
    groupshared float4 stats;

    [numthreads(16, 16, 1)]
    void csMain( uint3 Gid : SV_GroupID, uint3 DTid : SV_DispatchThreadID, uint3 GTid : SV_GroupThreadID, uint GI : SV_GroupIndex )
    {
    // Gid = The xyz index as spun up by the actual application call
    // e.g. if ::Dispatch(16,16,1) is used then this will index into those bounds
    // DTid = Similar to above, but the global offset, so for [numthreads(16,16,1)] and
    // ::Dispatch(16,16,1) this variable will range between 0-255,0-255,0
    // GTid = The offset within the [numthreads()] bounds e.g. 0-15,0-15,0
    // GI = The flattened offset for this group e.g. 0-255 (16*16)

    // Sample the appropriate 16x16 region
    groupResults[GI] = texHeightMap.Load( uint3( DTid.xy, 0 ) ).r;

    // Block until all threads have finished reading
    GroupMemoryBarrierWithGroupSync();

    // Then sum up the details
    if(GI == 0)
    {
    // Thread #0 computes the minimum
    // value from the raw data
    float mn = 1e9f;

    for(uint i = 0; i < 16 * 16; ++i)
    if(groupResults < mn)
    mn = groupResults;

    // Store!
    stats.r = mn;
    }
    else if(GI == 1)
    {
    // Thread #1 computes the maximum value
    // from the raw data
    float mx = -1e9f;

    for(uint i = 0; i < 16 * 16; ++i)
    if(groupResults > mx)
    mx = groupResults;

    // Store!
    stats.g = mx;
    }
    else if(GI == 2)
    {
    // Thread #2 computes the average value
    // from the raw data
    float avg = 0.0f;

    for(uint i = 0; i < 16 * 16; ++i)
    avg += groupResults;

    avg /= (16.0f * 16.0f);

    // Store!
    stats.b = avg;
    }
    else if(GI == 3)
    {
    // Thread #3 computes the standard
    // deviation of the raw data
    float avg = 0.0f;

    for(uint i = 0; i < 16 * 16; ++i)
    avg += groupResults;

    avg /= (16.0f * 16.0f);

    float stdDev = 0.0f;

    for(uint i = 0; i < 16 * 16; ++i)
    stdDev += pow(groupResults - avg, 2);

    stdDev /= ((16.0f*16.0f)-1.0f);

    stats.a = sqrt(stdDev);
    }

    GroupMemoryBarrierWithGroupSync();

    if( GI == 0 )
    {
    // Determine which cell in the ouput to write
    uint outIdx = Gid.x + Gid.y * 16;

    // Store!
    bufferResults[outIdx] = stats;
    }
    }


    It's basically a three-phase algorithm:
    1. All 256 threads read one sample from the heightmap and store it in group-shared memory
    2. Once complete the first 4 threads (leaving 252 idle) each calculate one of the statistics. In theory these will be done concurrently.
    3. Once all four statistics are finished computing it's written to the actual output buffer by the 1st thread (255 threads idle)

    Bit pointless to guess the performance profile, but it should be the sum of the slowest or most delayed read, then the most expensive of the four statistics, then the time to write to the 1D texture. I could definitely implement it better, but plenty of scope for lovely concurrent execution [grin]

    When run over a 256x256 heightmap gives me:

    LOG: Ran compute shader pre-pass, generated 256-element look-up table (4096 bytes) in 0.02ms. Used 16x16 groups of 16x16 threads over a 256x256 heightmap (65536 threads in total). [RunComputeShaderPrePass(...) @ line 1406]
    Result 0 (for tile 0,0 starting at 0,0): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 1 (for tile 1,0 starting at 16,0): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 2 (for tile 2,0 starting at 32,0): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 3 (for tile 3,0 starting at 48,0): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 4 (for tile 4,0 starting at 64,0): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 5 (for tile 5,0 starting at 80,0): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 6 (for tile 6,0 starting at 96,0): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 7 (for tile 7,0 starting at 112,0): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 8 (for tile 8,0 starting at 128,0): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 9 (for tile 9,0 starting at 144,0): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 10 (for tile 10,0 starting at 160,0): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 11 (for tile 11,0 starting at 176,0): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 12 (for tile 12,0 starting at 192,0): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 13 (for tile 13,0 starting at 208,0): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 14 (for tile 14,0 starting at 224,0): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 15 (for tile 15,0 starting at 240,0): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 16 (for tile 0,1 starting at 0,16): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 17 (for tile 1,1 starting at 16,16): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 18 (for tile 2,1 starting at 32,16): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 19 (for tile 3,1 starting at 48,16): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 20 (for tile 4,1 starting at 64,16): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 21 (for tile 5,1 starting at 80,16): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 22 (for tile 6,1 starting at 96,16): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 23 (for tile 7,1 starting at 112,16): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 24 (for tile 8,1 starting at 128,16): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 25 (for tile 9,1 starting at 144,16): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 26 (for tile 10,1 starting at 160,16): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 27 (for tile 11,1 starting at 176,16): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 28 (for tile 12,1 starting at 192,16): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 29 (for tile 13,1 starting at 208,16): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 30 (for tile 14,1 starting at 224,16): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 31 (for tile 15,1 starting at 240,16): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 32 (for tile 0,2 starting at 0,32): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 33 (for tile 1,2 starting at 16,32): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 34 (for tile 2,2 starting at 32,32): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 35 (for tile 3,2 starting at 48,32): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 36 (for tile 4,2 starting at 64,32): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 37 (for tile 5,2 starting at 80,32): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 38 (for tile 6,2 starting at 96,32): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 39 (for tile 7,2 starting at 112,32): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 40 (for tile 8,2 starting at 128,32): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 41 (for tile 9,2 starting at 144,32): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 42 (for tile 10,2 starting at 160,32): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 43 (for tile 11,2 starting at 176,32): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 44 (for tile 12,2 starting at 192,32): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 45 (for tile 13,2 starting at 208,32): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 46 (for tile 14,2 starting at 224,32): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 47 (for tile 15,2 starting at 240,32): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 48 (for tile 0,3 starting at 0,48): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 49 (for tile 1,3 starting at 16,48): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 50 (for tile 2,3 starting at 32,48): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 51 (for tile 3,3 starting at 48,48): Minimum = 0.000, Maximum = 0.110, Average = 0.001, Standard Deviation = 0.008
    Result 52 (for tile 4,3 starting at 64,48): Minimum = 0.000, Maximum = 0.333, Average = 0.027, Standard Deviation = 0.081
    Result 53 (for tile 5,3 starting at 80,48): Minimum = 0.000, Maximum = 0.333, Average = 0.028, Standard Deviation = 0.083
    Result 54 (for tile 6,3 starting at 96,48): Minimum = 0.000, Maximum = 0.333, Average = 0.021, Standard Deviation = 0.065
    Result 55 (for tile 7,3 starting at 112,48): Minimum = 0.000, Maximum = 0.165, Average = 0.004, Standard Deviation = 0.020
    Result 56 (for tile 8,3 starting at 128,48): Minimum = 0.000, Maximum = 0.161, Average = 0.006, Standard Deviation = 0.023
    Result 57 (for tile 9,3 starting at 144,48): Minimum = 0.000, Maximum = 0.165, Average = 0.014, Standard Deviation = 0.041
    Result 58 (for tile 10,3 starting at 160,48): Minimum = 0.000, Maximum = 0.196, Average = 0.015, Standard Deviation = 0.044
    Result 59 (for tile 11,3 starting at 176,48): Minimum = 0.000, Maximum = 0.157, Average = 0.008, Standard Deviation = 0.025
    Result 60 (for tile 12,3 starting at 192,48): Minimum = 0.000, Maximum = 0.027, Average = 0.000, Standard Deviation = 0.002
    Result 61 (for tile 13,3 starting at 208,48): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 62 (for tile 14,3 starting at 224,48): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 63 (for tile 15,3 starting at 240,48): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 64 (for tile 0,4 starting at 0,64): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 65 (for tile 1,4 starting at 16,64): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 66 (for tile 2,4 starting at 32,64): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 67 (for tile 3,4 starting at 48,64): Minimum = 0.000, Maximum = 0.333, Average = 0.027, Standard Deviation = 0.081
    Result 68 (for tile 4,4 starting at 64,64): Minimum = 0.443, Maximum = 1.000, Average = 0.945, Standard Deviation = 0.115
    Result 69 (for tile 5,4 starting at 80,64): Minimum = 0.667, Maximum = 1.000, Average = 0.971, Standard Deviation = 0.083
    Result 70 (for tile 6,4 starting at 96,64): Minimum = 0.302, Maximum = 1.000, Average = 0.788, Standard Deviation = 0.196
    Result 71 (for tile 7,4 starting at 112,64): Minimum = 0.000, Maximum = 0.780, Average = 0.426, Standard Deviation = 0.218
    Result 72 (for tile 8,4 starting at 128,64): Minimum = 0.000, Maximum = 0.498, Average = 0.191, Standard Deviation = 0.136
    Result 73 (for tile 9,4 starting at 144,64): Minimum = 0.310, Maximum = 0.733, Average = 0.503, Standard Deviation = 0.090
    Result 74 (for tile 10,4 starting at 160,64): Minimum = 0.333, Maximum = 0.737, Average = 0.651, Standard Deviation = 0.113
    Result 75 (for tile 11,4 starting at 176,64): Minimum = 0.047, Maximum = 0.737, Average = 0.413, Standard Deviation = 0.224
    Result 76 (for tile 12,4 starting at 192,64): Minimum = 0.000, Maximum = 0.082, Average = 0.006, Standard Deviation = 0.018
    Result 77 (for tile 13,4 starting at 208,64): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 78 (for tile 14,4 starting at 224,64): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 79 (for tile 15,4 starting at 240,64): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 80 (for tile 0,5 starting at 0,80): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 81 (for tile 1,5 starting at 16,80): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 82 (for tile 2,5 starting at 32,80): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 83 (for tile 3,5 starting at 48,80): Minimum = 0.000, Maximum = 0.333, Average = 0.028, Standard Deviation = 0.083
    Result 84 (for tile 4,5 starting at 64,80): Minimum = 0.533, Maximum = 1.000, Average = 0.927, Standard Deviation = 0.116
    Result 85 (for tile 5,5 starting at 80,80): Minimum = 0.502, Maximum = 0.996, Average = 0.779, Standard Deviation = 0.151
    Result 86 (for tile 6,5 starting at 96,80): Minimum = 0.000, Maximum = 0.925, Average = 0.488, Standard Deviation = 0.305
    Result 87 (for tile 7,5 starting at 112,80): Minimum = 0.000, Maximum = 0.380, Average = 0.157, Standard Deviation = 0.091
    Result 88 (for tile 8,5 starting at 128,80): Minimum = 0.055, Maximum = 0.733, Average = 0.402, Standard Deviation = 0.185
    Result 89 (for tile 9,5 starting at 144,80): Minimum = 0.263, Maximum = 0.737, Average = 0.638, Standard Deviation = 0.116
    Result 90 (for tile 10,5 starting at 160,80): Minimum = 0.251, Maximum = 0.737, Average = 0.611, Standard Deviation = 0.177
    Result 91 (for tile 11,5 starting at 176,80): Minimum = 0.000, Maximum = 0.737, Average = 0.313, Standard Deviation = 0.304
    Result 92 (for tile 12,5 starting at 192,80): Minimum = 0.000, Maximum = 0.008, Average = 0.000, Standard Deviation = 0.000
    Result 93 (for tile 13,5 starting at 208,80): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 94 (for tile 14,5 starting at 224,80): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 95 (for tile 15,5 starting at 240,80): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 96 (for tile 0,6 starting at 0,96): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 97 (for tile 1,6 starting at 16,96): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 98 (for tile 2,6 starting at 32,96): Minimum = 0.000, Maximum = 0.000, Average = 0.000, Standard Deviation = 0.000
    Result 99 (for tile 3,6 starting at 48,96): Minimum = 0.000, Maximum = 0.333, Average = 0.027, Standard Deviation = 0.080
    Result 100 (for tile 4,6 starting at 64,96): Minimum = 0.498, Maximum = 1.000, Average = 0.693, Standard Deviation = 0.184
    Result 101 (for tile 5,6 starting at 80,96): Minimum = 0.000, Maximum = 0.686
    jollyjeffers
    Evening all,

    My quiet spell has almost been broken now as I've spent pretty much all of bank holiday Monday writing up my Direct3D 11 tessellation algorithm for terrain rendering. Probably the most productive day I've had in weeks, which is a little disappointing [headshake]

    Anyway, two new YouTube videos for you to check out:

    ">D3D11 Terrain Tessellation Demo 2 - Solid and
    ">D3D11 Terrain Tessellation Demo 2 - Debug.

    The former video matches up with the
    ">wireframe video I posted a while back after figuring out what the [partitioning("")] attribute really did. I recommend you check the
    ">solid rendering out as it is nearly impossible to tell where the geomorphing occurs - watching the silhouette of horizon geometry and/or some shading artefacts are your only clues and they tend to be subtle enough that you wouldn't see it in a 'production' implementation.

    I'm genuinely quite impressed at how well the fixed function tessellator works in this case. The only dampener could be if performance of fractional_[odd|even] is somehow significantly worse than integer when hardware finally arrives.


    ">YouTube video


    The above image is more what I wanted to discuss this evening though.

    Today I wrote up 17 pages of A4 describing my adaptation of Greg Snook's Interlocking Terrain Tiles algorithm; including various code snippets and diagrams to make it as straight-forward as possible.

    For my encore I want to write up an enhancement that I shamelessly 'borrow' from a GDC slidedeck from Nvidia (can't find the link right now). In it they have a single slide indicating that pre-processing with a CS can provide useful weighting for terrain tessellation.

    I've yet to work on the actual implementation here but have started to scope it out and decide what I want to do - but more on that in a future journal entry.

    For now, the above screenshot (and associated video) were created as my justification for this extension.

    Basically, a simple linear mapping of LOD's based on camera distance doesn't work very well at all. The above image takes the inner tesselation factor and shades it red for high detail, green for middle detail and blue for low detail. As you can see, there is little (if any!) red in the image and of the final screen real-estate there are a large number of pixels generated that come from low-detail geometry.

    For a given performance budget it is quite clear that the camera distance heuristic does not spend wisely. My current implementation has the Hull Shader determining LOD's and the Domain Shader performing displacement mapping - specifically the Hull Shader has no visibility of the data that actually makes up the final surface!!

    I want to put in a CS-based pre-processing step to down-sample the heightmap to a per-tile set of coefficients that allow the Hull Shader constant function to extract more information about the surface it'll eventually end up representing.

    If nothing else it means that I can play with the Compute Shader and have a Direct3D 11 sample that exercises every single shader unit available [cool]
    jollyjeffers

    Single threaded GDI

    One of the many blogs I keep an eye on is Engineering Windows 7. For an official blog its been a bit wordy at times, but worth reading when the topic is of interest.

    The most recent one is Engineering Windows 7 Graphics Performance which I thought might be of interest to you guys. It's much more high-level than most people here as games and graphics developers will be used to, but provides some interesting context nonetheless.



    The above diagram/section caught my attention. Seems that after some experimentation they found that the single threaded and globally locking nature of GDI was a huge bottleneck in the responsiveness of the system.

    Firstly this raised an interesting example for me as my team are working on profiling/benchmarking a complex enterprise IT system at the moment. The above finding on the E7 blog demonstrates that getting cold, hard data can throw up problems or characteristics you never even thought existed!

    Secondly, this is of interest because my NUMBER ONE biggest irritation with Windows is when its unresponsive. I simply won't tolerate it and expect that if I click on something that it should respond immediately - maybe not with the full results, but at least a progress bar, loading image or something similar - a locked up application is absolutely and unequivocally unacceptable.

    Anyway. Seems that if they solved that bottleneck then the lock-ups disappeared:



    There are a few other interesting bits and pieces in the blog entry along similar lines, but thats the one that really stood out for me.

    Anyone here using the Win7 RC yet?


    In other news... Extremely busy at the moment, and any spare time I've got is being directed towards 'side projects', namely D3D11 tessellation, my SO debugging article and taking a look at SlimDX11. Not much spare time for forums unfortunately.

    Oh, and I'm getting a new bike - so more time outdoors and less time behind a PC [cool]
    jollyjeffers

    DirectX Blog

    Started a thread in the DirectX & XNA forum that says it all ready... FYI - Official DirectX blog has appeared
    jollyjeffers

    D3D11 Partitioning - pow2

    Evening all,

    My previous entry (p.s. you're all losers for not taking up on my question) missed out the pow2 partitioning mode that should be useful for texture based displacement mapping (2n subdivision conveniently matches mip-map levels). I chased this up with Michael Oneppo, PM for the Direct3D team and it turns out you need to use the Process**() HLSL intrinsics to enable this functionality.

    At this stage I'm not sure if thats a by-product of D3D11 still being in CTP or a 'by-design' choice, but without the Process**() intrinsics it appears that pow2 results in the same output as integer partitioning.

    More on this later I hope.

    As an aside, I just watch Waltz With Bashir having bought it on DVD last week. Didn't realise it was in Hebrew with English subtitles! Regardless, it's definitely a film worth watching and one that I'm truly glad I now own and have watched. But I don't think I'll sleep properly tonight - it's got that sort of gravity to it [oh]. For me, I'd give it a silver medal between Oliver Stone's Platoon's gold and Stanley Kubrick's Full Metal Jacket's bronze.
    jollyjeffers

    Hull Shader Partitioning Methods



    Missing the 4th method - "pow2" because as far as I can tell its broken in Nov'08 and Mar'09 [sad]

    Can you spot the pattern of how the three methods work? Bonus points for why morphing won't work in the same way on the integer mode.
    jollyjeffers

    Tessellation, now with less pop

    ">


    An updated video using [partitioning("fractional_odd")] instead of "integer" like in the first video. Note that there it is much harder, if at all possible, to see the popping of the geometry between LOD levels [cool]
    jollyjeffers

    Partitioning Modes

    Had a day or two off over the Easter weekend but now giving some thought to the next 'problem' with Direct3D 11 tessellation - interpolating between tessellation factors.

    If you watch either of the videos I posted you will see some quite noticeable 'popping' which is all due to the fact that it does not interpolate between factors. For the integer partitioning it will, for example, flip between 2 and 3 at 2.5 and it'll be a binary change - it's either 2 or its 3, and never a combination of the two.

    - edit -

    I just wrote up a possible plan for interpolation and handling the case where generated domain point didn't line up and then realised the GPU does that for me.

    D'Oh.

    Some diagrams I've drawn for the article I'm writing showed some very odd patterns for certain tessellation levels - especially non-uniform ones. I hadn't quite figured that out until reading a key sentence in the spec again just now.

    TBC, but from what I now understand, fractional partitioning values will have their domain points interpolated to aide transitions between LOD's.

    So for a [partitioning("fractional_even")] case you can get a pattern along the lines of:

    +-------+-------+ LOD = 2
    +------+++------+ LOD = 2.5 (rounded up to LOD = 4, 25% interpolation)
    +-----+-+-+-----+ LOD = 3.0 (rounded up to LOD = 4, 50% interpolation)
    +----+--+--+----+ LOD = 3.5 (rounded up to LOD = 4, 75% interpolation)
    +---+---+---+---+ LOD = 4.0

    Bit tricky to see I'm sure, but the domain points at the higher LOD appear to "grow" out of the lower LOD ones.

    I just re-ran the video I posted with that partitioning mode instead of integer and I get no more popping. That was a lot easier than I expected [cool]
    jollyjeffers

    Videos and GP&T thread...

    Todays journal entry is really in the Graphics Programming & Theory Forum. I wanted to get some feedback from the knowledgeable chaps that hang out over there - bit of an RFC really [smile]

    Read the thread here.

    For those who can't be bothered with that, YouTube videos:

    ">
    ">

    (Click for YouTube video)
    jollyjeffers

    Another screenshot



    Might be a little easier to see the LOD terrain tessellation in the above shot than yesterdays diagrams...

    Anyone got any recommendations regarding posting videos of the above? Rendering one into VirtualDub now and considering uploading to YouTube, but there must be better??
    jollyjeffers

    Snook's terrain algo on a GPU

    Evening all,

    I'm still alive. Finally moving from West London to North West London appears complete - boxes sorted, broadband switched on.

    Was quite productive living in this "offline" world.



    The above is generated from a simple 8x8 grid of quads (81 vertices @ 972 bytes / 768 indices @ 1536 bytes) and the GPU renders me a total of 1572 triangles. All in a single draw call and without the CPU creating any look-up tables or constant buffers with pre-computed LOD values. Cool, huh? [cool]

    I always quite liked Greg Snook's "Simplified Terrain Using Interlocking Tiles" that first featured in Game Programming Gems vol. 2 (to my knowledge, there is no digital version available for free).

    Back when that text was up-to-date it had many of the quality properties of the more favourable ROAM-like algorithms whilst also mapping very nicely to GPU's. If you didn't mind storing extra indices (tiny - uint16's!) you could get very good results without having to constantly update your raw data (one of the main problems with the more CPU-centric LOD algorithms).

    In a nutshell Greg's algorithm divided a terrain tile into 5 sections - the inner area and the 4 edges:

    +-----+
    |\___/|
    ||___||
    |/ \|
    +-----+


    Basically the inner section would be the main LOD and the 4 'skirts' would map between that inner LOD and whatever the neighbouring tile was using.

    So I realised a week or so back that the above (crappy) diagram matches the Direct3D 11 quad patch tessellation model. Pretty much 100% exactly as well.

    The initial screen-shot shows my current implementation of this algorithm (less height data). As a slightly clearer alternative:



    So the above is a 3x3 grid of quads - you can get a feel for how big they are by looking at the top-middle quad. That's the lowest detail, a quad divided into two triangles only.

    The LOD increases as the mid-point of the tile gets closer to the camera.

    Now consider the right-most corner quad. This has two neighbours, diagonally up-left and down-left. The up-left has an LOD of 2 and the down-left has an LOD of 5 whereas the tile in question has an LOD of 4. Notice how the edges, the bits with the odd patterns, blend between all 3 of the LOD's such that there aren't any 'stray' vertices - if this were a solid rendering and not wireframe there would be NO gaps in this surface.

    I'm writing this up for my main article, so I'll have some better diagrams soon!

    After this I want to try and render it into a YouTube video so you can see how it works and, more importantly, how damned ugly the popping is [wink]
    jollyjeffers

    Detailed flow through the Direct3D 11 Pipeline

    Feedback on the below is greatly appreciated!



    Detailed flow

    At first glance the new stages don't appear overly complex, but on closer inspection when designing and writing code the flow of data and the responsibilities of each unit can quickly become confusing. This is compounded by the deeper pipeline being harder to visualize - a classic VS+PS pipeline was easy and small enough to be held in a developers head, but trying to juggle all 6 stages can get much more difficult!



    The above diagram shows a simplified flow of the Curved Point Normal implementation covered later in this article. To help make the diagram clearer the conceptual view discussed in the previous section has been included with matching a colour scheme. Arrows show the flow of data and/or execution and individual shader functions are denoted in the form "xx(yy) -> zz", where 'xx' is the abbreviation of the shader type, 'yy' are important parameter(s) and 'zz' is the output.

    Before getting involved in the full details there are two important and general observations to be made:

    1. Several of the shaders are executed repeatedly for different inputs according to configuration by the developer. Whilst a bit obvious it is important to note that a Domain Shader transforms only a single new point at a time and the code you write does not have to write out all new points to a stream in a similar way to the Geometry Shader.

    2. Previously in Direct3D 10 the Geometry Shader was the only unit that had visibility of the entire set of inputs. Now the Hull, Domain and Geometry Shaders have visibility of all outputs from the previous stage.


    The Input Assembler



    As previously mentioned, the input shader can now output primitives with up to 32 vertices but regardless of this the IA functions in exactly the same way as it has in previous versions. It uses the vertex declaration (a ID3D11InputLayout created from an array of D3D11_INPUT_ELEMENT_DESC's), a vertex buffer (an ID3D11Buffer with binding of D3D11_BIND_VERTEX_BUFFER) and an index buffer (another ID3D11Buffer with a binding of D3D11_BIND_INDEX_BUFFER). The topology set on the device will be D3D11_PRIMITIVE_TOPOLOGY_n_CONTROL_POINT_PATCHLIST where 'n' is between 1 and 32 and the IA will then read the index buffer in chunks of 'n' and pick out the appropriate vertices from the vertex buffer.

    In this example there are four vertices defining a quad on the XZ plane and six indices defining two triangles. Unlike many other tessellation algorithms there is no adjacency information required thus it doesn't appear to be any different from rendering a normal quad. Without tessellation the output would look like:
    < image of just plain 4-vert quad >

    The Vertex Shaders



    As previously stated, the vertex shaders no longer have to output a projection space vertex to SV_Position like in Direct3D 10 (technically this could be done in a Geometry Shader in D3D10, but it was more efficient to stick with the conventional VS approach). It is now completely free to operate on data in whatever form the application gives it (via the IA) and output that data in whatever coordinate system or format it chooses.

    The common use-case for a vertex shader with D3D11 tessellation will be for animation - transforming a model according to the bones provided for a skeletal animation being a good example. In this example the vertex shader simply transforms the model-space vertex buffer data into world-space for the later stages. Note that once the VS has executed the later stages cannot see any of the original data from the vertex buffer, so if this is useful it should be passed down as part of an output.

    The Hull Shaders



    This is the first new programmable unit in the Direct3D 11 pipeline. It is made up of two developer-authored functions - the Hull Shader itself and a 'constant function'.

    The constant function is executed once per-patch and its job is to compute any values that are shared by all control points and don't logically belong as per-control-point attributes. As far as Direct3D 11 is concerned the requirement is to output an array of SV_TessFactor and SV_InsideTessFactor values. Depending on the primitive topology the size of these arrays varies, but all this is discussed in a later section. The outputs from a constant function are limited to 128 scalars (32 float4's) which gives ample room for per-patch constants once the tessellation factors have been included.

    An attribute on the main Hull Shader function declares how many output control points will be generated. This can be a maximum of 32, and does not have to match the topology set on the Input Assembler - it is perfectly legal for the HS to increase or decrease the number of control points. This attribute determines how many times the individual Hull Shader is executed, once for each of the declared output (the index is provided via a SV_OutputControlPointID uint input) control points. The quantity of data can be up to 32 float4's or 128 scalars, the same as the per-patch constant function but with one difference; the maximum output for all HS invocations is 3,968 scalars. In practice this means that if you're outputting 32 control points then you can only use 31 float4's instead of 32. Putting these numbers together the entire Hull Shader output size is clamped to 4kb.

    Both functions have full visibility of all vertices output by the Vertex Shader and deemed to be part of this primitive. This is represented on the diagram by the blue arrows leading from the Vertex Shading section to each of the Hull Shader and constant function invocations.

    In this example of Curved Point Normal Triangles (discussed in detail later) the later Domain Shader needs 10 control points to construct the appropriate cubic surface. These are broken up into 3 for the original vertices, 2 for each edge (6 in total) and 1 in the middle of the triangle, for 10 in total.

    The Fixed Function Tessellator



    The next stage of processing is entirely fixed function and operates as a black-box except for the two inputs - the SV_TessFactor and SV_InsideTessFactor values output by the Hull Shader constant function.

    A very important point to note is that the control points output by the Hull Shader are not used by this stage. That is, it does all of its tessellation work based on the two aforementioned inputs. The control points are entirely in existence for you as the developer to create your tessellation algorithm and the pipeline itself doesn't ever pay them any attention, hence why there are no restrictions (other than space) on the output from the individual invocations.
    The output from the tessellator is a set of weights corresponding to the primitive topology declared in the Hull Shader - line, triangle or quad. Each of these weights gets fed into a separate Domain Shader invocation, discussed next. In addition to these newly created vertices which the developer actually sees as part of the Domain Shading stage the tessellator also handles the necessary winding and relations between domain samples so that they form correct triangles that can later be rasterized.

    The Domain Shaders



    Domain shaders run in isolation yet, as is necessary, they can see all the control points and per-patch constants output by the earlier Hull Shader stage. Simply put, the domain shader's job is to take the point on the line/triangle/quad domain provided by the tessellator and use the control mesh provided by the hull shader to create a complete new, renderable vertex.
    It is now the Domain Shader's responsibility to output a projection space vertex coordinate to SV_Position (although, strictly speaking, the GS could do this but it'll be less efficient).

    The Geometry Shaders



    This stage remains unchanged from previous versions of Direct3D and effectively marks the end of any tessellation related programming. Unless the Domain Shader passed along any of the control mesh information as per-vertex attributes then the Geometry Shader has no knowledge of the tessellation work that preceded it.
    Unlike in Direct3D 10 where the number of GS invocations was directly linked to the parameters of a draw call, the number of executions is now linked to the SV_TessFactor and SV_InsideTessFactor values emitted from the Hull Shader constant function. If these are constant and/or set by the application (e.g. via a constant buffer) then you can derive the number of GS invocations, but if a more intelligent LOD scheme is implemented then the number of invocations will be much more difficult to compute.

    Rasterization and the Pixel Shaders

    Tessellation is a geometry based operation, thus the final rasterization stages remain completely unchanged and oblivious to anything that came before it.

    One Final Note on Efficiency

    The diagram presented here takes the naive approach of executing as many times as there are outputs or inputs. It is expected that hardware can take advantage of commonality (via pre- or post-transform caches for example) and reduce the number of invocations. The two best candidates are the vertex and domain shaders; in this example there are 6 VS invocations and 36 DS invocations yet there are only 4 unique control points and 14 domain points. Specifically, the example used here would do 50% more vertex shading and 157% more domain shading - in a field where performance is crucial it's easy to see why the hardware would want to be cleverer!
    jollyjeffers

    Diagram update

    Another updated diagram preview for you.

    It's now pretty much 'feature complete' and I just need to tidy it up and make it a bit more presentable.

    Question: Can you look at this and get a feel for both the flow of execution/data in the D3D11 pipeline as well as some context on how many times each shader is executed?

    jollyjeffers

    The Direct3D 11 Pipeline

    WORK IN PROGRESS: (but I wanted to share it with you all anyway!)

    jollyjeffers

    Using D3D11 Stream Out for debugging

    Using Direct3D's Stream Out Stage for Debugging

    As requested in a previous journal entry I'm writing up my new found debugging trick. At the beginning of March I had the pleasure of visiting Microsoft's Direct3D team over in Redmond and during one of the discussions they suggested using SO for debugging. I'd never even contemplated using a Geometry Shader with Stream Out as a log file before!

    I've implemented this technique for debugging Direct3D 11 tesselation code and the code that I present below is for D3D11, but implementing under D3D10 should be pretty straightforward. For this usage of stream output there haven't really been any big changes between 10, 10.1 and 11.

    What do you get back?

    Simply put you can pull back a BLOB containing a subset of the output from your geometry shader. You can then map this buffer to its equivalent C/C++ struct and read the data that was returned. In a nutshell it allows you to perform 'printf' style debugging of your geometry processing.

    Its usage is a bit of a moot point in two regards:
    1. We have PIX for Windows that provides a much more intuitive interface

    2. It can generate a huge amount of data that may not be practical to debug.


    However, in the context of Direct3D 11 where the tools aren't yet mature it is a very useful and very powerful tool.

    More generally, you may find it useful as you can output this data mixed along with whatever app-specific data was used to generate the draw call, thus offering the ability to correlate and cross-reference cause and effect. Given the seperation from your application when using PIX this is a definite plus-point.

    Show me the money!

    There are four basic steps:
    1. Modify your shaders
      You need to add and generate any additional fields that you want to output. For example, you might not normally output the world-space position but for this case you may well want to include it.

    2. Change how you generate your GS
      You need to use slightly different API methods for constructing your ID3D11GeometryShader (or ID3D10GeometryShader)

    3. Create a buffer to capture the output
      Quite simple really - you can't output data to nowhere!

    4. Decode the results
      Once you've finished rendering you need to put this new found source of information to good use


    First-up, modify your shaders.

    As you should be familiar, Direct3D supports a "pass forward" mechanism such that anything you want at a later stage must be passed through from an earlier stage and if it isn't there is no way to get it back again. This is the crucial detail, such that if you want intermediary values from your vertex shader you need to output these and, in the case of D3D11, persist them through the HS/DS stages as well.

    Take the following Direct3D 11 Geometry Shader:
    struct DS_OUTPUT
    {
    float4 position : SV_Position;
    float3 colour : COLOUR;
    float3 uvw : DOMAIN_SHADER_LOCATION;
    float3 wPos : WORLD_POSITION;
    };

    [maxvertexcount(3)]
    void gsMain( triangle DS_OUTPUT input[3], inout TriangleStream TriangleOutputStream )
    {
    TriangleOutputStream.Append( input[0] );
    TriangleOutputStream.Append( input[1] );
    TriangleOutputStream.Append( input[2] );
    TriangleOutputStream.RestartStrip();
    }


    The GS operates entirely as a pass-through, such that utilizing will not actually change the behaviour of your application. In particular note the DS_OUTPUT struct, in the next step we will choose which elements we want to be made available to the application.

    An important point is that your pixel shader need not be changed, provided the order of elements is preserved and that it is still a strict subset of the GS output. In the above example the input I provide to the Pixel Shader only expects the second element - float3 colour : COLOUR - and ignores everything else. Thus a simple design methodolgy is to just append any new SO-specific fields to the end of each struct that you pass from stage-to-stage.

    Second-up is to modify how you create your geometry shader.

    In either Direct3D 11 or 10 you need to call CreateGeometryShaderWithStreamOutput() instead of CreateGeometryShader(), which is pretty straightforward except you need also provide a D3D11_SO_DECLARATION_ENTRY or D3D10_SO_DECLARATION_ENTRY (depending on which version you use):

    D3D11_SO_DECLARATION_ENTRY soDecl[] = 
    {
    { 0, "COLOUR", 0, 0, 3, 0 }
    , { 0, "DOMAIN_SHADER_LOCATION", 0, 0, 3, 0 }
    , { 0, "WORLD_POSITION", 0, 0, 3, 0 }
    };

    UINT stride = 9 * sizeof(float); // *NOT* sizeof the above array!
    UINT elems = sizeof(soDecl) / sizeof(D3D11_SO_DECLARATION_ENTRY);


    There are three things to pay attention to:
    1. The semantic name: this must match one of the entries in your HLSL shader, note that this declaration outputs the last 3 of the 4 declared in the previous snippet.

    2. The start and element count fields: This is the 4th and 5th elements, all 0,3 in this case. for vector types, such as float3 in the above HLSL, you state which element to start reading from (0=x, 1=y, 2=z, 3=w) and then the count of elements (implying the last one read). For a float3 and 0,3 declaration it means ALL elements - but if it were a float4 then the 'w' component would not be streamed out.

    3. The element stride: The call to CreateGeometryShaderWithStreamOutput() needs to know the stride of a streamed out structure. Not exactly hard to compute, but easy to mistake it for the size of the soDecl array!


    The only difference for D3D10 is that D3D11 introduces the concept of a 'stream' field, but we're not using that here such that you'd just drop the first integer on each row of the above.

    Thirdly, you need to create a buffer to write the output to.

    This is pretty much identical to how you create vertex and index buffers for rendering, with two caveats - you need two, one GPU writeable and one CPU readable and you don't provide any initial data prior to rendering like you would for a VB or IB.

    D3D11_BUFFER_DESC soDesc;

    soDesc.BindFlags = D3D11_BIND_STREAM_OUTPUT;
    soDesc.ByteWidth = 10 * 1024 * 1024; // 10mb
    soDesc.CPUAccessFlags = 0;
    soDesc.Usage = D3D11_USAGE_DEFAULT;
    soDesc.MiscFlags = 0;
    soDesc.StructureByteStride = 0;

    if( FAILED( hr = g_pd3dDevice->CreateBuffer( &soDesc, NULL, &g_pStreamOutBuffer ) ) )
    {
    /* handle the error here */

    return hr;
    }

    // Simply re-use the above struct

    soDesc.BindFlags = 0;
    soDesc.CPUAccessFlags = D3D11_CPU_ACCESS_READ;
    soDesc.Usage = D3D11_USAGE_STAGING;

    if( FAILED( hr = g_pd3dDevice->CreateBuffer( &soDesc, NULL, &g_pStagingStreamOutBuffer ) ) )
    {
    /* handle the error here */

    return hr;
    }


    You can't call Map() on a D3D11_USAGE_DEFAULT (or the v10 equivalent) and you can't bind a D3D11_CPU_ACCESS_READ resource as a pipeline output, so you create one of each class and later on we copy from the GPU to CPU in order to get at the data.

    With the two buffers created the final step is to make sure you bind the output:

    UINT offset = 0;
    g_pContext->SOSetTargets( 1, &g_pStreamOutBuffer, &offset );


    In D3D10 it'll be on a device rather than a context, but thats a trivial difference.

    Finally you'll obviously want to read back the results!


    g_pContext->CopyResource( g_pStagingStreamOutBuffer, g_pStreamOutBuffer );

    D3D11_MAPPED_SUBRESOURCE data;
    if( SUCCEEDED( g_pContext->Map( g_pStagingStreamOutBuffer, 0, D3D11_MAP_READ, 0, &data ) ) )
    {
    struct GS_OUTPUT
    {
    D3DXVECTOR3 COLOUR;
    D3DXVECTOR3 DOMAIN_SHADER_LOCATION;
    D3DXVECTOR3 WORLD_POSITION;
    };

    GS_OUTPUT *pRaw = reinterpret_cast< GS_OUTPUT* >( data.pData );

    /* Work with the pRaw[] array here */
    // Consider StringCchPrintf() and OutputDebugString() as simple ways of printing the above struct, or use the debugger and step through.

    g_pContext->Unmap( g_pStagingStreamOutBuffer, 0 );
    }


    All of the above is executed after you've issued the draw call. The first line handles the previously mentioned case where GPU-writeable and CPU-readable can't be the same resource.

    You need to be a bit careful with the struct you cast the pointer to - if you've any fancy alignment or padding in your application the C/C++ struct may not exactly match the binary representation in the buffer!

    How much data is returned?

    This is a subtle but rather important question!

    In a conventional pipeline where you only use a pass-through GS you can know the amount of data in your SO buffer from only the draw call parameters, but if you use the GS to amplify or clip data, or use the tesselation features in D3D11 it becomes non-trivial.

    When you're not entirely sure how many invocations there might be you need to resort to queries.

    The exact mechanism varies between the API's but the basic idea is to start a query before the draw call, end it immediately after and then grab the result(s) which will tell you how much data to expect.

    In my code I use the D3D11_QUERY_PIPELINE_STATISTICS query, which returns the D3D11_QUERY_DATA_PIPELINE_STATISTICS::GSPrimitives field. Similar should work with Direct3D 10.

    // When initializing/loading
    D3D11_QUERY_DESC queryDesc;
    queryDesc.Query = D3D11_QUERY_PIPELINE_STATISTICS;
    queryDesc.MiscFlags = 0;
    if( FAILED( hr = g_pd3dDevice->CreateQuery( &queryDesc, &g_pDeviceStats ) ) )
    {
    return hr;
    }

    // When rendering
    g_pContext->Begin(g_pDeviceStats);

    g_pContext->DrawIndexed( 3, 0, 0 ); // one triangle only

    g_pContext->End(g_pDeviceStats);

    D3D11_QUERY_DATA_PIPELINE_STATISTICS stats;
    while( S_OK != g_pContext->GetData(g_pDeviceStats, &stats, g_pDeviceStats->GetDataSize(), 0 ) );


    Alternatively there are the D3D11_QUERY_SO_STATISTICS and D3D11_QUERY_SO_OVERFLOW_PREDICATE (replace '11' with '10' if desired) which will return similar information. More usefully they will also tell you if data was truncated due to overflowing the buffer - useful for "proper" use of SO, but when debugging you're unlikely to want such huge amounts of output that this is likely!!

    Any limitations?

    Sadly, yes!

  • The performance of this solution is not going to be great. But as its a debugging feature this shouldn't really be a major concern - you're not going to want something like this in your production code! In particular, the use of a query forces lock-step between the CPU and GPU, the copy operation may be slow(ish) and the extra data the GPU has to pass around is going to have some impact (e.g. register pressure decreasing the number of in-flight threads).

  • There is no pixel shader output as this is a debugging trick for your geometry processing only. If you want to debug a pixel shader, consider using MRT techniques to output intermediaries...

  • It's an intrusive technique - you need to modify your codebase and your shaders in order to expose this information. For that reason alone it may well be more hassle than its worth.

  • You may have noticed that the GS never explicitly pipes data to an SO output. Instead we just peek into and pull bits out of the stream that leaves the GS on its way to the rasterizer. Thus you're forced to output at the same frequency as normal rendering and you'd find it hard to implement arbitrary debug output using this technique.

  • There are storage constraints when using SO, but in general these won't be an issue for debugging tasks. E.g. D3D10 the output structure is 64 or fewer scalar's (e.g. members of the output struct) or a maximum of 2kb if using vectorised types (32xmatrix for example - debugging vertex skinning for example).


    That's all folks!
  • jollyjeffers

    More pictures

    So I tweaked the output from my code:






    On the left is the solid shading and on the right is the same frame but in wireframe.

    I set up the colours as follows:
    • WHITE for the original control points
    • BLUE for the first edge
    • GREEN for the second edge
    • RED for the third edge
    • BLACK for the middle points


    In theory I should be getting white corners with a black spot/area in the middle. The edges should fade from whichever colour to black in the middle and to white at either end of the edges.

    In particular, the distribution of colour should be equal. As should the shape - a coplanar triangle.

    As I'm sure you can easily see, its not quite right [sad]

    I also experimented a bit further with my Stream-Out debugging feature, but I'll leave that to the weekend to explain in more detail...
    jollyjeffers

    Note to self - don't be stupid

    If your Domain Shader expects a constant buffer and you don't give it one, you might - surprise frickin' surprise - not see anything rendered on screen.

    [headshake][headshake][headshake]
    [headshake][embarrass][headshake]
    [headshake][headshake][headshake]


    So I was running around plugging in some SO logging and when it came to bind my SO output I scratched my head wondering why I had a //TODO: Set constant buffer here comment in my code.

    Someone, please bang my head against the nearest brick wall.




    SV_TessFactor == 1.0f



    Pipeline Stats:
    IA: Formed 2 primitives from 6 vertices.
    VS: Shaded 6 vertices.
    HS: Invoked the hull shader on 2 patches.
    DS: 6 new vertices generated.
    GS: Processed 2 triangles.
    PS: 2 triangles sent, 2 rasterized and 7568 pixels rendered.
    [vert 1, tri 1 of 2] SV_Position={2.559, -0.077, 3.325, 3.419}, COLOUR={1.00, 1.00, 1.00}
    [vert 2, tri 1 of 2] SV_Position={2.559, -0.077, 3.325, 3.419}, COLOUR={1.00, 1.00, 1.00}
    [vert 3, tri 1 of 2] SV_Position={2.559, -0.077, 3.325, 3.419}, COLOUR={1.00, 1.00, 1.00}
    [vert 1, tri 2 of 2] SV_Position={-0.101, -1.970, 2.215, 2.310}, COLOUR={1.00, 1.00, 1.00}
    [vert 2, tri 2 of 2] SV_Position={-0.101, -1.970, 2.215, 2.310}, COLOUR={1.00, 1.00, 1.00}
    [vert 3, tri 2 of 2] SV_Position={-0.101, -1.970, 2.215, 2.310}, COLOUR={1.00, 1.00, 1.00}
    LOG: Finished rendering in 250.57ms, saved as Output\Frame 0001.png in 18.95ms. TOTAL TIME 269.52ms (~3.71hz) [Render(...) @ line 951]


    SV_TessFactor == 3.0f



    Pipeline Stats:
    IA: Formed 2 primitives from 6 vertices.
    VS: Shaded 6 vertices.
    HS: Invoked the hull shader on 2 patches.
    DS: 28 new vertices generated.
    GS: Processed 18 triangles.
    PS: 18 triangles sent, 18 rasterized and 27480 pixels rendered.
    [vert 1, tri 1 of 18] SV_Position={2.559, -0.077, 3.325, 3.419}, COLOUR={1.00, 1.00, 1.00}
    [vert 2, tri 1 of 18] SV_Position={2.559, -0.077, 3.325, 3.419}, COLOUR={1.00, 1.00, 1.00}
    [vert 3, tri 1 of 18] SV_Position={2.559, -0.077, 3.325, 3.419}, COLOUR={1.00, 1.00, 1.00}
    [vert 1, tri 2 of 18] SV_Position={1.498, -0.411, 3.130, 3.224}, COLOUR={1.00, 1.00, 1.00}
    [vert 2, tri 2 of 18] SV_Position={1.498, -0.411, 3.130, 3.224}, COLOUR={1.00, 1.00, 1.00}
    [vert 3, tri 2 of 18] SV_Position={1.498, -0.411, 3.130, 3.224}, COLOUR={1.00, 1.00, 1.00}
    [vert 1, tri 3 of 18] SV_Position={-0.034, -0.657, 2.985, 3.079}, COLOUR={1.00, 1.00, 1.00}
    [vert 2, tri 3 of 18] SV_Position={-0.034, -0.657, 2.985, 3.079}, COLOUR={1.00, 1.00, 1.00}
    [vert 3, tri 3 of 18] SV_Position={-0.034, -0.657, 2.985, 3.079}, COLOUR={1.00, 1.00, 1.00}
    [vert 1, tri 4 of 18] SV_Position={-0.034, -0.657, 2.985, 3.079}, COLOUR={1.00, 1.00, 1.00}
    [vert 2, tri 4 of 18] SV_Position={-0.034, -0.657, 2.985, 3.079}, COLOUR={1.00, 1.00, 1.00}
    [vert 3, tri 4 of 18] SV_Position={-0.034, -0.657, 2.985, 3.079}, COLOUR={1.00, 1.00, 1.00}
    [vert 1, tri 5 of 18] SV_Position={1.498, -0.411, 3.130, 3.224}, COLOUR={1.00, 1.00, 1.00}
    [vert 2, tri 5 of 18] SV_Position={1.498, -0.411, 3.130, 3.224}, COLOUR={1.00, 1.00, 1.00}
    [vert 3, tri 5 of 18] SV_Position={1.498, -0.411, 3.130, 3.224}, COLOUR={1.00, 1.00, 1.00}
    [vert 1, tri 6 of 18] SV_Position={0.437, -0.744, 2.934, 3.028}, COLOUR={1.00, 1.00, 1.00}
    [vert 2, tri 6 of 18] SV_Position={0.437, -0.744, 2.934, 3.028}, COLOUR={1.00, 1.00, 1.00}
    [vert 3, tri 6 of 18] SV_Position={0.437, -0.744, 2.934, 3.028}, COLOUR={1.00, 1.00, 1.00}
    [vert 1, tri 7 of 18] SV_Position={0.437, -0.744, 2.934, 3.028}, COLOUR={1.00, 1.00, 1.00}
    [vert 2, tri 7 of 18] SV_Position={0.437, -0.744, 2.934, 3.028}, COLOUR={1.00, 1.00, 1.00}
    [vert 3, tri 7 of 18] SV_Position={0.437, -0.744, 2.934, 3.028}, COLOUR={1.00, 1.00, 1.00}
    [vert 1, tri 8 of 18] SV_Position={-0.101, -1.970, 2.215, 2.310}, COLOUR={1.00, 1.00, 1.00}
    [vert 2, tri 8 of 18] SV_Position={-0.101, -1.970, 2.215, 2.310}, COLOUR={1.00, 1.00, 1.00}
    [vert 3, tri 8 of 18] SV_Position={-0.101, -1.970, 2.215, 2.310}, COLOUR={1.00, 1.00, 1.00}
    [vert 1, tri 9 of 18] SV_Position={-0.034, -0.657, 2.985, 3.079}, COLOUR={1.00, 1.00, 1.00}
    [vert 2, tri 9 of 18] SV_Position={-0.034, -0.657, 2.985, 3.079}, COLOUR={1.00, 1.00, 1.00}
    [vert 3, tri 9 of 18] SV_Position={-0.034, -0.657, 2.985, 3.079}, COLOUR={1.00, 1.00, 1.00}
    [vert 1, tri 10 of 18] SV_Position={-0.101, -1.970, 2.215, 2.310}, COLOUR={1.00, 1.00, 1.00}
    [vert 2, tri 10 of 18] SV_Position={-0.101, -1.970, 2.215, 2.310}, COLOUR={1.00, 1.00, 1.00}
    [vert 3, tri 10 of 18] SV_Position={-0.101, -1.970, 2.215, 2.310}, COLOUR={1.00, 1.00, 1.00}
    [vert 1, tri 11 of 18] SV_Position={-0.571, -1.882, 2.266, 2.361}, COLOUR={1.00, 1.00, 1.00}
    [vert 2, tri 11 of 18] SV_Position={-0.571, -1.882, 2.266, 2.361}, COLOUR={1.00, 1.00, 1.00}
    [vert 3, tri 11 of 18] SV_Position={-0.571, -1.882, 2.266, 2.361}, COLOUR={1.00, 1.00, 1.00}
    [vert 1, tri 12 of 18] SV_Position={-0.034, -0.657, 2.985, 3.079}, COLOUR={1.00, 1.00, 1.00}
    [vert 2, tri 12 of 18] SV_Position={-0.034, -0.657, 2.985, 3.079}, COLOUR={1.00, 1.00, 1.00}
    [vert 3, tri 12 of 18] SV_Position={-0.034, -0.657, 2.985, 3.079}, COLOUR={1.00, 1.00, 1.00}
    [vert 1, tri 13 of 18] SV_Position={-0.034, -0.657, 2.985, 3.079}, COLOUR={1.00, 1.00, 1.00}
    [vert 2, tri 13 of 18] SV_Position={-0.034, -0.657, 2.985, 3.079}, COLOUR={1.00, 1.00, 1.00}
    [vert 3, tri 13 of 18] SV_Position={-0.034, -0.657, 2.985, 3.079}, COLOUR={1.00, 1.00, 1.00}
    [vert 1, tri 14 of 18] SV_Position={-0.571, -1.882, 2.266, 2.361}, COLOUR={1.00, 1.00, 1.00}
    [vert 2, tri 14 of 18] SV_Position={-0.571, -1.882, 2.266, 2.361}, COLOUR={1.00, 1.00, 1.00}
    [vert 3, tri 14 of 18] SV_Position={-0.571, -1.882, 2.266, 2.361}, COLOUR={1.00, 1.00, 1.00}
    [vert 1, tri 15 of 18] SV_Position={-1.565, -0.902, 2.841, 2.935}, COLOUR={1.00, 1.00, 1.00}
    [vert 2, tri 15 of 18] SV_Position={-1.565, -0.902, 2.841, 2.935}, COLOUR={1.00, 1.00, 1.00}
    [vert 3, tri 15 of 18] SV_Position={-1.565, -0.902, 2.841, 2.935}, COLOUR={1.00, 1.00, 1.00}
    [vert 1, tri 16 of 18] SV_Position={-1.565, -0.902, 2.841, 2.935}, COLOUR={1.00, 1.00, 1.00}
    [vert 2, tri 16 of 18] SV_Position={-1.565, -0.902, 2.841, 2.935}, COLOUR={1.00, 1.00, 1.00}
    [vert 3, tri 16 of 18] SV_Position={-1.565, -0.902, 2.841, 2.935}, COLOUR={1.00, 1.00, 1.00}
    [vert 1, tri 17 of 18] SV_Position={-2.559, 0.077, 3.416, 3.509}, COLOUR={1.00, 1.00, 1.00}
    [vert 2, tri 17 of 18] SV_Position={-2.559, 0.077, 3.416, 3.509}, COLOUR={1.00, 1.00, 1.00}
    [vert 3, tri 17 of 18] SV_Position={-2.559, 0.077, 3.416, 3.509}, COLOUR={1.00, 1.00, 1.00}
    [vert 1, tri 18 of 18] SV_Position={-0.034, -0.657, 2.985, 3.079}, COLOUR={1.00, 1.00, 1.00}
    [vert 2, tri 18 of 18] SV_Position={-0.034, -0.657, 2.985, 3.079}, COLOUR={1.00, 1.00, 1.00}
    [vert 3, tri 18 of 18] SV_Position={-0.034, -0.657, 2.985, 3.079}, COLOUR={1.00, 1.00, 1.00}
    LOG: Finished rendering in 275.59ms, saved as Output\Frame 0001.png in 20.68ms. TOTAL TIME 296.27ms (~3.38hz) [Render(...) @ line 951]


    SV_TessFactor == 10.0f



    Pipeline Stats:
    IA: Formed 2 primitives from 6 vertices.
    VS: Shaded 6 vertices.
    HS: Invoked the hull shader on 2 patches.
    DS: 64 new vertices generated.
    GS: Processed 60 triangles.
    PS: 60 triangles sent, 60 rasterized and 70276 pixels rendered.
    [vert 1, tri 1 of 60] SV_Position={2.559, -0.077, 3.325, 3.419}, COLOUR={1.00, 1.00, 1.00}
    [vert 2, tri 1 of 60] SV_Position={2.559, -0.077, 3.325, 3.419}, COLOUR={1.00, 1.00, 1.00}
    [vert 3, tri 1 of 60] SV_Position={2.559, -0.077, 3.325, 3.419}, COLOUR={1.00, 1.00, 1.00}
    [vert 1, tri 2 of 60] SV_Position={2.272, -0.230, 3.236, 3.329}, COLOUR={1.00, 1.00, 1.00}
    [vert 2, tri 2 of 60] SV_Position={2.272, -0.230, 3.236, 3.329}, COLOUR={1.00, 1.00, 1.00}
    [vert 3, tri 2 of 60] SV_Position={2.272, -0.230, 3.236, 3.329}, COLOUR={1.00, 1.00, 1.00}
    [vert 1, tri 3 of 60] SV_Position={-0.034, -0.657, 2.985, 3.079}, COLOUR={1.00, 1.00, 1.00}
    [vert 2, tri 3 of 60] SV_Position={-0.034, -0.657, 2.985, 3.079}, COLOUR={1.00, 1.00, 1.00}
    [vert 3, tri 3 of 60] SV_Position={-0.034, -0.657, 2.985, 3.079}, COLOUR={1.00, 1.00, 1.00}
    [vert 1, tri 4 of 60] SV_Position={2.272, -0.230, 3.236, 3.329}, COLOUR={1.00, 1.00, 1.00}
    [vert 2, tri 4 of 60] SV_Position={2.272, -0.230, 3.236, 3.329}, COLOUR={1.00, 1.00, 1.00}
    [vert 3, tri 4 of 60] SV_Position={2.272, -0.230, 3.236, 3.329}, COLOUR={1.00, 1.00, 1.00}
    [vert 1, tri 5 of 60] SV_Position={1.951, -0.327, 3.179, 3.272}, COLOUR={1.00, 1.00, 1.00}
    [vert 2, tri 5 of 60] SV_Position={1.951, -0.327, 3.179, 3.272}, COLOUR={1.00, 1.00, 1.00}
    [vert 3, tri 5 of 60] SV_Position={1.951, -0.327, 3.179, 3.272}, COLOUR={1.00, 1.00, 1.00}
    [vert 1, tri 6 of 60] SV_Position={-0.034, -0.657, 2.985, 3.079}, COLOUR={1.00, 1.00, 1.00}
    [vert 2, tri 6 of 60] SV_Position={-0.034, -0.657, 2.985, 3.079}, COLOUR={1.00, 1.00, 1.00}
    [vert 3, tri 6 of 60] SV_Position={-0.034, -0.657, 2.985, 3.079}, COLOUR={1.00, 1.00, 1.00}
    [vert 1, tri 7 of 60] SV_Position={1.951, -0.327, 3.179, 3.272}, COLOUR={1.00, 1.00, 1.00}
    [vert 2, tri 7 of 60] SV_Position={1.951, -0.327, 3.179, 3.272}, COLOUR={1.00, 1.00, 1.00}
    [vert 3, tri 7 of 60] SV_Position={1.951, -0.327, 3.179, 3.272}, COLOUR={1.00, 1.00, 1.00}
    [vert 1, tri 8 of 60] SV_Position={1.612, -0.392, 3.141, 3.234}, COLOUR={1.00, 1.00, 1.00}
    [vert 2, tri 8 of 60] SV_Position={1.612, -0.392, 3.141, 3.234}, COLOUR={1.00, 1.00, 1.00}
    [vert 3, tri 8 of 60] SV_Position={1.612, -0.392, 3.141, 3.234}, COLOUR={1.00, 1.00, 1.00}
    [vert 1, tri 9 of 60] SV_Position={-0.034, -0.657, 2.985, 3.079}, COLOUR={1.00, 1.00, 1.00}
    [vert 2, tri 9 of 60] SV_Position={-0.034, -0.657, 2.985, 3.079}, COLOUR={1.00, 1.00, 1.00}
    [vert 3, tri 9 of 60] SV_Position={-0.034, -0.657, 2.985, 3.079}, COLOUR={1.00, 1.00, 1.00}
    [vert 1, tri 10 of 60] SV_Position={1.612, -0.392, 3.141, 3.234}, COLOUR={1.00, 1.00, 1.00}
    [vert 2, tri 10 of 60] SV_Position={1.612, -0.392, 3.141, 3.234}, COLOUR={1.00, 1.00, 1.00}
    [vert 3, tri 10 of 60] SV_Position={1.612, -0.392, 3.141, 3.234}, COLOUR={1.00, 1.00, 1.00}
    [vert 1, tri 11 of 60] SV_Position={1.269, -0.449, 3.107, 3.201}, COLOUR={1.00, 1.00, 1.00}
    [vert 2, tri 11 of 60] SV_Position={1.269, -0.449, 3.107, 3.201}, COLOUR={1.00, 1.00, 1.00}
    [vert 3, tri 11 of 60] SV_Position={1.269, -0.449, 3.107, 3.201}, COLOUR={1.00, 1.00, 1.00}
    [vert 1, tri 12 of 60] SV_Position={-0.034, -0.657, 2.985, 3.079}, COLOUR={1.00, 1.00, 1.00}
    [vert 2, tri 12 of 60] SV_Position={-0.034, -0.657, 2.985, 3.079}, COLOUR={1.00, 1.00, 1.00}
    [vert 3, tri 12 of 60] SV_Position={-0.034, -0.657, 2.985, 3.079}, COLOUR={1.00, 1.00, 1.00}
    [vert 1, tri 13 of 60] SV_Position={1.269, -0.449, 3.107, 3.Read more...
  • 4 comments
  • 247 views
  • Sign in to follow this