Sign in to follow this  

[DX9] Multi-stream rendering performance

This topic is 2542 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hi,
I'm implementing geomipmapping and need to use two vertex buffers in order to reduce memory consumption. One VB would contain (x, z) data for single terrain block while the second VB would contain only height values for each vertex. I can reuse first VB for each terrain chunk and change only second VB.


const D3DVERTEXELEMENT9 Decl[4] =
{
{ 0, 0, D3DDECLTYPE_FLOAT2, D3DDECLMETHOD_DEFAULT, D3DDECLUSAGE_POSITION, 0 },
{ 0, 8, D3DDECLTYPE_FLOAT2, D3DDECLMETHOD_DEFAULT, D3DDECLUSAGE_TEXCOORD, 0 },
{ 1, 0, D3DDECLTYPE_FLOAT1, D3DDECLMETHOD_DEFAULT, D3DDECLUSAGE_TEXCOORD, 1 },
D3DDECL_END()
};

...

g_pd3dDevice->CreateVertexDeclaration(Decl, &g_pVertDecl)

...

g_pd3dDevice->SetStreamSource(0, g_pHeightMapVB, 0, sizeof(DATAVERTEX));
g_pd3dDevice->SetStreamSource(1, g_pHeightMapValuesVB, 0, sizeof(HEIGHTVERTEX));

g_pd3dDevice->SetIndices(g_pHeightMapIB);
g_pd3dDevice->SetVertexDeclaration(g_pVertDecl);

g_pd3dDevice->DrawIndexedPrimitive( D3DPT_TRIANGLELIST, 0, 0, numVertices, 0, numFaces);





When I render a single 256x256 terrain block with two VBs as in the code above the framerate is ~35fps and when I render the same block but with all data (x,y,z) in one VB the framerate is ~60fps. What's the cause of it getting so slow? Is the multi-stream rendering inefficient?

Thanks for help.

[Edited by - miloszmaki on December 29, 2010 3:09:37 PM]

Share this post


Link to post
Share on other sites
What graphics card do you have?
Can you show the code you use to create the buffers?
Do you create your device with hardware vertex processing?

Also disable VSync if you haven't already done so (use D3DPRESENT_INTERVAL_IMMEDIATE for the present interval when creating your device). Your normal FPS of exactly 60 sounds like vsync, so it's hard to say exactly how big the actual performance difference is.

Share this post


Link to post
Share on other sites
Quote:
What graphics card do you have?

ATI Radeon X1550
Quote:
Can you show the code you use to create the buffers?


struct DATAVERTEX
{
D3DXVECTOR2 Position;
D3DXVECTOR2 TexCoord;
};

struct HEIGHTVERTEX
{
float height;
};

...

g_pd3dDevice->CreateVertexBuffer( HMsize*HMsize*sizeof(DATAVERTEX), 0, 0, D3DPOOL_DEFAULT, &g_pHeightMapVB, NULL );
g_pd3dDevice->CreateVertexBuffer( HMsize*HMsize*sizeof(HEIGHTVERTEX), 0, 0, D3DPOOL_DEFAULT, &g_pHeightMapValuesVB, NULL);


Quote:
Do you create your device with hardware vertex processing?

Yes, I use hardware vertex processing.

VSync is disabled. I use D3DPRESENT_INTERVAL_IMMEDIATE.

Share this post


Link to post
Share on other sites
Drawing a 256x256 heightfield at 60 FPS gives you about 8 million triangles a second, which seems very low. Try running the OptimizedMesh sample that comes with the SDK, in /Samples/C++/Direct3D/Bin/OptimizedMesh.exe, to make sure that you get the performance appropriate for your graphics card. I'm not sure how much it should be able to handle. (Select 36 meshes, since that sample uses VSync so otherwise you will get a very low number).
Also, make sure you download the latest DX SDK, and the latest drivers from AMD.com.

As for the difference in performance with two vertex buffers, it could be that the GPU doesn't like reading from two buffers at the same time, especially when one of them has only a single float. Just as a test, try padding that structure to a size of 4 floats, to see if that alignment helps, or further reduces performance.

Share this post


Link to post
Share on other sites
Don't you also need a normal vector in the second stream's vertex buffer? Anyway did you create the second vertex buffer with D3DUSAGE_DYNAMIC since you are potentially locking and uploading new height data every frame? I am doing exactly the same thing as you (but with a normal vector in the second stream) and my performance did not decrease.

Share this post


Link to post
Share on other sites
I have about 100 million tris per second when running OptimizedMesh sample with 36 meshes and disabled VSync.

I'm going to add normal vector to the second vertex buffer also. I think the second VB don't have to be dynamic, since I can change only indices.

Share this post


Link to post
Share on other sites
What do you mean with change indices?
If your heights VB is larger than the X/Z VB you will get problems with indices, since the same indices are used to index both vertex streams. However, you could perhaps use the OffsetInBytes parameter to SetStreamSource.

Share this post


Link to post
Share on other sites
Here is corrected code for creating vertex buffers and declaration:

struct DATAVERTEX
{
D3DXVECTOR4 PosTex; // (x, y) - pos, (z, w) - tex coord
};

struct HEIGHTVERTEX
{
D3DXVECTOR4 HeightNormal; // (x, y, z) - normal, (w) - height
};

const D3DVERTEXELEMENT9 Decl[3] =
{
{ 0, 0, D3DDECLTYPE_FLOAT4, D3DDECLMETHOD_DEFAULT, D3DDECLUSAGE_POSITION, 0 },
{ 1, 0, D3DDECLTYPE_FLOAT4, D3DDECLMETHOD_DEFAULT, D3DDECLUSAGE_TEXCOORD, 0 },
D3DDECL_END()
};

...

g_pd3dDevice->CreateVertexDeclaration(Decl, &g_pVertDecl)

...

g_pd3dDevice->CreateVertexBuffer( HMsize*HMsize*sizeof(HEIGHTVERTEX), 0, 0, D3DPOOL_DEFAULT, &g_pHeightMapValuesVB, NULL);

g_pd3dDevice->CreateVertexBuffer( HMsize*HMsize*sizeof(DATAVERTEX), 0, 0, D3DPOOL_DEFAULT, &g_pHeightMapVB, NULL );






My render function:

VOID Render()
{
g_pd3dDevice->BeginScene();

g_pd3dDevice->Clear(0, NULL, D3DCLEAR_TARGET | D3DCLEAR_ZBUFFER, D3DCOLOR_XRGB(0,0,0), 1.0f, 0);

g_pEffectHM->SetTechnique("Heightmap");
g_pEffectHM->SetMatrix("WorldViewProjMatrix", &(Camera.GetMatrices().GetView()*Camera.GetMatrices().GetProj()));
g_pEffectHM->SetVector("Color", &D3DXVECTOR4(0.5,1,0.2,1));
UINT passes;
g_pEffectHM->Begin(&passes, 0);
for (UINT p=0; p<passes; ++p)
{
g_pEffectHM->BeginPass(p);

g_pd3dDevice->SetVertexDeclaration(g_pVertDecl);
g_pd3dDevice->SetStreamSource(0, g_pHeightMapVB, 0, sizeof(DATAVERTEX));
g_pd3dDevice->SetStreamSource(1, g_pHeightMapValuesVB, 0, sizeof(HEIGHTVERTEX));
g_pd3dDevice->SetIndices(g_pHeightMapIB);

g_pd3dDevice->DrawIndexedPrimitive( D3DPT_TRIANGLELIST, 0, 0, HMsize*HMsize, 0, (HMsize-1)*(HMsize-1)*2);

g_pEffectHM->EndPass();
}
g_pEffectHM->End();

g_pd3dDevice->EndScene();

g_pd3dDevice->Present( NULL, NULL, NULL, NULL);
}






And shaders (without lighting yet):

void VS_HM( float4 vPosTex : POSITION,
float4 vHeightNormal : TEXCOORD0,
out float4 oPos : POSITION,
out float2 oTex : TEXCOORD0 )
{
float height = vHeightNormal.w;
float4 Pos = float4(vPosTex.x, height*30, vPosTex.y, 1);
oPos = mul( Pos, WorldViewProjMatrix );
oTex = vPosTex.zw;
}

float4 PS_HM( float2 vTex : TEXCOORD0 ) : COLOR0
{
return tex2D(texSampler, vTex);
}

technique Heightmap
{
pass P0
{
VertexShader = compile vs_3_0 VS_HM();
PixelShader = compile ps_3_0 PS_HM();
}
}






I create device with this:

HRESULT InitD3D(HWND hWnd)
{
if ((g_pD3D=Direct3DCreate9(D3D_SDK_VERSION))==NULL) return E_FAIL;

D3DDISPLAYMODE d3ddm;
if (FAILED(g_pD3D->GetAdapterDisplayMode(D3DADAPTER_DEFAULT, &d3ddm))) return E_FAIL;

D3DPRESENT_PARAMETERS d3dpp;
ZeroMemory(&d3dpp, sizeof(d3dpp));

d3dpp.BackBufferWidth = d3ddm.Width;
d3dpp.BackBufferHeight = d3ddm.Height;
d3dpp.BackBufferFormat = d3ddm.Format;
d3dpp.BackBufferCount = 1;
d3dpp.MultiSampleType = D3DMULTISAMPLE_NONE;
d3dpp.SwapEffect = D3DSWAPEFFECT_DISCARD;
d3dpp.hDeviceWindow = hWnd;
d3dpp.Windowed = false;
d3dpp.EnableAutoDepthStencil = true;
d3dpp.AutoDepthStencilFormat = D3DFMT_D16;
d3dpp.FullScreen_RefreshRateInHz = D3DPRESENT_RATE_DEFAULT;
d3dpp.PresentationInterval = D3DPRESENT_INTERVAL_IMMEDIATE;
d3dpp.Flags |= D3DPRESENTFLAG_LOCKABLE_BACKBUFFER;

if (FAILED(g_pD3D->CreateDevice(D3DADAPTER_DEFAULT, D3DDEVTYPE_HAL, hWnd, D3DCREATE_HARDWARE_VERTEXPROCESSING, &d3dpp, &g_pd3dDevice))) return E_FAIL;
return S_OK;
}






Now I have ~145 fps when rendering 256x256 terrain block. It gives 255*255*2 * 145 = 130050 * 145 = ~19M tris / sec.
When I put everything into one mesh (LPD3DXMESH) I have ~361 fps. It gives 130050 * 361 = ~47M tris / sec.
The same mesh but without optimizing: ~355 fps, ~46M tris / sec.
When I render the mesh 20 times in loop each frame I get:
- optimized mesh: ~38 fps, which gives 20 * 130050 * 38 = ~99M tris / sec.
- unoptimized mesh: ~34 fps, which gives 20 * 130050 * 34 = ~88M tris /sec.
I tried also creating buffers with D3DPOOL_MANAGED instead of D3DPOOL_DEFAULT and had ~175fps. It gives 130050 * 175 = ~23M tris / sec.
Why is mesh, even unoptimized, much faster than vertex buffers? Maybe there's something wrong with my implementation?

Share this post


Link to post
Share on other sites
The 145 FPS, is that with two vertex buffers?
I guess that the mesh only uses one, so perhaps that's the problem you're seeing again. Interleaved buffers will probably always be faster, but I'm a bit surprised by the margin. When using only one buffer, is the total buffer size exactly 8*sizeof(float) instead of the two 4*sizeof(float)?
If that is the case, then perhaps your card simply doesn't play that well with multiple vertex streams.

Share this post


Link to post
Share on other sites
Quote:
The 145 FPS, is that with two vertex buffers?

Yes, it is. As I said before, when I create these buffers with D3DPOOL_MANAGED it's about 175 fps.
I've just tested performance for rendering with one vertex buffer and it's quite interesting. Vertex declaration for one VB is very similar to declaration for two VBs:

const Decl2[3] = // vertex declaration for 2 VBs
{
{ 0, 0, D3DDECLTYPE_FLOAT4, D3DDECLMETHOD_DEFAULT, D3DDECLUSAGE_POSITION, 0 },
{ 1, 0, D3DDECLTYPE_FLOAT4, D3DDECLMETHOD_DEFAULT, D3DDECLUSAGE_TEXCOORD, 0 },
D3DDECL_END()
};
const Decl1[3] = // vertex declaration for 1 VB
{
{ 0, 0, D3DDECLTYPE_FLOAT4, D3DDECLMETHOD_DEFAULT, D3DDECLUSAGE_POSITION, 0 },
{ 0, 16, D3DDECLTYPE_FLOAT4, D3DDECLMETHOD_DEFAULT, D3DDECLUSAGE_TEXCOORD, 0 },
D3DDECL_END()
};



When I render the terrain block with one VB I get 60 fps which is very little in comparison with two VB. But when I created this VB with D3DPOOL_MANAGED the framerate was 359 fps ! It's as well as optimized mesh. But still, two VBs with D3DPOOL_MANAGED - only 175 fps.

Share this post


Link to post
Share on other sites
It seems that you get pretty much exactly a 100% increase in frame-time when reading from two streams instead of one. It might be that you can't expect any different with your card.
I tried a simple similar test right now, I have a GeForce 8800GT, and I got an increase of about 40% with two buffers instead of one (same declarations as you).
It will depend entirely on your graphics hardware, but you can probably expect a performance decrease when using two buffers. Perhaps the newer generations of cards are much better at these things, as it seems like a cache or memory access issue.

These tests show strictly the vertex rate however, which might not be that big of a deal in the end, though in your case where it halves the number of vertices you can draw in a certain time it does seem like a high cost..

Share this post


Link to post
Share on other sites
Quote:
Use the D3DUSAGE_WRITEONLY flag if you create the buffers with D3DPOOL_DEFAULT.

With D3DUSAGE_WRITEONLY and D3DPOOL_DEFAULT the framarate is ~175 fps, so it's the same as with usage = 0 and D3DPOOL_MANAGED.
Should I create buffers with D3DPOOL_DEFAULT or D3DPOOL_MANAGED? What's the difference?

Quote:
These tests show strictly the vertex rate however, which might not be that big of a deal in the end, though in your case where it halves the number of vertices you can draw in a certain time it does seem like a high cost..

My tests also show the vertex rate only, because I focus my camera out of object, so that pixel shader does nothing. When it comes to number of vertices I can draw, now it's only a single block. Since I'm implementing geomipmapping, even with frustum culling, I have to render more blocks. Of course, these blocks can be smaller than 256x256 but together they can decrease framarate as well.

Share this post


Link to post
Share on other sites
Quote:
Original post by miloszmaki
With D3DUSAGE_WRITEONLY and D3DPOOL_DEFAULT the framarate is ~175 fps, so it's the same as with usage = 0 and D3DPOOL_MANAGED.
Should I create buffers with D3DPOOL_DEFAULT or D3DPOOL_MANAGED? What's the difference?


D3DPOOL_MANAGED resources are managed by D3D, and a copy of the data is kept in system memory while D3D handles transferring it to the graphics card. So the internal buffer in video memory probably automatically uses D3DUSAGE_WRITEONLY. You can read more here: D3DPOOL Enumeration.
It is easier to create your buffers in the managed pool, as otherwise you have to recreate them whenever your device is lost or reset (alt-tabbing etc.). See Lost Devices (Direct3D 9) for more information on that.

Share this post


Link to post
Share on other sites

This topic is 2542 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this