Dynamic Textures for Skinning

Started by
6 comments, last by Hodgman 8 years, 2 months ago

Hey there,

I'm using Dynamic Textures with format D3DFMT_A32B32G32R32F to place the Matrix Palette of bones in. In the vertex shader I use tex2lod. However, some cards don't support this format. I was wondering if some of you would know if D3DFMT_A8R8G8B8 would also support matrices before I try.

Thanks for your time

Advertisement

That texture format works with vertex shader when using GPUs with feature level 9_3 and higher.

i.e. every D3D10/11 GPU, and some D3D9-era GPUs (everything released during the past 10 years).

You can use D3DFMT_A8R8G8B8 as long as your matrices only contain values from zero to one...

However... feature level 9_x does not support fetching from textures inside a vertex shader whatsoever, so the format of your texture is irrelevant :)

If you want to support those decade-old cards, you'll have to use a cbuffer to store your matrix pallete.

Ah if tex2lod won't work then trying D3DFMT_A8R8G8B8 is useless.

I don't know about feature levels because I'm solely using Direct3D 9.0c. I'm just trying to make all things work on that only.

I changed to texture buffers because 'cbuffers' on vs_3_0 only support less than 64 matrices, and that's not enough. Quite annoying having to think of it again. I'll see if I can use cbuffers.

I was thinking of doing the matrix calculations on the CPU if the D3DFMT_A32B32G32R32F is not supported. Thing is, I've designed my code so that I have vertex buffers with the vertex, normals, tex coords etc in different buffers for different streams. Is there anyway I can transform those vertices inside a static vertex buffer on the cpu before it goes to the gpu without resolving to 'DrawPrimitiveUP' functions ? that would be the simplest solution.

I don't know about feature levels because I'm solely using Direct3D 9.0c. I'm just trying to make all things work on that only.

Ah ok, I assumed D3D11 sorry unsure.png

Yeah, on D3D9 you can query whether different formats support vertex texture fetch. In my experience, the only formats that ever worked were D3DFMT_A32B32G32R32F and D3DFMT_A16B16G16R16F on some other cards. Modern (DX10+) GPUs will likely return true for D3DFMT_A8R8G8B8, but these ones will support 32F as well.

However, vertex texture fetch basically only works on DX10+ GPUs, or nVidia GPUs that support shader model 3.
Instead of supporting VTF, AMD instead chose to implement "render to vertex buffer"... but R2VB is now a completely dead technique that doesn't even work on modern GPUs.
At the time, a comprehensive D3D9 engine would use VTF on GeForce 7/8, R2VB on Radeon X1000, and otherwise used constant registers... What a pain!

If you want to limit yourself to D3D10+ capable GPUs, then I'd just use VTF with the 32F format, otherwise, I'd just use constant registers smile.png

FWIW, it's very common for D3D9 games that are still being released to require a D3D10-capable GPU :wink:

vs_3_0 only support less than 64 matrices

One row/column of your bone matrices should be (0,0,0,1), so you can hard-code that row-column and store each matrix in 3 registers rather than 4. There's also a CryTek technique where you convert each matrix into a dual-quaternion, which can be stored in 2 registers.
You can also just split your model into several sub-meshes, so that each sub-mesh only uses <64 bones, even if the model has hundreds.

Is there anyway I can transform those vertices inside a static vertex buffer on the cpu before it goes to the gpu without resolving to 'DrawPrimitiveUP' functions ? that would be the simplest solution.

Yes, use a dynamic vertex buffer. See: https://developer.nvidia.com/sites/default/files/akamai/gamedev/files/gdc12/Efficient_Buffer_Management_McDonald.pdf
We even did that on a D3D11-era game released last year as we were GPU-bound and had some very well optimized CPU-skinning code.

Ah right, yeah most players are having cards where VTF is working fine. It's just that few that don't that I still want to support (ie. some Windows XP users). I noticed that even some people with Windows 7 don't support D3DFMT_A32B32G32R32F. They have built-in intel chips. Kind of strange because I always thought Windows 7 required DX11 cards.

I prefer not to split the mesh into submeshes based by bones, seems like alot of work, I already split it per material. So I'm thinking of the other options.

Dual Quaternions I looked it up but it seems like alot of work to convert to that at this point (not to mention any unexpected dreadful artifacts that might appear). Storing just 4x3 matrices to be able to support 80 bones is still not enough.

It looks like one static buffer with base vertex data, and one dynamic vertex buffer for rendering after transforming the vertices with the bone matrices on the CPU might just be the best idea! And here I was thinking about going back to DrawPrimitiveUP building big buffers on the CPU.. good I started this topic.

I'm also having trouble with some cards not supporting Index Buffers for 32 bit indices. It's one really big mesh and I can't split it up. The only solution to that seems to be DrawPrimitiveUP (a backup technique used only when the card doesn't support 32 bit indices, ofcourse!)

Thanks for your insights

I always thought Windows 7 required DX11 cards.

Win7 introduced the Dx11 API (it was later available on WinVista in an update).
The Dx11 API works with Shader model 2-5 GPUs (D3D9-11 era) and newer.

WinXP users might have a SM5(D3D11-era GPU), and Win7 users might have a SM2(D3D9-era GPU).

I'm also having trouble with some cards not supporting Index Buffers for 32 bit indices. It's one really big mesh and I can't split it up.

This shouldn't be an issue. You should pretty much always use 16bit indices for speed. I forget the parameter names in dx9, but DrawIndexedPrimitive should have:
* the number of indices to read / number of primitives to draw (can convert between the two if the primitive type is known).
* an offset into the index buffer of where to start reading indices from.
* an offset into the vertex buffer of where to start reading vertices from -- or if you like, a 32 bit value that gets added to every 16bit index.
Those tools let you put as many vertices / indices as you like into a buffer and then draw them with one DrawIndexedPrimitive call per 65k verts.

I can share a routine for performing the buffer reorganisation / draw splitting if it's helpful.

It looks like one static buffer with base vertex data, and one dynamic vertex buffer for rendering after transforming the vertices with the bone matrices on the CPU might just be the best idea!

Only as long as you've got the CPU cycles and RAM to burn. It obviously scales well across threads too :wink:
If you want to invest in this approach, it pays off to structure your source data to be compact and well-aligned for SSE instructions. e.g. we ended up with 16 bytes per vertex for the skinned attributes - 16bit fixed-point positions, 10bit fixed-point normals/tangents and one bit for binormal sign, which meant we could load those attributes with a single aligned SSE load instruction, then quickly unpack, transform/skin, and write them into the dynamic vertex buffer.

Hm, a buffer reorganisation / draw splitting routine would be useful yeah. It's like having a 16 bit index buffer but still allowing use of a vertex buffer containing more than 65k vertices?

You were able to fit the whole vertex info into 128 bits in one buffer? What I currently do is seperate vertex position (3 floats), normals (3 floats), bone index (1 float only, there's no blended skinning, so no weights also), uv coords (2 floats) each in seperate vertex buffers. At rendering, they are seperate streams. So for me, I'd have a dynamic vertex buffer for the positions and normals only, to transform those at render time.

I'm not planning to 'thread' the transformations into the dynamic vertex buffer, because I build the bone matrices for a given frame practically right before rendering, so it seems pointless for the rendering thread to wait for a seperate thread to finish what the rendering thread itself could do. It might be useful if I'd do other things between building the bone matrices and actual rendering, but it'd have to be faster than the whole thread overhead.

Lastly, I noticed there are no usage flags for creating read-only vertex buffers? I should just leave out D3DUSAGE_WRITEONLY at creating the vertex buffer, and use D3DLOCK_READONLY when I lock the vertex buffer for reading? Is this the fastest read-only static vertex buffer? Or should I just not use d3d vertex buffers at all and go with std vectors or so...

Hm, a buffer reorganisation / draw splitting routine would be useful yeah. It's like having a 16 bit index buffer but still allowing use of a vertex buffer containing more than 65k vertices?

Yep. So in my (C#) tools, I've got some code like this, which takes a single non-indexed triangle list, and splits it into 1 or more indexed triangle lists with a max of 65k verts per sub-list so that 16-bit indices can be used:


IndexedTriList[] ReindexTriList(TriList list)
{
	var outputLists = new List<IndexedTriList>();
	int numOutputVerts = list.vertices.Count();
	for (int vertIdx = 0; vertIdx != numOutputVerts; )
	{
		//This will become the index buffer content for this group of up to 65k verts:
		var indices = new List<int>();
		//This will become the vertex buffer content for this group of up to 65k verts:
		var vertices = new List<Vertex>();
		//This lets us keep track of duplicate vertices, mapping them to their index into vertices
		var uniqueVertices = new Dictionary<Vertex, int>(new VertexEqualityComparer());
		//number of triangles in this group:
		int triCount = 0;
		//keep pushing triangles into the group until we've consumed them all, or there's 65k unique verts in the buffer
		for (; vertIdx != numOutputVerts && vertices.Count < 0xFFFC; vertIdx += 3)
		{
			++triCount;
			//read the next triangle out of the input non-indexed triangle list
			for (int j = 0; j != 3; ++j)
			{
				Vertex v = list.vertices[vertIdx+j];
				//check if we've already added this vertex to the group
				int index = -1;
				if (!uniqueVertices.TryGetValue(v, out index))
				{
					//if not, add this vertex to the group now
					index = vertices.Count;
					vertices.Add(v);
					uniqueVertices.Add(v, index);
				}
				//add the index of the vertex within the group to the group's index buffer
				indices.Add(index);
			}
		}
		//add this group of <=65k verts to the output
		outputLists.Add( new IndexedTriList(triCount, vertices.ToArray(), indices.ToArray()) );
	}
	return outputLists.ToArray();
}

You can take the resulting array of indexed tri-lists and put them all into a single vertex-buffer and single index-buffer if you like -- and then use the DrawIndexedPrimitive parameters to set the appropriate offsets into your buffers: BaseVertexIndex (32bit number to add to each 16-bit index) and StartIndex (offset into the index buffer of where to start reading from).

I'm not planning to 'thread' the transformations into the dynamic vertex buffer, because I build the bone matrices for a given frame practically right before rendering, so it seems pointless for the rendering thread to wait for a seperate thread to finish what the rendering thread itself could do.

Most engines use a "job system" for threading these days. Say you've got 100k vertices to be processed, you could add 100 jobs to the job queue, each of which is responsible for processing 1k vertices each. All of your threads (main thread, and worker threads) can then consume those jobs when they've got nothing else to do. While the main thread is waiting on these jobs to finish (before it can continue with rendering), it can consume jobs too. This model lets you (periodically) get 100% CPU usage (all cores busy) pretty easily, at least whenever you've got large batches of data to process.

Lastly, I noticed there are no usage flags for creating read-only vertex buffers? I should just leave out D3DUSAGE_WRITEONLY at creating the vertex buffer, and use D3DLOCK_READONLY when I lock the vertex buffer for reading? Is this the fastest read-only static vertex buffer? Or should I just not use d3d vertex buffers at all and go with std vectors or so...

If you're never going to be sending the data to the GPU or otherwise transforming it using D3D... then yeah, just allocate the memory yourself instead of using a D3D buffer object. If you do need CPU-side D3D data for whatever reason though, D3DPOOL_SYSTEMMEM is what you're looking for.

This topic is closed to new replies.

Advertisement