Ineffective use of VBO?

Started by
33 comments, last by maxest 14 years, 1 month ago
I'm having cross renderer (D3D9 and OGL). To my surpries I founded that D3D9 works approximately 25-40% faster than OGL. So I started digging. I know it's not pixel-processing problem, it's vertex. Right now I render about 120, each around 20k faces. With small viewport set (to eliminate pixel-processing influence) I get around 30 fps for OGL, and 40 fps for D3D9. After many tests I really don't know what's goind on. One interesting thing is that for OGL I tested to approaches in specifying data. One approach is to have an array of vertices lined up (with structure describing vertex, just like in D3D9). The other approach is to have first specified all vertices positions, then normals, then other data (I guess it's not possible to arrange data in that way in D3D9). And when I tried this second approach the fps grew to around 32. So the arrangement of data matters and I'm wondering if DX is doing something "magical" to gain better performance when sending vertex data? My init code for vertex buffer (for index buffer it's similar):

			CRenderer::D3DDevice->CreateVertexBuffer(size, D3DUSAGE_WRITEONLY, 0, D3DPOOL_MANAGED, &id, NULL);
		}
		#else
		{
			glGenBuffers(1, &id);
			glBindBuffer(GL_ARRAY_BUFFER, id);
			glBufferData(GL_ARRAY_BUFFER, size, NULL, GL_STATIC_DRAW);
Code for mapping:

			id->Lock(0, 0, (void**)&data, 0);
		}
		#else
		{
			glBindBuffer(GL_ARRAY_BUFFER, id);
			data = glMapBuffer(GL_ARRAY_BUFFER, GL_WRITE_ONLY);
I guess the configuration of buffer is the same for both D3D9 and OGL. Or maybe I'm missing something?
Advertisement
"GL_STATIC_DRAW" - try replacing it with GL_STREAM_DRAW or GL_DYNAMIC_DRAW. You could also try using glBufferData() instead of glMapBuffer, the driver might be more optimized for that.
Interleaved vtx data is generally faster. Do mind the DWORD-alignment of attribs, too. (don't have a 3-byte attrib, instead go for 4-byte)

Streaming is by far most efficient when two copies of the data can fit in L1 cache of the cpu. 20k verts * 32byte/vert can't fit.
First of all note that I'm only talking about static buffers and GL_STREAM_DRAW/GL_DYNAMIC_DRAW causes drop of fps from 30 to 19.
Aligning vertex structure to 32/64 bytes or having 4-byte attribs doesn't help either (I'm working on GF8400 if that matters).
One interesting thing is that some time ago I was testing the speed of updating animated geometry data. And in that case OGL was faster a bit than D3D. But when it comes to simple static data, D3D is definitely much faster. And I really can't understand this big difference (40 to 30), OGL must put the data in some really not nice way
You didn't mention if the data should be static, and instead posted how you map it (which is used mostly for streaming).

Just measure the calls with RDTSC
Quote:Original post by maxest
Right now I render about 120, each around 20k faces.
By my reckoning that makes 2.5 million tris. Any reason you can't cull these further? 2.5 million tris will bring most cards into the low double-digits.

Vertex layout is unlikely to be the culprit here, so I would guess that you haven't matched the vertex processing between D3D and OpenGL. Are you using a vertex shader, or the fixed function OpenGL vertex pipe? And what operations is it performing (i.e. lighting, fog, etc.)?

Another high probability is drivers - D3D drivers are often more mature. Make sure you have installed the very latest drivers for your card, and benchmark again.

Quote:Original post by idinev
Just measure the calls with RDTSC
How will that help? The API is asynchronous, so individual call times are completely meaningless - operations may be deferred substantially where the driver considers it beneficial.

Tristam MacDonald. Ex-BigTech Software Engineer. Future farmer. [https://trist.am]

Quote:
You didn't mention if the data should be static, and instead posted how you map it (which is used mostly for streaming).

You're right. I should have marked I mean static buffers. I use map/unmap (instead of BufferSubData) to keep coherency with my D3D9 renderer.

Quote:
Vertex layout is unlikely to be the culprit here

Note that when I use array of vertices lined up just like in D3D, the performance is a bit worse than in situations when I first specify positions, then normals and the other data in video memory (the latter I do with glBufferSubData, and the former with glMap/glUnmap) so vertex layout has some meaning, even if very subtle.

I've also reduced the number of objects to 25. And FPS is now about 240 for D3D9 and 120 for OGL!

Quote:
so I would guess that you haven't matched the vertex processing between D3D and OpenGL. Are you using a vertex shader, or the fixed function OpenGL vertex pipe? And what operations is it performing (i.e. lighting, fog, etc.)?

I'm using Cg shaders, compiled to NVidia's VP40 and FP40, and for D3D I use shader model 3.0. At first I was doing some tangent-space calculations. But now I've simplifed shader a lot and found something interesting. After simplifying D3D9's performance increased to around 300, whereas OGL's is still 120! So the bottleneck is definitely not vertex processing, but rather memory access to geometry or something. This may also seem like CPU bottleneck, but I have very few drawcalls now, and I've also tested situation when I abuse drawcalls. In that case (I don't remember the precise number) OGL had 2x better performance.

Quote:
Another high probability is drivers - D3D drivers are often more mature. Make sure you have installed the very latest drivers for your card, and benchmark again.

On my notebook with GF8400 I have some old one, but I tested application on my desktop with GF6600 and newest NVidia drivers and performance is similar to the one from notebook.
Have you tried coding the shaders specifically to each native shading language? Cg isn't always known for making the best bytecode, and you'll almost always see a performance improvement in complex shaders from doing them directly in GLSL. Could be that Cg just isn't doing as good a job at optimizing the shaders for OpenGL?
My simplified vertex shader has 9 instructions. Do you really think Cg could mess something up here? :)
I've just tried GLSL profile instead of NV40 and it changed nothing. FPS is exactly the same so it probably isn't a problem of bad written shader.
Quote:Original post by maxest
Quote:
Vertex layout is unlikely to be the culprit here

Note that when I use array of vertices lined up just like in D3D, the performance is a bit worse than in situations when I first specify positions, then normals and the other data in video memory (the latter I do with glBufferSubData, and the former with glMap/glUnmap) so vertex layout has some meaning, even if very subtle.
Quote:So the bottleneck is definitely not vertex processing, but rather memory access to geometry or something.
Are you absolutely sure that your vertices are falling on 32-byte boundaries? Incorrect alignment can cause performance problems, and I believe that the D3D vertex formats handle this for you.

For example, 3 floats position + 3 floats normal + 2 floats texcoord = 32 bytes. Add/remove one float to that, and your cache behaviour is shot to hell.

Quote:I've also tested situation when I abuse drawcalls. In that case (I don't remember the precise number) OGL had 2x better performance.
This is expected - OpenGL drawcalls are far less expensive than D3D9 drawcalls.
Quote:My simplified vertex shader has 9 instructions. Do you really think Cg could mess something up here? :)
I've just tried GLSL profile instead of NV40 and it changed nothing. FPS is exactly the same so it probably isn't a problem of bad written shader.
NVidia's CG compiler uses the same backend as their GLSL compiler, so I wouldn't expect major differences.

Tristam MacDonald. Ex-BigTech Software Engineer. Future farmer. [https://trist.am]

You are definitely doing something that you shouldn't be doing somewhere. The large performance difference is not normal. In fact, from my own experience, NVidias OpenGL tends to be slightly (but consistently) faster than D3D9, if (and only if) you perform the exact same operations in both APIs.

Now, what exactly is going wrong on your end is impossible to say without going much deeper into your code and profiling it. You might want to try an OpenGL profiler such as gDebugger on an instrumented driver.

That said, I find it suspicious that your OpenGL FPS figures always seem to be multiples of 30. It actually looks like you have a 60Hz vsync enabled in OpenGL, but disabled in D3D.

This topic is closed to new replies.

Advertisement