Sign in to follow this  
maxest

Ineffective use of VBO?

Recommended Posts

I'm having cross renderer (D3D9 and OGL). To my surpries I founded that D3D9 works approximately 25-40% faster than OGL. So I started digging. I know it's not pixel-processing problem, it's vertex. Right now I render about 120, each around 20k faces. With small viewport set (to eliminate pixel-processing influence) I get around 30 fps for OGL, and 40 fps for D3D9. After many tests I really don't know what's goind on. One interesting thing is that for OGL I tested to approaches in specifying data. One approach is to have an array of vertices lined up (with structure describing vertex, just like in D3D9). The other approach is to have first specified all vertices positions, then normals, then other data (I guess it's not possible to arrange data in that way in D3D9). And when I tried this second approach the fps grew to around 32. So the arrangement of data matters and I'm wondering if DX is doing something "magical" to gain better performance when sending vertex data? My init code for vertex buffer (for index buffer it's similar):
			CRenderer::D3DDevice->CreateVertexBuffer(size, D3DUSAGE_WRITEONLY, 0, D3DPOOL_MANAGED, &id, NULL);
		}
		#else
		{
			glGenBuffers(1, &id);
			glBindBuffer(GL_ARRAY_BUFFER, id);
			glBufferData(GL_ARRAY_BUFFER, size, NULL, GL_STATIC_DRAW);
Code for mapping:
			id->Lock(0, 0, (void**)&data, 0);
		}
		#else
		{
			glBindBuffer(GL_ARRAY_BUFFER, id);
			data = glMapBuffer(GL_ARRAY_BUFFER, GL_WRITE_ONLY);
I guess the configuration of buffer is the same for both D3D9 and OGL. Or maybe I'm missing something?

Share this post


Link to post
Share on other sites
"GL_STATIC_DRAW" - try replacing it with GL_STREAM_DRAW or GL_DYNAMIC_DRAW. You could also try using glBufferData() instead of glMapBuffer, the driver might be more optimized for that.
Interleaved vtx data is generally faster. Do mind the DWORD-alignment of attribs, too. (don't have a 3-byte attrib, instead go for 4-byte)

Streaming is by far most efficient when two copies of the data can fit in L1 cache of the cpu. 20k verts * 32byte/vert can't fit.

Share this post


Link to post
Share on other sites
First of all note that I'm only talking about static buffers and GL_STREAM_DRAW/GL_DYNAMIC_DRAW causes drop of fps from 30 to 19.
Aligning vertex structure to 32/64 bytes or having 4-byte attribs doesn't help either (I'm working on GF8400 if that matters).
One interesting thing is that some time ago I was testing the speed of updating animated geometry data. And in that case OGL was faster a bit than D3D. But when it comes to simple static data, D3D is definitely much faster. And I really can't understand this big difference (40 to 30), OGL must put the data in some really not nice way

Share this post


Link to post
Share on other sites
Quote:
Original post by maxest
Right now I render about 120, each around 20k faces.
By my reckoning that makes 2.5 million tris. Any reason you can't cull these further? 2.5 million tris will bring most cards into the low double-digits.

Vertex layout is unlikely to be the culprit here, so I would guess that you haven't matched the vertex processing between D3D and OpenGL. Are you using a vertex shader, or the fixed function OpenGL vertex pipe? And what operations is it performing (i.e. lighting, fog, etc.)?

Another high probability is drivers - D3D drivers are often more mature. Make sure you have installed the very latest drivers for your card, and benchmark again.

Quote:
Original post by idinev
Just measure the calls with RDTSC
How will that help? The API is asynchronous, so individual call times are completely meaningless - operations may be deferred substantially where the driver considers it beneficial.

Share this post


Link to post
Share on other sites
Quote:

You didn't mention if the data should be static, and instead posted how you map it (which is used mostly for streaming).

You're right. I should have marked I mean static buffers. I use map/unmap (instead of BufferSubData) to keep coherency with my D3D9 renderer.

Quote:

Vertex layout is unlikely to be the culprit here

Note that when I use array of vertices lined up just like in D3D, the performance is a bit worse than in situations when I first specify positions, then normals and the other data in video memory (the latter I do with glBufferSubData, and the former with glMap/glUnmap) so vertex layout has some meaning, even if very subtle.

I've also reduced the number of objects to 25. And FPS is now about 240 for D3D9 and 120 for OGL!

Quote:

so I would guess that you haven't matched the vertex processing between D3D and OpenGL. Are you using a vertex shader, or the fixed function OpenGL vertex pipe? And what operations is it performing (i.e. lighting, fog, etc.)?

I'm using Cg shaders, compiled to NVidia's VP40 and FP40, and for D3D I use shader model 3.0. At first I was doing some tangent-space calculations. But now I've simplifed shader a lot and found something interesting. After simplifying D3D9's performance increased to around 300, whereas OGL's is still 120! So the bottleneck is definitely not vertex processing, but rather memory access to geometry or something. This may also seem like CPU bottleneck, but I have very few drawcalls now, and I've also tested situation when I abuse drawcalls. In that case (I don't remember the precise number) OGL had 2x better performance.

Quote:

Another high probability is drivers - D3D drivers are often more mature. Make sure you have installed the very latest drivers for your card, and benchmark again.

On my notebook with GF8400 I have some old one, but I tested application on my desktop with GF6600 and newest NVidia drivers and performance is similar to the one from notebook.

Share this post


Link to post
Share on other sites
Have you tried coding the shaders specifically to each native shading language? Cg isn't always known for making the best bytecode, and you'll almost always see a performance improvement in complex shaders from doing them directly in GLSL. Could be that Cg just isn't doing as good a job at optimizing the shaders for OpenGL?

Share this post


Link to post
Share on other sites
My simplified vertex shader has 9 instructions. Do you really think Cg could mess something up here? :)
I've just tried GLSL profile instead of NV40 and it changed nothing. FPS is exactly the same so it probably isn't a problem of bad written shader.

Share this post


Link to post
Share on other sites
Quote:
Original post by maxest
Quote:

Vertex layout is unlikely to be the culprit here

Note that when I use array of vertices lined up just like in D3D, the performance is a bit worse than in situations when I first specify positions, then normals and the other data in video memory (the latter I do with glBufferSubData, and the former with glMap/glUnmap) so vertex layout has some meaning, even if very subtle.
Quote:
So the bottleneck is definitely not vertex processing, but rather memory access to geometry or something.
Are you absolutely sure that your vertices are falling on 32-byte boundaries? Incorrect alignment can cause performance problems, and I believe that the D3D vertex formats handle this for you.

For example, 3 floats position + 3 floats normal + 2 floats texcoord = 32 bytes. Add/remove one float to that, and your cache behaviour is shot to hell.

Quote:
I've also tested situation when I abuse drawcalls. In that case (I don't remember the precise number) OGL had 2x better performance.
This is expected - OpenGL drawcalls are far less expensive than D3D9 drawcalls.
Quote:
My simplified vertex shader has 9 instructions. Do you really think Cg could mess something up here? :)
I've just tried GLSL profile instead of NV40 and it changed nothing. FPS is exactly the same so it probably isn't a problem of bad written shader.
NVidia's CG compiler uses the same backend as their GLSL compiler, so I wouldn't expect major differences.

Share this post


Link to post
Share on other sites
You are definitely doing something that you shouldn't be doing somewhere. The large performance difference is not normal. In fact, from my own experience, NVidias OpenGL tends to be slightly (but consistently) faster than D3D9, if (and only if) you perform the exact same operations in both APIs.

Now, what exactly is going wrong on your end is impossible to say without going much deeper into your code and profiling it. You might want to try an OpenGL profiler such as gDebugger on an instrumented driver.

That said, I find it suspicious that your OpenGL FPS figures always seem to be multiples of 30. It actually looks like you have a 60Hz vsync enabled in OpenGL, but disabled in D3D.

Share this post


Link to post
Share on other sites
Quote:

That said, I find it suspicious that your OpenGL FPS figures always seem to be multiples of 30. It actually looks like you have a 60Hz vsync enabled in OpenGL, but disabled in D3D.

It's not this, I'm just rounding the FPS :)

Quote:

Now, what exactly is going wrong on your end is impossible to say without going much deeper into your code and profiling it. You might want to try an OpenGL profiler such as gDebugger on an instrumented driver.

I would do that but 7-day-trial for gDEBugger has expired. On my notebook I'm also having problem to install instrumented driver and I can't use NVPerfHUD.

Quote:

Are you absolutely sure that your vertices are falling on 32-byte boundaries? Incorrect alignment can cause performance problems, and I believe that the D3D vertex formats handle this for you.

That's funny. I was using 60-byte vertex. Now I've switched to 64-byte and D3D9 went from 300 to 350 FPS, and OGL is still around 120 (but I've noticed a very subtle drop).

I'm really not doing anything non-typical. I prepare data, load it, set glVertex/Normal/...Pointer and finally draw everything with glDrawElements. Maybe I should upload my code so you could see if there's something suspicious.

I don't know how important it is, but I remember testing one of my applications (yet based on the same renderer) with gDEBugger and I remember I had numerous call of glGetProgram (or something, function to retrieve shader's id) every frame. gDEBugger says this is not optimal but I can't reduce that since Cg is doing that. And I don't know why it's so stubborn in calling that function.

Share this post


Link to post
Share on other sites
Quote:
Original post by maxest
I'm really not doing anything non-typical. I prepare data, load it, set glVertex/Normal/...Pointer and finally draw everything with glDrawElements.

A) You should use generic vertex attributes instead of named ones. The latter are deprecated, and may not be an optimal codepath in newer drivers anymore.

B) Make sure to use IBOs with your VBOs. Not doing so can make you CPU bound in some scenarios.

C) Cg may or may not be problematic. Take a look at the generated shader assembly to be sure.

D) Are you changing fragment shader uniform variables per frame ? Some older NVidia GPUs require a shader recompile everytime a uniform variable is modified, which is transparently done by the driver. And that obviously kills performance.

Besides that, there's not much more one can say without further profiling. This situation is strange, as your usage case (static geometry, simple shaders, no FFP) essentially maps directly down to the hardware, without much interference by the driver. The only point where the CPU and driver will have a significant role in is batch submission. And ironically that's exactly the situation where OGL is much faster than D3D9. But you said you already tested that.

Share this post


Link to post
Share on other sites
Quote:
Original post by maxest
I don't know how important it is, but I remember testing one of my applications (yet based on the same renderer) with gDEBugger and I remember I had numerous call of glGetProgram (or something, function to retrieve shader's id) every frame. gDEBugger says this is not optimal but I can't reduce that since Cg is doing that. And I don't know why it's so stubborn in calling that function.


Every frame? There's really no reason that should be necessary assuming all the binding locations are being cached. If it's only calling glGetProgram, it's probably doing some kind of error/bounds checking. If it's making other glGet*** calls too, then it could be doing a lot of unnecessary stuff.

glslDevil should tell you exactly what's going on and isn't limited use (it can give you a trace of OpenGL function calls).

Share this post


Link to post
Share on other sites
I've used glIntercept. This is the log of my whole frame:

glViewport(0,700,100,100)
glClearColor(0.500000,0.500000,0.500000,1.000000)
glClearDepth(1.000000)
glClear(GL_DEPTH_BUFFER_BIT | GL_COLOR_BUFFER_BIT)

// 25x times the code below
glGetProgramivARB(GL_VERTEX_PROGRAM_ARB,GL_PROGRAM_BINDING_ARB,0x12fcd8)
glProgramLocalParameters4fvEXT( ??? )
glGetProgramivARB(GL_VERTEX_PROGRAM_ARB,GL_PROGRAM_BINDING_ARB,0x12fcd8)
glProgramLocalParameters4fvEXT( ??? )
glDrawRangeElements(GL_TRIANGLES,0,31518,60000,GL_UNSIGNED_SHORT,0x0000) VP=1 FP=5 Textures[ (4,6) ]
...

wglSwapBuffers(0x520118f1)=true

So as you can see I do minimal effort (. And I don't think that calling glGetProgram 50 times per frame could break the performance. Currently I mostly suspect it might be something with deprecated glVertex/Normal/...Pointer, as Yann L stated. This is next thing I'll check in my investigation. I'm 95% sure that driver places the geometry in some wrong manner and the deprecated functions might be the ones that make this whole mess.

Share this post


Link to post
Share on other sites
Looks like you are using old GL_ARB_vertex_program. I suggest GLSL.
Why do you need glGetProgramivARB? It could very well drag performance. All glGet calls should be avoided.
You could upload the code somewhere but it should be compilable. What compiler does it need?

Quote:
Currently I mostly suspect it might be something with deprecated glVertex/Normal/...Pointer

I doubt it cause I have always though glVertexAttrib is just sugar coating but it is worth a try. The switch is not difficult.

Share this post


Link to post
Share on other sites
Quote:

Looks like you are using old GL_ARB_vertex_program. I suggest GLSL.

As I mentioned, using Cg's GLSL profile didn't help. Would use of "native" GLSL and bypassing Cg completely could help here?

Quote:

Why do you need glGetProgramivARB? It could very well drag performance. All glGet calls should be avoided.

I don't need it, this stupid Cg runtime calls it every time I call cgGLSetParameterXXX! At first I thought the problem could be cgGetNamedParameter which I was calling before every change of parameter's value. So I switched to STL's map container and keep every new "NamedParameter" in the table so I don't need to use cgGetNamedParameter all the time but I just look into the table to find proper parameter. But this didn't help so Cg's probably calling these gets when I call something like aformentioned cgGLSetParameterXXX.

Quote:

I doubt it cause I have always though glVertexAttrib is just sugar coating but it is worth a try. The switch is not difficult.

Mhm, I've just checked it. It changed nothing in FPS.

Quote:

You could upload the code somewhere but it should be compilable. What compiler does it need?

I'm using VS2008 but the code has always been well compilable on GCC (my code-colleague uses it so I have to keep my code compatible with this compiler :)). It uses libraries like SDL, GLEW and Cg. I'll upload it tomorrow after I make some cleaning. Right now I'm going to bed cause I'm really tired with this struggle for today :).

Share this post


Link to post
Share on other sites
zedz:
I could have at least read the whole thread before posting. In my first post I wrote I'm rendering to small viewport and in the one where I posted glIntercept's code you can clearly see I use 100x100 viewport, which is small enough to neglect pixel-processing.

As for all these glGetProgram I've googled a bit and found following threads:
http://www.gamedev.net/community/forums/topic.asp?topic_id=396698
http://www.gamedev.net/community/forums/topic.asp?topic_id=325816
http://developer.nvidia.com/forums/index.php?showtopic=598

The lest one is particularly interesting. I've checked it, now I don't call nothing more than:

glProgramLocalParameters4fvEXT( ??? )
glProgramLocalParameters4fvEXT( ??? )
glDrawElements(GL_TRIANGLES,60000,GL_UNSIGNED_SHORT,0x0000) VP=1 FP=5

per object, but this still doesn't help. FPS hasn't changed even for a bit.

Please, download the application http://maxest.fm.interia.pl/PerformanceTest.zip and test it yourself. I think the code is clear enough so you could experiment with it. Note that you need Fraps or some other external tool to measure FPS since I'm not outputting this information.

If you would like to create OGL's function calls log with glIntercept, rename file "!OpenGL32.dll" to "OpenGL32.dll" and run the application.

Share this post


Link to post
Share on other sites
To get Cg to work correctly with GLSL, you need to do a few things.

1.) Explicitly set CG_PROFILE_GLSLV, CG_PROFILE_GLSLF for the profiles, since the Cg runtime won't select them for you.
2.) Use said profiles to load the your vertex and fragment programs separately.
3.) Use cgCombinePrograms() to make the CGprograms for the vertex shader and fragment shader into one CGprogram.
4.) Use this program wherever you used to use the two program handles separately.

This approach also works for any other profile that is available, so you can build a codebase around it. As far as I know, this is the only way to make GLSL work in Cg.

In my own code, I made it so that Nvidia cards will automatically get the highest version of Nvidia's assembly language. In the case of non-Nvidia cards, I manually set the profiles to GLSL to get the full feature-set.

Share this post


Link to post
Share on other sites
Thank you for your test.

According to what Yann L stated:
Quote:

You are definitely doing something that you shouldn't be doing somewhere. The large performance difference is not normal. In fact, from my own experience, NVidias OpenGL tends to be slightly (but consistently) faster than D3D9, if (and only if) you perform the exact same operations in both APIs.

I believe that the difference 650 - 560 is not a satisfying for OpenGL. This is strongened by the fact that OGL renderer is using only 3 gl calls per one rendered object. It's roughly 80 gl calls per frame. I've also found in PIX that DX is doing much more effort by creating and releasing a lot of resources, about 2-3 per shader parameter change (this is done by Cg runtime)

Share this post


Link to post
Share on other sites
This whole situation is getting more and more interesting. I've managed to update my drivers with some newest beta drivers destined for notebooks. And the performance dropped for D3D9 from 350 to 210. OGL stays untouched.

I'm also wondering whether it's really a problem of OGL. I've just checked of my "normal" applications running on my renderer. It does quite a lot of pixel processing and renders one huge box (few faces). And I get 100 fps for D3D and 60-80 for OGL. I'm also not sure about the importance of this but... my simple PerformanceTest I'm testing all the time for barely 4 20k-faces boxes reached 999 fps for D3D and 450 fps for OGL. When I render nothing (to a small viewport) I get 9999 fps for D3D and 999 fps for OGL. Maybe I'm overreacting but shouldn't I get similar results for both APIs in such situation? (they both perform jest clearing of small viewport after all). Maybe the problem is somewhere in SDL and maybe I've configured something in wrong way?

Share this post


Link to post
Share on other sites
Quote:
Original post by maxest
When I render nothing (to a small viewport) I get 9999 fps for D3D and 999 fps for OGL. Maybe I'm overreacting but shouldn't I get similar results for both APIs in such situation? (they both perform jest clearing of small viewport after all).
That is a difference of well under a millisecond per frame - hardly something to worry about.

I also wouldn't be surprised if SDL managed to initialise a less-than-optimal OpenGL context - it is a pretty ancient code base, unless you are using the 1.3 branch.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this