Sign in to follow this  
jsg007

OpenGL VBOs not giving any performance

Recommended Posts

Hey everyone... It seems that VBOs are not giving any performance on my pc. It was not woking in my application, so I downloaded the VBO demo from here which has a VBO and a regular vertex array version, but both of them run at the same speed. Has anyone got any idea why isn't VBO faster? My specs: GeForce 6600 AMD64 3000+ 512MB RAM Windows XP SP 2 I tried the same demo on my brother's Radeon X700 and the VBO version ran about 50% faster.

Share this post


Link to post
Share on other sites
Performance relative to what? I take it you mean glBegin/glEnd.

VBOs are best used for large amounts of verticies. With few verticies you are unlikely to get much of a performance boost and I think it may even be possible to get a performance loss.

Share this post


Link to post
Share on other sites
Quote:
Original post by TGA
Performance relative to what? I take it you mean glBegin/glEnd.

VBOs are best used for large amounts of verticies. With few verticies you are unlikely to get much of a performance boost and I think it may even be possible to get a performance loss.


I mean performance relative with regular Vertex Arrays, and I'm rendering 250.000 triangle to test it...

But I don't think there is a problem in my code, since none of the VBO demos are giving any performance boost, like on other systems... check out the demo in my first post.

Share this post


Link to post
Share on other sites
Your computer and your brother's no doubt have different OpenGL implementations. You'll have to read up on your card and see whether or not it supports the VBO extension and actually stores the vertex data on-card, or just simulates them with vertex arrays (caching the vertex data in main memory, then streaming it over each pass).

Since you're getting no performance boost (but your brother is), I doubt your card supports the extension (but his does). There's probably an easy way to check it, but I've no idea how </3

Share this post


Link to post
Share on other sites
well... I think a GeForce 6 class video card should have support for VBOs, even if it's not a high end video card right now... the extensenion is obviously present, but the performance is exactly the same as normal vertex arrays, so it might be simulated internally.

It is quite weird...


Is someone else experiencing this on a 6600?

Share this post


Link to post
Share on other sites
I use something such as this:

int CheckExtension(char* extensionName)
{
// get the list of supported extensions
char* extensionList = (char*) glGetString(GL_EXTENSIONS);

if (!extensionName || !extensionList)
return 0;

while (*extensionList)
{
// find the length of the first extension substring
unsigned int firstExtensionLength = strcspn(extensionList, " ");


if (strlen(extensionName) == firstExtensionLength &&
strncmp(extensionName, extensionList, firstExtensionLength) == 0)
{
return 1;
}

// move to the next substring
extensionList += firstExtensionLength + 1;
}

return 0;
} // end CheckExtension()


Then call with:

if (!CheckExtension("GL_ARB_vertex_buffer_object"))
MessageBox(NULL, "You don't have VBO Support", "No VBO Support", MB_OK);

Obviously the MessageBox Call is assuming Windows but I'm sure you could write to a file or use OS independent calls.

P.S how do you get nice code scroll windows in my posts? Thanks
[Like this. Edit your post to see how it works. -Yann]

Jamie

[Edited by - Yann L on August 14, 2007 5:02:06 PM]

Share this post


Link to post
Share on other sites
Quote:
Original post by jsg007
It is quite weird...

No, it's not. If your test app is not bus-bandwidth bound (highly unlikely with only 250k triangles on a modern PCIe, or even an AGPx8 system), then you won't see the slightest difference between VAs and VBOs. Also, I highly suspect NVidia to cheat a little around the litteral (read: obsolete) wording of the GL specs, and cache VAs onboard. Which for all practical purposes make them equivalent to VBOs in terms of performance.

Never forget that profiling a stream processor (ie. a GPU) is very different than profiling a general purpose CPU. Unless you modify the part of the pipeline that happends to make the current bottleneck of the stream operation chain, you won't see any difference in performance at all. In fact, if you're eg. fragment bound, then immediate mode can be just as fast (or slow) as VBOs.

Share this post


Link to post
Share on other sites
r u testing in a small window eg 400x300 (or smaller)

i just tried my game ( which pushes >100k tris frame )
1366x766 VBO=51fps, VA=48fps
400x300 VBO=70fps, VA=58fps (much bigger gap)

preprocessing at 1366x768 on both

Share this post


Link to post
Share on other sites
Thanks Yann.

Was a bit weird at first, was shocked when the page loaded to see my code in blocks lol.

I would expect VBO's to be available for your graphics card (mine's really old but it still has it), maybe its a graphics driver thing?

Jamie

Share this post


Link to post
Share on other sites
Well, I made my app to be vertex bound, it renders to a small viewport,no texture or fragment program, it loads a model with ~61.000 triangles and renders it 4 times (that's about 250.000 triangles), in 4 different places (so there's no overdraw).

The VA is much faster than the immediate mode, but the VBO runs with the same speed as normal VA, just like in every application I have tested.

I also use the latest driver.

Share this post


Link to post
Share on other sites
If the models you're using are static, you can try and use the regular vertex arrays in a display list. Try one list for the model and reuse it 4 times. Then see what performance difference you get. In theory, display lists should even be somewhat faster than VBO's. If you aren't seeing any increase with displays lists then I don't think you're vertex bound. On my PC (GF 7800 GS) I see an incredible performance difference with all VBO demos (up to 3X).

to Yann: isn't accessing data from fast video RAM always going to be faster than fetching it from system RAM over the bus?! It must be, because where would the performance difference I'm seeing come from?

Jeroen

Share this post


Link to post
Share on other sites
Quote:
Original post by jsg007
Well, I made my app to be vertex bound, it renders to a small viewport,no texture or fragment program, it loads a model with ~61.000 triangles and renders it 4 times (that's about 250.000 triangles), in 4 different places (so there's no overdraw).

The VA is much faster than the immediate mode, but the VBO runs with the same speed as normal VA, just like in every application I have tested.

I also use the latest driver.


What vertex format do you use?

Share this post


Link to post
Share on other sites
I just implemented rendering through a display list and the performance jump is absolutely amazing! It renders 4 times faster than immediate mode!

While VA (and VBO in my case) renders only 30% faster than immediate mode.


I don't think I'm going to bother with VBOs anymore, I'll stick with the display lists.



Share this post


Link to post
Share on other sites
Quote:
Original post by jsg007
I just implemented rendering through a display list and the performance jump is absolutely amazing! It renders 4 times faster than immediate mode!

While VA (and VBO in my case) renders only 30% faster than immediate mode.


I don't think I'm going to bother with VBOs anymore, I'll stick with the display lists.

That's a sign that you are either not using VBOs correctly (ie. they're not in VRAM, which would explain the missing performance difference between VBOs and VAs, or you don't do correct CPU side frustum culling on your objects), or your drivers are not up to date. VBOs only being 30% faster than immediate mode is definitely a sign of the former (ie. incorrect use).

Display lists are a dead end. If used correctly, VBOs are almost always considerably faster than display lists, but at least as fast as them.

Share this post


Link to post
Share on other sites
yep it sounds as if youre not doing frustum culling with your meshes
on nvidia hardware they do a DL vs frustum check to see if the things onscreen, if not they wont draw it (with vbo/va each vert still has to be transformed to see if its onscreen)
with correct usage
there should be practically no speed difference between DL's + VBO's

Share this post


Link to post
Share on other sites
It depends on vertex format used. For example, someone had done this

struct myvertex
{
float x, y, z
float s, t
ubyte red, green blue
}


since alpha was missing, each frame, the drivers was processing the entire VBO to add alpha, then it sends to video card.
I've seen people who use double precision everywhere. Again, the drivers convert the entire VBO.

If you don't stay on the fast path, you won't see any difference.

Share this post


Link to post
Share on other sites
Quote:
Original post by V-man
It depends on vertex format used. For example, someone had done this

struct myvertex
{
float x, y, z
float s, t
ubyte red, green blue
}


since alpha was missing, each frame, the drivers was processing the entire VBO to add alpha, then it sends to video card.
I've seen people who use double precision everywhere. Again, the drivers convert the entire VBO.

If you don't stay on the fast path, you won't see any difference.


What if you have a vertex struct that holds 56bytes? I am assuming the VBO would like to see 64bytes then so one should pad it with 8 extra bytes from what I have read 32byte alignments are the best?

Share this post


Link to post
Share on other sites
Quote:
Original post by MARS_999
Quote:
Original post by V-man
It depends on vertex format used. For example, someone had done this

struct myvertex
{
float x, y, z
float s, t
ubyte red, green blue
}


since alpha was missing, each frame, the drivers was processing the entire VBO to add alpha, then it sends to video card.
I've seen people who use double precision everywhere. Again, the drivers convert the entire VBO.

If you don't stay on the fast path, you won't see any difference.


What if you have a vertex struct that holds 56bytes? I am assuming the VBO would like to see 64bytes then so one should pad it with 8 extra bytes from what I have read 32byte alignments are the best?


That's true for ATI. I do that for both ATI and nVidia however.
The struct size has to be a multiple of 32 bytes.

Share this post


Link to post
Share on other sites
ATI/AMD hardware likes 4 byte alignment on it's data; so that's either floats for everything or floats and 4 byte based types. I can't speak for NV but last I heard they didn't have this 'issue'.

The 32bytes comes from the min. data transfer size of an AGP or PCIe transaction across the bus. If you can get your data structs into 32bytes then they get transfered in one hit (of course, if you only have 16bytes then 2 blocks will be transfered). I also believe it has some relivence to the pre-T&L cache on the chips (I wouldn't be shocked if these were 32bytes per cache line, given the transaction size).

Basically, be more worried about the 4 bytes alignment on the various attributes than the whole vertex size; if you are in VRAM then it's only likely to affect the pre-T&L cache usage, and that can 'glue' entries together anyways.

Share this post


Link to post
Share on other sites
This is quite confusing for me, because as far as I tested using interleaved vertex data decreases performance... even in immediate mode...


And for example I render in multiple passes, and I don't need color and texture information for all passes...

Share this post


Link to post
Share on other sites
32 bytes is usually for the pre-transform cache. I wouldn't get too hung up on it though, you have to be doing some pretty weird stuff to get vertex fetch bound on a modern card, especially if fetching from VRAM. Generally the speed of your shaders and how well you reuse the post-transform cache are going to be much bigger influences on vertex throughput.

Share this post


Link to post
Share on other sites
Have you tested your own code rather than someone elses application? My game works on the 6600 with vbo's just fine. With two of these cards I get 1.8 Million triangles running at 34 frames per second. Try pushing more than 250,000 because you should still get good a good framerate.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this  

  • Forum Statistics

    • Total Topics
      627736
    • Total Posts
      2978869
  • Similar Content

    • By DelicateTreeFrog
      Hello! As an exercise for delving into modern OpenGL, I'm creating a simple .obj renderer. I want to support things like varying degrees of specularity, geometry opacity, things like that, on a per-material basis. Different materials can also have different textures. Basic .obj necessities. I've done this in old school OpenGL, but modern OpenGL has its own thing going on, and I'd like to conform as closely to the standards as possible so as to keep the program running correctly, and I'm hoping to avoid picking up bad habits this early on.
      Reading around on the OpenGL Wiki, one tip in particular really stands out to me on this page:
      For something like a renderer for .obj files, this sort of thing seems almost ideal, but according to the wiki, it's a bad idea. Interesting to note!
      So, here's what the plan is so far as far as loading goes:
      Set up a type for materials so that materials can be created and destroyed. They will contain things like diffuse color, diffuse texture, geometry opacity, and so on, for each material in the .mtl file. Since .obj files are conveniently split up by material, I can load different groups of vertices/normals/UVs and triangles into different blocks of data for different models. When it comes to the rendering, I get a bit lost. I can either:
      Between drawing triangle groups, call glUseProgram to use a different shader for that particular geometry (so a unique shader just for the material that is shared by this triangle group). or
      Between drawing triangle groups, call glUniform a few times to adjust different parameters within the "master shader", such as specularity, diffuse color, and geometry opacity. In both cases, I still have to call glBindTexture between drawing triangle groups in order to bind the diffuse texture used by the material, so there doesn't seem to be a way around having the CPU do *something* during the rendering process instead of letting the GPU do everything all at once.
      The second option here seems less cluttered, however. There are less shaders to keep up with while one "master shader" handles it all. I don't have to duplicate any code or compile multiple shaders. Arguably, I could always have the shader program for each material be embedded in the material itself, and be auto-generated upon loading the material from the .mtl file. But this still leads to constantly calling glUseProgram, much more than is probably necessary in order to properly render the .obj. There seem to be a number of differing opinions on if it's okay to use hundreds of shaders or if it's best to just use tens of shaders.
      So, ultimately, what is the "right" way to do this? Does using a "master shader" (or a few variants of one) bog down the system compared to using hundreds of shader programs each dedicated to their own corresponding materials? Keeping in mind that the "master shaders" would have to track these additional uniforms and potentially have numerous branches of ifs, it may be possible that the ifs will lead to additional and unnecessary processing. But would that more expensive than constantly calling glUseProgram to switch shaders, or storing the shaders to begin with?
      With all these angles to consider, it's difficult to come to a conclusion. Both possible methods work, and both seem rather convenient for their own reasons, but which is the most performant? Please help this beginner/dummy understand. Thank you!
    • By JJCDeveloper
      I want to make professional java 3d game with server program and database,packet handling for multiplayer and client-server communicating,maps rendering,models,and stuffs Which aspect of java can I learn and where can I learn java Lwjgl OpenGL rendering Like minecraft and world of tanks
    • By AyeRonTarpas
      A friend of mine and I are making a 2D game engine as a learning experience and to hopefully build upon the experience in the long run.

      -What I'm using:
          C++;. Since im learning this language while in college and its one of the popular language to make games with why not.     Visual Studios; Im using a windows so yea.     SDL or GLFW; was thinking about SDL since i do some research on it where it is catching my interest but i hear SDL is a huge package compared to GLFW, so i may do GLFW to start with as learning since i may get overwhelmed with SDL.  
      -Questions
      Knowing what we want in the engine what should our main focus be in terms of learning. File managements, with headers, functions ect. How can i properly manage files with out confusing myself and my friend when sharing code. Alternative to Visual studios: My friend has a mac and cant properly use Vis studios, is there another alternative to it?  
    • By ferreiradaselva
      Both functions are available since 3.0, and I'm currently using `glMapBuffer()`, which works fine.
      But, I was wondering if anyone has experienced advantage in using `glMapBufferRange()`, which allows to specify the range of the mapped buffer. Could this be only a safety measure or does it improve performance?
      Note: I'm not asking about glBufferSubData()/glBufferData. Those two are irrelevant in this case.
    • By xhcao
      Before using void glBindImageTexture(    GLuint unit, GLuint texture, GLint level, GLboolean layered, GLint layer, GLenum access, GLenum format), does need to make sure that texture is completeness. 
  • Popular Now