Sign in to follow this  
RPTD

Memory-Array vs Dynamic-VBO: which is better?

Recommended Posts

I'm trying to optimize my render code. During this I noticed that the copying values to the VBO takes quite some time. The VBO is a dynamic one as the mesh bends around ( creature ). Now I question myself if I am quicker using a memory-array instead of making a VBO. I also would like to keep in mind the memory consumption. With higher resolution meshes the VBO data can quickly explode eating precious texture memory. Is it worth dropping a VBO in favor of cpu memory especially if the data changes every frame?

Share this post


Link to post
Share on other sites
You would probably be better off using VBOs for more static data thats displayed a lot. I would recommend trying both methods and seeing what works best for you, and what is fastest.

Share this post


Link to post
Share on other sites
Quote:
Original post by RPTD
The VBO is a dynamic one as the mesh bends around ( creature ).
Instead of using a dymaic mesh (that is generated in software), it may be possible to use a static mesh and deform it in hardware with matricies (i.e. hardware skinning / skeletal animation).

Share this post


Link to post
Share on other sites
VBO might stand out as better approach, if you go deeper and do stuff like caching, multiple passes per frame. And you shouldn't copy an array of vertices to the VBO, but calculate the data directly to it using memory-mapping. And create the VBO with write-only, rw-mode _will_ kill you. And then there's the performance hints, static/dynamic usage etc..

For single-pass non-cached low-poly models VA's work really well these days at least.

ch.

Share this post


Link to post
Share on other sites
@PhilMorton:
The problem is that I use a complex animation system. For example my dragon player model weights in at roughly 410 weight matrices ( from over 100 bones with vertex bone weights ). While I could do a Float-Texture hack there to calculate the vertices I am at a complete loss what goes for normals and tangents. I have to calculate them all over from the transformed vertices as there exists no way to produce a weight matrix for those ( A simple example situation shows the impossibility immediatly ). Hence transforming on the GPU would be not impossible but would require heavy tricks with Float-Textures and various GLSL scripts. This approach would cost a huge amount of texture memory and I don't know if the speed would really catch up with in the end.

@christian:
And there I heard before that memory mapping is worse than using a copy array: now what is true? And furthermore I am in OpenGL here. Don't know where there would be "write-only" mode and such things. You can only set STATIC or STEAM modes ( 3 in total ).

Share this post


Link to post
Share on other sites
Quote:
Original post by christian h
VBO might stand out as better approach, if you go deeper and do stuff like caching, multiple passes per frame. And you shouldn't copy an array of vertices to the VBO, but calculate the data directly to it using memory-mapping.
ch.


If you use glBufferSubData, then your later method is not possible.
I have read that ATI prefers this over glMapBuffer. I don't really know which method is better.

The OP can make a dynamic VBO.

glBindBuffer(...., VBOID);
glBufferData(..., ..., ..., GL_STREAM_DRAW);
or
glBufferData(..., ..., ..., GL_DYNAMIC_DRAW);

STREAM means you will change very often : change, draw, change, draw
DYNAMIC means you will change les often : change, draw, draw, change, draw, draw, draw, draw, draw

but these are hints to the driver.
For some driver, STREAM and DYNAMIC may be the same thing.

Share this post


Link to post
Share on other sites
Quote:
Original post by V-man
Quote:
Original post by christian h
VBO might stand out as better approach, if you go deeper and do stuff like caching, multiple passes per frame. And you shouldn't copy an array of vertices to the VBO, but calculate the data directly to it using memory-mapping.
ch.


If you use glBufferSubData, then your later method is not possible.
I have read that ATI prefers this over glMapBuffer. I don't really know which method is better.


Yes the glBufferSubData is faster on modern hardware with the latest drivers. The map method was older and is generally a slower method. I have actually seen FPS drop because of mapping VBOs.



Share this post


Link to post
Share on other sites
Quote:
Original post by RPTD
So if I have a preallocated VBO using glBufferSubData for the entire range is faster than glBufferData on a modern system?


I will indirectly answer your question because frankly I am not sure how glBufferSubData and glBufferData are managed by the drivers and the GPUs internally.

1. glBufferData does an allocation of memory evertime you call it.

2. glBufferSubData will update a part of the data and does no memory allocation or deallocation. So should be faster.

Now the results. I did some changes in our engine's renderer, which uses VBOs whenever possible and falls back on Vertex arrays if VBO support is absent. I made sure the renderer was using VBOs and then tried replacing glBufferSubData and glBufferData. There was a drop in speed, but the results are far from conclusive. Also I currently have only a single gpu to test the code so can't really say if glBufferSubData was of any real value. The other reason may be because the engine batches data aggressively so there is no real difference noticeable. I need to test more with animated meshes and a bunch of other stuff.

Then I tried using Vertex Arrays instead of VBOs on same GPU, speed dropped considerably. This again is with the entire engine and not with one particular instance like the one you are interested in ("With higher resolution meshes")

Share this post


Link to post
Share on other sites
Quote:
Original post by RPTD
So if I have a preallocated VBO using glBufferSubData for the entire range is faster than glBufferData on a modern system?


glBufferSubData doesn't allocate. It updates an already existing buffer.
glBufferData allocates much like glTexImage2D.

If you lose performance with glBufferSubData, then it's not normal. Email the hw vendor. I'd be dissapointed if I lossed performance.

I was accidently calling glBufferData instead of the other but when I fixed it, it gave no improvement because other parts are keeping the GPU busy.

Share this post


Link to post
Share on other sites
Quote:
Original post by _neutrin0_
Yes the glBufferSubData is faster on modern hardware with the latest drivers. The map method was older and is generally a slower method. I have actually seen FPS drop because of mapping VBOs.


Thats news to me :o I tried pretty much all variations on GF6600GT and mapping was fastest one, in where you had to rewrite the data every frame. Faster than glBufferSubData, which was faster than VA's as expected.

ch.

Share this post


Link to post
Share on other sites
Quote:
Original post by RPTD
What do you mean exactly by "batching" in this context? ( just to see if I have something similar to compare results )


By batch I mean

Create vertex buffer.
Create index buffer

While( !gameEnd )
Load mesh 1 vertices into VBO.
Load mesh 2 vertices into VBO.
Load mesh 1 faces into IBO.
Load mesh 2 faces into IBO.
Draw mesh 1.
Draw mesh 2.

Destroy VBO
Destroy IBO

The VBO/IBO is statically allocated and not created and destroyed for every iteration. In fact it is not even resized. If the vertex buffer fills up, we draw the geometry and restart from the beginning of the buffer.

Ok, I managed to do some more tests and it seems that glBufferSubData is indeed faster than glBufferData.

The engine uses chunks of vertex data in a continuous array that it loads using calls to glBufferSubData. I tried replacing glBufferData here and it was slower. The reason was glBufferData would replace all the data in the array and maybe do some internal memory allocation/deallocation. The glBufferData will take a hit on performance as the size of the VBO increases (big hit). So for a big VBO it is best to use glBufferSubData.

Share this post


Link to post
Share on other sites
Quote:
Original post by christian h
Quote:
Original post by _neutrin0_
Yes the glBufferSubData is faster on modern hardware with the latest drivers. The map method was older and is generally a slower method. I have actually seen FPS drop because of mapping VBOs.


Thats news to me :o I tried pretty much all variations on GF6600GT and mapping was fastest one, in where you had to rewrite the data every frame. Faster than glBufferSubData, which was faster than VA's as expected.

ch.


When the Vertex buffer is small, glMapBuffer might be faster.
Have you tried your method on large chunks of data and big VBOs?

The reason I am asking is that the VBO document here (ref pages 12 and 13) says that the value passed to glMapBuffer is just a hint. glMapBuffer will "map" the data into system RAM. In worst case senario, the whole buffer might get mapped to the system RAM. For small VBOs it might not matter. If the VBO is large, then there could be a performance issue.

[Edited by - _neutrin0_ on October 9, 2006 2:10:44 PM]

Share this post


Link to post
Share on other sites
Quote:
Original post by _neutrin0_By batch I mean

Create vertex buffer.
Create index buffer

While( !gameEnd )
Load mesh 1 vertices into VBO.
Load mesh 2 vertices into VBO.
Load mesh 1 faces into IBO.
Load mesh 2 faces into IBO.
Draw mesh 1.
Draw mesh 2.

Destroy VBO
Destroy IBO

The VBO/IBO is statically allocated and not created and destroyed for every iteration. In fact it is not even resized. If the vertex buffer fills up, we draw the geometry and restart from the beginning of the buffer.

I see. That matches up with some parts of my engine.

Quote:
The engine uses chunks of vertex data in a continuous array that it loads using calls to glBufferSubData. I tried replacing glBufferData here and it was slower. The reason was glBufferData would replace all the data in the array and maybe do some internal memory allocation/deallocation. The glBufferData will take a hit on performance as the size of the VBO increases (big hit). So for a big VBO it is best to use glBufferSubData.

This makes sense. I'll do some testing tomorrow by replacing the calls at the appropriate places. For my models for example a VBO can easily reach 400K worth of dynamic data ( but the size stays the same all the time ).

Share this post


Link to post
Share on other sites
Did now some testing and replaced one ( the large one ) VBO with the SubData call. Bumbed the framerate at the worst place in the map from 32 up to 40. Still a long way to go but that sounds much better.

Share this post


Link to post
Share on other sites
I am just hazarding a guess here...

How big is your VBO? Very large VBOs may overflow the GPU memory especially if you are already loading large textures, other VBOs and a bunch of other data.

Worst case, OpenGL maps the VBO in system RAM.

Maybe reducing the the VBO size and sending the data in batches for very large meshes may help.

You need to verify this by actually doing it.

Share this post


Link to post
Share on other sites
if you are replacing a whole VBO you could do what is known in D3D circles as 'render and discard'

Basically, you fill you VBO with data, then when it comes time to update that VBO you rebind it and issue a glDataBuffer() call with NULL as the pointer to the data, then repeat the glDataBuffer() call with the pointer to the data pointing to the data to map into the buffer.

For both NV and ATI this sequence tells the driver 'I don't care about the data in that VBO, discard it when you are done rendering from it and create me a new buffer to put data in'.

I'm pretty sure this is covered in one of the performance pdfs in the Forum FAQ

Share this post


Link to post
Share on other sites
@_neutrin0_:
As stated above somewhere the dynamic VBO for the player character homes in at about 400kb of size. This is more or less the size of the largest dynamic VBOs that I should come across.

@phantom:
Is re-allocating the memory by the driver ( through BufferData ) what eats time? I did now use the SubData to avoid this reallocation and it showed that it really reduces the processing time. It's just confusing me that it should be better the other way round.

Share this post


Link to post
Share on other sites
Quote:
Original post by RPTD
@phantom:
Is re-allocating the memory by the driver ( through BufferData ) what eats time? I did now use the SubData to avoid this reallocation and it showed that it really reduces the processing time. It's just confusing me that it should be better the other way round.


Yup. You need to discard the data by calling glBufferData and passing a NULL pointer to it.

Share this post


Link to post
Share on other sites
Quote:
Original post by RPTD
@phantom:
Is re-allocating the memory by the driver ( through BufferData ) what eats time? I did now use the SubData to avoid this reallocation and it showed that it really reduces the processing time. It's just confusing me that it should be better the other way round.


Ah yes, I should have been clear, what this can get around is any sync issues which might arise from reusing the same VBO. Without the discard the system might well have to wait for drawing to be completed from the VBO before it can be updated, this causes a stall and wastes time while the system sits around and waits.

glSubBufferData() also suffers from this sync problem as in this case you are saying 'only change this data, leave the rest intact' which could lead to the driver having to make copies and other such things.

If you are totally replacing a buffer then judging by what the IHVs/driver writers have said the method I outlined should give you the best results.

Share this post


Link to post
Share on other sites
Quote:
Original post by phantom
glSubBufferData() also suffers from this sync problem as in this case you are saying 'only change this data, leave the rest intact' which could lead to the driver having to make copies and other such things.

If you are totally replacing a buffer then judging by what the IHVs/driver writers have said the method I outlined should give you the best results.


Yes that is what the docs say on NVIDIA site too. But for some reason I keep getting more speed with glBufferSubData() than glBufferData(), and yes, I am passing NULL to glBufferData to do a 'render and discard' before doing a glBufferSubData() call to update the buffer.

Oh well, it maybe something to do with my particular implementation!!

Share this post


Link to post
Share on other sites
It is indeed a bit faster than using the SubData. With SubData I had between 37 and 40 fps. With this version I have between 39 and 42 fps. A difference of 2 but better than nothing.

Is this behaviour stated in the extension specs or is it just an optimization done by the driver makers? Just asking as from the extension specs I did not read out that such a combination of commands yields better performance.

Share this post


Link to post
Share on other sites
Quote:
Original post by RPTD
Is this behaviour stated in the extension specs or is it just an optimization done by the driver makers? Just asking as from the extension specs I did not read out that such a combination of commands yields better performance.


I believe it's just something NV and ATI agreed on, it would have been nice for it to be part of the spec however but such is life.

Share this post


Link to post
Share on other sites
Quote:
Original post by _neutrin0_
When the Vertex buffer is small, glMapBuffer might be faster.
Have you tried your method on large chunks of data and big VBOs?

I used a hipoly model, at least dozen-hundred tris maybe more, so at least >300kb.

Quote:

The reason I am asking is that the VBO document here (ref pages 12 and 13) says that the value passed to glMapBuffer is just a hint. glMapBuffer will "map" the data into system RAM. In worst case senario, the whole buffer might get mapped to the system RAM. For small VBOs it might not matter. If the VBO is large, then there could be a performance issue.


It was faster to map it in write-only mode than read-write which was really slow, so it really might be mapping it?

Oh wait, this was on linux, I haven't tried it in windows though :o.

ch.

[edit]
I meant write-only, not read-only!
[/edit]

[Edited by - christian h on October 12, 2006 11:54:35 AM]

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this