Archived

This topic is now archived and is closed to further replies.

Fingers_

Unusual VBO slowdown

Recommended Posts

I implemented vertex arrays and VBO in my planet generator experiment... To my great surprise, VBO is performing much slower than regular vertex array, and even slower than immediate mode. The polygons are divided into 20 batches (there are 20 unique textures), so the batch size in the following data varies from 1024 to 65536 triangles.
(Mtris/second with immediate mode/regular vertex array/VBO)
20480 tris:   2.56 / 4.09 / 1.86
81920 tris:   2.64 / 13.7 / 1.95
327680 tris:  2.69 / 2.73 / 2.05
1310720 tris: 2.65 / 2.76 / 0.1 (out of VRAM?)

Setup: P4/2.4Ghz, Radeon 9700 pro 128M
 
Things to note are that the vertex array performance has a sharp peak at 4096 tris/batch, but VBO is consistently slow. Can you see anything obviously wrong in the code below? (this is called once per frame for each quadrant [batch], the arrays are written to only once in an init function)
ifdef USE_VAR
		glEnableClientState(GL_VERTEX_ARRAY);
		glEnable(GL_TEXTURE_COORD_ARRAY);
		glEnableClientState(GL_NORMAL_ARRAY);

		if (gfx.vbosup)
		{
			glBindBufferARB( GL_ARRAY_BUFFER_ARB, orb->rants[r].nvar);
			glVertexPointer(3, GL_INT, 0, NULL);
			glBindBufferARB( GL_ARRAY_BUFFER_ARB, orb->rants[r].nnar);
			glNormalPointer(GL_FLOAT, 0, NULL);
			glBindBufferARB( GL_ARRAY_BUFFER_ARB, orb->rants[r].ntar);
			glTexCoordPointer(2, GL_FLOAT, 0, NULL);
		}
		else
		{
			glVertexPointer(3, GL_INT, 0, orb->rants[r].var);
			glNormalPointer(GL_FLOAT, 0, orb->rants[r].nar);
			glTexCoordPointer(2, GL_FLOAT, 0, orb->rants[r].tar);
		}

		glDrawElements(GL_TRIANGLES, orb->rants[r].ntris*3, GL_UNSIGNED_INT, orb->rants[r].iar); // array of vertex/normal/texcoord indices

		glDisable(GL_TEXTURE_COORD_ARRAY);
		glDisableClientState(GL_NORMAL_ARRAY);
		glDisableClientState(GL_VERTEX_ARRAY);
#else
		glBegin(GL_TRIANGLES);
		tris = orb->rants[r].tris;
		verts = orb->rants[r].verts;
		for (tri = 0; tri < orb->rants[r].ntris; tri++)
		{
			//glNormal3f(tris[tri].norm[0], tris[tri].norm[1], tris[tri].norm[2]);
			glNormal3f(orb->rants[r].nar[tris[tri].v[0]*3+0], orb->rants[r].nar[tris[tri].v[0]*3+1], orb->rants[r].nar[tris[tri].v[0]*3+2]);
			glTexCoord2f(orb->rants[r].tar[tris[tri].v[0]*2+0], orb->rants[r].tar[tris[tri].v[0]*2+1]);
			glVertex3i(orb->rants[r].var[tris[tri].v[0]*3+0], orb->rants[r].var[tris[tri].v[0]*3+1], orb->rants[r].var[tris[tri].v[0]*3+2]);

			glNormal3f(orb->rants[r].nar[tris[tri].v[1]*3+0], orb->rants[r].nar[tris[tri].v[1]*3+1], orb->rants[r].nar[tris[tri].v[1]*3+2]);
			glTexCoord2f(orb->rants[r].tar[tris[tri].v[1]*2+0], orb->rants[r].tar[tris[tri].v[1]*2+1]);
			glVertex3i(orb->rants[r].var[tris[tri].v[1]*3+0], orb->rants[r].var[tris[tri].v[1]*3+1], orb->rants[r].var[tris[tri].v[1]*3+2]);

			glNormal3f(orb->rants[r].nar[tris[tri].v[2]*3+0], orb->rants[r].nar[tris[tri].v[2]*3+1], orb->rants[r].nar[tris[tri].v[2]*3+2]);
			glTexCoord2f(orb->rants[r].tar[tris[tri].v[2]*2+0], orb->rants[r].tar[tris[tri].v[2]*2+1]);
			glVertex3i(orb->rants[r].var[tris[tri].v[2]*3+0], orb->rants[r].var[tris[tri].v[2]*3+1], orb->rants[r].var[tris[tri].v[2]*3+2]);
			pc++;
		}
		glEnd();
#endif

Share this post


Link to post
Share on other sites
it seems like you need to access three different vertex buffers for one vertex. i dont know if thats regular behaviour, but when i tried to place my texcoords somewhere else (ie the moment i was accessing more than one vb at once) the performance degraded horribly. in other words: dont. allocate one big buffer and either use offsets for the different kinds of data or (probably better) store them interleaved.

looking something like either this:
glBindBufferARB( GL_ARRAY_BUFFER_ARB, orb->rants[r].nvar);
glVertexPointer(3, GL_INT, 0, 0);
glNormalPointer(GL_FLOAT, 0, (char*)NormOffset);
glTexCoordPointer(2, GL_FLOAT, 0, (char*)TexOffset);

or: (with struct Vertex as {int x,y,z; float nx,ny,nz; float u,v;}
glVertexPointer(3, GL_INT, sizeof(Vertex), 0);
glNormalPointer(GL_FLOAT, sizeof(Vertex), (char*)(3*sizeof(int)));
glTexCoordPointer(2, GL_FLOAT, sizeof(Vertex), (char*)(3* (sizeof(int)+sizeof(float)) ));


the idea behind method 2 is that the closer you store your data the less it needs to wildly jump all over memory to collect it, though i wouldnt expect it to make much difference. just try keeping your whole stuff for one draw-call in one single vb.

[edited by - Trienco on November 16, 2003 7:39:52 AM]

Share this post


Link to post
Share on other sites
You should also use GL_ELEMENT_ARRAY_BUFFER_ARB to store indices in video mem. I know ATi prefers this over system-stored indices. Overall those numbers (tris/sec) are very low for that kind of card. You should be getting at least 5x more. The "out of VRAM" thing is probably becouse there is a 32mb limit on VBO size (not writen anywhere but both nVidia and ATi fail to allocate this size in VRAM)

You should never let your fears become the boundaries of your dreams.

Share this post


Link to post
Share on other sites
glVertexPointer(3, GL_INT , sizeof(Vertex), 0);
int aren''t optimised on most drivers.

quote:
Blocks of vertex array data may be stored in buffer objects with the
same format and layout options supported for client-side vertex
arrays. However, it is expected that GL implementations will (at
minimum) be optimized for data with all components represented as
floats, as well as for color data with components represented as
either floats or unsigned bytes.



_______________

Jester, studient programmer
The Jester Home in French

Share this post


Link to post
Share on other sites
jesterlecodeur is correct. Here''s a table from ATI''s OpenGL SDK (http://www.ati.com/developer/sdk/radeonsdk/Gl_sdk.zip):

Type			Native	Alignment	Components	Range
GLdouble No
GLfloat Yes 32-bit 1,2,3,4 +/- MAX_FLOAT
GLuint No
GLint No
GLushort Yes 32-bit 2,4 [0,65536]
GLshort Yes 32-bit 2,4 [-32768,32767]
GLushort (normalized) Yes 32-bit 2,4 [0,1]
GLshort (normalized) Yes 32-bit 2,4 [-1,1]
GLubyte Yes 32-bit 4 [0,255]
GLbyte Yes 32-bit 4 [-128,127]
GLubyte (normalized) Yes 32-bit 4 [0,1]
GLbyte (normalized Yes 32-bit 4 [-1,1]

Share this post


Link to post
Share on other sites
Thanks for good suggestions... In particular the lack of int would explain a lot. I''ll try all of these and see how it turns out. I haven''t used VBO before so this is all new to me

Share this post


Link to post
Share on other sites
Ah, the smell of progress


tris Mtris/s VAR / VBO1 / VBO2
20480 4.09 / 4.09 / 4.09
81920 13.7 / 16.4 / 16.4
327680 2.73 / 27.3 / 41.0
1310720 2.76 / 21.5 / 42.3


Replacing ints with floats alone increased the triangle rates dramatically (VBO1). Interleaving the vertex/normal/texcoord data had negligible effect (<1ms/frame). Adding a hardware buffer for indices caused another performance jump at the high end of poly counts (VBO2), although I may not end up using it if/when I implement some kind of a LOD scheme.

Also it turns out that I''m not out of VRAM after all.. I''m using ~21M for the vertex arrays at the highest detail level. Still, this means I''ll have to cut the detail if I ever want to display more than one planet.

In case you''re wondering what the thing looks like, here''s a picture.

Thanks for your help!

Share this post


Link to post
Share on other sites
so youre saying you dont have any slowdowns when using multiple vbs for position/normals etc.? hm, time to either get an ati or hope newer drivers work better, because the current setup is horribly chaotic *g*

Share this post


Link to post
Share on other sites
Yes, it''s interesting because what you said made a lot of sense. I''m keeping them all in a single array now anyway since it''s easier to manage (and other hardware might not be as forgiving).

I did find that the indices themselves want to be as sequential as possible rather than jumping around within the array(s). And I guess it''s easier for caching when subsequent triangles re-use vertices too. So ordering the triangles like it was a triangle strip seems to be the fastest to render.

Share this post


Link to post
Share on other sites
that one for sure, making the most of your cache is always a good thing (though you often end up with fillrate being the limit and all your geometry optimization was pointless).

if your data is nice and sequential you can always just create a vertex struct and use that. the reason i wanted it in seperate buffers was reducing redundant data. for example texture coordinates would repeat a lot and storing only 16x16 is better than storing the same 256coords around 1024 times. so i end up with wild offsets, elements of completely different layout etc. single buffers would have cleaned up the code alot but for some reason were slow as hell.

about your sorting: i assume you just write them to an array in the order of a strip and then remove duplicates? and was it worth the extra work? ,-)

Share this post


Link to post
Share on other sites
The vertex array is just rows and columns without any duplicated vertices, and it coincides with drawing the triangles in the same order (in rows left to right). This was the first and easiest method I tried and turned out to be the fastest. My sectors are triangular (and curved) rather than square but otherwise it''s very much like any heightmap.

In fact I''ll probably make it actually use a triangle strip per sector instead of a triangle list and see if it goes any faster. I didn''t before because I was thinking about occlusion culling, but it appears that brute-force drawing the whole thing is faster than selectively "skipping" triangles even if I don''t spend too much time selecting what to skip.

Share this post


Link to post
Share on other sites
last time i tried that (the other way round, from one big strip to triangle list) it didnt make any difference at all, so its quite likely that the index has enough time to be sent while the last vertex is transformed.. but of course i already had a lot of texturing at that point and once youre fillrate limited all those micro optimizations to the geometry submission will become a little pointless anyway.

Share this post


Link to post
Share on other sites