Unusual VBO slowdown

Started by
11 comments, last by Fingers_ 20 years, 4 months ago
I implemented vertex arrays and VBO in my planet generator experiment... To my great surprise, VBO is performing much slower than regular vertex array, and even slower than immediate mode. The polygons are divided into 20 batches (there are 20 unique textures), so the batch size in the following data varies from 1024 to 65536 triangles.

(Mtris/second with immediate mode/regular vertex array/VBO)
20480 tris:   2.56 / 4.09 / 1.86
81920 tris:   2.64 / 13.7 / 1.95
327680 tris:  2.69 / 2.73 / 2.05
1310720 tris: 2.65 / 2.76 / 0.1 (out of VRAM?)

Setup: P4/2.4Ghz, Radeon 9700 pro 128M
 
Things to note are that the vertex array performance has a sharp peak at 4096 tris/batch, but VBO is consistently slow. Can you see anything obviously wrong in the code below? (this is called once per frame for each quadrant [batch], the arrays are written to only once in an init function)

ifdef USE_VAR
		glEnableClientState(GL_VERTEX_ARRAY);
		glEnable(GL_TEXTURE_COORD_ARRAY);
		glEnableClientState(GL_NORMAL_ARRAY);

		if (gfx.vbosup)
		{
			glBindBufferARB( GL_ARRAY_BUFFER_ARB, orb->rants[r].nvar);
			glVertexPointer(3, GL_INT, 0, NULL);
			glBindBufferARB( GL_ARRAY_BUFFER_ARB, orb->rants[r].nnar);
			glNormalPointer(GL_FLOAT, 0, NULL);
			glBindBufferARB( GL_ARRAY_BUFFER_ARB, orb->rants[r].ntar);
			glTexCoordPointer(2, GL_FLOAT, 0, NULL);
		}
		else
		{
			glVertexPointer(3, GL_INT, 0, orb->rants[r].var);
			glNormalPointer(GL_FLOAT, 0, orb->rants[r].nar);
			glTexCoordPointer(2, GL_FLOAT, 0, orb->rants[r].tar);
		}

		glDrawElements(GL_TRIANGLES, orb->rants[r].ntris*3, GL_UNSIGNED_INT, orb->rants[r].iar); // array of vertex/normal/texcoord indices

		glDisable(GL_TEXTURE_COORD_ARRAY);
		glDisableClientState(GL_NORMAL_ARRAY);
		glDisableClientState(GL_VERTEX_ARRAY);
#else
		glBegin(GL_TRIANGLES);
		tris = orb->rants[r].tris;
		verts = orb->rants[r].verts;
		for (tri = 0; tri < orb->rants[r].ntris; tri++)
		{
			//glNormal3f(tris[tri].norm[0], tris[tri].norm[1], tris[tri].norm[2]);
			glNormal3f(orb->rants[r].nar[tris[tri].v[0]*3+0], orb->rants[r].nar[tris[tri].v[0]*3+1], orb->rants[r].nar[tris[tri].v[0]*3+2]);
			glTexCoord2f(orb->rants[r].tar[tris[tri].v[0]*2+0], orb->rants[r].tar[tris[tri].v[0]*2+1]);
			glVertex3i(orb->rants[r].var[tris[tri].v[0]*3+0], orb->rants[r].var[tris[tri].v[0]*3+1], orb->rants[r].var[tris[tri].v[0]*3+2]);

			glNormal3f(orb->rants[r].nar[tris[tri].v[1]*3+0], orb->rants[r].nar[tris[tri].v[1]*3+1], orb->rants[r].nar[tris[tri].v[1]*3+2]);
			glTexCoord2f(orb->rants[r].tar[tris[tri].v[1]*2+0], orb->rants[r].tar[tris[tri].v[1]*2+1]);
			glVertex3i(orb->rants[r].var[tris[tri].v[1]*3+0], orb->rants[r].var[tris[tri].v[1]*3+1], orb->rants[r].var[tris[tri].v[1]*3+2]);

			glNormal3f(orb->rants[r].nar[tris[tri].v[2]*3+0], orb->rants[r].nar[tris[tri].v[2]*3+1], orb->rants[r].nar[tris[tri].v[2]*3+2]);
			glTexCoord2f(orb->rants[r].tar[tris[tri].v[2]*2+0], orb->rants[r].tar[tris[tri].v[2]*2+1]);
			glVertex3i(orb->rants[r].var[tris[tri].v[2]*3+0], orb->rants[r].var[tris[tri].v[2]*3+1], orb->rants[r].var[tris[tri].v[2]*3+2]);
			pc++;
		}
		glEnd();
#endif
Advertisement
it seems like you need to access three different vertex buffers for one vertex. i dont know if thats regular behaviour, but when i tried to place my texcoords somewhere else (ie the moment i was accessing more than one vb at once) the performance degraded horribly. in other words: dont. allocate one big buffer and either use offsets for the different kinds of data or (probably better) store them interleaved.

looking something like either this:
glBindBufferARB( GL_ARRAY_BUFFER_ARB, orb->rants[r].nvar);
glVertexPointer(3, GL_INT, 0, 0);
glNormalPointer(GL_FLOAT, 0, (char*)NormOffset);
glTexCoordPointer(2, GL_FLOAT, 0, (char*)TexOffset);

or: (with struct Vertex as {int x,y,z; float nx,ny,nz; float u,v;}
glVertexPointer(3, GL_INT, sizeof(Vertex), 0);
glNormalPointer(GL_FLOAT, sizeof(Vertex), (char*)(3*sizeof(int)));
glTexCoordPointer(2, GL_FLOAT, sizeof(Vertex), (char*)(3* (sizeof(int)+sizeof(float)) ));


the idea behind method 2 is that the closer you store your data the less it needs to wildly jump all over memory to collect it, though i wouldnt expect it to make much difference. just try keeping your whole stuff for one draw-call in one single vb.

[edited by - Trienco on November 16, 2003 7:39:52 AM]
f@dzhttp://festini.device-zero.de
You should also use GL_ELEMENT_ARRAY_BUFFER_ARB to store indices in video mem. I know ATi prefers this over system-stored indices. Overall those numbers (tris/sec) are very low for that kind of card. You should be getting at least 5x more. The "out of VRAM" thing is probably becouse there is a 32mb limit on VBO size (not writen anywhere but both nVidia and ATi fail to allocate this size in VRAM)

You should never let your fears become the boundaries of your dreams.
You should never let your fears become the boundaries of your dreams.
glVertexPointer(3, GL_INT , sizeof(Vertex), 0);
int aren''t optimised on most drivers.

quote: Blocks of vertex array data may be stored in buffer objects with the
same format and layout options supported for client-side vertex
arrays. However, it is expected that GL implementations will (at
minimum) be optimized for data with all components represented as
floats, as well as for color data with components represented as
either floats or unsigned bytes.


_______________

Jester, studient programmer
The Jester Home in French
_______________
Jester, studient programmerThe Jester Home in French
jesterlecodeur is correct. Here''s a table from ATI''s OpenGL SDK (http://www.ati.com/developer/sdk/radeonsdk/Gl_sdk.zip):

Type			Native	Alignment	Components	RangeGLdouble		No			GLfloat			Yes	32-bit		1,2,3,4		+/- MAX_FLOATGLuint			No			GLint			No			GLushort		Yes	32-bit		2,4		[0,65536]GLshort			Yes	32-bit		2,4		[-32768,32767]GLushort (normalized)	Yes	32-bit		2,4		[0,1]GLshort (normalized)	Yes	32-bit		2,4		[-1,1]GLubyte			Yes	32-bit		4		[0,255]GLbyte			Yes	32-bit		4		[-128,127]GLubyte (normalized)	Yes	32-bit		4		[0,1]GLbyte (normalized	Yes	32-bit		4		[-1,1]
-Ostsol
Thanks for good suggestions... In particular the lack of int would explain a lot. I''ll try all of these and see how it turns out. I haven''t used VBO before so this is all new to me
Ah, the smell of progress

tris   Mtris/s VAR / VBO1 / VBO220480         4.09 / 4.09 / 4.0981920         13.7 / 16.4 / 16.4327680        2.73 / 27.3 / 41.01310720       2.76 / 21.5 / 42.3 


Replacing ints with floats alone increased the triangle rates dramatically (VBO1). Interleaving the vertex/normal/texcoord data had negligible effect (<1ms/frame). Adding a hardware buffer for indices caused another performance jump at the high end of poly counts (VBO2), although I may not end up using it if/when I implement some kind of a LOD scheme.

Also it turns out that I''m not out of VRAM after all.. I''m using ~21M for the vertex arrays at the highest detail level. Still, this means I''ll have to cut the detail if I ever want to display more than one planet.

In case you''re wondering what the thing looks like, here''s a picture.

Thanks for your help!
Well, I don''t think you''re going to have multiple planets close enough together for such detail to need to be shown simultaneously on all. . .
-Ostsol
so youre saying you dont have any slowdowns when using multiple vbs for position/normals etc.? hm, time to either get an ati or hope newer drivers work better, because the current setup is horribly chaotic *g*
f@dzhttp://festini.device-zero.de
Yes, it''s interesting because what you said made a lot of sense. I''m keeping them all in a single array now anyway since it''s easier to manage (and other hardware might not be as forgiving).

I did find that the indices themselves want to be as sequential as possible rather than jumping around within the array(s). And I guess it''s easier for caching when subsequent triangles re-use vertices too. So ordering the triangles like it was a triangle strip seems to be the fastest to render.

This topic is closed to new replies.

Advertisement