#### Archived

This topic is now archived and is closed to further replies.

# VAR + SwapBuffers performance problems

This topic is 5549 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Hello, Can somebody explain this weird behavior? Let me explain. I'm working on a demo level with about 30k triangles, sudivided using Octrees. Since yesterday i was using simple Vertex Arrays for rendering. Today i set up NV_VAR. I have minimized state changes (binding textures, changing blending) and i have sorted my triangle lists in a cache-friendly way (at least i did the best i could on that!). Yesterday, i wrote a simple in-code (not external, like VTune) profiler (based on Enguinity series), and i tried to measure the time spent in various parts of my code. I have tested the demo with some different code-formats and i found out some things. First of all, the most time (about 80 - 90%) was taken by Octree::Render() + SwapBuffers(). Everything else (Text output, demo updating, and Octree::Update()) was extremely fast. I think you knew this, without telling you so!! Here are the setups i use for testing. (in order of execution) Case 1: Individual triangles, manual backface culling, simple VAs, with compiled vertex arrays Case 2: Individual triangles, manual backface culling, simple VAs, without CVA. Case 3: Bunch of triangles in one call, OGL backface culling, simple VAs, with CVA. Case 4: Bunch of triangles in one call, OGL backface culling, simple VAs, no CVA. Case 5: Individual triangles, manual backface culling, NV_VAR, no CVA. Case 6: Bunch of triangles, OGL backface culling, NV_VAR, no CVA. In all cases (except case 6) the minimum FPS i got was 38 (min FPS in case 3) and the maximum i got was 300 (max FPS in case 5). Everything looked normal. The FPS was smoothly going up and down and the averange FPS was about 98 - 102. No significant performance changes between cases 1-5. The weird stuff now. The last setup (case 6). First of all, Octree::Render() time droped from about 9 msec to 2 msec, but the total render time was increased. To figure out whats happening, i placed a timer, to time SwapBuffers() calls. Guess what. SwapBuffers took about 17 msec (averange) to complete. Despite the fact that the whole render time had increased (from 131 secs to 255 secs for rendering a demo with 11000 frames), and despite the fact that the minimum FPS was 18, the averange had been increased to 110 from 102! My first thought was, "VAR had finally worked!". It did what it's suppose to do. But that's not what i'm looking for. To be more precise, i have to tell you that in case 6, the fps counter was jumping from 30 to 700 all the time! And the whole system is completely "unstable". Despite my dissappointment, i tried to streess the system a little, to see if it fails (FPS drops). I placed a Sleep(15) before SwapBuffers, and i got the same results. Now SwapBuffers() took 2 msec to complete (17 - 15). I thought, "I can place more CPU work before swapping, in case to fill the gap!". But this isn't the case, is it? I want to ask if there is something i can do to make this work in a more stable way. Triple buffering may be a option, but i don't know if it possible with OpenGL. Do you have any suggestion for the above "weird" behavior? I don't think it is really weird. VAR suppose to do this kind of things. But how can i make it more stable? Any feedback appreciated. Thanks in advance. HellRaiZer [edited by - HellRaiZer on October 7, 2003 12:48:29 PM]

##### Share on other sites
quote:

Individual triangles, manual backface culling, NV_VAR

That sentence is a paradox. VAR cannot deal with individual triangles, and you can't do manual BF culling on vertex arrays (unless you repopulate every frame, which is evil™ )

I didn't understand everything in your message. But keep in mind two important things: first, profiling a 3D card is not that easy. It has a command FIFO, and will cache commands given to it. A SwapBuffers() will wait for that cache to empty, so it is normal that a standard profiler will hang on this one.

Second, VAR can achieve extremely high performances and is 100% stable, but it is very sensitive to correct usage. I suspect that you are using it in a way it wasn't supposed to be used. There are a lot of small details you have to take into account with VAR. If you want something more beginner-friendly, try VBO, as it hides the ugly stuff in the driver. It's a little slower than VAR, though.

That's pretty much all one can say from the info you posted. For more details, post the source skeleton of your renderer.

BTW: in case 5 and 6, are you checking for a valid vertex range ? Ie. make sure that the GPU has actually activated VAR. If there is the slightest incompatibility between your code and VAR, the GPU will automatically turn it off. And depending on where your data was stored, you will experience a tremendeous performance drop.

[edited by - Yann L on October 7, 2003 3:01:39 PM]

##### Share on other sites
Do you have vertical sync turned on?

If so, I believe it will cause the SwapBuffer call to wait on a vertical refresh, which would explain why you program spends a larger amount of time on that one function.

Also, when changing various render settings you are altering the amount of time spent not waiting on the swap buffer so when you do get the swap buffer it spends less time waiting there.

Just a theory and it all is hinging on whether vsync is on.

##### Share on other sites
quote:

That sentence is a paradox. VAR cannot deal with individual triangles, and you can't do manual BF culling on vertex arrays (unless you repopulate every frame, which is evil)

I didn't understand everything in your message. But keep in mind two important things: first, profiling a 3D card is not that easy. It has a command FIFO, and will cache commands given to it. A SwapBuffers() will wait for that cache to empty, so it is normal that a standard profiler will hang on this one.

Sorry for the confusion. I wanted to keep it small, so someone can read it easily. Or maybe i did stupid things after all!

What keywords in all cases mean.

1) Individual triangles :
for every material{	for every mesh using this material	{		for every triangle in that mesh		{			glDrawElements(GL_TRIANGLES, curPol->NumVertices, GL_UNSIGNED_INT, curPol->VertexID);		}	}}

1a) Bunch of triangles:
for every material{	for every mesh using this material	{		glDrawElements(GL_TRIANGLES, curMesh->NumVertices, GL_UNSIGNED_INT, &curMesh->VertexIndex[0]);	}}

2) Manual backface culling
for every material{	for every mesh using this material	{		for every triangle in that mesh		{			if(!PolyAlreadyRendered[curPol->Index] && curPol->IsVisible(Camera->Pos))			{				PolyAlreadyRendered[curPol->Index] = true;				glDrawElements(GL_TRIANGLES, curPol->NumVertices, GL_UNSIGNED_INT, curPol->VertexID);			}		}	}}

3) Simple VA vs NV_VAR
The difference in those two, is where the data stays. Simple VAs => System Memory. NV_VAR => AGP or Video Memory (Is there anyway of knowing this ? Is Priority enough to decide ?) Of course the vertex pointers are set accordingly.
// One time setup, if NV_VAR is being used.glVertexArrayRangeNV(MyVertexArray.NumVertices() * sizeof(GEVertex3D), VAR_Pointer);glEnableClientState(GL_VERTEX_ARRAY_RANGE_NV);// Setup for rendering (every frame)GEVertex3D* firstVertInArray = NULL;if(VAR_Pointer != NULL)	// Do we use VAR? VAR_Pointer is a float*{	firstVertInArray = (GEVertex3D*)VAR_Pointer;}	else{	firstVertInArray = &MyVertexArray.GetVertex(0)->x;}glVertexPointer(3, GL_FLOAT, sizeof(GEVertex3D), &firstVertInArray->x);// all other pointers needed are set to.

I think i explained it all.

quote:

Second, VAR can achieve extremely high performances and is 100% stable, but it is very sensitive to correct usage.

Misunderstanding again. With stable, i didn't mean that the system (PC) wasn't stable. What i wanted to say was, that despite the fact that in some cases that camera stands still for a few frames looking at the exctly same amount of triangles, the frame rate wasn't stable.

quote:

BTW: in case 5 and 6, are you checking for a valid vertex range ? Ie. make sure that the GPU has actually activated VAR. If there is the slightest incompatibility between your code and VAR, the GPU will automatically turn it off. And depending on where your data was stored, you will experience a tremendeous performance drop.

I had some experience with incompatibility/misuse of VAR as you say.
1) I tried to allocate memory with readFrequency 1.0f, writeFrequency 0.0f and priority 1.0f. The specs didn't say anything about limitations on this values, except that they must be inside [0.0f, 1.0f]. I read a doc from nVidia, and i found that readFrequency and writeFrequency must be inside [0.0f, 0.25f) for best performance. So i did.
2) I used glEnable instead of glEnableClientState for GL_VERTEX_ARRAY_RANGE_NV, and it gave the same speed results as with simple VAs.
3) I re-set the range every frame, and i found that this slows down the whole procedure, so i moved it to one time setup while loading. The same stands for glEnableClientState() and glDisableClientState(). As the specs say, all of this (plus some more calls), make the buffers flush, so i had to avoid them. But SwapBuffers() also flushes the buffers, so i can't do something for it!!!

quote:

If you want something more beginner-friendly, try VBO, as it hides the ugly stuff in the driver. It's a little slower than VAR, though.

I tried VBOs after posting the first post. They are really slower than VAR, but the FPS was much more smoothly going up and down. And, SwapBuffers eats the more time also.

Finally, i haven't turned on VSync. God, why should i do something like that ???

Any suggestions? Any tips for correct VAR usage? Something i can do to speed things up a little? I know: "Buy a new card, you ...!"

Thanks. I hope i didn't forgot anything.

HellRaiZer

Btw, how can you profile a GPU? I mean an older GPU. For GFx there is NVPerfHUD, but...!

[edited by - HellRaiZer on October 8, 2003 3:40:51 AM]

##### Share on other sites
OK. There are a lot of problems with your code. *takes deep breath*

First of all, when I saw case 1 and 2, I almost got a heart attack Drop them now , delete them, annihilate them, make a huge fire in your backyard, throw them in, and dance around until they are totally destroyed...

Seriously though, those two approaches must be the most inefficient modes to render a mesh that are possibly imaginable under OpenGL. They are by far worse than even immediate mode. glDrawElements() has an inherent overhead, and becomes efficient from around 500 triangles on. Don't use it on chunks with less than say, 50 triangles.

quote:

1) I tried to allocate memory with readFrequency 1.0f, writeFrequency 0.0f and priority 1.0f. The specs didn't say anything about limitations on this values, except that they must be inside [0.0f, 1.0f]. I read a doc from nVidia, and i found that readFrequency and writeFrequency must be inside [0.0f, 0.25f) for best performance. So i did.

For VRAM, use wglAllocateMemoryNV(size, 0, 0, 1);
For AGP, use wglAllocateMemoryNV(size, 0, 0, 0.5f);

quote:

2) I used glEnable instead of glEnableClientState for GL_VERTEX_ARRAY_RANGE_NV, and it gave the same speed results as with simple VAs.

Of course. GL_VERTEX_ARRAY_RANGE_NV is for use with glEnableClientState. You can't just use it on a different command, and expect it to work It's all detailed in the docs.

quote:

3) I re-set the range every frame, and i found that this slows down the whole procedure, so i moved it to one time setup while loading. The same stands for glEnableClientState() and glDisableClientState(). As the specs say, all of this (plus some more calls), make the buffers flush, so i had to avoid them. But SwapBuffers() also flushes the buffers, so i can't do something for it!!!

Well, you have to flush the pipeline at the end of the frame. Otherwise, you'd never see your rendered image. But the flushing should generally be reserved to SwapBuffers(). Never flush in the middle of rendering something, other than for profiling.

quote:

Finally, i haven't turned on VSync. God, why should i do something like that ???

To get a smoother animation. VSync should always be turned on, expect for profiling.

Your code to setup the range looks somewhat weird. Use this:
// Get memorychar *DataPointer = (char *)wglAllocateMemoryNV(size, 0, 0, 1);if( !DataPointer ) throw var_error("Can't allocate the VAR !");// Setup VARglVertexArrayRangeNV(size, DataPointer);// Copy your data into the VAR memorymemcpy(DataPointer, MyVertexDataPool, size);// Enable VARglEnableClientState(GL_VERTEX_ARRAY_RANGE_NV);glEnableClientState(GL_VERTEX_ARRAY);// For each framewhile( rendering ) {   for( each mesh ) {      // Set vertex pointers, colours, texcoords, etc      glVertexPointer(blah, blah, blah, &DataPointer[...]);      // Make sure the VAR is still valid      int u;      glGetIntegerv(GL_VERTEX_ARRAY_RANGE_VALID_NV, &u);      if( !u ) throw var_error("VAR state is invalid !");      // Render stuff      glDrawElements(...);   }}

To profile 3D code with a minimum of precision, enclose the code into glFinish() blocks:

void Render(void){   ... Stuff ...   // Begin profiling section   glFinish();   uint64 cycles = rdtsc();   // Profile this   DrawStuffToProfile();   // End profiling section   glFinish();   cycles = rdtsc() - cycles;}

[edited by - Yann L on October 8, 2003 7:43:17 AM]

##### Share on other sites
Oh, and:

quote:

Misunderstanding again. With stable, i didn''t mean that the system (PC) wasn''t stable. What i wanted to say was, that despite the fact that in some cases that camera stands still for a few frames looking at the exctly same amount of triangles, the frame rate wasn''t stable.

That''s not a problem with VAR, but a problem with your profiling code which is unable to handle parallel execution of GPU and CPU code (as it happends with VAR, but not with standard VAs).

##### Share on other sites
Thanks for the precious tips. They helped a lot!

quote:

First of all, when I saw case 1 and 2, I almost got a heart attack Drop them now , delete them, annihilate them, make a huge fire in your backyard, throw them in, and dance around until they are totally destroyed...

Seriously though, those two approaches must be the most inefficient modes to render a mesh that are possibly imaginable under OpenGL. They are by far worse than even immediate mode. glDrawElements() has an inherent overhead, and becomes efficient from around 500 triangles on. Don''t use it on chunks with less than say, 50 triangles.

The reason i included those two cases in the whole "benchmark" was, because i wanted to see how the worst case scenario execute. Worst case would be immediate mode, but i didn''t include it.

Because it''s raining here, I decided not to burn them. Instead i buried them. Buried them deapply, on the top of a high mountain, in the middle of a desert island, where ordinary people can''t go. I went there using my custom travel machine, and i''m not planning going back and dig them out again. In fact i ran out of fuels, so... Also i placed a electrical wire around it, to keep others from digging it out.

quote:

For VRAM, use wglAllocateMemoryNV(size, 0, 0, 1);
For AGP, use wglAllocateMemoryNV(size, 0, 0, 0.5f);

Does this make any sense? What read and write frequencies really mean? Why should i set both of them to 0?

quote:

Of course. GL_VERTEX_ARRAY_RANGE_NV is for use with glEnableClientState. You can''t just use it on a different command, and expect it to work It''s all detailed in the docs.

This was a mistake, made because of my hurry to see some results. While reading the specs, i haven''t paied the respected attention to this little detail. After a second read of the specs, i found my mistake.

The code you provided worked fine. No weird-up-and-down to the frame rate, and everything is working fine. I don''t know where i made the mistake, but i''ll find out. Thanks for that.

About the profiling snippet. I gave it a shot, to see what i can take. Result was that all the time spent in SwapBuffers() were transfered to the second glFinish() call. I can''t understand the need of such a profiling method. Why would someone want to profile a piece of code without keeping in mind all the code before and after the profiled part? Especially for the 3d rendering part of code. If i''m going to include collision detection code, immediately after the second glFinish() what i will get is double the real time. I mean,

profile(Rendering + glFinish()) + profile(ColDet) != profile(Rendering + ColDet) + profile(glFinish())

I think that''s why you mentioned
quote:

Never flush in the middle of rendering something, other than for profiling.

but, still i don''t see any use of that kind of profiling.

quote:

That''s not a problem with VAR, but a problem with your profiling code which is unable to handle parallel execution of GPU and CPU code (as it happends with VAR, but not with standard VAs).

I also, don''t understand this part. Why my profiling code isn''t able to handle the parallel execution on both of them? I''m profiling a whole frame. Where i miss something? When SwapBuffers() returns, both CPU and GPU are on the same condition. Both their caches are empty (CPU''s maybe not ) and they are waiting new instructions. Isn''t this correct?

Some more experimental results.

Using the code you gave, the frame rate became stable as without VAR, without crazy values (like frame 1 : 30 FPS, frame 2 : 600 FPS) But still it isn''t as fast as it supposed to. Averange FPS for the demo was 60, when the Octree nodes had maximum 500 triangles, and 70 when i increased it to 2500. I played with these values, but i couldn''t get any better than that. Is there enything else i can do?

With my old VAR setup the demo took 255 secs to complete, and with your setup took 180 secs. Thats an improvement. But with one of the others setups (i don''t remember which but it didn''t use VAR), i made it run in 130 secs. This is the best i''ve ever achieved. Is there anything i can do to reach this level, using VARs?

Thanks.

HellRaiZer

PS. What''s the "rdtsc()" function you use in the source? I searched it in MSDN but i haven''t found it? Is it mean something, or it is an existing function?

##### Share on other sites
Forget about the rdtsc(). I searched google, and i found what it is and how to use it Thanks for that. It looks pretty interesting. Hope it works on my Duron!

HellRaiZer

##### Share on other sites
Some things came to my mind, when i wake up this morning.

1) In your code you use char* as the data pointer type. Does it guarantee 4-byte alignment? Using float* instead isn''t more logical? Does it matter?

2) Time Stamp counter works fine, and i''m going to rewrite all my timer functions for using it instead from QueryPerformanceCounter(). Doesn''t this makes the same calls? It includes "Performance counter" in it.

3)[Quoting myself]
quote:

quote:
--------------------------------------------------------------------------------

For VRAM, use wglAllocateMemoryNV(size, 0, 0, 1);
For AGP, use wglAllocateMemoryNV(size, 0, 0, 0.5f);

--------------------------------------------------------------------------------

Does this make any sense? What read and write frequencies really mean? Why should i set both of them to 0?

What i can''t understand is, what read frequency suppose to represent. Isn''t it a "use" frequency? Like when you are using DrawElements() with VAR setup, don''t you actually read the buffer? Or reading the buffer involves something like :

VARDataPointer[100] = VertexPool[21].x;

4) How bad, for performance, is two times rendering of a single triangle in the same frame? I''m not talking about some double pass algorithm. Because i use Octrees, as subdivision algorithm, and some trinagles are on the box boundaries, i don''t split them. Instead, i push them to all the nodes it intersects. When sending a bunch of triangles to the card at a time, it''s not easy to see what and how many triangles had been already passed in, and i have to render some of them 2 or maybe three times in a frame !!! Does this really kill performance? I''m afraid, that if i split those tris, i''ll get a much bigger dataset, and it''ll be slower than the existing one. This another reason why i had tested the individual-triangles cases. I''ll try to split them to see what i get. But the question stands.

5)And, once more, is there anything else i can do to speed up VAR Rendering in my system? Or this is the best i can have? My system is a Duron 700MHz, with a GeForce 256 DDR. When can i say that the results i gor is the best i can get?

HellRaiZer

##### Share on other sites
quote:
Original post by HellRaiZer
Some things came to my mind, when i wake up this morning.

1) In your code you use char* as the data pointer type. Does it guarantee 4-byte alignment? Using float* instead isn''t more logical? Does it matter?

No, it doesn''t. The NV memory allocator will always 16-align the data (or even 256, it seems). When you fill in the data, then you have to make sure it is aligned. I''m using a char*, because I like calculating everything in bytes, and because some vertex formats use single byte components (normals, rgba colours, vertex weights, etc). You can use any type you like, even void*.

quote:

2) Time Stamp counter works fine, and i''m going to rewrite all my timer functions for using it instead from QueryPerformanceCounter(). Doesn''t this makes the same calls? It includes "Performance counter" in it.

QueryPerformanceCounter() can use rdtsc, but is not guaranteed to. I personally use rdtsc, because it is OS independent, and it is more predictable (you know exactly what you get, and the precise overhead of the rdtsc command). I don''t feel comfortable calling a Win32 function for performance profiling.

quote:

Does this make any sense? What read and write frequencies really mean? Why should i set both of them to 0?

I don''t know, ask nVidia. The whole read/write frequency stuff was a bad idea from the beginning on. Luckily, nvidia soon realized that themselves, and published the exact values to use in order to get the RAM type you want. Don''t worry about the specific values, just use the two mentioned above. They''re the ones recommended by nVidia, and they''re guaranteed to work.

quote:

4) How bad, for performance, is two times rendering of a single triangle in the same frame? I''m not talking about some double pass algorithm. Because i use Octrees, as subdivision algorithm, and some trinagles are on the box boundaries, i don''t split them. Instead, i push them to all the nodes it intersects. When sending a bunch of triangles to the card at a time, it''s not easy to see what and how many triangles had been already passed in, and i have to render some of them 2 or maybe three times in a frame !!! Does this really kill performance?

Yes, it does. Especially on a lowend card such as yours, this is going to stress the fillrate for nothing. And fillrate is exactly what a GF-256 lacks.

quote:

I''m afraid, that if i split those tris, i''ll get a much bigger dataset, and it''ll be slower than the existing one.

Don''t be too sure about that. It obviously depends on the type of scene you use, but generally (on modern 3D cards), splitting triangles is much faster, than rendering them twice or more. Also, consider transparent faces: rendering them more than once will introduce visual artifacts.

quote:

This another reason why i had tested the individual-triangles cases. I''ll try to split them to see what i get. But the question stands.

You might want to consider a different spatial subdivision structure than an octree, in order to avoid the redundancy problem and minimize the required splits.

quote:

5)And, once more, is there anything else i can do to speed up VAR Rendering in my system? Or this is the best i can have? My system is a Duron 700MHz, with a GeForce 256 DDR. When can i say that the results i gor is the best i can get?

Ah, but you didn''t mention you have a GF-1... But you didn''t really mention how much performance you actually get either. First, get rid of those double triangles you mentioned above, they will suck away your precious fillrate like mad. I''m starting to suspect that you aren''t really geometry limited, but fillrate limited. VAR won''t help you very much in that case.

Try this: setup the VARs. Then, render a certain number of simple spheres onto the screen. Each sphere should have around 1000 faces, and should be rendered with a single glDrawElements() call. Draw a couple of thousand spheres, and measure the speed. Now you have the raw maximum performance containing both the geometry pipeline and the fragment pipeline. Next step, disable any texturing, lighting, environment combiners. Set your output window size to 8*8. Set the z compare function to always fail every fragment. Disable zbuffer and colour write masks. This will more or less take out the impact of fillrate, and you can measure the raw geometry processing speed.

quote:

About the profiling snippet. I gave it a shot, to see what i can take. Result was that all the time spent in SwapBuffers() were transfered to the second glFinish() call. I can''t understand the need of such a profiling method. Why would someone want to profile a piece of code without keeping in mind all the code before and after the profiled part?

Well, that is generally what is called profiling: measuring the exact impact of a specific piece of code, without external interference from other code parts. Things like parallel execution pipelines can hide the actual performance hit of a function, because it is delayed/overlapped. What you are talking about is benchmarking, which is basically a speed measure over the entire program. Profiling is micro-benchmarking, on subsystem or even instruction level.

1. 1
2. 2
Rutin
20
3. 3
khawk
16
4. 4
A4L
14
5. 5

• 11
• 16
• 26
• 10
• 11
• ### Forum Statistics

• Total Topics
633755
• Total Posts
3013706
×