Archived

This topic is now archived and is closed to further replies.

HellRaiZer

VAR + SwapBuffers performance problems

Recommended Posts

Hello, Can somebody explain this weird behavior? Let me explain. I'm working on a demo level with about 30k triangles, sudivided using Octrees. Since yesterday i was using simple Vertex Arrays for rendering. Today i set up NV_VAR. I have minimized state changes (binding textures, changing blending) and i have sorted my triangle lists in a cache-friendly way (at least i did the best i could on that!). Yesterday, i wrote a simple in-code (not external, like VTune) profiler (based on Enguinity series), and i tried to measure the time spent in various parts of my code. I have tested the demo with some different code-formats and i found out some things. First of all, the most time (about 80 - 90%) was taken by Octree::Render() + SwapBuffers(). Everything else (Text output, demo updating, and Octree::Update()) was extremely fast. I think you knew this, without telling you so!! Here are the setups i use for testing. (in order of execution) Case 1: Individual triangles, manual backface culling, simple VAs, with compiled vertex arrays Case 2: Individual triangles, manual backface culling, simple VAs, without CVA. Case 3: Bunch of triangles in one call, OGL backface culling, simple VAs, with CVA. Case 4: Bunch of triangles in one call, OGL backface culling, simple VAs, no CVA. Case 5: Individual triangles, manual backface culling, NV_VAR, no CVA. Case 6: Bunch of triangles, OGL backface culling, NV_VAR, no CVA. In all cases (except case 6) the minimum FPS i got was 38 (min FPS in case 3) and the maximum i got was 300 (max FPS in case 5). Everything looked normal. The FPS was smoothly going up and down and the averange FPS was about 98 - 102. No significant performance changes between cases 1-5. The weird stuff now. The last setup (case 6). First of all, Octree::Render() time droped from about 9 msec to 2 msec, but the total render time was increased. To figure out whats happening, i placed a timer, to time SwapBuffers() calls. Guess what. SwapBuffers took about 17 msec (averange) to complete. Despite the fact that the whole render time had increased (from 131 secs to 255 secs for rendering a demo with 11000 frames), and despite the fact that the minimum FPS was 18, the averange had been increased to 110 from 102! My first thought was, "VAR had finally worked!". It did what it's suppose to do. But that's not what i'm looking for. To be more precise, i have to tell you that in case 6, the fps counter was jumping from 30 to 700 all the time! And the whole system is completely "unstable". Despite my dissappointment, i tried to streess the system a little, to see if it fails (FPS drops). I placed a Sleep(15) before SwapBuffers, and i got the same results. Now SwapBuffers() took 2 msec to complete (17 - 15). I thought, "I can place more CPU work before swapping, in case to fill the gap!". But this isn't the case, is it? I want to ask if there is something i can do to make this work in a more stable way. Triple buffering may be a option, but i don't know if it possible with OpenGL. Do you have any suggestion for the above "weird" behavior? I don't think it is really weird. VAR suppose to do this kind of things. But how can i make it more stable? Any feedback appreciated. Thanks in advance. HellRaiZer [edited by - HellRaiZer on October 7, 2003 12:48:29 PM]

Share this post


Link to post
Share on other sites
quote:

Individual triangles, manual backface culling, NV_VAR


That sentence is a paradox. VAR cannot deal with individual triangles, and you can't do manual BF culling on vertex arrays (unless you repopulate every frame, which is evil™ )

I didn't understand everything in your message. But keep in mind two important things: first, profiling a 3D card is not that easy. It has a command FIFO, and will cache commands given to it. A SwapBuffers() will wait for that cache to empty, so it is normal that a standard profiler will hang on this one.

Second, VAR can achieve extremely high performances and is 100% stable, but it is very sensitive to correct usage. I suspect that you are using it in a way it wasn't supposed to be used. There are a lot of small details you have to take into account with VAR. If you want something more beginner-friendly, try VBO, as it hides the ugly stuff in the driver. It's a little slower than VAR, though.

That's pretty much all one can say from the info you posted. For more details, post the source skeleton of your renderer.

BTW: in case 5 and 6, are you checking for a valid vertex range ? Ie. make sure that the GPU has actually activated VAR. If there is the slightest incompatibility between your code and VAR, the GPU will automatically turn it off. And depending on where your data was stored, you will experience a tremendeous performance drop.


[edited by - Yann L on October 7, 2003 3:01:39 PM]

Share this post


Link to post
Share on other sites
Do you have vertical sync turned on?

If so, I believe it will cause the SwapBuffer call to wait on a vertical refresh, which would explain why you program spends a larger amount of time on that one function.

Also, when changing various render settings you are altering the amount of time spent not waiting on the swap buffer so when you do get the swap buffer it spends less time waiting there.

Just a theory and it all is hinging on whether vsync is on.

Share this post


Link to post
Share on other sites
quote:

That sentence is a paradox. VAR cannot deal with individual triangles, and you can't do manual BF culling on vertex arrays (unless you repopulate every frame, which is evil)

I didn't understand everything in your message. But keep in mind two important things: first, profiling a 3D card is not that easy. It has a command FIFO, and will cache commands given to it. A SwapBuffers() will wait for that cache to empty, so it is normal that a standard profiler will hang on this one.



Sorry for the confusion. I wanted to keep it small, so someone can read it easily. Or maybe i did stupid things after all!

What keywords in all cases mean.

1) Individual triangles :

for every material
{
for every mesh using this material
{
for every triangle in that mesh
{
glDrawElements(GL_TRIANGLES, curPol->NumVertices, GL_UNSIGNED_INT, curPol->VertexID);
}
}
}


1a) Bunch of triangles:

for every material
{
for every mesh using this material
{
glDrawElements(GL_TRIANGLES, curMesh->NumVertices, GL_UNSIGNED_INT, &curMesh->VertexIndex[0]);
}
}


2) Manual backface culling

for every material
{
for every mesh using this material
{
for every triangle in that mesh
{
if(!PolyAlreadyRendered[curPol->Index] && curPol->IsVisible(Camera->Pos))
{
PolyAlreadyRendered[curPol->Index] = true;
glDrawElements(GL_TRIANGLES, curPol->NumVertices, GL_UNSIGNED_INT, curPol->VertexID);
}
}
}
}


3) Simple VA vs NV_VAR
The difference in those two, is where the data stays. Simple VAs => System Memory. NV_VAR => AGP or Video Memory (Is there anyway of knowing this ? Is Priority enough to decide ?) Of course the vertex pointers are set accordingly.

// One time setup, if NV_VAR is being used.

glVertexArrayRangeNV(MyVertexArray.NumVertices() * sizeof(GEVertex3D), VAR_Pointer);
glEnableClientState(GL_VERTEX_ARRAY_RANGE_NV);

// Setup for rendering (every frame)

GEVertex3D* firstVertInArray = NULL;
if(VAR_Pointer != NULL) // Do we use VAR? VAR_Pointer is a float*

{
firstVertInArray = (GEVertex3D*)VAR_Pointer;
}
else
{
firstVertInArray = &MyVertexArray.GetVertex(0)->x;
}

glVertexPointer(3, GL_FLOAT, sizeof(GEVertex3D), &firstVertInArray->x);
// all other pointers needed are set to.



I think i explained it all.

quote:

Second, VAR can achieve extremely high performances and is 100% stable, but it is very sensitive to correct usage.



Misunderstanding again. With stable, i didn't mean that the system (PC) wasn't stable. What i wanted to say was, that despite the fact that in some cases that camera stands still for a few frames looking at the exctly same amount of triangles, the frame rate wasn't stable.

quote:

BTW: in case 5 and 6, are you checking for a valid vertex range ? Ie. make sure that the GPU has actually activated VAR. If there is the slightest incompatibility between your code and VAR, the GPU will automatically turn it off. And depending on where your data was stored, you will experience a tremendeous performance drop.



I had some experience with incompatibility/misuse of VAR as you say.
1) I tried to allocate memory with readFrequency 1.0f, writeFrequency 0.0f and priority 1.0f. The specs didn't say anything about limitations on this values, except that they must be inside [0.0f, 1.0f]. I read a doc from nVidia, and i found that readFrequency and writeFrequency must be inside [0.0f, 0.25f) for best performance. So i did.
2) I used glEnable instead of glEnableClientState for GL_VERTEX_ARRAY_RANGE_NV, and it gave the same speed results as with simple VAs.
3) I re-set the range every frame, and i found that this slows down the whole procedure, so i moved it to one time setup while loading. The same stands for glEnableClientState() and glDisableClientState(). As the specs say, all of this (plus some more calls), make the buffers flush, so i had to avoid them. But SwapBuffers() also flushes the buffers, so i can't do something for it!!!

quote:

If you want something more beginner-friendly, try VBO, as it hides the ugly stuff in the driver. It's a little slower than VAR, though.



I tried VBOs after posting the first post. They are really slower than VAR, but the FPS was much more smoothly going up and down. And, SwapBuffers eats the more time also.

Finally, i haven't turned on VSync. God, why should i do something like that ???

Any suggestions? Any tips for correct VAR usage? Something i can do to speed things up a little? I know: "Buy a new card, you ...!"

Thanks. I hope i didn't forgot anything.

HellRaiZer

Btw, how can you profile a GPU? I mean an older GPU. For GFx there is NVPerfHUD, but...!


[edited by - HellRaiZer on October 8, 2003 3:40:51 AM]

Share this post


Link to post
Share on other sites
OK. There are a lot of problems with your code. *takes deep breath*

First of all, when I saw case 1 and 2, I almost got a heart attack Drop them now , delete them, annihilate them, make a huge fire in your backyard, throw them in, and dance around until they are totally destroyed...

Seriously though, those two approaches must be the most inefficient modes to render a mesh that are possibly imaginable under OpenGL. They are by far worse than even immediate mode. glDrawElements() has an inherent overhead, and becomes efficient from around 500 triangles on. Don't use it on chunks with less than say, 50 triangles.

quote:

1) I tried to allocate memory with readFrequency 1.0f, writeFrequency 0.0f and priority 1.0f. The specs didn't say anything about limitations on this values, except that they must be inside [0.0f, 1.0f]. I read a doc from nVidia, and i found that readFrequency and writeFrequency must be inside [0.0f, 0.25f) for best performance. So i did.


For VRAM, use wglAllocateMemoryNV(size, 0, 0, 1);
For AGP, use wglAllocateMemoryNV(size, 0, 0, 0.5f);

quote:

2) I used glEnable instead of glEnableClientState for GL_VERTEX_ARRAY_RANGE_NV, and it gave the same speed results as with simple VAs.


Of course. GL_VERTEX_ARRAY_RANGE_NV is for use with glEnableClientState. You can't just use it on a different command, and expect it to work It's all detailed in the docs.

quote:

3) I re-set the range every frame, and i found that this slows down the whole procedure, so i moved it to one time setup while loading. The same stands for glEnableClientState() and glDisableClientState(). As the specs say, all of this (plus some more calls), make the buffers flush, so i had to avoid them. But SwapBuffers() also flushes the buffers, so i can't do something for it!!!


Well, you have to flush the pipeline at the end of the frame. Otherwise, you'd never see your rendered image. But the flushing should generally be reserved to SwapBuffers(). Never flush in the middle of rendering something, other than for profiling.

quote:

Finally, i haven't turned on VSync. God, why should i do something like that ???


To get a smoother animation. VSync should always be turned on, expect for profiling.

Your code to setup the range looks somewhat weird. Use this:

// Get memory

char *DataPointer = (char *)wglAllocateMemoryNV(size, 0, 0, 1);
if( !DataPointer ) throw var_error("Can't allocate the VAR !");

// Setup VAR

glVertexArrayRangeNV(size, DataPointer);

// Copy your data into the VAR memory

memcpy(DataPointer, MyVertexDataPool, size);

// Enable VAR

glEnableClientState(GL_VERTEX_ARRAY_RANGE_NV);
glEnableClientState(GL_VERTEX_ARRAY);

// For each frame

while( rendering ) {

for( each mesh ) {

// Set vertex pointers, colours, texcoords, etc

glVertexPointer(blah, blah, blah, &DataPointer[...]);

// Make sure the VAR is still valid

int u;
glGetIntegerv(GL_VERTEX_ARRAY_RANGE_VALID_NV, &u);
if( !u ) throw var_error("VAR state is invalid !");

// Render stuff

glDrawElements(...);
}
}


To profile 3D code with a minimum of precision, enclose the code into glFinish() blocks:


void Render(void)
{
... Stuff ...

// Begin profiling section

glFinish();
uint64 cycles = rdtsc();

// Profile this

DrawStuffToProfile();

// End profiling section

glFinish();
cycles = rdtsc() - cycles;

}



[edited by - Yann L on October 8, 2003 7:43:17 AM]

Share this post


Link to post
Share on other sites
Oh, and:

quote:

Misunderstanding again. With stable, i didn''t mean that the system (PC) wasn''t stable. What i wanted to say was, that despite the fact that in some cases that camera stands still for a few frames looking at the exctly same amount of triangles, the frame rate wasn''t stable.


That''s not a problem with VAR, but a problem with your profiling code which is unable to handle parallel execution of GPU and CPU code (as it happends with VAR, but not with standard VAs).

Share this post


Link to post
Share on other sites
Thanks for the precious tips. They helped a lot!

quote:

First of all, when I saw case 1 and 2, I almost got a heart attack Drop them now , delete them, annihilate them, make a huge fire in your backyard, throw them in, and dance around until they are totally destroyed...

Seriously though, those two approaches must be the most inefficient modes to render a mesh that are possibly imaginable under OpenGL. They are by far worse than even immediate mode. glDrawElements() has an inherent overhead, and becomes efficient from around 500 triangles on. Don''t use it on chunks with less than say, 50 triangles.



The reason i included those two cases in the whole "benchmark" was, because i wanted to see how the worst case scenario execute. Worst case would be immediate mode, but i didn''t include it.

Because it''s raining here, I decided not to burn them. Instead i buried them. Buried them deapply, on the top of a high mountain, in the middle of a desert island, where ordinary people can''t go. I went there using my custom travel machine, and i''m not planning going back and dig them out again. In fact i ran out of fuels, so... Also i placed a electrical wire around it, to keep others from digging it out.

quote:

For VRAM, use wglAllocateMemoryNV(size, 0, 0, 1);
For AGP, use wglAllocateMemoryNV(size, 0, 0, 0.5f);



Does this make any sense? What read and write frequencies really mean? Why should i set both of them to 0?

quote:

Of course. GL_VERTEX_ARRAY_RANGE_NV is for use with glEnableClientState. You can''t just use it on a different command, and expect it to work It''s all detailed in the docs.



This was a mistake, made because of my hurry to see some results. While reading the specs, i haven''t paied the respected attention to this little detail. After a second read of the specs, i found my mistake.

The code you provided worked fine. No weird-up-and-down to the frame rate, and everything is working fine. I don''t know where i made the mistake, but i''ll find out. Thanks for that.

About the profiling snippet. I gave it a shot, to see what i can take. Result was that all the time spent in SwapBuffers() were transfered to the second glFinish() call. I can''t understand the need of such a profiling method. Why would someone want to profile a piece of code without keeping in mind all the code before and after the profiled part? Especially for the 3d rendering part of code. If i''m going to include collision detection code, immediately after the second glFinish() what i will get is double the real time. I mean,

profile(Rendering + glFinish()) + profile(ColDet) != profile(Rendering + ColDet) + profile(glFinish())

I think that''s why you mentioned
quote:

Never flush in the middle of rendering something, other than for profiling.


but, still i don''t see any use of that kind of profiling.

quote:

That''s not a problem with VAR, but a problem with your profiling code which is unable to handle parallel execution of GPU and CPU code (as it happends with VAR, but not with standard VAs).



I also, don''t understand this part. Why my profiling code isn''t able to handle the parallel execution on both of them? I''m profiling a whole frame. Where i miss something? When SwapBuffers() returns, both CPU and GPU are on the same condition. Both their caches are empty (CPU''s maybe not ) and they are waiting new instructions. Isn''t this correct?

Some more experimental results.

Using the code you gave, the frame rate became stable as without VAR, without crazy values (like frame 1 : 30 FPS, frame 2 : 600 FPS) But still it isn''t as fast as it supposed to. Averange FPS for the demo was 60, when the Octree nodes had maximum 500 triangles, and 70 when i increased it to 2500. I played with these values, but i couldn''t get any better than that. Is there enything else i can do?

With my old VAR setup the demo took 255 secs to complete, and with your setup took 180 secs. Thats an improvement. But with one of the others setups (i don''t remember which but it didn''t use VAR), i made it run in 130 secs. This is the best i''ve ever achieved. Is there anything i can do to reach this level, using VARs?

Thanks.

HellRaiZer

PS. What''s the "rdtsc()" function you use in the source? I searched it in MSDN but i haven''t found it? Is it mean something, or it is an existing function?

Share this post


Link to post
Share on other sites
Some things came to my mind, when i wake up this morning.

1) In your code you use char* as the data pointer type. Does it guarantee 4-byte alignment? Using float* instead isn''t more logical? Does it matter?

2) Time Stamp counter works fine, and i''m going to rewrite all my timer functions for using it instead from QueryPerformanceCounter(). Doesn''t this makes the same calls? It includes "Performance counter" in it.

3)[Quoting myself]
quote:

quote:
--------------------------------------------------------------------------------

For VRAM, use wglAllocateMemoryNV(size, 0, 0, 1);
For AGP, use wglAllocateMemoryNV(size, 0, 0, 0.5f);

--------------------------------------------------------------------------------

Does this make any sense? What read and write frequencies really mean? Why should i set both of them to 0?



What i can''t understand is, what read frequency suppose to represent. Isn''t it a "use" frequency? Like when you are using DrawElements() with VAR setup, don''t you actually read the buffer? Or reading the buffer involves something like :

VARDataPointer[100] = VertexPool[21].x;

4) How bad, for performance, is two times rendering of a single triangle in the same frame? I''m not talking about some double pass algorithm. Because i use Octrees, as subdivision algorithm, and some trinagles are on the box boundaries, i don''t split them. Instead, i push them to all the nodes it intersects. When sending a bunch of triangles to the card at a time, it''s not easy to see what and how many triangles had been already passed in, and i have to render some of them 2 or maybe three times in a frame !!! Does this really kill performance? I''m afraid, that if i split those tris, i''ll get a much bigger dataset, and it''ll be slower than the existing one. This another reason why i had tested the individual-triangles cases. I''ll try to split them to see what i get. But the question stands.

5)And, once more, is there anything else i can do to speed up VAR Rendering in my system? Or this is the best i can have? My system is a Duron 700MHz, with a GeForce 256 DDR. When can i say that the results i gor is the best i can get?

HellRaiZer

Share this post


Link to post
Share on other sites
quote:
Original post by HellRaiZer
Some things came to my mind, when i wake up this morning.

1) In your code you use char* as the data pointer type. Does it guarantee 4-byte alignment? Using float* instead isn''t more logical? Does it matter?


No, it doesn''t. The NV memory allocator will always 16-align the data (or even 256, it seems). When you fill in the data, then you have to make sure it is aligned. I''m using a char*, because I like calculating everything in bytes, and because some vertex formats use single byte components (normals, rgba colours, vertex weights, etc). You can use any type you like, even void*.

quote:

2) Time Stamp counter works fine, and i''m going to rewrite all my timer functions for using it instead from QueryPerformanceCounter(). Doesn''t this makes the same calls? It includes "Performance counter" in it.


QueryPerformanceCounter() can use rdtsc, but is not guaranteed to. I personally use rdtsc, because it is OS independent, and it is more predictable (you know exactly what you get, and the precise overhead of the rdtsc command). I don''t feel comfortable calling a Win32 function for performance profiling.

quote:

Does this make any sense? What read and write frequencies really mean? Why should i set both of them to 0?


I don''t know, ask nVidia. The whole read/write frequency stuff was a bad idea from the beginning on. Luckily, nvidia soon realized that themselves, and published the exact values to use in order to get the RAM type you want. Don''t worry about the specific values, just use the two mentioned above. They''re the ones recommended by nVidia, and they''re guaranteed to work.

quote:

4) How bad, for performance, is two times rendering of a single triangle in the same frame? I''m not talking about some double pass algorithm. Because i use Octrees, as subdivision algorithm, and some trinagles are on the box boundaries, i don''t split them. Instead, i push them to all the nodes it intersects. When sending a bunch of triangles to the card at a time, it''s not easy to see what and how many triangles had been already passed in, and i have to render some of them 2 or maybe three times in a frame !!! Does this really kill performance?


Yes, it does. Especially on a lowend card such as yours, this is going to stress the fillrate for nothing. And fillrate is exactly what a GF-256 lacks.

quote:

I''m afraid, that if i split those tris, i''ll get a much bigger dataset, and it''ll be slower than the existing one.


Don''t be too sure about that. It obviously depends on the type of scene you use, but generally (on modern 3D cards), splitting triangles is much faster, than rendering them twice or more. Also, consider transparent faces: rendering them more than once will introduce visual artifacts.

quote:

This another reason why i had tested the individual-triangles cases. I''ll try to split them to see what i get. But the question stands.


You might want to consider a different spatial subdivision structure than an octree, in order to avoid the redundancy problem and minimize the required splits.

quote:

5)And, once more, is there anything else i can do to speed up VAR Rendering in my system? Or this is the best i can have? My system is a Duron 700MHz, with a GeForce 256 DDR. When can i say that the results i gor is the best i can get?


Ah, but you didn''t mention you have a GF-1... But you didn''t really mention how much performance you actually get either. First, get rid of those double triangles you mentioned above, they will suck away your precious fillrate like mad. I''m starting to suspect that you aren''t really geometry limited, but fillrate limited. VAR won''t help you very much in that case.

Try this: setup the VARs. Then, render a certain number of simple spheres onto the screen. Each sphere should have around 1000 faces, and should be rendered with a single glDrawElements() call. Draw a couple of thousand spheres, and measure the speed. Now you have the raw maximum performance containing both the geometry pipeline and the fragment pipeline. Next step, disable any texturing, lighting, environment combiners. Set your output window size to 8*8. Set the z compare function to always fail every fragment. Disable zbuffer and colour write masks. This will more or less take out the impact of fillrate, and you can measure the raw geometry processing speed.

quote:

About the profiling snippet. I gave it a shot, to see what i can take. Result was that all the time spent in SwapBuffers() were transfered to the second glFinish() call. I can''t understand the need of such a profiling method. Why would someone want to profile a piece of code without keeping in mind all the code before and after the profiled part?


Well, that is generally what is called profiling: measuring the exact impact of a specific piece of code, without external interference from other code parts. Things like parallel execution pipelines can hide the actual performance hit of a function, because it is delayed/overlapped. What you are talking about is benchmarking, which is basically a speed measure over the entire program. Profiling is micro-benchmarking, on subsystem or even instruction level.

Share this post


Link to post
Share on other sites
I just finished testing the sphere benchmark you suggested. Results from some tests are:


- Lighting is disabled in all tests.
- Rendering 3000 Balls with 950 triangles each, grouped in one
triangle list for every ball.

**********************************************************************
** withOUT GL_NV_vertex_array_range (Simple VA) (800x600x32)
**********************************************************************
================================================
- Minimum Geometry : 2,905,622.25 tris/sec
- Maximum Geometry : 3,595,494.50 tris/sec
- Average Geometry : 3,509,296.25 tris/sec
================================================

**********************************************************************
** with GL_NV_vertex_array_range (2 textures) (800x600x32)
**********************************************************************
================================================
- Minimum Geometry : 7,029,515.00 tris/sec
- Maximum Geometry : 9,769,094.00 tris/sec
- Average Geometry : 9,260,820.00 tris/sec
================================================

**********************************************************************
** with GL_NV_vertex_array_range (1 texture) (800x600x32)
**********************************************************************
================================================
- Minimum Geometry : 8,782,627.00 tris/sec
- Maximum Geometry : 9,797,635.00 tris/sec
- Average Geometry : 9,347,072.00 tris/sec
================================================

**********************************************************************
** with GL_NV_vertex_array_range (No textures) (800x600x32)
**********************************************************************
================================================
- Minimum Geometry : 9,405,980.00 tris/sec
- Maximum Geometry : 9,830,263.00 tris/sec
- Average Geometry : 9,382,089.00 tris/sec
================================================

**********************************************************************
** with GL_NV_vertex_array_range (No textures) (8x8x32) (all enabled)
**********************************************************************
================================================
- Minimum Geometry : 9,450,366.00 tris/sec
- Maximum Geometry : 9,836,973.00 tris/sec
- Average Geometry : 9,388,133.00 tris/sec
================================================

**********************************************************************
** with GL_NV_vertex_array_range (No textures) (8x8x32) (nothing enabled)
**********************************************************************
================================================
- Minimum Geometry : 9,559,716.00 tris/sec
- Maximum Geometry : 10,462,466.00 tris/sec
- Average Geometry : 9,577,451.00 tris/sec
================================================


As you can see i hardly break the 10 Mtris/sec barrier when rendering on a 8x8 window with depth-stencil-color-texture disabled I don''t say that the numbers aren''t good enough. nVidia''s demo on VAR (the wavy thing) gave the similar results.

The 3000-sphere package, has been rendered by randomly translating the origin for every sphere, every frame, so every ball lies inside an imaginary box. Ball''s dimensions are small, because i thought this way i can minimize fillrate. I placed the camera so all the balls being visible every frame (imaginary box completely inside the frustum). Backface culling is enabled, and no hint for volume clipping.

Can you explain the above results? I''m a little confused with them, because there seems to be no big difference between multitextured and no-textured tests (except from the min counter).

quote:

You might want to consider a different spatial subdivision structure than an octree, in order to avoid the redundancy problem and minimize the required splits.



The reason i started (and stuck) with octrees is the simplicity when creating them. No tree is that hard to implement, but octrees are the simplest. I wanted octrees for another reason to. As i had read in some occlusion culling papers, their perfect cube node shape, is more friendly to those algorithms, than an arbitary BBox. While i was implementing HOMs, i haven''t saw any advantage of using perfect cubes vs arbitary boxes, but i was too focus on occlusion culling, i haven''t got any time for changing them. Now, that my HOM implementation is stuck on the software rasterizer (really hard part, i must admit, not only from the speed point of view but too hard to get OpenGL-like precise results), i''m too bored to change them. But occlusion culling is another topic, which respects plenty of threads and posts!

[off topic]
(Sad memories came to his mind. "What the hell", he thinks, "i''ll complete it someday.")

- Reminder (to myself) : Change Octrees.
- Question (to myself) : With what??????
[/off topic]

quote:

Ah, but you didn''t mention you have a GF-1... But you didn''t really mention how much performance you actually get either. First, get rid of those double triangles you mentioned above, they will suck away your precious fillrate like mad. I''m starting to suspect that you aren''t really geometry limited, but fillrate limited. VAR won''t help you very much in that case.


Please give the definition of performance! Do you mean tris/sec, pixels/sec (how can you measure that? is it the obvious way, the way to go?)? I thought FPS is enough as a performance result.

My opinion is that i''m both geometry and fillrate limited. But lets assume that i''m only fillrate limited, what can i do to overpass it? Change resolution, smaller textures, texture compression, texture filtering, minimize blended polygons, no multitexturing, are some possible solutions, i think. But what if i can''t "implement" one of them, because i really want the functionality it gives, then i guess the only solution is a newer card

Thanks for the support Yann. I''m really greatfull

I''ll now try to eliminate double rendered triangles, and i''ll be back with the final results

HellRaiZer

PS:
quote:

Well, that is generally what is called profiling: measuring the exact impact of a specific piece of code, without external interference from other code parts. Things like parallel execution pipelines can hide the actual performance hit of a function, because it is delayed/overlapped. What you are talking about is benchmarking, which is basically a speed measure over the entire program. Profiling is micro-benchmarking, on subsystem or even instruction level.


My bad english made me think of benchmarking and profiling as the same thing. Thanks for the clear explanation

Share this post


Link to post
Share on other sites
quote:
Original post by HellRaiZer
As you can see i hardly break the 10 Mtris/sec barrier when rendering on a 8x8 window with depth-stencil-color-texture disabled


Well, that''s the maximum a GeForce1 can do. Actually, 10Mtris/sec is a very good number for a GF1.

quote:

Can you explain the above results? I''m a little confused with them, because there seems to be no big difference between multitextured and no-textured tests (except from the min counter).


Texturing will make very little difference, until you saturate the fragment pipeline, ie. you are fillrate limited. On your sphere test, you obviously aren''t. Try to make your spheres much bigger, but without changing the face count. From a certain size on, you will hit the bandwidth limit of the fragment pipeline or texture memory. Then you''ll see an extreme difference between both figures.

quote:

- Reminder (to myself) : Change Octrees.
- Question (to myself) : With what??????


I would (of course) suggest ABTs, but I''m probably biased on that point

quote:

Please give the definition of performance! Do you mean tris/sec, pixels/sec (how can you measure that? is it the obvious way, the way to go?)? I thought FPS is enough as a performance result.


FPS is an absolutely arbitrary performance unit, only valid on one single 3D scene, with a specific camera path, a specific shader setup, etc. It''s unusable for general performance comparison. You need to provide at least tris/sec, and the median value of the triangle area. Better is to provide two numbers: one with minimized fillrate impact (ie. very small render window, no textures), and one with the fillrate impact.

quote:

My opinion is that i''m both geometry and fillrate limited.


As you could see in your tests, the geometry limit is 10 Mtris/sec. You haven''t reached that in your engine. Since you are not performing any special vertex processing (hardware lights, or texgen, for example), pretty much everything else comes from the fragment pipeline, texture memory and framebuffer accesses.

quote:

But lets assume that i''m only fillrate limited, what can i do to overpass it? Change resolution, smaller textures, texture compression, texture filtering, minimize blended polygons, no multitexturing, are some possible solutions, i think.


Correct.

quote:

But what if i can''t "implement" one of them, because i really want the functionality it gives, then i guess the only solution is a newer card


Also correct.

Share this post


Link to post
Share on other sites
Maybe you could overclock it to get better performance, but i dont think there will be much improvement.

quote:

i have sorted my triangle lists in a cache-friendly way



what do you mean by that, and how did you do it?

[ My Site ]
''I wish life was not so short,'' he thought. ''Languages take such a time, and so do all the things one wants to know about.'' - J.R.R Tolkien
/*ilici*/

Share this post


Link to post
Share on other sites
With GPU-friendly triangle lists, i mean all the triangles are in a specific order. This is you render adjacent triangles one by another, so this way the cache has already 2 (at least) indices in it. You can''t achieve perfect continuity, but i do my best.

In few words:
Sort triangles so adjacent triangles are rendered next to each other.

NVTriStrip is a tool that can do that things for you. Keep in mind your cards cache size (which NVTriStrip i think does it for you), and try to group triangles.

I prefer to do it with my own code, because i don''t want to mess up with nVidia''s stuff I know this may not be as efficient (more cache misses will occur), but it is mine. Also i have added this procedure into a plugin i wrote for Lightwave so i can export to my own format. It''s easier this way.

I have to go now. CU.

And Yann, thanks once again. I think this is over.

HellRaiZer

Share this post


Link to post
Share on other sites