Convert triangle list to triangle strip

Started by
16 comments, last by jbarcz1 16 years, 9 months ago
Thank you for all the article references! Excellent stuff, and I look forward to reading them all!

One thing is now utterly confusing me though. I can understand why indexed triangle lists are very fast, but not why they would be faster than a non-indexed triangle strip. This is because in the case of indexed triangle lists more data must be pushed down the AGP bus as opposed to non-indexed triangle strips. Is this perhaps because of an algorithm I dont know about that can exploit the extra data encoded in the indices of a triangle list? Surely a straight triangle strip without indices is the fastest technique of all simply because it requires the least amount of data to be pushed? (Less vertices & no indices).

Or is it the case that an unoptimised indexed triangle list is slower than an optimised triangle strip (Without indices) but that we can exploit the extra data encoded in the indices to manually optimise the mesh data (Eg. Reorganising indices to minimise overdraw and vertex caching).
Advertisement
Quote:Original post by snk_kid
Interesting, i assume this isn't ATI only? here is another paper 2007 which seems to extend the work: Fast Triangle Reordering for Vertex Locality and Reduced Overdraw.
I don't see why it would be ATI only but I haven't used it so I guess I can't say with 100% confidence, I don't know why they would do that though.

It does indeed appear that paper is an extension/update to the one I linked to, even written by the same 3 authors. Just from reading the abstract it seems like they were able to speed it up so significantly that it's feasible to run it at load-time or even whenever the mesh is altered. In Figure 1 it says with the old technique it took 40sec to run on the 40k triangle dragon mesh but with the new technique it took only 76ms with similar results. That's pretty damn impressive! I'll have to give a read through that paper and see what they're doing, such a drastic change could even mean it's a totally different technique that just happened to be developed by the same people.

Thanks for the link, you just ruined my Friday night! [grin]
Quote:Original post by Kalidor
Thanks for the link, you just ruined my Friday night! [grin]


I totally agree.

If I was helpful, feel free to rate me up ;)If I wasn't and you feel to rate me down, please let me know why!
Quote:Original post by TheGilb
Surely a straight triangle strip without indices is the fastest technique of all simply because it requires the least amount of data to be pushed? (Less vertices & no indices).
Downstream bandwidth on AGP 8x is 2.1 GB/s, and 4 GB/s on PCI Express x16. The amount of data involved is not a big deal.

Indexed triangle lists are the most efficient when it comes to actually rendering, because it provides an opportunity to maximize the number of cache hits while processing vertices.
SlimDX | Ventspace Blog | Twitter | Diverse teams make better games. I am currently hiring capable C++ engine developers in Baltimore, MD.
Quote:Original post by Kalidor
Quote:Original post by snk_kid
Interesting, i assume this isn't ATI only? here is another paper 2007 which seems to extend the work: Fast Triangle Reordering for Vertex Locality and Reduced Overdraw.
I don't see why it would be ATI only but I haven't used it so I guess I can't say with 100% confidence, I don't know why they would do that though.

It does indeed appear that paper is an extension/update to the one I linked to, even written by the same 3 authors. Just from reading the abstract it seems like they were able to speed it up so significantly that it's feasible to run it at load-time or even whenever the mesh is altered. In Figure 1 it says with the old technique it took 40sec to run on the 40k triangle dragon mesh but with the new technique it took only 76ms with similar results. That's pretty damn impressive! I'll have to give a read through that paper and see what they're doing, such a drastic change could even mean it's a totally different technique that just happened to be developed by the same people.

Thanks for the link, you just ruined my Friday night! [grin]


Just thought I'd chime in here. I'm the primary developer on Tootle, and one of the 3 paper authors.

1. Tootle is GPU neutral. It uses D3D for overdraw measurement, but that should run on any GPU with occlusion query support. And if you run it as a preprocess (which is what we recommend, it can take a while) you can render the resulting meshes on any platform you want. At the moment we're just using D3DX directly for the vertex cache optimization, but there are plans to eventually include the methods from our SIGGRAPH 2007 paper.

2. Kalidor is right, the two papers are related, but very different. The SIGGRAPH 2007 paper presents a fast vertex cache algorithm which works just as well as any existing ones. It also includes a quicker, approximate algorithm for overdraw thats competitive with Tootle in some cases. In other cases, the methods used in Tootle will do a better job on overdraw, because Tootle bases its results on actual measurements whereas the SIG'07 method is a heuristic. So, if you dont care about running time, you'll probably want to stick with Tootle. The advantage of the SIGGRAPH method is that its fast enough to run at load time (or even on the fly at runtime if you like).

3. Ditto to what others say about strips, they're more trouble than they're worth. The only time when it really pays to use strips if you're somehow running on hardware that lacks a vertex cache (or has a really small one, e.g. 2 vertices).
Joshua Barczak3D Application Research GroupAMD
Why VCACHE is faster than tri-strip algorithm explained?

Each GPU has its own vertex cache (like cache on INTEL/AMD CPU's). And what's in cache is faster processed than outside vertices. Now the idea is, to load 16 or 24 vertices into that cache and use all the faces that use vertices in that group frist, and then load another 16 or 24 vertices in cache and use another set of faces.

You have to be careful. If you use 2 vertices in cache and 1 outside cahce you break the optimization. And that's why you can use triangle strips, because they might use more than 16 vertices in one call and would break the cache.

The cache access is by far faster that any reordering would ever be because it happens directly on the GPU without face/vertex index remaping.

The problem is, that you can not query, how big the vertex cache is. Every other graphic card might have different vertex cache size. But they say, better underestimate the size than overemestiate it, because if you put 24 vertices in 16 cache slots you can not use the speed boost.

So D3DX assumes that you have 16 VCACHE slots even if you might have 24 or 32 (NVidia series 6??? and above).
Quote:Original post by jbarcz1
Just thought I'd chime in here. I'm the primary developer on Tootle


Hello there, i just wanted to know if your work is related/based on Linear-Speed Vertex Cache Optimisation (ignoring the part which reduces overdraw) or is it something different altogether, if so how does Tootle's method compare?
Quote:Original post by snk_kid
Quote:Original post by jbarcz1
Just thought I'd chime in here. I'm the primary developer on Tootle


Hello there, i just wanted to know if your work is related/based on Linear-Speed Vertex Cache Optimisation (ignoring the part which reduces overdraw) or is it something different altogether, if so how does Tootle's method compare?


At the moment, Tootle uses D3DX directly for VCache optimization. The SIGGRAPH 2007 paper (which we're planning to switch to) is a different algorithm than this one. Ours is designed to target a specific (FIFO) cache size, and its a bit more straightforward (fewer magic numbers). I dont know if we ever sat down and did an apples to apples comparison against this one. Our ACMR numbers are similar to the ones presented in that link, but without an exact comparison on a variety of meshes its hard to say whose is better. I dont know how their method would perform on larger meshes (where ACMR is more important). I also have no idea how our running time compares, since they didn't list running times.

I suspect that our method will do a little better in terms of ACMR for a known cache size, since we model a FIFO cache directly.


[Edited by - jbarcz1 on July 21, 2007 5:21:03 PM]
Joshua Barczak3D Application Research GroupAMD

This topic is closed to new replies.

Advertisement