Jump to content
  • Advertisement
Sign in to follow this  
67rtyus

Optimizing for the vertex cache and multiple vertex pipelines

This topic is 3664 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hi I am studying on improving performance by optimizing indexed triangle lists for the vertex cache use. The algorithms I found on the internet assume that a vertex list is processed just by a single vertex shader, or processor; one vertex at a time. For example we have the vertex buffer: v0,v1,v2,v3,v4,v5 And the index buffer: 0,1,2,1,2,3,2,3,4,3,4,5 According to those algorithms, the vertices 0,1,2 then again 1,2 and then 3 and so on will get processed by a single vertex shader, in the order they appear in the index buffer.If a vertex is already in the cache, it won't get shaded.If it is not,it is shaded and added to the cache. This works fine for a single vertex shader(or pipeline;everyone seems to use different words those things that does the T&L job) But in reality, today's graphic cards have multiple shader units and the vertices get shaded in parallel. Well, considering the huge amount of parallel processing done in the GPU, how is the vertex cache used? I mean, if there are 8 vertex shaders running in parallel, for the example buffers above the vertices 0-1-2-1-2-3-2-3 are processed at the same time and a vertex gets shaded redundantly.So the cache becomes simply useless. Now, what is the real situation here? How does the cache handle the parallel processing of the vertices? Thanks in advance

Share this post


Link to post
Share on other sites
Advertisement
I don't think you need to worry about the number of vertex shader units.

If you have 1 unit pulling off vertices 0,1,2,3 the cache will perform well

If you have 4 units pulling off the same 4 vertices in parallel, well, that's no different to the vertex cache.

I doubt there's anything at all clever about a vertex cache, it just services memory requests the same as any other shared cache.



Share this post


Link to post
Share on other sites
Quote:
I don't think you need to worry about the number of vertex shader units.

If you have 1 unit pulling off vertices 0,1,2,3 the cache will perform well

If you have 4 units pulling off the same 4 vertices in parallel, well, that's no different to the vertex cache.

I doubt there's anything at all clever about a vertex cache, it just services memory requests the same as any other shared cache.

Referencing to my example;
If we have 1 unit, it will process 0,1,2 and put them in the cache. For the next three indices, 1,2,3, it will skip shading of 1 and 2; because they are already in the cache, it will just process 3.It will consume all the vertices by following the same procedure.Everything is in order. But if we have 4 units, they will first process 0,1,2 and 1. 1 gets shaded two times. Lets say 0,1 and 2 are put to the cache after that pull.Then come 2,3,2,3. 2 is already in the cache, we need only to shade 3. Let's say we skip 2 here; we would have nothing to gain, because for this cycle we still need to process vertex 3, skipping the vertex 2 or processing it doesn't make any difference at all. Moreover, if we skip vertex 2, the two units which should normally process it would stand idle for this cycle, bringing down the efficiency.

Therefore I have the feeling that the vertex cache doesn't have any advantage when there are multiple shader units. I know that I am wrong with a gigantic possibility and I want to know how..

Share this post


Link to post
Share on other sites
Try not to think in triangles, just think of individual memory fetches.

Lets pretend you have 4 vertex shading cores VS1-4, lets pretend you vertex cache can hold 8 verts at a time (you should appreciate smaller vertex sizes are better)

VS1 requests v0, stalls because it's not in the cache, boo
VS2 requests v1, stalls because it's not in the cache, boo
VS3 requests v2, stalls because it's not in the cache, boo
VS4 requests v3, stalls because it's not in the cache, boo
the cache loads v0-v7
VS1 processes v0
VS2 processes v1
VS3 processes v2
VS4 processes v3
VS1 requests v4, this is in the cache, hooray!
VS2 requests v5, this is in the cache, hooray!
VS3 requests v6, this is in the cache, hooray!
VS4 requests v7, this is in the cache, hooray!
VS1 processes v4
VS2 processes v5
VS3 processes v6
VS4 processes v7

So, we're going 4x faster because we're achieving great parallelism!

Ok, this is simplistic I know, what if the index order is bad? Well, there are things called cache ways, the cache can start fetching from up to N locations at once. (Think texture cache fetching multiple textures at once!) So for example a 2 way cache could be feeding 4 vertex units, this would make it more tolerant to index list jumps, humm that would be good for parallelism as well! However the complexity of the cache circuitry goes up and transistors are at a premium. There will be a sweet spot, cache size, number of cache ways, number of available vertex processing units...

It may even be that on modern cards the vertex cache and texture cache are unified, who knows! It would make sense to me, if you're doing a lot of texture ops which slow you down you don't care so much about rip roaring vertex cache performance however for that z only pass you care very much about rip roaring vertex cache! Having said that I expect the post transform cache is very different to the pre transform and the post transform is the one you really want to be hitting. (Its likely to be able to hold less vertices, especially if you're making heavy use of interpolators, hint console devs)

Share this post


Link to post
Share on other sites
It might also be worth baring in mind that they removed vertex cache discovery from the D3D10 API. There was originally a ID3D10Device::CheckVertexCache() method (or similar) that was dropped during the beta.

The reason at the time is that with the unified shader architectures and complex GPU's it was next to impossible to create an abstracted (and useful) view of the vertex cache for application developers. This seems reasonable to me, even if I was annoyed about not having visibility of it.

Cache optimization is still a good idea, but probably not worth investing ridiculous amounts of time in getting it perfect as you're likely to need different tweaks for each major hardware revision to get optimal performance.

hth
Jack

Share this post


Link to post
Share on other sites
Martin, does the example you gave show how the pre transform cache works? The way it gets 8 vertices in a row resembles the CPU caches. If so how would the post transform cache behave in this 4 shader units situation?

Share this post


Link to post
Share on other sites
I agree it's not worth spending the time trying to work out what cache size you have at runtime. There are good papers on how to optimise for any size cache, it isn't too hard. If you're really worried then let DirectX optimise your index lists on load. (I wouldn't bother however, offline should be good enough)

All the information I've seen (either read or witnessed with my own eyes) suggests that there is very little difference between a CPU cache and a cache on the graphics card. Indeed why would there be? I suspect simply the cache lines might be a touch smaller than on a CPU cache and perhaps less ways in a typical PC.

The example I showed was really a generic cache example, post transform cache will work in the same way as pre transform.

The reasons post transform cache typically works less well:
1. Physical location of the cache means it is likely to be smaller than pre transform. (The data needs to flow fast to keep pixel shaders busy!)
2. No compression of interpolator data (all floats) means larger vertex sizes. Vertices in the pretransform cache can be compressed, colour type for example expands by a factor of 4!
3. Procedurally generating additional information to send to the pixel shader makes the vertex size larger.
4. Passing pixel shader constants via interpolators bloats vertex size

Share this post


Link to post
Share on other sites
Quote:
Original post by Martin
All the information I've seen (either read or witnessed with my own eyes) suggests that there is very little difference between a CPU cache and a cache on the graphics card. Indeed why would there be?
I suspect this is more true now than it used to be.

Up until the last round of D3D9 parts (or 1st gen D3D10, forget which) there were discreet VS and PS units on the hardware. Thus they had very specific customized caches for the type of work being done - sure, still 'normal' caches but they did have specific characteristics you could code for.

Now, from what I understand, the cache is more general purpose as is the computation architecture - unified shaders and all that jazz. Thus you could now have the GS/DS/HS/SO interfere with the VS post-transform cache...


Cheers,
Jack

Share this post


Link to post
Share on other sites
Hi JollyJeffers

That's my understanding as well, as mentioned previously I would not be at all surprised to find the post transform cache is now unified with the texture cache.

Share this post


Link to post
Share on other sites
OK, I believe now that I am greatly confused; the discussion went off topic.. Let me rearrange my thoughts and questions in steps:

1-AFAIK, the vertex cache(the post-transform one, it seems) is actually similar to a FIFO queue. The last processed vertex is put into the cache. The optimizer algorithms try to include the triangles which has the most vertices in the cache to the triangle list, in a greedy fashion.

2-Since the cache is a FIFO structure, it is meant to work with a single stream of vertices coming from a single vertex shader unit.When there are multiple vertex shaders, running in parallel, the cache becomes useless as a vertex can get processed at the same time in multiple processors. (As I mentioned in my first posts.) If 4 units do shade the vertices 0-1-2-1, 1 is processed by the second and fourth shaders at the same time. If there was only a single shader, it could throw 1 into the cache in the second cycle and read it directly from the cache in the fourth cycle.

3-So how does the cache works with multiple vertex shaders? I think there can be seperate caches for every single shader. (2 or 3 cache entries for every shader). Or the hardware can split the triangle list into 4 parts for the 4 shaders units, and every single unit can start from reading i*N/4 .th index, given N is the number of indices and i=0,1,2 and 3.


Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!