Jump to content
  • Advertisement
Sign in to follow this  
PolarWolf

does video memory cache effect efficiency

This topic is 776 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Suppose i have a vbo filled with some vertex data,draw it in two different ways.

way 1: draw in sequence (vertex1,vertex2,vertex3....)

way 2: draw randomly (vertex77,vertex34,vertex1356....)

Will way 1 be faster than way 2? thanks.

Share this post


Link to post
Share on other sites
Advertisement

I have googled video memory cache a lot and found nothing useful,  thanks for your information, now its very clear, thanks.

Share this post


Link to post
Share on other sites

The general reason for why method 1 is faster, is because it lets the GPU easily iterate through memory as it draws in sequence. If the GPU needs to randomly access it's memory, there will be hell to pay as it unleashes chaos on the pointer, forcing some sort of sanity checking before it draws. This is typically why when a model is created, people run them through a processor from Nvidia that reorganizes the vertex data so they can be read faster.

Share this post


Link to post
Share on other sites

If the GPU needs to randomly access it's memory, there will be hell to pay as it unleashes chaos on the pointer, forcing some sort of sanity checking before it draws.

Nope. The reason is simply scattered reads: it's hitting a memory address most likely not in cache, perhaps it will also trigger bank/channel conflicts. In full sequential the successive memory to access is hopefully close, more likely to be in cache and more likely to be read 'packed' as the various processors fetch their data.

The sanity checking is in hardware it does not have 'I am accessing something guaranteed valid' operations AFAIK.

Share this post


Link to post
Share on other sites

MaxDZ8 is correct, memory is accessed in bursts and each burst is stored in a cache line.  If vertex data is small enough you get multiple vertex's per cacheline.  So if a memory access is 100cycles you get 2 or 4 vertex data's out of it instead of one.  In addition linear prefetching is fast.

 

edit - linear prefetching might not apply to GPU's

Edited by Infinisearch

Share this post


Link to post
Share on other sites

to be more clear, GPU-caches are more often streaming-buffer than actually a cache. That's because classical 3D work is very predictive. On CPU side, you cannot predict the access pattern, you'd need to execute all previous instructions to determine what access a particular piece of code will do. On GPU on the other side, once the command buffer is flushed to the GPU, all drawcalls of the frame are 100% specified. you know exactly what vertex #16 of drawcall #1337 gonna be.

Hence, in a lot of cases, the GPU just needs to start to read from vmem ahead of the usage, it could be processing vertex #0, but already loading vertex #100 into the streaming buffer (aka cache).

Having random order of vertices might not be noticeable, if there is enough work to do on other units, as the GPU (unlike CPU) should usually not stall on memory look-ups. But if there is not much to do per-vertex, the memory fetching will just not keep up with the processing, as accessing random places in memory is way more work and way more wasteful than accessing data in a linear way.

 

(in modern rendering, where the flow is more cpu-alike, this changes, of course.)

Share this post


Link to post
Share on other sites

It's a really complex mechanic to rasterize the triangles and it depends on many conditions to get the fastest result.

 

First point is the way you use the vertex data, if you have an index buffer then the driver will prefetch the data far ahead and the random access get less and less a problem.

The second point is the usage of the vertex data pattern e.g. v1,v2,v3 , v34,v35,v36,... will be not a big deal, because the driver fetch multiple large chunks of vertices and the processing is splitted among all processing units. The cachelines started with 16 Vertices on the first Geforce generation and growing since then.

But if you have a pattern like v1,v34,v103 ,v8, v2,v143 then you will notice a penalty because you also break the cacheline mechanic and the memory access penalty will rise.

 

The third point is, vertice processing compared to the fragment processing takes a very small amount of time.

Which means if you can reduce the amount of fragments needed to draw the triangles, it's worth to suffer vertex cache efficiency.

Sort the triangles by distance to the center of the model(from farest to nearest).

This will render the triangles depth buffer friendly, avoids overdraw and is the faster solution.

The pattern will look like v1,v2,v3 , v34,v35,v36,... .

Edited by TAK2004

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!