does video memory cache effect efficiency

Started by
16 comments, last by Infinisearch 7 years, 10 months ago

Suppose i have a vbo filled with some vertex data,draw it in two different ways.

way 1: draw in sequence (vertex1,vertex2,vertex3....)

way 2: draw randomly (vertex77,vertex34,vertex1356....)

Will way 1 be faster than way 2? thanks.

Advertisement

Yes it will be faster, in addition to this there is something called the post transform vertex cache you can take advantage of if you use indexed primitives. Take a look at this:

http://gpuopen.com/gaming-product/tootle/

-potential energy is easily made kinetic-

I have googled video memory cache a lot and found nothing useful, thanks for your information, now its very clear, thanks.

The general reason for why method 1 is faster, is because it lets the GPU easily iterate through memory as it draws in sequence. If the GPU needs to randomly access it's memory, there will be hell to pay as it unleashes chaos on the pointer, forcing some sort of sanity checking before it draws. This is typically why when a model is created, people run them through a processor from Nvidia that reorganizes the vertex data so they can be read faster.

If the GPU needs to randomly access it's memory, there will be hell to pay as it unleashes chaos on the pointer, forcing some sort of sanity checking before it draws.

Nope. The reason is simply scattered reads: it's hitting a memory address most likely not in cache, perhaps it will also trigger bank/channel conflicts. In full sequential the successive memory to access is hopefully close, more likely to be in cache and more likely to be read 'packed' as the various processors fetch their data.

The sanity checking is in hardware it does not have 'I am accessing something guaranteed valid' operations AFAIK.

Previously "Krohm"

MaxDZ8 is correct, memory is accessed in bursts and each burst is stored in a cache line. If vertex data is small enough you get multiple vertex's per cacheline. So if a memory access is 100cycles you get 2 or 4 vertex data's out of it instead of one. In addition linear prefetching is fast.

edit - linear prefetching might not apply to GPU's

-potential energy is easily made kinetic-

E: whoops beaten :(

I think you'll find more results if you search for "GPU cache" instead of "video memory cache". This is because the cache structure is really a part of a GPU and not its onboard memory, and also because the term "video memory" is pretty outdated.

Unlike CPU's, there's no generic cache structure that's used by all GPU's. Instead GPU's have a mixture of special-case and general-purpose caches where the exact number and details can vary significantly between hardware vendors, or even across different architectures from the same vendor. They also tend to be much smaller and much more transient compared to the CPU caches. CPU's actually dedicate a relatively large portion of their die space to cache, while GPU's tend to dedicate more space to their SIMD ALU units and corresponding register files. Ultimately this all means that the cache behavior ends up being different then what you would expect from a CPU with large L1/L2 caches, and you can't always apply the same rules-of-thumb.

to be more clear, GPU-caches are more often streaming-buffer than actually a cache. That's because classical 3D work is very predictive. On CPU side, you cannot predict the access pattern, you'd need to execute all previous instructions to determine what access a particular piece of code will do. On GPU on the other side, once the command buffer is flushed to the GPU, all drawcalls of the frame are 100% specified. you know exactly what vertex #16 of drawcall #1337 gonna be.

Hence, in a lot of cases, the GPU just needs to start to read from vmem ahead of the usage, it could be processing vertex #0, but already loading vertex #100 into the streaming buffer (aka cache).

Having random order of vertices might not be noticeable, if there is enough work to do on other units, as the GPU (unlike CPU) should usually not stall on memory look-ups. But if there is not much to do per-vertex, the memory fetching will just not keep up with the processing, as accessing random places in memory is way more work and way more wasteful than accessing data in a linear way.

(in modern rendering, where the flow is more cpu-alike, this changes, of course.)

It's a really complex mechanic to rasterize the triangles and it depends on many conditions to get the fastest result.

First point is the way you use the vertex data, if you have an index buffer then the driver will prefetch the data far ahead and the random access get less and less a problem.

The second point is the usage of the vertex data pattern e.g. v1,v2,v3 , v34,v35,v36,... will be not a big deal, because the driver fetch multiple large chunks of vertices and the processing is splitted among all processing units. The cachelines started with 16 Vertices on the first Geforce generation and growing since then.

But if you have a pattern like v1,v34,v103 ,v8, v2,v143 then you will notice a penalty because you also break the cacheline mechanic and the memory access penalty will rise.

The third point is, vertice processing compared to the fragment processing takes a very small amount of time.

Which means if you can reduce the amount of fragments needed to draw the triangles, it's worth to suffer vertex cache efficiency.

Sort the triangles by distance to the center of the model(from farest to nearest).

This will render the triangles depth buffer friendly, avoids overdraw and is the faster solution.

The pattern will look like v1,v2,v3 , v34,v35,v36,... .

This topic is closed to new replies.

Advertisement