Performance Techniques

138

Author

September 03, 2015 06:10 AM

Greetings,

I am am doing a revision on my game engine and I as confused on its architecture, specially usage of resources.

My current engine can create almost all types of DirectX 11 resources in any combination. But I am honestly find this feature set pretty useless. Since all vertex buffer, constant buffer and texture do is supply data to the shader stages. I fail to see the difference between any of them, except for the kind of data they are meant to supply. A vertex stores vertices, constant buffer frequently updated data, texture resource textures. In a little extra time of I was able to make a minimal project that uses only texture resource to supply all types of data to the shaders. One texture that has vertices, one very small texture as constant buffer and more textures for textures.

Using only textures seems a nice idea to me. It makes my engine so small, simple, efficient and cool in design. So I ask why are there so many different resources. Are the any performance benefits of using vertex buffer or constant buffers.In a different thread someone told me AMD drivers ultimately convert vertex buffers to texture resource and attach it to shader stage. So I said what the heck why should I make different resources too.

One more thing. how does performance difference do D3D11_USAGE_DEFAULT, D3D11_USAGE_IMMUTABLE, D3D11_USAGE_DYNAMIC have. I tried all and I found almost no difference. Not noticeable at least. I am pretty sure my test were insufficient to conclude because that testing project was minimal and did not do even a thousand of graphics calls and did not have more than thousands of vertices. So I ask how much do they do exactly. Should I just hard code D3D11_USAGE_DEFAULT everywhere?

One more thing. How much difference is there in terms of performance in UpdateSubresource() and Map()

One more thing. Is there a performance difference between Draw() and DrawIndexed(). My current engine uses only Draw() because when I made it, it was rather very simpler to and faster to unwrap all indexed vertices into unindexed vertices and upload to graphics memory and directly Draw(). I may be wrong I did a few calculations and I see vertex data is very small in comparison to texture data of models. I think converting them to unindexed format also doesn't make that much of a difference

I understand I might not be the first one to ask all these questions. Id be amazed if I were. So If you want to tell me solutions by other threads I can do with it too.

Thanks a million

Hodgman

52,717

September 03, 2015 06:55 AM

In a different thread someone told me AMD drivers ultimately convert vertex buffers to texture resource and attach it to shader stage. So I said what the heck why should I make different resources too.

Nope. The fundamental data types on AMD are images, buffers, samplers (which tell the hardware how to read from images), and "memory" (which is a raw buffer with less safety).

how does performance difference do D3D11_USAGE_DEFAULT, D3D11_USAGE_IMMUTABLE, D3D11_USAGE_DYNAMIC have.
Should I just hard code D3D11_USAGE_DEFAULT everywhere?

The perf difference between DEFAULT/IMMUTABLE probably isn't that great. You should use IMMUTABLE wherever you can (textures loaded from files, which don't change afterwards) just to be friendly to the driver, but it's not that much of a big deal if you don't.

DYNAMIC is very important for textures which are streamed from the CPU->GPU once per frame. It will tell the GPU to optimize the memory location, and/or pre-allocate a triple-buffered version, etc... To test this, do a perf-test where you generate a procedural texturing using the CPU once per frame, comparing between DYNAMIC and DEFAULT.

One more thing. How much difference is there in terms of performance in UpdateSubresource() and Map()

Depends what you're using them for - they can be quite different semantically. e.g. Map with NO_OVERWRITE lets you stream to the GPU with zero driver synchronization, instead taking responsibility for GPU synch yourself, as in D3D12.

One more thing. Is there a performance difference between Draw() and DrawIndexed()

Yes. Indexed draws generally achieve better vertex shader performance. However, Draw with triangle-strip primitives are similar (but harder to author).

Using only textures seems a nice idea to me. It makes my engine so small, simple, efficient and cool in design. So I ask why are there so many different resources. Are the any performance benefits of using vertex buffer or constant buffers.

Yes. A lot of GPUs have specialized hardware for constant-buffers, as it's assumed that EVERY vertex/pixel/etc in the draw call will require ALL of the constants -- so it's smart for a GPU to download all the cbuffers into a local cache before starting the draw-call, and leaving them present in that local cache for the entire duration of the draw-call.
On the other hand, each pixel/vertex usually fetches only a small part of a texture/buffer, so this data is fetched on demand and dynamically cached in L2.

Also, textures are more complex than buffers -- more dimensions, cube-map support, filtering options, addressing modes, etc... On some GPUs it might be significantly more expensive to fetch from a texture rather than from a buffer. Also, texture-descriptors are generally much larger than buffer descriptors (and usually require an accompanying sampler descriptor), meaning your shaders must allocate more SGPR's per draw-call -- basically, the driver has to create an invisible "cbuffer" that contains your texture/buffer descriptors (or you have to do this yourself via a descriptor heap on D3D12).

Older GPUs have specialized hardware for the Input Assembler stage -- that's the fixed funciton hardware which fetches vertices from buffers immediately prior to vertex shading.

Even on newer hardware though, buffers can allow for more complex data layouts (i.e. structures) than textures allow for.

. 22 Racing Series .

AbhimanyuSingh

138

Author

September 03, 2015 10:36 AM

@Hodgman Thank you very much. You have been helping me a lot these days

do a perf-test where you generate a procedural texturing using the CPU once per frame

Ill test it on canny's edge detection algorithm

stream to the GPU with zero driver synchronization

Can you tell me more about it. Or somewhere I can read about it. I would like to know what actually goes under the directx API. May be this way I can learn what to use and how to use.

so it's smart for a GPU to download all the cbuffers into a local cache before starting the draw-call, and leaving them present in that local cache for the entire duration of the draw-call.
On the other hand, each pixel/vertex usually fetches only a small part of a texture/buffer, so this data is fetched on demand and dynamically cached in L2.

I see. I did not know GPU's cache constant buffer. This changes a lot of things.

Say, would it be smart to put all highly frequently used data in constant buffers by updating them only once at initialization and never again. So you could achieve high shader performance due to high fetching speed of some variables without having the performance drop that come while updating constant buffers.

And is there anyway to lock some texture pixels on cache somehow? I don't think such a method would exist, but no harm in asking.

Also, textures are more complex than buffers -- more dimensions, cube-map support, filtering options, addressing modes, etc...

I see. But this only means they are more feature-full. So are you sure the goodness of a better feature set weighs down the fetching speed from texture.

Older GPUs have specialized hardware for the Input Assembler stage -- that's the fixed funciton hardware which fetches vertices from buffers immediately prior to vertex shading.
Even on newer hardware though, buffers can allow for more complex data layouts (i.e. structures) than textures allow for.

This was actually what I meant by when I said

In a different thread someone told me AMD drivers ultimately convert vertex buffers to texture resource and attach it to shader stage.

I think I said it completely wrong. You said it correctly. I heard new hardware don't have designated hardware for input assembler. So shouldn't putting vertices in vertex buffer or putting vertices in a 1D texture fetch similar performance. In fact putting vertices in texture seems better. You even have access to other vertices if needed in some application. Also you could make a 1D UAV texture for on the go vertex modification. I hope you see the applications by now, which I think might already know. From the pref-test I did, UAV had similar speed to non UAV if no two shaders tried to access same texture pixel.

Also you have any good links about how to up your performance? These all dos and donts. I see including you many people on this forum have this knowledge which I can't find a source to acquire it. Up till now I think you all get it by shear experience

Hodgman

52,717

September 03, 2015 02:12 PM

Can you tell me more about it. Or somewhere I can read about it. I would like to know what actually goes under the directx API. May be this way I can learn what to use and how to use.

The CPU and GPU work best when there is a decent latency between the two of them. Typically GL/D3D will store all of your commands (draw/etc) into a command buffer, which the GPU consumes about 1 frame after you've made the calls. This buffering helps the GPU maintain maximal throughput.
However, it causes inconvenience when the CPU decides that it wants to modify a resource.

If you tell the GPU to draw an object using a green texture, then you modify that texture to be pink and tell the GPU to draw a second, you expect to see one green object and one pink object... and that's what D3D will do for you. But D3D has to pull off some magic to make this work.
Those draw commands are enqueued for a long period of time, so if D3D didn't do any resource synchronization, you'd end up seeing two pink objects, due to a race condition!

D3D has two main options:
1) When you ask to modify a resource, D3D stalls the CPU until the GPU has finished (at least until it has finished any draw-commands which will use that resource). After waiting for the GPU, then it lets you modify the resource.
In the example, that ends up with:
Draw using texture. Modify texture -- D3D magic sync point: wait dozens of milliseconds until the GPU has actually drawn the green object. Draw using texture.

2) Perform "resource orphaning" aka "resource renaming".
When you ask to modify a resource, D3D actually allocates you a brand new resource and marks the old one for garbage collection in a few frames time!
In the example, you end up with:
Draw using texture (D3D uses version #1). Modify texture -- D3D magic allocation: your 'texture' handle now points to version #2. Draw using texture (D3D uses version #2).

The DYNAMIC resource flag, and the DISCARD map flag, are hints that you want it to use strategy #2 (orphaning). This causes some overhead (garbage collection of resources, the fact that resource handles can be backed by multiple versions, extra memory usage) but avoids the performance-killing sync points of strategy #1. Typically I think that constant-buffers are generally treated as "dynamic" data by drivers too.

The NO_OVERWRITE map flag does neither of these strategies, so it's pretty dangerous. It's up to you to invent your own strategy to avoid race conditions. Game engines often use this to implement ring-buffers, where you record which section of your buffer are in use by data from each frame. You can use events to tell which frames the GPU has finished rendering, which lets you know that it's now safe to reuse those sections of your buffer.

Check this out: https://developer.nvidia.com/sites/default/files/akamai/gamedev/files/gdc12/Efficient_Buffer_Management_McDonald.pdf

I did not know GPU's cache constant buffer. This changes a lot of things.

It certainly used to be the case that cbuffers were cached differently. I think NVidia might still do that. AMD no longer do, but, IIRC they do fetch cbuffer variables differently. AMD GPUs always operate on 64 pixels/vertices at a time, so when any one of those pixels requires a cbuffer value, it is fetched once and then used by all 64 pixels, which means that it's basically 1/64th the cost.
If values are per-object or per-material, try to put them into a constant buffer. If they change per-vertex or per-pixel, then put them into a regular buffer or a texture.

I see. But this only means they are more feature-full. So are you sure the goodness of a better feature set weighs down the fetching speed from texture.

It's up to you. If buffers had no use, they would no longer be HW accelerated, and the HW itself would just treat everything as a texture!
The HW designers have decided (for now) to still support buffers/textures differently, and let us choose when to use each type.

You said it correctly. I heard new hardware don't have designated hardware for input assembler. So shouldn't putting vertices in vertex buffer or putting vertices in a 1D texture fetch similar performance.

No - it means that putting vertices into a buffer that is bound to the (no-longer-HW-accelerated) "input assembler" stage should have similar performance to putting vertices into a buffer that is bound directly to the vertex shader via a shader-resource-view, and fetched manually in the VS.
Doing this via a texture instead of a buffer may still incur extra overhead.

In fact putting vertices in texture seems better. You even have access to other vertices if needed in some application. Also you could make a 1D UAV texture for on the go vertex modification.

You can do that with buffer resources as well as texture resources -- i.e. you can make shader-resource-views and unordered-access-views of both resource types.

Also you have any good links about how to up your performance? These all dos and donts. I see including you many people on this forum have this knowledge which I can't find a source to acquire it. Up till now I think you all get it by shear experience

Yeah a lot of it just comes down to experience. When doing it as a full-time job, you'll often be required to read documents like this, which explain exactly how a particular GPU works. When doing it for consoles, you also get access to non-public documents :|

For the shader side of things, these two presentations are really helpful :)

The first starts with prev-gen GPUs, and the second then extends that to current ones.

http://www.humus.name/Articles/Persson_LowLevelThinking.pdf

http://www.humus.name/Articles/Persson_LowlevelShaderOptimization.pdf

. 22 Racing Series .

Infinisearch

3,058

September 04, 2015 02:53 AM

Posted these in another thread a few minutes ago but you might be interested in them as well so:

https://developer.nvidia.com/content/life-triangle-nvidias-logical-pipeline

https://fgiesen.wordpress.com/2011/07/09/a-trip-through-the-graphics-pipeline-2011-index/

If you want to peek behind the veil a little the above are good reading.

One more thing. Is there a performance difference between Draw() and DrawIndexed().

Hodgman already answered this but I'd add that it speeds up vertex performance by enabling the post transform vertex cache thus saving the gpu from vertex shader invocations.

-potential energy is easily made kinetic-

AbhimanyuSingh

138

Author

September 04, 2015 06:22 AM

All of this is very informative. After learning so much, now it feels it was just the tip of the ice berg. This is what I have been looking for, still this is a lot to sink in. I will read everything you all have given to me. I have been reading it all since you posted. I'll get back if anything comes up.

Posted these in another thread a few minutes ago but you might be interested in them as well so:
https://developer.nvidia.com/content/life-triangle-nvidias-logical-pipeline
https://fgiesen.wordpress.com/2011/07/09/a-trip-through-the-graphics-pipeline-2011-index/

If you want to peek behind the veil a little the above are good reading.

These are amazing!!!

speeds up vertex performance by enabling the post transform vertex cache

What do you mean? And how do I do it?

Thank you all, you have been a BIG help.

Infinisearch

3,058

September 04, 2015 10:10 PM

Infinisearch, on 03 Sept 2015 - 10:53 PM, said:

speeds up vertex performance by enabling the post transform vertex cache
What do you mean? And how do I do it?

Using indexed primitives automatically enables the post transform vertex cache. This cache caches vertexes after the vertex shader has been run on it so when a vertex in a mesh is reused it doesn't need to be transformed again. more on this later in the post.

Also you have any good links about how to up your performance? These all dos and donts. I see including you many people on this forum have this knowledge which I can't find a source to acquire it. Up till now I think you all get it by shear experience

There was a pretty good post by L.spiro IIRC on performance don't remember what thread though, I'll do my best though to help.

0. I assume you know about frustum culling and BSP, quad, oct trees, and Bounding volume hierarchies. aka send less to the gpu.

1.vertex data

a. smaller vertexes, see how small you can get away with.

b. vertex stream sizes that are cache aligned (pre-transform cache), add padding if necessary.

c. minimize vertex streams.

d. use indexing and optimize ordering for post transform vertex cache and potentially visibility. look here for both http://gfx.cs.princeton.edu/pubs/Sander_2007_>TR/tipsy.pdf

edit - use 16bit indexes.

2.batch to minimize state changes an old document called batch,batch,batch IIRC by nvidia

a.common advice nowadays is sort by shader, then texture, then distance then other state. (although in some cases i think sort by texture first might be appropriate)

b.the less state changes between draw calls the more calls you get IIRC. (D3D11 and below)

c.consider an occluder pass that breaks batching for a the first section of rendering then after you've filled the zbuffer with major occluders batched rendering.

d.minimize state changes by putting like data into same buffer, all vertexes of a type in a single vertex buffer, texture atlases...

e. look at this common link:http://realtimecollisiondetection.net/blog/?p=86

3. What kind of renderer are you using?

a. forward or two pass forward (z or id pass)

b. deferred (google for presentations on killzone 2 and http://download.nvidia.com/developer/presentations/2004/6800_Leagues/6800_Leagues_Deferred_Shading.pdf

c. light prepass

d. tile deferred http://www.slideshare.net/DICEStudio/directx-11-rendering-in-battlefield-3

e. forward+ http://developer.amd.com/tools-and-sdks/graphics-development/amd-radeon-sdk/

f. clustered forward+ http://www.humus.name/index.php?page=Articles (practical clustered shading)

this has a review of most of the above:

http://dl.acm.org/citation.cfm?id=2776880.2792712&coll=DL&dl=GUIDE

4. assorted stuff

a. tiled alpha blending to take advantage of on chip memory look for gpu particles here:http://developer.amd.com/tools-and-sdks/graphics-development/amd-radeon-sdk/

b. post process AA.

c. http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/Vertex-Shader-Tricks-Bill-Bilodeau.ppsx

d. http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/04/DX11PerformanceReloaded.ppsx

e. gpu driven rendering... here http://advances.realtimerendering.com/s2015/index.html

edit - also play nice with the api and driver, do everything you're supposed to do. For example use the right resource type and creation flags, they give hints to the driver that can impact performance.

edit - found L. Spiro 's post : http://www.gamedev.net/topic/660883-best-way-to-make-game-render-most-efficient/#entry5179755

-potential energy is easily made kinetic-

AbhimanyuSingh

138

Author

September 09, 2015 04:01 PM

Hi,

Sorry for the late reply. I was trying to read all the data you have bombarded me with and everything is AWESOME!!!

1.)

Using indexed primitives automatically enables the post transform vertex cache. This cache caches vertexes after the vertex shader has been run on it so when a vertex in a mesh is reused it doesn't need to be transformed again. more on this later in the post.

This is an awesome technique. Damn why didn't I think of this before. This seems a classical DAA's dynamic programming problem. I am not yet sure of the way how I might implement. I might use DrawIndexed() or might manually reuse the already transformed vertex data by manual indexing. Although I am not sure that manual method be as fast, as reused vertices might be picked up from cache memory and manual would require fetching the reusable vertex from graphics memory. So I am confused there. The manual's advantage where I can directly modify the vertex buffer which is actually a UAV buffer for bone animation as it would remove the need of transforming mesh vertices for bones every frame. Yeah so there's that.

2.)

https://developer.nvidia.com/sites/default/files/akamai/gamedev/files/gdc12/Efficient_Buffer_Management_McDonald.pdf

This says:

"Alignment matters! (16-byte, please) Aligned copies can be ~30x faster"

So what does mean,in what context? I saw 30 times faster and I was like, Wow interesting.

3.)

It also says to avoid CPU-GPU sync points.

My main thread has only graphics calls. Every other think I have multi-threaded.

I have minimized all CPU-GPU data transfers. Resource updation and state changes only occurs when necessary.

Any more thoughts?

4.)

Is this the batch batch batch you are talking about: http://www.nvidia.com/docs/IO/8230/BatchBatchBatch.pdf

and

This thing is SUPER!:

http://gfx.cs.princeton.edu/pubs/Sander_2007_>TR/tipsy.pdf

Although I gave a look at the ordering in which 3DS Max already exports index orders. This is a sample:

0,1,2;,

1,0,3;,

2,4,5;,

4,2,1;,

5,6,7;,

6,5,4;,

7,8,9;,

8,7,6;,

9,10,11;,

10,9,8;,

11,12,13;,

12,11,10;,

13,14,15;,

14,13,12;,

15,16,17;,

16,15,14;,

17,18,19;,

18,17,16;;

These look pretty neat in-terms of last recently used order, so I think I wouldn't require a code to order them. The general trend is index has maximum last recently used of 9-10 places. Now I understand why my vertex data needs to be small. This is my Vertex struct:

struct Vertex

{

float pos[3];

float normal[3];

float tex[2];

};

It never changes ever. I have no idea how much large GPU caches are. Does it look adequate for post transform vertex cache?

5.)I use deferred, and btw this is very nice:

http://download.nvidia.com/developer/presentations/2004/6800_Leagues/6800_Leagues_Deferred_Shading.pdf

I have about half of everything. Its taking a lot of time to crunch all the information. I will come back if I need anything. Thank you again!!!

Infinisearch

3,058

September 10, 2015 01:54 AM

This is an awesome technique. Damn why didn't I think of this before. This seems a classical DAA's dynamic programming problem. I am not yet sure of the way how I might implement. I might use DrawIndexed() or might manually reuse the already transformed vertex data by manual indexing.

What do you mean by manual indexing? You need to use drawindexed() or the post transform cache won't be enabled.

This says:
"Alignment matters! (16-byte, please) Aligned copies can be ~30x faster"
So what does mean,in what context? I saw 30 times faster and I was like, Wow interesting.

If I'm not mistaken it means if you are uploading data to a buffer make sure the source buffer is aligned to a 16byte interval. This will allow for fast data uploads.

It also says to avoid CPU-GPU sync points.
My main thread has only graphics calls. Every other think I have multi-threaded.
I have minimized all CPU-GPU data transfers. Resource updation and state changes only occurs when necessary.
Any more thoughts?

Hodgman explains here the CPU-GPU sync point in 1., 2. is the alternative.

D3D has two main options:
1) When you ask to modify a resource, D3D stalls the CPU until the GPU has finished (at least until it has finished any draw-commands which will use that resource). After waiting for the GPU, then it lets you modify the resource.
In the example, that ends up with:
Draw using texture. Modify texture -- D3D magic sync point: wait dozens of milliseconds until the GPU has actually drawn the green object. Draw using texture.

2) Perform "resource orphaning" aka "resource renaming".
When you ask to modify a resource, D3D actually allocates you a brand new resource and marks the old one for garbage collection in a few frames time!
In the example, you end up with:
Draw using texture (D3D uses version #1). Modify texture -- D3D magic allocation: your 'texture' handle now points to version #2. Draw using texture (D3D uses version #2).

Is this the batch batch batch you are talking about: http://www.nvidia.com/docs/IO/8230/BatchBatchBatch.pdf

Yes that is the document, but it is really old read it get what you can from it then read the link to L. Spiro's post I gave you he basically gives you the same but updated advice.

http://www.gamedev.net/topic/660883-best-way-to-make-game-render-most-efficient/#entry5179755

This thing is SUPER!:
Infinisearch, on 04 Sept 2015 - 6:10 PM, said:

http://gfx.cs.princeton.edu/pubs/Sander_2007_>TR/tipsy.pdf
Although I gave a look at the ordering in which 3DS Max already exports index orders. This is a sample:

That one and this (https://home.comcast.net/~tom_forsyth/papers/fast_vert_cache_opt.html) one are the common ones you will come across but there are others. Your modeling software will not always output vertices/indices in optimal order, use one of the above or some other. Remember it's for indexed drawing.

struct Vertex
{
float pos[3];
float normal[3];
float tex[2];
};
It never changes ever. I have no idea how much large GPU caches are. Does it look adequate for post transform vertex cache?

Vertex size is relevent for the pre-transform vertex cache and in reducing the bandwidth the input stream requires (there isn't one in modern GPU's its cached by the standard GPU cache instead), you don't really worry about the vertex size output by the vertex shader (as long as you're being reasonable). As far as what you posted, do you really need floats for any of those values?

5.)I use deferred, and btw this is very nice:
Infinisearch, on 04 Sept 2015 - 6:10 PM, said:

http://download.nvidia.com/developer/presentations/2004/6800_Leagues/6800_Leagues_Deferred_Shading.pdf

That is a very old document, you should at least lookup the killzone 2 presentation on deferred. Also if speed is your goal tiled deferred and clustered tiled deferred are faster options especially with more lights.

-potential energy is easily made kinetic-

Infinisearch

3,058

September 10, 2015 05:40 PM

a.common advice nowadays is sort by shader, then texture, then distance then other state. (although in some cases i think sort by texture first might be appropriate)

BTW for opaque's sort front to back, front to back rendering of opaques is important but not really possible (to sort depth first, read batch batch batch) on DX11 (unless you do virtual texturing, and maybe if DX11 has drawindirect don't know). see here why front to back is important: http://developer.amd.com/wordpress/media/2012/10/Depth_in-depth.pdf

edit - for deferred rendering you don't need to store position you can reconstruct it from depth making your Gbuffer thinner which helps performance.

https://mynameismjp.wordpress.com/2009/03/10/reconstructing-position-from-depth/

https://mynameismjp.wordpress.com/2010/03/22/attack-of-the-depth-buffer/

-potential energy is easily made kinetic-