D3D11_USAGE for a vertex buffer

Started by
6 comments, last by Buckeye 10 years, 1 month ago

I'm making the transition from DX9 to D3D11. Looking at the docs for the D3D11_USAGE enumeration: IF a vertex buffer can be initialized when it's created, will not change during its lifetime, and will not have to be accessed by the CPU, is there any disadvantage to creating it with D3D11_USAGE_IMMUTABLE? Is it guaranteed, for instance, to be faster, or have fewer cache misses, than D3D11_USAGE_DEFAULT?

On the other, is D3D11_USAGE_DEFAULT the only choice if CPU access may be needed (say, for picking purposes)? The docs don't say explicitly, but I'm assuming DEFAULT can be accessed by the CPU. Is that correct? EDIT: And DYNAMIC obviously can be accessed by the CPU (as I read a little bit further) so that part is answered.

Please don't PM me with questions. Post them in the forums for everyone's benefit, and I can embarrass myself publicly.

You don't forget how to play when you grow old; you grow old when you forget how to play.

Advertisement

I can't answer questions about speed. however i created a method to wrap the DX calls away from me. and i put a boolean flag to indicate that the VBO contents will change each frame.

so if it is never changed i create with usage IMMUTABLE and CPUAccessFlag 0

if it is changed i create with usage DYNAMIC and CPUAccessFlag CPU_ACCESS_WRITE

YMMV


The docs don't say explicitly, but I'm assuming DEFAULT can be accessed by the CPU. Is that correct?
I don't know what docs are you referring to in "my" docs

http://msdn.microsoft.com/en-us/library/windows/desktop/ff476259(v=vs.85).aspx

there's a table at around half page, only resource with CPU access (no matter if R or W) is STAGING.

It also depends on whatever you want to use Map or UpdateSubresource but the documentation is rather clear to me: leave immutable resources alone.

Previously "Krohm"

@Krohm: Actually I said "the" docs, and I was looking at the same page you link. Your comment with regard to DYNAMIC and CPU access is correct.


the documentation is rather clear to me: leave immutable resources alone.

Interesting. I actually read it the other way 'round. Although it specifically mentions textures, it implies that IMMUTABLE is the choice for unchanging formatted data used only for input. That's why I'm curious about the disadvantages. Would you explain further?

EDIT: This link implies that D3D11_USAGE_IMMUTABLE may be a choice for vertex (as well as index and constant) buffers.

Please don't PM me with questions. Post them in the forums for everyone's benefit, and I can embarrass myself publicly.

You don't forget how to play when you grow old; you grow old when you forget how to play.

You've mentioned the disadvantage of immutable resources -- they're immutable. You get the best possible memory layouts and locations, but in exchange, you can't modify the data.

You should generally use it for everything that's... immutable... or not dynamic.

To give the graphics driver the best chance of actually doing a good job, you should generally always read up on all the usage hints/flags in a graphics API, and be as truthful to the driver as you can about your intended usage.

Even if it were valid to dynamically map/unmap an immutable resource, doing so would mean that you've lied to the driver... Which means the driver may have made some bad resource allocation decisions.


...the disadvantage of immutable resources -- they're immutable. You get the best possible memory layouts and locations, but in exchange, you can't modify the data.

If I understand what you're saying, that's a restriction, but not a disadvantage for a buffer that "will not change during its lifetime, and will not have to be accessed by the CPU.." With those previously mentioned assumptions, it appears IMMUTABLE would be the choice, and I'm trying to learn if there may be something "under the hood" that isn't clear. EDIT: Maybe I am misundertanding what you're saying and the assumptions mentioned do not meet the intent of "immutable?" If so, could you explain?

EDIT: I'm also just getting into mapping, and that appears to be a disadvantage if mapping would be needed.

As mentioned, I'm just getting into D3D11, but I've been programming for a long time. For me, as I learn any new interface, I try to be fairly thorough exploring advantages and limitations for making choices, so I can make informed decisions as I code.

Please don't PM me with questions. Post them in the forums for everyone's benefit, and I can embarrass myself publicly.

You don't forget how to play when you grow old; you grow old when you forget how to play.

Short answer: If the resource is actually immutable and is only ever going to be read by the GPU, then there is no disadvantage. It is only advantageous to tell the driver this information to let it make the best internal decisions.

Very long answer:

Say that the OS is able to "malloc" memory of two different types:


Name   Location    Cache policy
"main" Motherboard Write-back
"vram" GPU         Write-combine

See http://en.wikipedia.org/wiki/Write-combining and http://en.wikipedia.org/wiki/Cache_(computing)#Writing_policies

In your own programs, when you call malloc/new, you always get "main" RAM. However, the graphics driver is also able to allocate the "vram" type.

Despite the fact that they're both physically located in different places, as far as software is concerned, they can both be treated the same way. If you've got a pointer -- e.g. char* bytes = (char*)0x12345678 -- that address (which is virtual) might be physically located on the motherboard or the GPU. It doesn't matter what the physical location is, it's all just memory.
This means that the CPU can actually read/write "vram", and the GPU can actually read/write "main" RAM too. ALl that's required is that the OS allocates some of these physical resources and maps them to appropriate virtual addresses.

So, all that said, this means that when a GPU-driver is allocating a resource, such as a texture, it basically has two options -- does it use main or vram?

To make that decision, it has to know which processors will be reading/writing the resource, and then consider the performance behaviors of each of those situations:

  • CPU writes to VRAM -> The CPU writes to a write-combine cache and forgets about it. Aysnchronously, this cache is flushed with large sized burst writes through to VRAM. The path is fairly long, so it's not the absolute fastest data transfer (CPU->PCI->VRAM).
  • CPU reads from VRAM -> the local write-combining cache must first be flushed! After stalling, waiting for that to occur, it can finally request large blocks of memory to be sent back from VRAM->PCI->CPU. This is extremely slow due to the write-combining strategy.
  • GPU reads from VRAM -> There will likely be a large L2 cache between the VRAM and the processor that needs the data, so not every read operation actually accesses RAM. A cache-miss will result in a very large block of data being moved from VRAM to this L2 cache, where smaller relevant sections can be moved to the processor. Theses accesses are all lightning fast.
  • GPU writes to VRAM -> it's got it's own internal, specialized busses. Frickin' space magic occurs, and it's lighting fast. If you need to implement "thread safety"/synchronization (e.g. you want to read that data, but only after the write has actually completed), then there's a lot of complex software within the GPU driver, relying on specific internals of each GPU. The driver will likely have to manually instruct the GPU to flush it's L2 cache after each major write operation too (e.g. when a render-target has finished being written to).
  • CPU reads from Main -> There will be a complex L1/L2/L3 cache hierarchy, complete with a magical predictive prefetcher that tries to guess which sections of RAM you'll access next, and tries to move them from RAM to L3 before you even ask for them. When a RAM access can't be fulfilled by the cache (i.e. a cache miss), then it tries the next cache up. Each level deals with larger and larger blocks/'cache lines' of RAM, making large transfers from the level above, and smaller transfers to the level below.
  • CPU writes to Main -> The CPU writes the data into L1 and forgets about it. Asynchronously in the background, L1 propagates it up to L2, to L3, to RAM, etc, as required.
  • GPU reads from Main (A) -> The GPU fetches the data over GPU->PCI->Main. Bandwidth is less than just accessing the internal VRAM, but still pretty fast. If the CPU has recently written to this memory and the data is still in it's caches, then the GPU won't be reading those latest values! The software must make sure to flush all data out of the CPU cache before the GPU is instructed to read it.
  • GPU reads from Main (B) -> The GPU fetches the data over GPU->PCI->CPU->Main. Bandwidth is less than just accessing the internal VRAM, and also less than regular CPU->Main accessess. The transfer from main will go through the CPU cache, so any values recently written by the CPU (which are present in the CPU cache) will be fetched from there automatically instead of from Main.
  • GPU writes to Main (A) -> The GPU sends the data over GPU->PCI->Main. Bandwidth is less than just accessing the internal VRAM, but still pretty fast. If the CPU has recently used this memory and the data is still in it's caches, then the CPU could possibly still be using these invalid cached values instead of the latest data! The software must make sure to flush all data out of the CPU cache before the CPU is instructed to read it.
  • GPU writes to Main (B) -> The GPU sends the data over GPU->PCI->CPU->Main. Bandwidth is less than just accessing the internal VRAM, and also less than regular CPU->Main accessess. The transfer from main will go through the CPU cache, so any values recently used by the CPU (which are present in the CPU cache) will be updated automatically.

In reality it's more complex than this -- e.g. modern GPUs will also have an asynchronous DMA unit, whos job is to manage a queue of asynchronous memcpy events, which might involve transferring data between Main/VRAM...

Also, I've given two options for how the GPU might interact with Main, above (A/B), but there's others. Also, more and more systems are going towards heterogeneous designs, where Main/VRAM are physically the same RAM chips (there is no RAM "on the GPU") -- however, even in these designs you might still have multiple buses, such as the WC and a WB cache-policy buses above.

So... to answer the question now. Generally, if a resource is only going to be read from the GPU and never modified (it's immutable), then you'd choose to put it in VRAM, accessed from the CPU via a write-combining bus. This gives you super-fast read access from the GPU, and the best possible CPU-write performance too, but CPU-read performance will be horrible.

If you "map" this resource, the driver can do two things.

1) It can return you an actual pointer to the resource -- if you write to this pointer then you'll be performing fast WC writes, but if you read from this pointer then you'll be performing very slow, WC-flushing, non-cached reads.

2) It can allocate a temporary buffer on the CPU, memcpy the resource into that buffer using it's DMA engine, and then return a pointer to this buffer. This means that you can read/write it at regular speeds, however, depending on the API (i.e. if Map is a blocking function), then the CPU still has to stall for the DMA transfer to complete. If the API has an async-Map function, then this could still be fast (there's an async bulk memcpy from Vram to Main, you then get the mapped main memory to use as you like, and then there's another async bulk memcpy from Main back to VRAM) as long as the CPU/GPU don't both need to be working on the buffer at the same time (i.e they need a lot of latency here to hide the time wasted in the async memcpy'ing).

Either way, reading from the resource on the CPU will be a slow operation due to the driver's choice to allocate the resource behind a WC bus.

If you know in advance that you'll want to be reading from the resource on the CPU, then you should tell the driver this information. It will then likely allocate the resource within Main RAM, accessed via the regular CPU cache. GPU writes to the resource will be slightly slower, but CPU-reads will be the same speed as any old bit of malloc'ed memory.

Thank you very much, Hodgman! Actually, it was a bit more than I needed, but not at all more than I was willing to read! I've got this topic bookmarked. And thanks for the wikienc links. Wouldn't have had a clue.

Please don't PM me with questions. Post them in the forums for everyone's benefit, and I can embarrass myself publicly.

You don't forget how to play when you grow old; you grow old when you forget how to play.

This topic is closed to new replies.

Advertisement