Short answer: If the resource is actually immutable and is only ever going to be read by the GPU, then there is no disadvantage. It is only advantageous to tell the driver this information to let it make the best internal decisions.
Very long answer:
Say that the OS is able to "malloc" memory of two different types:
Name Location Cache policy
"main" Motherboard Write-back
"vram" GPU Write-combine
See http://en.wikipedia.org/wiki/Write-combining and http://en.wikipedia.org/wiki/Cache_(computing)#Writing_policies
In your own programs, when you call malloc/new, you always get "main" RAM. However, the graphics driver is also able to allocate the "vram" type.
Despite the fact that they're both physically located in different places, as far as software is concerned, they can both be treated the same way. If you've got a pointer -- e.g. char* bytes = (char*)0x12345678 -- that address (which is virtual) might be physically located on the motherboard or the GPU. It doesn't matter what the physical location is, it's all just memory.
This means that the CPU can actually read/write "vram", and the GPU can actually read/write "main" RAM too. ALl that's required is that the OS allocates some of these physical resources and maps them to appropriate virtual addresses.
So, all that said, this means that when a GPU-driver is allocating a resource, such as a texture, it basically has two options -- does it use main or vram?
To make that decision, it has to know which processors will be reading/writing the resource, and then consider the performance behaviors of each of those situations:
- CPU writes to VRAM -> The CPU writes to a write-combine cache and forgets about it. Aysnchronously, this cache is flushed with large sized burst writes through to VRAM. The path is fairly long, so it's not the absolute fastest data transfer (CPU->PCI->VRAM).
- CPU reads from VRAM -> the local write-combining cache must first be flushed! After stalling, waiting for that to occur, it can finally request large blocks of memory to be sent back from VRAM->PCI->CPU. This is extremely slow due to the write-combining strategy.
- GPU reads from VRAM -> There will likely be a large L2 cache between the VRAM and the processor that needs the data, so not every read operation actually accesses RAM. A cache-miss will result in a very large block of data being moved from VRAM to this L2 cache, where smaller relevant sections can be moved to the processor. Theses accesses are all lightning fast.
- GPU writes to VRAM -> it's got it's own internal, specialized busses. Frickin' space magic occurs, and it's lighting fast. If you need to implement "thread safety"/synchronization (e.g. you want to read that data, but only after the write has actually completed), then there's a lot of complex software within the GPU driver, relying on specific internals of each GPU. The driver will likely have to manually instruct the GPU to flush it's L2 cache after each major write operation too (e.g. when a render-target has finished being written to).
- CPU reads from Main -> There will be a complex L1/L2/L3 cache hierarchy, complete with a magical predictive prefetcher that tries to guess which sections of RAM you'll access next, and tries to move them from RAM to L3 before you even ask for them. When a RAM access can't be fulfilled by the cache (i.e. a cache miss), then it tries the next cache up. Each level deals with larger and larger blocks/'cache lines' of RAM, making large transfers from the level above, and smaller transfers to the level below.
- CPU writes to Main -> The CPU writes the data into L1 and forgets about it. Asynchronously in the background, L1 propagates it up to L2, to L3, to RAM, etc, as required.
- GPU reads from Main (A) -> The GPU fetches the data over GPU->PCI->Main. Bandwidth is less than just accessing the internal VRAM, but still pretty fast. If the CPU has recently written to this memory and the data is still in it's caches, then the GPU won't be reading those latest values! The software must make sure to flush all data out of the CPU cache before the GPU is instructed to read it.
- GPU reads from Main (B) -> The GPU fetches the data over GPU->PCI->CPU->Main. Bandwidth is less than just accessing the internal VRAM, and also less than regular CPU->Main accessess. The transfer from main will go through the CPU cache, so any values recently written by the CPU (which are present in the CPU cache) will be fetched from there automatically instead of from Main.
- GPU writes to Main (A) -> The GPU sends the data over GPU->PCI->Main. Bandwidth is less than just accessing the internal VRAM, but still pretty fast. If the CPU has recently used this memory and the data is still in it's caches, then the CPU could possibly still be using these invalid cached values instead of the latest data! The software must make sure to flush all data out of the CPU cache before the CPU is instructed to read it.
- GPU writes to Main (B) -> The GPU sends the data over GPU->PCI->CPU->Main. Bandwidth is less than just accessing the internal VRAM, and also less than regular CPU->Main accessess. The transfer from main will go through the CPU cache, so any values recently used by the CPU (which are present in the CPU cache) will be updated automatically.
In reality it's more complex than this -- e.g. modern GPUs will also have an asynchronous DMA unit, whos job is to manage a queue of asynchronous memcpy events, which might involve transferring data between Main/VRAM...
Also, I've given two options for how the GPU might interact with Main, above (A/B), but there's others. Also, more and more systems are going towards heterogeneous designs, where Main/VRAM are physically the same RAM chips (there is no RAM "on the GPU") -- however, even in these designs you might still have multiple buses, such as the WC and a WB cache-policy buses above.
So... to answer the question now. Generally, if a resource is only going to be read from the GPU and never modified (it's immutable), then you'd choose to put it in VRAM, accessed from the CPU via a write-combining bus. This gives you super-fast read access from the GPU, and the best possible CPU-write performance too, but CPU-read performance will be horrible.
If you "map" this resource, the driver can do two things.
1) It can return you an actual pointer to the resource -- if you write to this pointer then you'll be performing fast WC writes, but if you read from this pointer then you'll be performing very slow, WC-flushing, non-cached reads.
2) It can allocate a temporary buffer on the CPU, memcpy the resource into that buffer using it's DMA engine, and then return a pointer to this buffer. This means that you can read/write it at regular speeds, however, depending on the API (i.e. if Map is a blocking function), then the CPU still has to stall for the DMA transfer to complete. If the API has an async-Map function, then this could still be fast (there's an async bulk memcpy from Vram to Main, you then get the mapped main memory to use as you like, and then there's another async bulk memcpy from Main back to VRAM) as long as the CPU/GPU don't both need to be working on the buffer at the same time (i.e they need a lot of latency here to hide the time wasted in the async memcpy'ing).
Either way, reading from the resource on the CPU will be a slow operation due to the driver's choice to allocate the resource behind a WC bus.
If you know in advance that you'll want to be reading from the resource on the CPU, then you should tell the driver this information. It will then likely allocate the resource within Main RAM, accessed via the regular CPU cache. GPU writes to the resource will be slightly slower, but CPU-reads will be the same speed as any old bit of malloc'ed memory.
Edited by Hodgman, 10 March 2014 - 10:05 PM.