On platforms with "GPU malloc" (where memory allocation and resource creation are not linked), I have the streaming system do a regular (CPU) malloc for the header, and a GPU-malloc (write-combine, uncached, non-coherent) for the contents, and then pass the two pointers into a "device.CreateTexture" call, which has very little work to do, seeing that all the data has already been streamed into the right place.
I treat other platforms the same way, except the "GPU malloc" is just a regular malloc, which temporarily holds the pixel data until D3D/GL copies it into an immutable resource, at which point it's free'ed.
Interesting, I'm unfamiliar with the concept. Am I correct to assume it's strictly on consoles with unified memory (Xbox One, PS4)?
Xb360, XbOne, PS4 are known to have unified memory, but that's not important here -- unified memory means that it's all physically the same, as opposed to a typical PC where some is physically on the motherboard and some is physically on the GPU PCIe card. This doesn't matter.
What matters is whether you can map "GPU memory" (whether that's physically on the GPU, or unified memory) into the CPU's address space.
Pointers in C, C++ (or even in assembly!) aren't physical addresses - they're virtual addresses, which the hardware translates into addresses of some physical resources, which doesn't even have to be RAM! A pointer might refer to RAM on the motherboard, RAM on a PCIe card, a file on a HDD, a register in an IO device, etc...
The act of making a pointer (aka virtual address) correspond to a physical resources is called mapping.
When you call malloc, you're allocating some virtual address space (a contiguous range of pointer value), allocating a range of physical RAM, and then mapping that RAM to those pointers.
In the same way, some OS's let you allocate RAM that's physically on your PCIe GPU, but map it to a range of pointers so the CPU can use it.
This just magically works, except will of course be slower than usual when you use it, because there's a bigger physical distance between the CPU and GPU-RAM than there is between the CPU and motherboard-RAM (plus the busses on this physical path are slower).
So, if you can obtain a CPU-usable pointer (aka virtual address) into GPU-usable physical RAM, then your streaming system can stream resources directly into place, with zero graphics API involvement!
Yes, this is mostly reserved for game console devs... :(
But maybe Mantle/GLNext/D3D12 will bring into PC land.
GL4 has already kinda added support for it though! You cant completely do your own resource management (no "GPU malloc"), but you can actually map GPU-RAM into the CPU's address space.
The GL_ARB_buffer_storage extension lets you create a texture with the appropriate size/format, but no initial data, and then map is using the "PERSISTENT" flag.
This maps the texture's GPU-side allocation into CPU address space so you can write/stream into it.
You should avoid the "COHERENT" flag, as this will reduce performance dramatically by forcing the GPU to sniff the CPU's caches when reading from the texture :(
If not specifying COHERENT, the CPU-mapped virtual addresses will be marked as uncached, write-combined pages. This means the CPU will automatically bypass it's own caches, as you'll only be writing to these addresses (never reading from them) and that it will queue up your writes into a "write combining buffer" and do more efficient bulk-transfers through to the GPU (even if your code is only writing one byte at a time). The only catch is you have to call a GL fence function when you've finished writing/streaming, which will ensure the write-combine buffer is flushed out completel, and any GPU-side caches are invalidated if required.
Pretty awesome!
So one of the fastest loading algorithms on GL4 may be to create the resource first, which just does the resource allocation. Then use map-persistent to get a pointer to that allocation. Then stream data into that pointer, and unmap it and fence.
AFAIK, the background context DMA transfer method is probably the fastest though.
As above, I copy D3D11. I have a boolean property in a GpuCapabilities struct that tells high level code if multithreaded resource creation is going to be fast or will incur a performance penalty.
How do you check for this? Is it a simple divide like PC vs. consoles or can it also relate to graphics card/driver version?
If I'm emulating the feature myself (e.g. On D3D9), I know to simply set that boolean to false :D
On D3D11, you can ask the device if multithreaded resource creation is performed natively by the driver (fast) or emulated by the MS D3D runtime (slower).
On GL you have to use some vendor specific knowledge to decide if you'll even attempt with background context resource creation (or whether you'll emulate multithreaded resource creation yourself), and make an educated guess as to whether the vendor is going to actually optimize that code path by using DMA transfers, or whether you'll actually have fallen onto a slow-path... Fun times...
As you mention, modern drivers are pretty ok with using multiple GL contexts and doing shared resource creation. On some modern GPUs, this is even recommended, as it triggers the driver's magic fast-path of transferring the data to the GPU "for free" via the GPU's underutilised and API-less asynchronous DMA controller!
Neat, and all you have to do is kick off a DMA on a concurrent context?
As with everything in GL, there's multiple ways to do it, it might silenty be emulated really slowly, it might be buggy, you'll have to test on every vendors GPU's and each vendor probably has conflicting guidelines on how best to implement it.
Besides that, it's pretty simple :lol: