What performance will AMD's HBM bring for graphics programmers?

Started by
27 comments, last by Ravyne 8 years, 9 months ago

As graphics programmers can we rely on better GPU to main memory transfer times or just better VRAM to GPU cache transfer times?

There's been some discussion lately of how VRAM requirements are increasing and 4GB is 'not enough', is HBM going to bring a different outlook to the table, especially with current AMD cards only supporting up to 4GB?

I'm interested to hear from fellow, more experienced graphic developers on this situation.

Advertisement

This does not belong to this forum.

The thing is simple: you already get +75% bandwidth so you can get up to 75% more perf as long as you are bandwidth-limited and fit the memory budget.

It seems there's also an improvement in latency.

It might be a game-changer for manufacturing but it really isn't much from programming point of view. It's "just" faster memory. Except it's way faster. And as a flagship, it's fairly affordable.

Previously "Krohm"

Yeah I know it's not related to programming per se, but wondered if techniques could be changed to benefit it.

Mods feel free to move it to off-topic.

I'm anxious to see the inevitable CPU + GPU + HBM on a single 'chip'. That'll be awesome!

The problem with GPU to system memory transfers isn't interface or memory speeds, it's the huge pipeline stall that they require.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

The problem with GPU to system memory transfers isn't interface or memory speeds, it's the huge pipeline stall that they require.


Which is exactly why a true shared memory architecture would be great. You could actually use fine-grained algorithms as no data would need to be transferred, simply a pointer/handle here or there.
Well there's a few consoles out there that have unified memory and CPU/GPU on the same chip ;)

Actually Intel PCs are already in this category too...

You benefit from not having to waste time physically moving data between two different RAM locations, and you can flexibly just increase how much video memory you have by stealing it from system memory :)

Anything where you dynamically generate data using the CPU per frame will benefit, as you don't have to stream it over the PCIe bus.

Of course, you need high performance system memory though, otherwise you're shackling the GPU -- e.g. existing intel systems have DDR3, but using GDDR5 would be awesome.

The problem with GPU to system memory transfers isn't interface or memory speeds, it's the huge pipeline stall that they require.

Which is exactly why a true shared memory architecture would be great. You could actually use fine-grained algorithms as no data would need to be transferred, simply a pointer/handle here or there.
I assume he's talking about the fact that the GPU usually has 1+ frame of latency, so the CPU must wait before reading the data (else it will stall). That disadvantage doesn't change, but yes, transfers would be eliminated. There's still complications on top of just having a pointer -- e.g. knowing when/which caches to flush/invalidate/wait for, to ensure coherency. But that's something that current APIs do for you :)

The problem with GPU to system memory transfers isn't interface or memory speeds, it's the huge pipeline stall that they require.

Which is exactly why a true shared memory architecture would be great. You could actually use fine-grained algorithms as no data would need to be transferred, simply a pointer/handle here or there.
I assume he's talking about the fact that the GPU usually has 1+ frame of latency, so the CPU must wait before reading the data (else it will stall). That disadvantage doesn't change, but yes, transfers would be eliminated. There's still complications on top of just having a pointer -- e.g. knowing when/which caches to flush/invalidate/wait for, to ensure coherency. But that's something that current APIs do for you smile.png

Yes, that's exactly what I'm talking about. The problem isn't memory transfers, whether or not they happen, and how fast they are if they do. The problem is that the CPU and GPU are two separate processors that operate asynchronously. If you have one frame of latency and you need to do a readback every frame, the fastest memory transfer in the world (or no memory transfer) won't help you; you'll still halve your framerate.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

Yes, that's exactly what I'm talking about. The problem isn't memory transfers, whether or not they happen, and how fast they are if they do. The problem is that the CPU and GPU are two separate processors that operate asynchronously. If you have one frame of latency and you need to do a readback every frame, the fastest memory transfer in the world (or no memory transfer) won't help you; you'll still halve your framerate.


With current APIs, sure. With ones designed around shared memory not necessarily. On a shared memory system, there's no reason why after issuing a command it couldn't be running on the GPU nanoseconds (or at the very least microseconds) later (provided it wasn't busy of course). With that level of fine-grain control you could switch back and forth between CPU and GPU easily multiple times per frame.

Yes, that's exactly what I'm talking about. The problem isn't memory transfers, whether or not they happen, and how fast they are if they do. The problem is that the CPU and GPU are two separate processors that operate asynchronously. If you have one frame of latency and you need to do a readback every frame, the fastest memory transfer in the world (or no memory transfer) won't help you; you'll still halve your framerate.


With current APIs, sure. With ones designed around shared memory not necessarily. On a shared memory system, there's no reason why after issuing a command it couldn't be running on the GPU nanoseconds (or at the very least microseconds) later (provided it wasn't busy of course). With that level of fine-grain control you could switch back and forth between CPU and GPU easily multiple times per frame.


Even with a shared memory architecture it doesn't mean that you can suddenly run the CPU and GPU in lockstep with no consequences. Or at least, certainly not in the general case of issuing arbitrary rendering commands. What happens when the CPU issues a draw command that takes 3 milliseconds on the GPU? Does the CPU now sit around for 3ms waiting for the GPU to finish? It also totally breaks the concurrency model exposed by D3D12/Mantle/Vulkan, which are all based around the idea of different threads writing commands to separate command buffers that are later submitted in batches. On top of that, the GPU hardware that I'm familiar with is very much built around this submission model, and requires kernel-level access to privileged registers in order to submit command buffers. So it's certainly not something you'd want to do after draw or dispatch call.

Obviously these problems aren't insurmountable, but I think at the very least you would need a much tighter level of integration between CPU and GPU for the kind of generalized low-latency submission that you're talking about. Currently the only way to get anywhere close to that is to use async compute on AMD GPU's, which is specifically designed to let you submit small command buffers with compute jobs with minimal latency. With shared memory and careful cache management it is definitely possible to get your asysnc compute results back pretty quickly, but that's only going to work well for a certain category of short-running tasks that don't need a lot of GPU resources to execute.

This topic is closed to new replies.

Advertisement