Sign in to follow this  
Mr_Fox

GPU read/write speed wrt Format

Recommended Posts

Hey Guys,

 

What's the GPU read/write speed for different type of texture/buffer? For example 32bit, 16bit, 8bit? In my experience, R32G32B32A32 ( 128bit ) is slower than R16G16B16A16 ( 64bit ), and later is slower than R8G8B8A8 ( 32Bit ), so I assume R32 should be the same as R8G8B8A8, and R16 should be faster than R32, and R8 should be faster than R16. However, for my recent test, switching between R32, R16, R8 (both as texture or typed buffer)have no perf differences.

 

But why there is no perf difference for format size smaller than 32bit? At least cache hit rate should be higher for smaller format and then perform a little bit better (we can feed more data on the same cache line), right?

 

 

Thanks

 

P.S. Mind to have a look on my other question :)

 

Share this post


Link to post
Share on other sites
This is specific to every GPU, not the API. Even a GTX970 and GTX980 could provide completely different data for thid kind of profiling... Or even an Asus GTX980 vs an MSI GTX980!

If I reduce the amount of texture data being read but do not observe a reduction in the total execution time, then I would assume that the texture sampling instructions are not on the critical path. i.e. the shader is likely ALU-bound.

Share this post


Link to post
Share on other sites

As a general guideline, you want the smallest number of bytes per pixel that you can use to represent your data. While there's no guarantee that smaller textures will be faster, they are very unlikely to be slower.

 

For textures that generally means that the block compressed formats like DXT1/BC1 are ideal. DXT1 is essentially 4 bits per pixel, which is smaller than any uncompressed format. Of course you lose a bit of image quality to get the size so small, but in almost all cases it's a good trade off.

 

Having small textures helps for several reasons:

 

1. You can fit more stuff in video memory. If you run out of memory on the card, then performance will tend to suffer as the driver is forced to move data between the GPU and main memory.

2. GPUs have texture caches. The more pixels that fit in the cache, the better the performance should be.

3. Small textures use less of the available memory bandwidth.

4. You can load small textures from disc faster, and they take less storage space.

 

Having said that, GPUs do a lot of work to try to hide the bandwidth and latency costs of memory accesses, so it very much depends on exactly what you're doing how much performance impact there will be from using a different texture format.

Share this post


Link to post
Share on other sites

This is specific to every GPU, not the API. Even a GTX970 and GTX980 could provide completely different data for thid kind of profiling... Or even an Asus GTX980 vs an MSI GTX980!

If I reduce the amount of texture data being read but do not observe a reduction in the total execution time, then I would assume that the texture sampling instructions are not on the critical path. i.e. the shader is likely ALU-bound.

 

Thanks Hodgman, any recommendation on profile tool to find shader critical path on PC? 

Share this post


Link to post
Share on other sites

As a general guideline, you want the smallest number of bytes per pixel that you can use to represent your data. While there's no guarantee that smaller textures will be faster, they are very unlikely to be slower.

 

For textures that generally means that the block compressed formats like DXT1/BC1 are ideal. DXT1 is essentially 4 bits per pixel, which is smaller than any uncompressed format. Of course you lose a bit of image quality to get the size so small, but in almost all cases it's a good trade off.

 

Having small textures helps for several reasons:

 

1. You can fit more stuff in video memory. If you run out of memory on the card, then performance will tend to suffer as the driver is forced to move data between the GPU and main memory.

2. GPUs have texture caches. The more pixels that fit in the cache, the better the performance should be.

3. Small textures use less of the available memory bandwidth.

4. You can load small textures from disc faster, and they take less storage space.

 

Having said that, GPUs do a lot of work to try to hide the bandwidth and latency costs of memory accesses, so it very much depends on exactly what you're doing how much performance impact there will be from using a different texture format.

 

Thanks Adam_42, so you mean there is no pack/unpack extra ops on read/write tex/buf element which are smaller than 32bit? I have a volume texture with each voxel purely is a flag, so you suggest I should try DXGI_FORMAT_R1_UNORM without worrying slowing down (I always thought atomic operations on formats with size smaller than 32bit are slow since GPU data alignment issue)? Also it will be great if you could point me to some resources/samples talk about using compressed format like DXT/BC, I feel it's really hard to find those topic online.

 

Thanks again

Edited by Mr_Fox

Share this post


Link to post
Share on other sites

 

I have a volume texture with each voxel purely is a flag, so you suggest I should try DXGI_FORMAT_R1_UNORM

 

One bit voxels? Have you tried 4x4x4 bit cube in uint64 buffer instead texture?

 

 

I thought of pack multiple voxels into one buffer element, but then that means updating to each voxel within that buffer element (this will be general case in my project) need to be atomic right? which may be much slower? Also this volume will be accessed during raycasting pass which means non-linear access, and random access on buffer is slower than texture(swizzled). Or there are better ways?

 

Thanks

Edited by Mr_Fox

Share this post


Link to post
Share on other sites

I consider the bit cube idea extremely interesting e.g. for AO.

On top of those cubes you can have a parent volume with a set bit if any bit if the child cube is set -> nice hirarchy to quickly determinate empty space, can be sparse...

Also you can copy a big block of volume data to LDS and process all rays that intersect it -> good ALU utilization with little bandwidth. Or you could... ( too many simultaneous ideas :)

That's no non linear access, and even if it would be - a work efficient algorithm with nonlinear access is still faster than brute force if the problem is complex enough. And even if that.s wrong - it's only 1/8th of the memory!

 

There is this paper about Dreams if you don't know yet: http://www.mediamolecule.com/blog/article/siggraph_2015

Their entire renderer works with atomic splatting and they also have working AO like this.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this