Jump to content

  • Log In with Google      Sign In   
  • Create Account


Member Since 06 Sep 2004
Offline Last Active Sep 21 2016 03:07 PM

Posts I've Made

In Topic: GPU support for D3D12_CROSS_NODE_SHARING_TIER

03 August 2015 - 02:39 PM



Last post by "ManuelG" shows the Win10 DX caps viewer for various different GPUs. However flags > 11.3 are not available... so who knows if 

D3D12_CROSS_NODE_SHARING_TIER is supported or not... I guess it's still too early.

In Topic: GPU support for D3D12_CROSS_NODE_SHARING_TIER

03 August 2015 - 02:23 PM

Well... GPU manufacturers tend to be sketchy. "Fully" supports can mean supporting only the lower tiers. I'd say that a GPU that is D3D12_CROSS_NODE_SHARING_TIER_1_EMULATED may still claim DirectX 12 support. It'd be nice to get a clear list of supported tier levels for the D3D12_FEATURE_DATA_D3D12_OPTIONS structure (https://msdn.microsoft.com/en-us/library/windows/desktop/Dn770364(v=VS.85).aspx). I've made the mistake in the past of buying a GPU that claimed full support of DX11.2 only to get gypped when seeing the features I wanted were only supported at higher tier levels.

In Topic: Weird Direct Compute performance issue

01 March 2015 - 12:42 PM

Thank you for your reply.

Each pass sorts 2-bits at a time. Each pass calls 3 dispatches: 1st creates the block sums buffer mentioned in the paper, 2nd does a prefix sum on it, 3rd scatters and sorts. So 12-bits keys need 6*3 = 18 dispatches, 14-bits keys need 21 dispatches.  Regardless the number of bits the keys have, they'll always use the same kernels for those 3 dispatches per pass.


Here are some more benchmarks.


Sorting 50,000 values:

2-bits integers:   ~0.18ms

4-bits integers:   ~0.37ms

6-bits integers:   ~0.55ms

8-bits integers:   ~0.74ms

10-bits integers: ~0.92ms

12-bits integers: ~1.09ms

14-bits integers: ~8.92ms

16-bits integers: ~10.00ms

32-bits integers: ~11.45ms


Sorting 10,000 values:

2-bits integers:   ~0.10ms

4-bits integers:   ~0.19ms

6-bits integers:   ~0.27ms

8-bits integers:   ~0.36ms

10-bits integers: ~0.45ms

12-bits integers: ~0.54ms

14-bits integers: ~8.08ms

16-bits integers: ~9.47ms

32-bits integers: ~11.46ms



If interested, I could provide source code (which is already on bitbucket), or upload the executable so that you can benchmark on your own computer.

In Topic: Tiled Resources & Large RWByteAddressBuffers

05 September 2014 - 10:32 AM

I haven't used tiled resources yet, but I think you have to perform calculations that depend on the tiled resources' address mapping model on the CPU, and send the results to your shader, treating them as relative to the start of the region of active tiles. Also, I think that the RWByteAddressBuffer you access in your shader is not actually the whole buffer, but only the active tiles, so the RWByteAddressBuffer's address 0 actually corresponds to the tiled resource's pDestTileRegionStartCoordinate value set with ID3D11DeviceContext2::UpdateTiles, and RWByteAddressBuffer::GetDimensions returns only the size of the active tiles...


Also, there's no such thing as a negative uint. smile.png


Hey thanks for the reply. You're right about uints not being negative... that was silly of me. I wrote that because I was storing the values and reading them back on the CPU as integers. Regardless though, the returned values are incorrect once the tiled resource RWByteAddressBuffer is larger than what can be addressed with a 32bit uint.

With tiled resources though, indexing into buffers remains the same. If you hit a sparse area though, the behavior will differ depending on the "tier level": supported by your GPU. In my case, I map a single "dummy" physical tile to any sparse tile of the buffer. Though inefficient, any time a store occurs to a sparse area, it will map to the dummy tile.


I'm pretty sure the code is correct since I tested the kernel on a smaller buffer. The problem is that the API allows you to create really huge tiled resources (since memory isn't allocated until you actually use UpdateTiles(...)), but doesn't let you access areas of byte addressable buffers that are beyond 2^32 bytes. The only solution I currently see is to either bind multiple buffers to the kernel and implement some sort of logic that would spill over to the next buffer once you reach areas that are addressable or rethink my algorithm as a whole :(.

In Topic: Run DirectX 11 stream output without drawing

07 June 2014 - 04:03 PM

Hey unbird, thanks for your reply.



Edit: Is this for vanilla DX11 ? Because 11.1 allows writing to UAVs from every shader stage. You wouldn't even need stream out functionality.


You know, I recently bought a GTX 770 card that claims to support 11.2.  Scattered RW from any stage was one feature I really wanted on my new GPU.  Turns out NVIDIA does not support the full feature set. I do find it a bit misleading when the specifications fail to mention that it isn't a full support but "capable" of the 11.2 feature set.