hiya83

Members
  • Content count

    8
  • Joined

  • Last visited

Community Reputation

130 Neutral

About hiya83

  • Rank
    Newbie
  1. Ok i hate my life... apparently draw call bindings are different from dispatch bindings?? So the buffer when bound as UAV is written from a draw call pixel shader (i.e. bound using OMSetRenderTargetsAndUnorderedAccessViews to slot u1 since rtv defaults to u0). Then in the following dispatch that uses it as a SRV, I am calling CSSetUnorderedAccessViews with 2 other uav's, thinking that'll overwrite the ones from the previous draw call. Apparently not! I have to call OMSetRenderTargetsAndUnorderedAccessViews again with nulls before the dispatch otherwise it complains of double bind as srv/uav... I guess draw call and dispatch bind slots are separate??
  2. I am trying to use a buffer as a rwstructuredbuffer in one call and in another reuse that as a structuredbuffer to read the data from. However, that's not working. Can you not do that? There is no error thrown when binding the buffer as both ShaderResource & UnordererdAccess. It also works fine if I replace the StructuredBuffer with a RWStructuredBuffer in the reading call. and I am also doing the following already: - use StructuredBuffer<uint> : register(tx) vs RWStructuredBuffer<uint> : register(ux) in the shader - use CSSetShaderResource for structured vs CSSetUnorderedAccessViews for rwstructured I am hoping to use this buffer in different shaders (vs, ps, cs, but these are all different draw call/dispatches), and since rwstructuredbuffer is not usable in VS AFAIK, I was hoping I can bind it as both structured and rwstructured. Any thoughts? Thanks!
  3. Depth and stencil testing happen at the same time - so either both early or both late. If you need late depth testing then you're out of luck.   That's unfortunate, I guess I'll try a manual stencil test at start of pixel shader. It'll probably complain about divergence I am guessing.   I wonder if it works for stencil too; like if I enable that for depth will it do the stencil test at the same time it does the early z... I can give it a test, thanks
  4. I'll be darned. UpdateSubResource is actually faster; low 3ms instead of high 3ms. Not ideal yet, but it's better. Thanks for the tip! :D
  5. So tried the triple buffering approach hoping CopyResource is async dma, but it still stalls the gpu command. :(   Also since d11 device is free threaded, I tried to do something real "dumb" of creating another thread and just keep deleting old/creating new textures (with new content) on this other thread, hoping the texture creation/deletion is async from the gpu graphics engine, and that plan fell flat as well. Even though device is free threaded from the context, apparently creating/deleting resources still runs in same pipeline as the context commands. :(    Any other thoughts/ideas would be appreciated.
  6.   Does that mean unmap for Dynamic Textures triggers some sort of copy from cpu-accessible gpu memory to default gpu memory internally since that takes about same time as unmap/copyresource for staging/default textures.  Also possibly dumb question, how would you setup a memory fence on DX from the CPU?? There is no query for that, and everything seems to be implicit...        I did try texture arrays already, that was the 3rd thing I tried in my original post.. sorry if it was misleading.       Hey sorry but not sure what you meant by how long 40MB copy operation is taking? If you mean the methods I've tried above, they are all in the upper 3 ms ballpark (3.6 - 3.9).  Yea I am aware of the large memory video card, I am working on other forms of compression as well, but just want to get this down with BC7 for now :D
  7. I am measuring GPU times.    - in dynamic case, I put gpu ticks around unmap - in default & staging case, unmap doesn't take time, but CopyResource is where the time is - in default/staging with texture2darray, same as 2nd case. 
  8. I am not sure if PCIe is the problem, I have maybe 40MB per frame (with PCIe3.0 x16 for 32GB/s), and I am already double buffering (with a frame delay) to hide the memcpy operation. However, I was thinking earlier, it seems using staging/default approach, the time is not in the unmap, but copyresource. Does D11's CopyResource automatically use the Copy Engine and not stall the 3D Engine (if there is no dependency)? Or would I have to use D12 for that? I'll have to test that out with triple buffering and 2 frames delay I guess. :D   2048^3 is limiting cause my widths are > 2048 (height and depth are fine). 
  9. Hi,   I am trying to upload multiple (say n) textures (BC7) to the gpu each frame (there's data every frame read from CPU; there is no way around this), and I am trying to minimize this time as much as possible, was wondering if anyone has any insights other than what I've done:   - each texture is dynamic, have 2 copies (total 2n textures) and interchange between a cpu mapped (D3D11_MAP_WRITE_DISCARD) version to copy data into and gpu unmapped to use for render - each texture has 2 corresponding resources, a default & a staging version (2n staging, 2n default), map with D3D11_MAP_WRITE and CopyResource (n times each) to default from staging - have a staging & default texture2darray (array size = n, 2 staging, 2 default), call map D3D11_MAP_WRITE once per frame on staging, CopyResource once to copy and unmap once. - I also want to try 3d textures, but the limitation of 2048x2048x2048 means i can't use it.   All of these are approximately the same times. Does anyone have thoughts on how I can hide/reduce this time? I am aware GPU has compute/copy/3d engines (exposed in D12), but is there anyway to parallelize whatever unmap/copyresource is doing to a separate engine from the 3d engine on D11? If not any suggestions/thoughts?   Thanks
  10. Sorry was distracted by other stuff and just got back to this...      I don't think you can guarantee that m_fence is immediately updated to 2 right after entering the if-block since GetCompletedValue is a CPU call, but Signal is a command queued in the GPU cmd buffer?      Why would there be a race? Both those calls happen on a single CPU thread.   Isn't m_fenceEvent set by the GPU driver/kernel thread but WaitForSingleObject being done by the CPU userland thread? I don't think they are the same CPU thread?       I thought SetEventOnCompletion simply sets a callback function/event for when fence value equals the value, does it really wait for the fence value to equal the value before setting that function/event?   Thanks
  11. Hi,   I am going through the d3d12 samples from microsoft; the ones at https://github.com/Microsoft/DirectX-Graphics-Samples.git, and have a question about the synchronization they are doing for their command queues. Basically in all their samples inside ::WaitForPreviousFrame(), they are doing the following:   const UINT64 fence = m_fenceValue; ThrowIfFailed(m_commandQueue->Signal(m_fence.Get(), fence)); m_fenceValue++;   // Wait until the previous frame is finished. if (m_fence->GetCompletedValue() < fence) {     ThrowIfFailed(m_fence->SetEventOnCompletion(fence, m_fenceEvent));     WaitForSingleObject(m_fenceEvent, INFINITE); }   Isn't there a possible race condition here, where the signal command on the gpu pipe is run exactly after the CPU finishes checking the if (which succeeds) but before it sets the event on completion for the fence? wouldn't CPU just wait infinitely then? Am I missing something glaring here since all the samples seem to use this pattern?   Even if SetEventOnCompletion would fire the event if the fence value is already the trigger value, wouldn't there be a race now between that event firing and CPU running WaitForSingleObject instruction?   Thanks