pcmaster

Members
  • Content count

    192
  • Joined

  • Last visited

Community Reputation

984 Good

About pcmaster

  • Rank
    Member

Personal Information

  • Industry Role
    Programmer
  • Interests
    Programming
  1. It's totally fine to have a "fat" vertex with many attributes and only use a subset in various vertex shaders. For example the shadow map VS will only need the position and maybe uv0 (for alpha test), but the gbuffer VS will use all the attributes. You don't need to prepare two vertex buffers of the same mesh, just reuse the same one in all the necessary passes. It doesn't have to be multiple "streams", the data can be interleaved. Both options have slightly different performances, but I wouldn't be concerned with this at all, in the beginning. A layout is a "view" of the vertex data, which is just a bunch of bytes in memory. Layout tells the VS which offset each "variable" rests on, where to fetch it from. Switching a vertex shader has a cost. Switching layouts also has a cost. Switching buffers has no cost.
  2. While I'm here, I'm wondering about what exactly is CopyDesciptors doing (CopyDescriptorsSimple is... simple). Can the number of destination and source ranges be different? E.g. is it good for gathering scattered descriptors before the draw? Like copy 10 scattered SRVs into 1 contiguous range and 5 scattered UAVs into another contiguous range? The documentation doesn't say anything. Also, are we allowed to copy/write descriptors into a GPU visible descriptor heap from CPU (or GPU) directly, i.e. without CopyDescriptors? The descriptor heap has a CPU virtual address (pointer), the size of the descriptor is known...
  3. Aaaaaand I think I have my answer:
  4. Hello! Is it possible to mix ranges of samplers and ranges of SRVs and ranges of UAVs in one root parameter descriptor table? Like so: D3D12_DESCRIPTOR_RANGE ranges[3]; D3D12_ROOT_PARAMETER param; param.ParameterType = D3D12_ROOT_PARAMETER_TYPE_DESCRIPTOR_TABLE; param.DescriptorTable.NumDescriptorRanges = 3; param.DescriptorTable.pDescriptorRanges = ranges; range[0].RangeType = D3D12_DESCRIPTOR_RANGE_TYPE_SRV; .. range[1].RangeType = D3D12_DESCRIPTOR_RANGE_TYPE_UAV; .. range[2].RangeType = D3D12_DESCRIPTOR_RANGE_TYPE_SAMPLER; .. I wonder especially about CopyDescriptors, that will need to copy a range of D3D12_DESCRIPTOR_HEAP_TYPE_SAMPLER and a range of D3D12_DESCRIPTOR_HEAP_TYPE_CBV_SRV_UAV. Thanks if anyone knows (while I try it :)) .P
  5. Do you have the debug layer enabled? Does everything return VK_SUCCESS? How are you sure it "works"?
  6. This also means that you actually can't compile to any HW-dependent format with D3D11 on PC at all, it will all happen in the driver at run-time, as stated above. It can cost measurable amounts of time, so in order to reduce spikes during gameplay on PC (when accessing the shader for the first time), "pre-warm" your shader cache at load-time. On consoles, on the other hand, you can't compile in runtime (at all!) and must load everything as final microcode for the target HW. So it makes sense to build offline under all cases + compile online only for rapid development on PC.
  7. How will fp16 affect games

    It can be quite useful on PS4 Neo, I expect the same on new PC HW. You just have to know what you're doing and not mix it with fp32 (too much). And profile to see if it's better than with fp32. The yield can be savings in interpolators (VS->PS param cache), double rate of (some) ALU, maybe even lower register pressure? It's totally good for games, e.g. HDR colour computations where you don't need that much precision but also anything else.
  8. D3d12 fence value

    A fence is accessible by both CPU and GPU.   Scenario 1 -- GPU produces and CPU consumes   1. ID3D12CommandQueue::Signal inserts a command into the queue. This command will be executed by the GPU command processor LATER (in order, with other (draw) commands) and at that point, GPU will change the fence value. 2. CPU will use ID3D12Fence::GetCompletedValue (it returns immediately the current value) + busy waiting or SetEventOnCompletion. This is good for reclaiming/reusing/recycling/destroying resources used by the GPU. Once GPU actually signals, CPU can be sure that GPU is done with what it's been doing and can do whatever with the resources.   Scenario 2 -- CPU produces and GPU consumes   1. ID3D12CommandQueue::Wait inserts a command to into the queue. This command will be later executed by GPU and will cause its command processor to stall (and not launch any draws/dispatches) until the fence has the specified value. 2. ID3D12Fence::Signal immediately sets the fence value from the CPU and 'unblocks' the GPU. This is might be good for async loading of stuff - GPU is running on its own, CPU is producing stuff. The wait ensures that GPU won't run ahead of CPU.   I agree, that the docs aren't very clear :)
  9. Question about InterlockedOr

    Here, by a memory client I mean: CPU, shader core, colour block, depth block, command processor, DMA. Each has different caches (CPU has L1/L2 or even L3, shaders have L1 and L2, CB/DB have separate caches for accessing render targets (but not UAVs), etc). So in your case of PS/CS using atomics, it's all the same memory client (shader) using only GPU L1 and GPU L2.   What nobody seems to know here, is whether interlocked atomics go via L1 or L2 or somehow else. If it was as I write (and it is on the consoles), we'd be safe with not flushing or doing just a partial flush (which I'm not sure we can with PC D3D12).   After all, D3D12 probably doesn't even want us to know things like this :)   So, for now, we must assume the worst case and that is an unknown cache hierarchy, unlike on GCN consoles. Thus, we can't know what happens between dispatches/draws, therefore it seems completely necessary to flush the caches to ensure visibility :( That's my worst case assessment and I hope somebody proves me wrong and it's possible to do with a less-heavy cache sync.   I'm very sorry, I don't know any better yet...
  10. Question about InterlockedOr

    Mr_Fox - regarding L2, there's no specific region for any dispatch. Maybe Matias Goldberg might explain what he meant by that.   The writes to your UAV might for example bypass L1 (which is per compute unit, for example) and go directly to L2 (one GPU). All depending on the actual architecture. The writes don't go directly to main memory, though, not automatically at least, they go to a cache instead. The reads will look into L2 and if they a see a not-invalidated cache-line for the memory address in question, they won't read RAM and return the cached value instead. Hence the need to flush and invalidate (L2) caches between dispatches to ensure visibility for all other clients.   A flush will just transfer all the changed 64-byte (usual size) cache lines from L2 to the main memory and mark all lines invalid in the L2 cache.   However, if the only client interacting with your UAV is compute shader, and there's only one L2 per GPU, it should NOT be necessary to flush anything. I'm not sure here... But if the atomic accesses don't bypass L1, which I can't tell from MS docs (I don't know, how I'd configure that), a flush+invalidate is definitely necessary. I'd bet this is the default situation.   On one of the recent consoles with GCN, it's possible to set such a cache operation, that both L1 and L2 are completely bypassed and all reads and writes (on any certain resource) from shaders go straight to RAM, which is slow, yet usable for certain scenarios involving small amounts of transferred data. I'm not sure, if this is possible to set up on Windows/DX12, someone might hint if yes or no.
  11. Team makeup

    To my experience from a small (~15 people) and a medium (~200 people) developer body size, the ratio of artists : programmers was always around 3 : 1. As Josh Petrie mentioned, it's impossible for one person to manage more than 7-10 people, so there have to be at least some lead roles + managers (producers). Even for 15 people you'll have to have at least one or two managerial positions. I, myself, am a programmer, so this is a bottom-up perspective :)
  12. Visualize points as spheres

    You can render them as billboards with custom depth output from the pixel shader (also a pixel-kill). You would draw them the same way for your shadow-map (they're symmetrical).   You'd draw the billboards instanced, packing the vertices better into vertex wavefronts, instead of shading 4 vertices at a time greatly underutilising/wasting the wavefront size.   Whether it's going to be slower than rendering actual sphere meshes is unknown to me.
  13. Non-trivial union

    Hi.   unique_ptr::reset() calls the destructor and bang, you're dead, because its internal members are most probably crap.   If you alias an ArrayInfo over the same bytes as ObjectInfo, it can't end well.   unique_ptr and int size is implementation defined, your String is who knows what (pointer + int or an int + pointer or possibly anything). How do you imagine it should work? What are you trying to achieve? The unique_ptr won't be in a healthy state for you to call reset.   Furthermore, the standard says it's only legal to read from the most recently written union member, although actual compilers are less strict.   What you suggest might work, just I wouldn't call a "String" dtor on an instance of a ObjectInfo class :)   But what's the point?