• Content count

  • Joined

  • Last visited

Community Reputation

982 Good

About pcmaster

  • Rank

Personal Information

  • Interests
  1. This also means that you actually can't compile to any HW-dependent format with D3D11 on PC at all, it will all happen in the driver at run-time, as stated above. It can cost measurable amounts of time, so in order to reduce spikes during gameplay on PC (when accessing the shader for the first time), "pre-warm" your shader cache at load-time. On consoles, on the other hand, you can't compile in runtime (at all!) and must load everything as final microcode for the target HW. So it makes sense to build offline under all cases + compile online only for rapid development on PC.
  2. How will fp16 affect games

    It can be quite useful on PS4 Neo, I expect the same on new PC HW. You just have to know what you're doing and not mix it with fp32 (too much). And profile to see if it's better than with fp32. The yield can be savings in interpolators (VS->PS param cache), double rate of (some) ALU, maybe even lower register pressure? It's totally good for games, e.g. HDR colour computations where you don't need that much precision but also anything else.
  3. D3d12 fence value

    A fence is accessible by both CPU and GPU.   Scenario 1 -- GPU produces and CPU consumes   1. ID3D12CommandQueue::Signal inserts a command into the queue. This command will be executed by the GPU command processor LATER (in order, with other (draw) commands) and at that point, GPU will change the fence value. 2. CPU will use ID3D12Fence::GetCompletedValue (it returns immediately the current value) + busy waiting or SetEventOnCompletion. This is good for reclaiming/reusing/recycling/destroying resources used by the GPU. Once GPU actually signals, CPU can be sure that GPU is done with what it's been doing and can do whatever with the resources.   Scenario 2 -- CPU produces and GPU consumes   1. ID3D12CommandQueue::Wait inserts a command to into the queue. This command will be later executed by GPU and will cause its command processor to stall (and not launch any draws/dispatches) until the fence has the specified value. 2. ID3D12Fence::Signal immediately sets the fence value from the CPU and 'unblocks' the GPU. This is might be good for async loading of stuff - GPU is running on its own, CPU is producing stuff. The wait ensures that GPU won't run ahead of CPU.   I agree, that the docs aren't very clear :)
  4. Question about InterlockedOr

    Here, by a memory client I mean: CPU, shader core, colour block, depth block, command processor, DMA. Each has different caches (CPU has L1/L2 or even L3, shaders have L1 and L2, CB/DB have separate caches for accessing render targets (but not UAVs), etc). So in your case of PS/CS using atomics, it's all the same memory client (shader) using only GPU L1 and GPU L2.   What nobody seems to know here, is whether interlocked atomics go via L1 or L2 or somehow else. If it was as I write (and it is on the consoles), we'd be safe with not flushing or doing just a partial flush (which I'm not sure we can with PC D3D12).   After all, D3D12 probably doesn't even want us to know things like this :)   So, for now, we must assume the worst case and that is an unknown cache hierarchy, unlike on GCN consoles. Thus, we can't know what happens between dispatches/draws, therefore it seems completely necessary to flush the caches to ensure visibility :( That's my worst case assessment and I hope somebody proves me wrong and it's possible to do with a less-heavy cache sync.   I'm very sorry, I don't know any better yet...
  5. Question about InterlockedOr

    Mr_Fox - regarding L2, there's no specific region for any dispatch. Maybe Matias Goldberg might explain what he meant by that.   The writes to your UAV might for example bypass L1 (which is per compute unit, for example) and go directly to L2 (one GPU). All depending on the actual architecture. The writes don't go directly to main memory, though, not automatically at least, they go to a cache instead. The reads will look into L2 and if they a see a not-invalidated cache-line for the memory address in question, they won't read RAM and return the cached value instead. Hence the need to flush and invalidate (L2) caches between dispatches to ensure visibility for all other clients.   A flush will just transfer all the changed 64-byte (usual size) cache lines from L2 to the main memory and mark all lines invalid in the L2 cache.   However, if the only client interacting with your UAV is compute shader, and there's only one L2 per GPU, it should NOT be necessary to flush anything. I'm not sure here... But if the atomic accesses don't bypass L1, which I can't tell from MS docs (I don't know, how I'd configure that), a flush+invalidate is definitely necessary. I'd bet this is the default situation.   On one of the recent consoles with GCN, it's possible to set such a cache operation, that both L1 and L2 are completely bypassed and all reads and writes (on any certain resource) from shaders go straight to RAM, which is slow, yet usable for certain scenarios involving small amounts of transferred data. I'm not sure, if this is possible to set up on Windows/DX12, someone might hint if yes or no.
  6. Team makeup

    To my experience from a small (~15 people) and a medium (~200 people) developer body size, the ratio of artists : programmers was always around 3 : 1. As Josh Petrie mentioned, it's impossible for one person to manage more than 7-10 people, so there have to be at least some lead roles + managers (producers). Even for 15 people you'll have to have at least one or two managerial positions. I, myself, am a programmer, so this is a bottom-up perspective :)
  7. Visualize points as spheres

    You can render them as billboards with custom depth output from the pixel shader (also a pixel-kill). You would draw them the same way for your shadow-map (they're symmetrical).   You'd draw the billboards instanced, packing the vertices better into vertex wavefronts, instead of shading 4 vertices at a time greatly underutilising/wasting the wavefront size.   Whether it's going to be slower than rendering actual sphere meshes is unknown to me.
  8. Non-trivial union

    Hi.   unique_ptr::reset() calls the destructor and bang, you're dead, because its internal members are most probably crap.   If you alias an ArrayInfo over the same bytes as ObjectInfo, it can't end well.   unique_ptr and int size is implementation defined, your String is who knows what (pointer + int or an int + pointer or possibly anything). How do you imagine it should work? What are you trying to achieve? The unique_ptr won't be in a healthy state for you to call reset.   Furthermore, the standard says it's only legal to read from the most recently written union member, although actual compilers are less strict.   What you suggest might work, just I wouldn't call a "String" dtor on an instance of a ObjectInfo class :)   But what's the point?
  9. Thank you Alessio, this is exactly what I've failed to find! Your answer will become a pointer to how do it :)
  10. Hi guys!   I've failed to find how to create Reference/Software/WARP devices for Direct3D 12 (not 11). D3D12CreateDevice misses the D3D_DRIVER_TYPE argument from D3D11CreateDevice, so we there's no way to specify D3D_DRIVER_TYPE_REFERENCE or D3D_DRIVER_TYPE_WARP. The documentation on WARP itself still only mentions D3D11 and lower.   Am I missing something in the docs or is there really no reference software implementation? Thanks!
  11. Exactly, there's a position/parameter cache, however its "key" is the vertex index. If you use index-buffers, the same vertex won't be shaded twice, if shaded recently, and you save a bit.   If you don't use indices, then, I think, it doesn't matter where you put your computation - VS or GS.
  12. Side note: Watch out for the "fancy" DMA; for exaple on consoles it can have an unexpectedly shitty performance. Be sure to profile it.
  13. Can't Link Shader Program

    This should give you an answer: http://stackoverflow.com/questions/5366416/in-opengl-es-2-0-glsl-where-do-you-need-precision-specifiers   The default precision in fragment and vertex shaders isn't the same Also be aware that you aren't using GLSL but "OpenGL ES Shading Language" (GLSL ES, if you want). Not the same.
  14. Hehe, my epic fail. There was indeed a leak on my side - forgotten release on a bunch of textures So now I believe WARP is fine and actually releases the memory
  15. Okay, I reduced it as much as I could and now I do only some rendering using immutable resources (and constant buffers) to a few render targets, then a few Map/NoOverwrite for a fullscreen quad and finally read back the render targets on CPU. At that point (first Map/Read), all the massive allocations happen. No Map/Discard at all.