• Advertisement


  • Content count

  • Joined

  • Last visited

Community Reputation

164 Neutral

1 Follower

About NikiTo

  • Rank

Personal Information

  • Interests
  1. The space where the triangles do not cover becomes the mask(mask-out). I don't even care about the clear color of that pixels. They're totally ignored. I maybe should explain that I am not rendering 3D but GPGPU. So no 3D inside the 3D or something.
  2. No, i'm not discarding pixels in the pixel shader. My triangles cover the pixels that need to be rendered in order to discard pixels in a natural way. So it would be maybe the Stencil dedicated hardware vs the Rasterizer dedicated hardware. I am not entering the pixel shader. But I see now maybe Stencil will discard pixels even before vertex shader. Is so?
  3. I supposed that the layout is similar. My bad! I rendered to R8G8B8A8_UINT because I was expecting to can read the RTV easily in another format. Thank you for the certain answer that it will not work. I will have to reorganize it all now. It hurts my feelings to change the RTV from R8G8B8A8_UINT to R8_UINT. So first I will search for another way to go around the problem and only if cornered I will change the RTV to match. (is a triangle "painted" on the stencil faster discarding pixels than a triangle coming from the vertex shader?)
  4. I have to render only part of the RTV further in the pipeline. So I create triangles that determine which part of the RTV to render and the next shader renders only those pixels(like a mask). The problem is the triangles are by default unaligned to anything else than 1pixel. And my input data to the shader is one byte(per pixel). I have not control of the triangles either, they could happen of any shape. And it is even a little bit more complex than this, but I can't share more details. Initially I wanted of course to read 16 R8_UINTs from a R32G32B32A32_UINT, but I don't want to use if/else, and crossing the Bresenham line of the edge of a triangle lets me with up to 15 pixels of data not needed. And it is even more complex. I am organizing it the last three days and can't find a better setup. I am In the very beginning in the pipeline, further it will become more complex and being able to reinterpret a RTV to any format without copying it would be almost a must have. Is R8G8B8A8_UINT RTV to R8_UINT SRV definitively not going to work? I changed the height in order the total size of the heap to match, but still failing. both Intel and AMD fails in the very same way. Is there another way to accomplish it? D3D12_HEAP_DESC has its "Flags" set to D3D12_HEAP_FLAG_NONE. Could this be the problem?
  5. I project onto DXGI_FORMAT_R8G8B8A8_UINT and need to read as DXGI_FORMAT_R8_UINT(I was three days trying to make it just read 4 DXGI_FORMAT_R8_UINTs as one DXGI_FORMAT_R8G8B8A8_UINT, but I can't do it, hard to explain..)
  6. Case 1: device->CreatePlacedResource(heap, 0, resDesc1, state_render_target, nullptr, IID..(resource1)) // success device->CreatePlacedResource(heap, 0, resDesc1, state_pixel_shader, nullptr, IID..(resource2)) // success device->CreateCommitedResource(...) fails Case 2: (what I want to accomplish) device->CreatePlacedResource(heap, 0, resDesc1, state_render_target, nullptr, IID..(resource1)) // success device->CreatePlacedResource(heap, 0, resDesc2, state_pixel_shader, nullptr, IID..(resource2)) // fails (total size of texture described in resDesc2 is equal or less than the heap used, both cases fail) I need to render to RTV in one format then without copying, to read from it as SRV using another format. Is this the way I should try to do it?
  7. I would expect the compiler to produce 16 parallel XORs in each of those 4 threads(for the simple example above in the OP) giving me 16x4 totally parallel operations.
  8. @turanszkij So, there is yet another compiler that gets in the middle... For the simple example above, I could for example read the texture as DXGI_FORMAT_R32G32B32A32_UINT and XOR it with 0x44444444. This way I will solve my problem from the HLL, without losing time with assembler for the GPU. The problem is my project is more complex than this simplified example. I try to run out of doubts by changing my solution. I could try to force some kind of parallelism by using only vector variables of four components in my code, but it is painful to guess how it works.
  9. (I've googled for the possibility to write assembler-like instructions for DirectX 12, but have found nothing more than hacking the bytecode. Something I don't have the patience to do. So I need to deal with HLSL but I need to make few things clear) I have thought of a simple example that demonstrates what exactly I need to know: For the example I have two textures of the same size, both 2D and of the type of DXGI_FORMAT_R8_UINT. Let say 128x128 "pixels". One is inputted to the pixel shader and the pixel shader outputs the result to the other. I want the shader to take the input byte, XOR it with 44h and output it to the render target. My doubt is which of those two will happen: (assuming for the example that the registers used in the GPU's ALUs are 16 byte wide) case 1: the HLSL compiler loads the byte to the first component of a 16-bytes-wide vector register(or to the lowest 8 bits of the first 4-dwords-wide vector register). and the rest of 15 bytes are just zeroed case 2: HLSL compiler is wise enough to read the data in chunks of 16 bytes and load them into the 16 bytes of a 16 components vector register, propagate that 44h to another 16 bytes wide register, XOR it once and write a single 16 bytes chunk to the render target. (but I can't see it happen, because it says in MSDN that the unused components are zeroed. Not clear if they are just hidden from me or they was zeroed for real as in the first case....) What about variables: Would this float fVar = 3.1f; occupy the same type of register as this float4 fVector = { 0.2f, 0.3f, 0.4f, 0.1f }; ? (i'm sorry if i'm posting too often, but it is very hard for me to find something helpful in google)
  10. My initial intention was to run millions of those threads on the GPU... Thank you all! Now I have it much clearer.
  11. I remember I did safe branching in SIMD with the instructions of intel that take an extra vector for the decision making. But it is only good for few situations. I don't remember what situations exactly I used it in.
  12. I used to use a lot the SIMD in assembler. I'm used to it. For some reason It is harder for me to explore the GPU the same way I did with CPU. Intel has tons of very explicit documentation. in ASM I know what goes where and where is everything at any moment. But in GPU, although should be very similar it is not so clear for me. I will try to redesign the algorithm to run on the SIMD of Intel. It should help me to orientate myself. And it should be easier to port it from SIMD Intel to GPU, than from my C++ code full of branches(for some reason I totally forgot about branches, if you ask me: "are branches good for GPU" I would say NO! but for some reason I totally forgot about them)
  13. I am designing my algorithm in C++ now. Lately I want to adapt it to HLSL. It is working but it is huge and I have to optimize it more because I know the CPU each shader uses is not as fast as the main CPU where C++ runs. I could even redesign it completely if needed, but I don't think there is a way to do it with 256 registers only. I could say I wrote a genetic algorithm, but I don't want to enter discussions about what is a genetic algorithm/ML/big data/etc. I know for sure that my data grows iteratively and then iteratively shrinks. I think it is genetic algorithm, but I don't want to argue. Should I use RWBuffer* or something else more appropriate than an array? (a "cell" meaning "a small room in which a prisoner data is locked up", not "the smallest structural and functional unit of an organism")
  14. Would it be a problem to create in HLSL ~50 uninitialized arrays of ~300000 cells each and then use them for my algorithm(what I currently do in C++(and I had stack overflows problems because of large arrays)). It is something internal to the shader. Shader will create the arrays in the beginning, will use them and not need them anymore. Not taking data for the arrays from the outside world, not giving back data from the arrays to the outside world either. Nothing shared. My question is not very specific, it is about memory consumption considerations when writing shaders in general, because my algorithm still has to be polished. I will let the writing of HLSL for when I have the algorithm totally finished and working(because I expect writing HLSL to be just as unpleasant as GLSL). Still it is useful for me to know beforehand what problems to consider.
  15. @benjamin1441 I am very thankful for your wishes!! You made my day! Thank you! I wish you good luck in exchange!
  • Advertisement