galop1n

Member
  • Content count

    274
  • Joined

  • Last visited

Community Reputation

1026 Excellent

2 Followers

About galop1n

Personal Information

  • Industry Role
    DevOps
    Programmer
  • Interests
    Programming

Social

  • Twitter
    galop1n
  • Github
    galop1n
  • Steam
    galop1n

Recent Profile Visitors

5498 profile views
  1. The perfect answer still does not exist. It depends on your need, the nature of the terrain ( static / dynamic / large scale / small scale ), the complexity of the materials, does it support caves and cliffs, ... LOD techniques are more needed than ever, because it is still impossible to blast thousands of millions of triangle a frame, and because GPU are not all mighty and can fall quickly on a terrain renderer if it is too brute force, especially the tesselation unit on AMD ( no, it is not an innocent example ! ). You do not even have to render geometry anymore, what about raycasting/marching and rasterizing a terrain from a single screen aligned quad
  2. The DXBC bytecode does not matter much compared to the final uCode. Plus the GPUs are scalar these days, so the simd instruction are counter productive to the driver anyway. Could it have do a write4 ? maybe ! But is this important without seeing your GPU uCode ? No.
  3. The desktop coordinate are based on the actual resolution your monitor is set, not the physical capabilities. It may also be influenced by the DPI setting ( not sure on this one, DPI management in windows sucks ). You have to call IDXGIOutput::GetDisplayModeList to obtain the list of supported modes. EDIT: Confirmed, it is DPI dependent ( https://msdn.microsoft.com/en-us/library/windows/desktop/bb173068(v=vs.85).aspx ) : DesktopCoordinates Type: RECT A RECT structure containing the bounds of the output in desktop coordinates. Desktop coordinates depend on the dots per inch (DPI) of the desktop. For info about writing DPI-aware Win32 apps, see High DPI.
  4. What if dot(GLM_YUP,lightdir) == 1/-1 ?
  5. Tiled resource are just like virtual memory. You allocate a continious virtual address range ( a texture here ) and then map physical pages to it ( here from a buffer ). And it is ok to only map a portion of the space at a time. It means that you need to introduce streaming capability and a way to determine what is needed to draw your terrain ( mip levels, which materials ). tiled resource are a great tool to simplify virtual texturing techniques ( when you have a huge partially resident texture, no need for atlas indirection and padding/duplication for anisotropic ) and some hardware let you know if you try to fetch from an unmapped page too. Maybe in your case, you can just limit the size of your texture array and only stream slices from what is visible.
  6. DX11 DX11 - Problem with SOSetTargets

    SoldierOfLight is right, personally, i do not like the com ptr, so i usually roll my own smart pointer to deal with dx objects. On a side note, have you look at compute shaders, they are usually a better solution than the stream output because they don't involve the uber costly and inefficient geometry shader stage.
  7. It does not make sense either to measure the DirectX CPU that way. Especially the nVidia driver use heavy multithreading and has a notorious fat do it all Present call black-box. What you may have observe is some cost migrating from a deferred path to immediate and vice versa. And in all real scenario i had to observe, AMD drivers always perform worse than nVidia in regards to their CPU usage accross a frame, trust me ! Also, the speed of light of a single draw calls, if interesting, usually does not matters, drivers are not optimized to send one draw call, but thousands with complex state changes. The difference you see could absolutely vanish in a real case scenario because the driver can work in parallel and do what so ever. As for the GPU, a Cube is again one of the worst unit testing you can have. It does not provide a proper amount of work to the GPU per instance and you can hit hidden bottleneck with partially empty wavefronts and bad vertex cache usage. I recommend you to focus on GPU performance, it is always possible to some extent to improve CPU by getting rid of redudant states, reordering draw orders, but your GPU frame usually quickly reach a sum of little things that are as fast as possible but that are to be here ! And to measure the GPU, you need to use time stamp Queries first, Then make sure that your driver is not throttling frequency/voltage when you run !!!!! And for your sanity, forget about vertex buffers, they are so rigid, even when not involving per instance data. Even if it would cost you a little more cpu, the gain in flexibility, control and maintenance is priceless that you should decide to afford it !
  8. To get vendor disassembly in PIX ( The DX12 only one ), I believe that for AMD, the driver is all you need, and for nVidia, you can request the disassembly DLL if you are a registered developer https://developer.nvidia.com/shader-disasm The fetch shader inlining is always on from what i saw so far with AMD dx12, it is because, the PSO is statically bound to a unique input layout, and has a guarantee that the compile happen at Creation. When i said, if it run fine on AMD, then don't worry in nVidia, it is more like, if you achieve your performance target on AMD, even if you do something counter productive on nVidia, you probaly still outperform over the full frame, so no big deal. I would never bind again a vertex buffer as per instance ever again ( unless it is for a very specialized technique case ) because it is cumbersome, less flexible ( try to add extra instance params ?), slower on the CPU and it is notorious that AMD is way worse on Vertex waves than nVidia in the first place anyway…
  9. There is no input assembler on AMD hardware, a shader is patch to branch to a fetch shader at the begining, reading vertex buffer as regular buffer and use conversion intrinsic to fill registers. On DX12, the PSO approach allow to inline the fetch shader in your shaders, possibly improving latency hidding and register pressure. For nvidia, we have less knowledge of the internal ( i still have the side task to document for myself there assembly, visible in PIX ). But you can usually assume that if you run fast enough on AMD, then nVidia is not a concern It is sad, but it is the best you can do without more insight of what to optimize on their GPU.
  10. OpenGL Nsight 5.4.0.17240 OpenGL

    The best way for people that does not have a direct contact at nVidia is to use the email generate from the crash dialog box nsight is opening. You can also try DevSupport@nvidia.com. To maximize your chance, you have to provide as much as information you have, dxdiag, nsight version, driver, ... What help the most is a crash dump. In your application, add a wait for debugger in your main, and attached to the process started by nSight with a secondary visual studio. You will have a call stack and a dump to provide.
  11. What is your geometry ? how many triangle per instance, how many instance ? are they optimized for vertex cache ? Is the 30% only from that specific draw or for the full frame ? How did you meseare ( gpu marker, render doc, frame delta time ? ) On AMD hardware, a vertex shader reading a structured buffer or a vertex buffer from instance id would looks at best identical and at worst extremely similar, but hard to tell on nVidia tho. They still have a fast path for constant buffer that can outperform regular buffers too, but it is usually not worth the effort of double implementation and maintenance plus size limitation.
  12. You can have a glimps of the latest engine i worked by buying Black Ops 3 haha May i ask why it has to be a modeling tool ? Because their is a lot of cons not just relying on existing solution like maya/max/modo…
  13. It is a unit vector, your light orientation is a quaternion ( it encodes an axis + angle ), so if you transform the unit vectors (1,0,0), (0,1,0), and (0,0,+/-1), you get back a 3x3 rotation matrix, and likely the last row is your light direction ( in 99% engine, idTech view vector is X for historical reasons… ). So why i said it is not necessary to store it separately. In my size comparison, that was fair btw, i grouped the quaternion with a translation scales, and i even did not mention that you do not need full precision floats for a quaternion if you really want to push it ( i pack the TBN of my meshes in a single HALF4 + sign for example, other even encode it in 32bits total ). And no, you do not want skewing, you do not even want non uniform scale in an engine as it leads to malformed tangent space and incorrect lighting
  14. You do not have to keep the direction separately ( and duplicated data is a source of bug as you have to deal with keeping them in sync ), a vector rotation transform by a quaternion is only a few mul and adds. In your case, even less work as what you want is to rotate (0,0,1) to get back the direction and it nullify many terms. You did not need quaternion to solve your initial problem, they are just an extra tool to your arsenal, and you are likely to need them at some point. What you are aiming for is good understanding of the euclidean space and trigonometry, nothing more in order to get your mind over the geometry problems you encounter in a 3D engine If you want a few reason why quaternion are useful : * Solve the gimbal lock issue on most camera system ( camera loosing a degree of freedom ) * Small memory footprint ( 4 values ). A full 4x4 world matrix (rot/trans/scale) is 64B, while a q+trans+scale is 32/40B ( homogenous scale or not ) * Premium for animation using slerp, an interpolation method with a constant speed ( soon in DXIL 6.1 with custom interpolator ). * Easy to negate, multiply, …