Jump to content
  • Advertisement

deadc0deh

Member
  • Content Count

    19
  • Joined

  • Last visited

  • Days Won

    1

deadc0deh last won the day on June 6

deadc0deh had the most liked content!

Community Reputation

14 Neutral

About deadc0deh

  • Rank
    Member

Personal Information

  • Role
    Programmer
  • Interests
    Programming

Recent Profile Visitors

The recent visitors block is disabled and is not being shown to other users.

  1. That depends a little on which hardware you are on. In general things like root constants are pre-loaded into SGPRs and hence there is no indirection in the shader, but there is a limit to this after which things get 'promoted' to other memory. Some hardware has more flexibility in what types of things it can access through raw pointers vs descriptors (e.g. some unified memory addressing consoles have a lot more flexibility) Also on some older GPUs NVIDIA it seems to want to store frequently accessed constant buffer data earlier, because there is seemingly some special hardware to prefetch that (I couldn't find the link off hand, but you can google that.) On much newer hardware that stuff seems to matter less. (A good anecdote is some project I did where using a large buffer descriptor was 2x slower on the GPU than using dedicated constant buffers for a Tegra X1, whereas on a 1080 GTX there was near 0 difference). That said I haven't seen similar benefits on AMD hardware, so YMMV. Just to be clear that a cache miss on the GPU isn't terrible per definition. As long as you can hide the latency (similar to texture fetch) with another warp/wavefront then you don't notice it too much. In my experience it is generally better keeping VGPR pressure lower... not to say there is no benefit in aligning the descriptors better... but VGPRs usage translates better to a larger subset of hardware, where descriptor layout seems to be more finicky per hardware. And that is totally cool to spend a lot of time if you're optimizing for PS4 or XBOX One, but that's not so cool if you are targeting the PC/Mobile market. I find that AMD is particularly open about this, but NVIDIA is fairly secretive about this. If anyone does find good links about this, please post as I would like to know more about this.
  2. deadc0deh

    Is vkCmdPushDescriptorSetKHR efficient?

    Your approach B) has a problem. The problem being that most devices have a limitation of maxBoundDescriptorSets (https://vulkan.gpuinfo.org/listlimits.php) so you might not get away with that approach for more complicated draws. Approach A) is generally a reasonable solution, provided that you divide your rendering data based on frequency of updates. E.g. PerFrame, PerView, PerMaterial, PerDraw. The per frame descriptor set is created and set 1x per frame, per view has a descriptor per view and bound per view. The per material, well you can build descriptor sets per material and then just set the descriptor set (not update). Per draw, well here performance matters and in my own experiments I have found dynamic constant buffers and push constants to be really helpful for performance. To your point creating descriptor sets for this type of short lived data carries an overhead that's not really acceptable and is usually solved in DX12 by setting the buffer as a root constant. The DX12 approach you describe is partially true... the root constants are limited in size (similar to the vulkan problem of maxBoundDescriptorSets) and you can't set certain resources (e.g. ShaderResourceViews) I hope this helps.
  3. deadc0deh

    Fast random number generation

    The Mersenne-twister is quite expensive, it is however really random. rand() is known to not be a good random generator, over many many samples patterns emerge. It is however decently fast. Fast 'random' is usually a Linear Congruential Generator (https://en.wikipedia.org/wiki/Linear_congruential_generator) However if you are using it for a ray tracer, you actually don't want pure random samples as pure random is usually not very uniform. You want something a little better. Hammersley is a good start, but depending on your exact case not great (http://holger.dammertz.org/stuff/notes_HammersleyOnHemisphere.html) If you want to get higher quality you need to look at something like: http://s2017.siggraph.org/technical-papers/sessions/random-sampling.html But keep in mind, different use cases, different generators. Different people, different opinions. Never the less, I hope this gets you in the right direction.
  4. It seems the compute shader is just not a valid compute shader, so when you attach it it is not valid and hence pipeline creation fails with incomplete because there is no compute shader (the handle will be invalid VK_NULL_HANDLE.) Is your compute shader perhaps using a feature/profile that isn't supported by the Adreno 506, but IS supported by the G71/G76? You can try running it through the validator tool.
  5. And not even a +1, sad panda. Editted: Thank you
  6. In general this stuff happens if you accidentally calculated the window size wrongly and it ends up slightly scaling to the display surface. Make sure you use the calculation for the window's client rectangle, not the entire window. The second most common error I know of is people using multi sampled render targets and well.. the resolve will then 'smear' it out (for lack of a better term and lazyness to not want to explain multi sampling.) It is important to keep the glPointSize at 1.0f as different sizes can have different results based on: https://www.khronos.org/registry/OpenGL-Refpages/gl2.1/xhtml/glPointSize.xml
  7. deadc0deh

    Optimizing OpenGL rendering

    Did you ever figure out what was wrong? Would be nice to know what it ended up being.
  8. This is partially true. In the "real world" for large scenes, everything gets transformed to around the camera on the CPU side first, this greatly decreases the required bit depth. A lot of special cases exist too... for example in FPS the gun/hands is usually using a different depth range than the rest of the scene as it requires a near plane really close, which destroys z precision at distance. Far away geometry (other than skybox) uses yet another, etc. Another example of relocating to camera is skeletal animation which usually suffers heavily from imprecision in the hierarchy of matrix multiplications in combination with compression, moving this to a camera local frame before multiplying the whole thing helps greatly.
  9. You cast a ray from the camera's eye position into the direction of the ray in the view frustum based on the mouse's location. (An easy way to do this is to project the point onto the near plane and using inverse World * View * Projection. Note not the World per object... only if you have global World matrix, otherwise just View * Projection) The ray is then intersected with bounding volumes to quickly cull a lot of geometry not in the bounding volume. If you have overlapping geometry, then their boxes overlap. You will then need to do a ray-triangle cast on the geometry in the bounding volume and determine the hit point you want (usually the closest hit point .e.g. origin + t * direction for minimum t (but not negative t))... but in any case the ray may intersect multiple triangles. If you have a lot of triangles in your single bounding volume, it may be beneficial to create a bounding volume hierarchy if you're going to do a lot of ray casts (in order to speed it up.) Alternatively if you don't really care about the triangle itself, but you care only about the object and you have trouble with the math... an easy 'solution' is to render a 1x1 viewport at the mouse location and render each object with a unique color. Then you can read back that color and you know which object was closest to the viewer. (The downside to this is that you need to use the GPU a little and you need to read back which causes a sync point. But in practice this is a fairly easy method for things like editors and the like.)
  10. Thank you for that link, that was a good read. I'm guessing you are referring to this section: The only issue that I have with this is that the hardware mentioned is again from around 2012, which perhaps doesn't make it too representative of 2019. The original poster quoted an Adreno 630 which is from 2018. Never the less, your link set me onto an interesting search and I found some more relevant numbers for Unreal Engine in this presentation where they talk about holding 30fps at a comfortable CPU usage for Fortnite. (Note that the time stamp is to the draw call count averages and is for fairly dated hardware) It seems they vary from ~1000-2000 root draw calls (not counting instances/draw indirect.) For main stream hardware from around 2015-2016. Anyway, thanks again Wyrframe for the information, it set me down a path of watching videos and reading articles in which I learned more things about cool ways of optimizing contents too!
  11. Do you have any information that supports this "sub-100, or even sub-30"? Best I could find was a Unity post from 2014 that claims 2000 draw calls is a good number: https://answers.unity.com/questions/694570/unity-mobile-draw-call-limit.html I am really interested in getting a feeling for very rough draw call count on mobile. It seems on PC the numbers are really high (for example: https://www.anandtech.com/show/11223/quick-look-vulkan-3dmark-api-overhead), but I don't have a good feeling for mobile. That said draw call count isn't really a representative number though, there are many factors. The CPU overhead of settings render state, setting up resources (textures, buffers, etc). These make the driver overhead vary wildly from say a simple draw call that just calls draw, vs one that sets a lot of state. Then there is obviously the GPU side, which varies even more wildly, but is also not really related to the original poster's problem as he/she doesn't change the scene. In any case I hope someone has a good explanation for this, or that iGrfx posts the results he finds, because that is an interesting question.
  12. deadc0deh

    Optimizing OpenGL rendering

    Based on the information you have provided it (in particular the NSight screenshots) there are a lot of texture stalls, based on the SM Throughput For Active Cycles being low, combined with the SM Warp Stall Long Scoreboard being relatively high. You can read more about this here (https://devblogs.nvidia.com/the-peak-performance-analysis-method-for-optimizing-any-gpu-workload/, in particular the Example 3: TEX-Latency Limited Workload) Digging through your capture for RenderDoc (which I couldn't open in RenderDoc 1.4 nor 1.3, so I just dug throught the XML) it seems like you're creating reasonably sized textures with uncompressed formats that are not mipmapped (I see a 512x512 GL_RGB, 1000x1000 GL_RGB, 1024x1024 GL_RGB, etc) (Note that because I cannot open the capture I can't easily tell if there is another thing going on where mipmaps are created of this texture, but it is worth looking at) You could try setting 1x1 textures in NSight and you should see framerate go up. If so, make sure the textures are mipmapped. More ideally for the textures you are using DXT1. Also it seems from the screenshot that you have some sort of Depth of Field going on. From the same capture it seems like you're using a frame buffer of 1366x763 of half float format (GL_RGB16F). If you are using this for the DoF that would cost a lot of texture bandwidth. I would imagine the down sample is somewhere in your NSight capture around 30ms and if so, that looks expensive at first glance. I hope this helps ------------ Speculation on my part, but based on: And: It could be that at low resolutions less fragments are needed, so it uses less memory bandwidth. At 1024x1024 and 2048x2048 window sizes, multiple fragments could be using the same texture sample and hence performance doesn't deteriorate drastically with higher resolutions. Again... this is speculation.
  13. deadc0deh

    Rendering Order, Blending and Performance.

    Might be worth to note that for the front-to-back opaque objects the order does not have to be perfect. One of the main principles of this optimization is that you populate the depth buffer so that any fragments (that don't change the depth in the shader) can prevent the fragment/pixel shader from running at all. In particular there is usually an acceleration structure such as HTile to make this very efficient for multiple fragments per clock. (e.g. see http://developer.amd.com/wordpress/media/2013/10/evergreen_cayman_programming_guide.pdf chapter 7.3) This allows you to simply use a dot product between the object's origin and the camera's view direction. (You may have to subtract the camera's eye location if you're dealing with large scenes) (and obviously this works best if you don't have huge objects) Further you could easily limit your set of objects that you have to sort by marking large occluding objects and rendering only them front-to-back... once most of the depth buffer is populated a lot of the smaller objects will never make it to the fragment shader so it doesn't matter if they go front-to-back or not. This can severely reduce the set of objects you have to sort. It is also common place to perform a 'light weight sort' by storing say struct { float distance, int index } and then later using just index to walk through the actual objects. This is particularly beneficial if you're not sorting pointers, but heavy weight objects as it limits memory traffic for the sort. (Radix sort works quite well here.) Note that you will still need to sort all transparent objects back-to-front. It is worth noting that not 'all' objects with alpha need to have this. In particular a lot of particle systems render with additive mode (src = one, dst = one) and this is render order independent and hence no sorting is required. As user DerTroll points out: But it might be worth to note that Order Independent Transparency is usually quite memory hungry. If you are willing to opt for a multi-pass algorithm it might be worth looking into a technique called "depth peeling". It depends a little on the scenario and the type of objects your are trying to render, but knowing yet another technique doesn't hurt. Hope this helps
  14. https://devblogs.microsoft.com/cppblog/c-game-development-workload-in-visual-studio-2017/
  15. deadc0deh

    Applying a Shader to a sprites leads to strange things

    So I just reproduced a version of your problem. Given this image (generated in paint.net): And creating a sprite in unity and then assigning the unlit shader to material and assigning that material to a mesh I'm seeing this: Now the problem seems to be the the unlit shader is marked as "Opaque", which means it renders mostly solid. The weird cutout you see is in the parts where the alpha is below a specific threshold (without going into details, this is then where the renderer just discards the pixels). The different colors you see in the transparent parts in your image are 'hidden' in the alpha channel of the images you have. If you have the source image you can verify this by upping the alpha for that layer, and they will then show. If you were to want to see this: Then you would just use a sprite and not assign an unlit shader. In general you should not be using an unlit shader with sprites. I'm not sure what specific use you could have for wanting to override this shader.
  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!