Jump to content
  • Advertisement

HellRaiZer

Member
  • Content count

    663
  • Joined

  • Last visited

Community Reputation

1001 Excellent

About HellRaiZer

  • Rank
    Advanced Member

Recent Profile Visitors

The recent visitors block is disabled and is not being shown to other users.

  1. HellRaiZer

    Useless Snippet #2: AABB/Frustum test

    Everyone, thanks for the constructive comments. I've updated the article based on some of your suggestions.   @Matias Goldberg: Unfortunately, changing _mm_load_ps((float*)&absPlaneMask[0]) to be standard compliant as you suggested, as well as adding the __restrict keyword, requires a rerun of the benchmarks, because otherwise the numbers won't be accurate. I'll keep a note and hope to be able to do it sometime soon.   One small note about your last comment. The 4 AABBs at a time SSE version has a loop at the end which should handle the case where the number of AABBs isn't a multiple of 4. The code isn't shown for clarity (it should actually be the same as the 1 AABB at a time version).   Thanks again for the great input.
  2. HellRaiZer

    Useless Snippet #2: AABB/Frustum test

    @zdlr: Of course you are right. I forgot about those two. I'll edit the article to read "C++" instead of "C".   @Matias: Thank you very much for the tips. I'll try to find some time to do the changes you suggest and edit the article. I'll also try to make a little VS project to attach to the article at the same time.    About the two dot products. Do you mean method 5 on that article? In this case, the resulting code wouldn't be able to distinguish between fully inside and intersecting states which is a requirement for this article. I know, it might sound bad trying to optimize something and add arbitrary restrictions which might affect performance, but I think having the ability to distinguish between those two cases might help in the case of an hierarchy. E.g. parent is completely inside, so there's no need to check any of its children.   Please correct me if I'm wrong. 
  3. HellRaiZer

    Useless Snippet #2: AABB/Frustum test

    Thank you all for the comments.   @Servant: Bacterius is right. The numbers are "cycles per AABB". Culling a batch of 1024 AABBs in a single loop ends up being faster than 32 probably because the function overhead (e.g. calculating the abs of the plane normals) is minimal compared to the main loop.    Example: Assume that the initial loop which calculates the abs of the plane normals requires 200 cycles. Also, assume that each AABB requires 100 cycles. For a batch of 10 AABBs, the function would require 1200 cycles to complete. Or in other words, 120 cycles per AABB. If the batch had 1000 AABBs, the function would require 100200 cycles, or 100.2 cycles per AABB. Hope that makes sense.   @zdlr: Variable names were kept like that in order to match the examples in Fabian Giesen's article which I linked above. Also, the term 'reference implementation' doesn't mean that the code uses references. It means that, that specific snippet is used as a baseline for performance comparisons (if this was what you meant).
  4. HellRaiZer

    Shader "plugin" system?

    It has been a long time since I touched any rendering related code, but I'll try to describe what I remember from my implementation.   Each surface shader (e.g. a shader that will be applied to the surface of a 3D model) can use one or more different plugins.  Shader plugins were implemented using Cg interfaces.    So, your example above would look something like this (pseudocode since I haven't written a line of Cg for a couple of years now). IAmbientLighting g_AmbientLighting; sampler2D DiffuseTex; float4 mainPS(VS_OUTPUT i) { float4 diffuseColor; diffuseColor.rgb = tex2D(DiffuseTex, i.UV.xy).rgb; diffuseColor.rgb *= g_AmbientiLighting.CalcAmbientLight(i.PosWS.xyz); diffuseColor.a = 1.0; return diffuseColor; } The IAmbientLighting interface would look like this:  interface IAmbientLighting { float3 CalcAmbientLighting(float3 posWS); } Your current shader would have used a constant ambient color implementation. Something like:  class ConstAmbientLight : IAmbientLighting { float4 AmbientColor; float3 CalcAmbientLighting(float3 posWS) { return AmbientColor.rgb * AmbientColor.a; } } If you would like to change to an SSAO implementation, instead of using this class you would use: class SSAO : IAmbientLighting { sampler2D SSAOTex; float4 AmbientColor; float4x4 WSToSSMatrix; float3 CalcAmbientLighting(float3 posWS) { float2 screenSpacePos = TransformToSS(posWS, WSToSSMatrix); float ssao = tex2D(SSAOTex, screenSpacePos).r; return AmbientColor.rgb * AmbientColor.r * ssao; } With those two interface implementations available, the renderer is responsible for selecting the correct one at run-time, based on some criteria (user prefs, GPU caps, etc.) and linking it to all the surface shaders which use an IAmbientLighting object.   The idea can be extended to other things. E.g. different kind of lights (omni, point, directional) can be implemented as interfaces of one common ILight interface.    This way you can create (e.g.) a Phong shader with or without SSAO, using one or more lights of any type.    That's the basic idea. Hope it makes some sense. If not, just say it and I'll do my best to describe it better.
  5. Try creating one View.OnClickListener object and use the View.getID() on the passed view object.   Something like this: View.OnClickListener listener = new View.OnClickListener() { public void onClick(View v) { int id = v.getID(); switch(id) { case 1000: // button0 was clicked... break; case 1001: // button1 was clicked... break; } } } button0.setOnClickListener(listener); button1.setOnClickListener(listener); ... button0.setID(1000); button1.setID(1001); ... Hope that helps.
  6. HellRaiZer

    Is my frustum culling slow ?

    If the AABBs correspond to static geometry, translating them to world space every frame is an overkill. You should do it once at start up.   If it's about dynamic geometry, then it shouldn't be that simple when rotation is involved. If your objects rotate, you should calculate the AABB from the OBB defined by the original AABB and the object's transformation, in case you want to use the same code for all your objects. Otherwise you can find/write another function which culls OBBs against the frustum.    In case you go about the OBB route, it might be faster to just check the bounding sphere (which isn't affected by rotation) against the frustum, at the expense of rendering a few more objects (bounding spheres tend to be larger that AABBs depending on the object they enclose). 
  7. HellRaiZer

    Is my frustum culling slow ?

    @lipsryme If you get an access violation as soon as you add a 4th aabb in the list it means that your aabbList array isn't 16-byte aligned. Two choices here: 1) Explicitly allocate the aabbList array to be 16-byte aligned (e.g. using _aligned_malloc()) or 2) Change the 6 _mm_load_ps() calls with _mm_loadu_ps() which allow reading from unaligned addresses.   Hope that helps.   PS. To test if an array address is 16-byte aligned you can use one of the functions found here: http://stackoverflow.com/a/1898487/2136504 E.g. Check &aabbList[(iIter << 2) + 0].center.x and if it returns true but the _mm_load_ps() fails then something else is wrong with your array.
  8. HellRaiZer

    Is my frustum culling slow ?

    I believe the changes you made in the code is the problem. To be more exact:   The original code read the centers and extends of the 4 AABBs from an array with the following layout: c0.x, c0.y, c0.z, e0.x, e0.y, e0.z, c1.x, c1.y, c1.z, e1.x, e1.y, e1.z, c2.x, c2.y, c2.z, e2.x, e2.y, e2.z, c3.x, c3.y, c3.z, e3.x, e3.y, e3.z, ...   When the following instructions are executed, the XMM registers hold the values mentioned in their name:   // NOTE: Since the aabbList is 16-byte aligned, we can use aligned moves. // Load the 4 Center/Extents pairs for the 4 AABBs. __m128 xmm_cx0_cy0_cz0_ex0 = _mm_load_ps(&aabbList[(iIter << 2) + 0].m_Center.x); __m128 xmm_ey0_ez0_cx1_cy1 = _mm_load_ps(&aabbList[(iIter << 2) + 0].m_Extent.y); __m128 xmm_cz1_ex1_ey1_ez1 = _mm_load_ps(&aabbList[(iIter << 2) + 1].m_Center.z); __m128 xmm_cx2_cy2_cz2_ex2 = _mm_load_ps(&aabbList[(iIter << 2) + 2].m_Center.x); __m128 xmm_ey2_ez2_cx3_cy3 = _mm_load_ps(&aabbList[(iIter << 2) + 2].m_Extent.y); __m128 xmm_cz3_ex3_ey3_ez3 = _mm_load_ps(&aabbList[(iIter << 2) + 3].m_Center.z);   If we assume that the initial aabbList array is 16-byte aligned, all loads are 16-byte aligned and the instructions are executed correctly. This is the reason we are loading the XMM regs with those specific array offsets.   On the other hand, your code doesn't do the same thing. It just stores the AABBs on the stack and the layout isn't the one expected by the code. The best case scenario is that your layout is:      c0.x, c0.y, c0.z, c1.x, c1.y, c1.z, c2.x, c2.y, c2.z, c3.x, c3.y, c3.z, e0.x, e0.y, e0.z, e1.x, e1.y, e1.z, e2.x, e2.y, e2.z, e3.x, e3.y, e3.z   but: 1) I think you can't be 100% sure about that (e.g. that the compiler will place the centers before the extends) 2) It's not guaranteed to be 16-byte aligned. 3) Most importantly, it's not what the code expects.   If you have to read the AABBs the way you did (one element at a time) I would suggest something like this:   __declspec(align(16)) _Vector3f aabbData[8]; aabbData[0].x = ... // center0.x aabbData[0].y = ... // center0.y aabbData[0].z = ... // center0.z aabbData[1].x = ... // extend0.x ... And then use this array to load the XMM regs as in the original code snippet.   PS. If you try to understand what the code does with those SSE instructions, you might be able to "optimize" it and get rid of the loads and shuffles completely. This is in case you continue to read the AABB data the way you do it.
  9. HellRaiZer

    Is my frustum culling slow ?

      Now i am confused as in 4box version says:   // NOTE: This loop is identical to the CullAABBList_SSE_1() loop. Not shown in order to keep this snippet small.   where that part of code is that you mentioned.   What I meant is, that the calculations of (d+r) and (d-r) in the 4-box-at-a-time loop are correct.   When you substitute the comment you mentioned, with the loop from CullAABBList_SSE_1(), you have to fix the typo I mentioned, in order for it to be correct.   Hope that makes sense.
  10. HellRaiZer

    Is my frustum culling slow ?

    It's been a long time since my last reply here on gamedev.net.   @lipsryme: Happy to know that my blog post actually helped someone Unfortunately, there seems to be an error in the code. The culling should be incorrect. Since you haven't seen it yet I'd assume that you are just rendering more objects than needed.   The error is in: __m128 xmm_d_p_r = _mm_add_ss(_mm_add_ss(xmm_d, xmm_r), xmm_frustumPlane_d); __m128 xmm_d_m_r = _mm_add_ss(_mm_add_ss(xmm_d, xmm_r), xmm_frustumPlane_d);   Can you spot it? xmm_d_m_r should subtract r from d, not add it! it should be:   __m128 xmm_d_m_r = _mm_add_ss(_mm_sub_ss(xmm_d, xmm_r), xmm_frustumPlane_d);   I don't have the project anymore so I'd assume it's just a blog post typo and it didn't affect the timings.   On the plus side, the last piece of code in the post (4 boxes at a time) does it correctly   Hope this doesn't ruin your benchmarks.
  11. HellRaiZer

    Vertex Cache Optimal Grid

    The post from my google reader cache This is a trick that I learned from Matthias Wloka during my job interview at NVIDIA. I thought I had a good understanding of the behavior of the vertex post-transform cache, but embarrassingly it turned out it wasn’t good enough. I’m sure many people don’t know this either, so here it goes. Rendering regular grids of triangles is common enough to make it worth spending some time thinking about how to do it most efficiently. They are used to render terrains, water effects, curved surfaces, and in general any regularly tessellated object. It’s possible to simulate the native hardware tessellation by rendering a single grid multiple times, and the fastest way of doing that is using instancing. That idea was first proposed in Generic Mesh Refinement on GPU and at NVIDIA we also have examples that show how to do that in OpenGL and Direct3D. That’s enough for the motivation. Imagine we have a 4×4 grid. The first two rows would look like this: * - * - * - * - * | / | / | / | / | * - * - * - * - * | / | / | / | / | * - * - * - * - * With a vertex cache with 8 entries, the location of the vertices after rendering the first 6 triangles of the first row should be as follows: 7 - 5 - 3 - 1 - * | / | / | / | / | 6 - 4 - 2 - 0 - * | / | / | / | / | * - * - * - * - * And after the next two triangles: * - 7 - 5 - 3 - 1 | / | / | / | / | * - 6 - 4 - 2 - 0 | / | / | / | / | * - * - * - * - * Notice that the first two vertices are no longer in the cache. As we proceed to the next two triangles two of the vertices that were previously in the cache need to be loaded again: * - * - * - 7 - 5 | / | / | / | / | 3 - 1 - * - 6 - 4 | / | / | / | / | 2 - 0 - * - * - * Instead of using the straightforward traversal, it’s possible to traverse the triangles in Morton or Hilbert order, which are known to have better cache behavior. Another possibility is to feed the triangles to any of the standard mesh optimization algorithms. All these options are better than not doing anything, but still produce results that are far from the optimal. In the table below you can see the results obtained for a 16×16 grid and with a FIFO cache with 20 entries: Method ACMR ATVR Scanline 1.062 1.882 NVTriStrip 0.818 1.450 Morton 0.719 1.273 K-Cache-Reorder 0.711 1.260 Hilbert 0.699 1.239 Forsyth 0.666 1.180 Tipsy 0.658 1.166 Optimal 0.564 1.000 Note that I’m using my own implementation for all of these methods. So, the results with the code from the original author might differ slightly. The most important observation is that, for every row of triangles, the only vertices that are reused are the vertices that are at the bottom of the triangles, and these are the vertices that we would like to have in the cache when rendering the next row of triangles. When traversing triangles in scanline order the cache interleaves vertices from the first and second row. However, we can avoid that by prefetching the first row of vertices: 4 - 3 - 2 - 1 - 0 | / | / | / | / | * - * - * - * - * | | | | | * - * - * - * - * That can be done issuing degenerate triangles. Once the first row of vertices is in the cache, you can continue adding the triangles in scanline order. The cool thing now is that the vertices that leave the cache are always vertices that are not going to be used anymore: * - 7 - 6 - 5 - 4 | / | / | / | / | 3 - 2 - 1 - 0 - * | / | / | / | / | * - * - * - * - * In general, the minimum cache size to render a W*W grid without transforming any vertex multiple times is W+2. The degenerate triangles have a small overhead, so you also want to avoid them when the cache is sufficiently large to store two rows of vertices. When the cache is too small you also have to split the grid into smaller sections and apply this method to each of them. The following code accomplishes that: void gridGen(int x0, int x1, int y0, int y1, int width, int cacheSize) { if (x1 - x0 + 1 < cacheSize) { if (2 * (x1 - x0) + 1 > cacheSize) { for (int x = x0; x < x1; x++) { indices.push_back(x + 0); indices.push_back(x + 0); indices.push_back(x + 1); } } for (int y = y0; y < y1; y++) { for (int x = x0; x < x1; x++) { indices.push_back((width + 1) * (y + 0) + (x + 0)); indices.push_back((width + 1) * (y + 1) + (x + 0)); indices.push_back((width + 1) * (y + 0) + (x + 1)); indices.push_back((width + 1) * (y + 0) + (x + 1)); indices.push_back((width + 1) * (y + 1) + (x + 0)); indices.push_back((width + 1) * (y + 1) + (x + 1)); } } } else { int xm = x0 + cacheSize - 2; gridGen(x0, xm, y0, y1, width, cacheSize); gridGen(xm, x1, y0, y1, width, cacheSize); } } This may not be the most optimal grid partition, but the method still performs pretty well in those cases. Here are the results for a cache with 16 entries: Method ACMR ATVR Scanline 1.062 1.882 NVTriStrip 0.775 1.374 K-Cache-Reorder 0.766 1.356 Hilbert 0.754 1.336 Morton 0.750 1.329 Tipsy 0.711 1.260 Forsyth 0.699 1.239 Optimal 0.598 1.059 And for a cache with only 12 entries: Method ACMR ATVR Scanline 1.062 1.882 NVTriStrip 0.875 1.550 Forsyth 0.859 1.522 K-Cache-Reorder 0.807 1.491 Morton 0.812 1.439 Hilbert 0.797 1.412 Tipsy 0.758 1.343 Optimal 0.600 1.062 In all cases, the proposed algorithm is significantly faster than the other approaches. In the future it would interesting to take into account some of these observations in a general mesh optimization algorithm. [/quote]
  12. HellRaiZer

    Virtualized Scenes and Rendering

    I think you have to include the reply_id=XXXXXX part of the post. E.g. all links right now seem to point to the journal itself (http://www.gamedev.net/community/forums/mod/journal/journal.asp?jn=363003?). Btw, you don't need the last '?'. Change those to (e.g.) http://www.gamedev.net/community/forums/mod/journal/journal.asp?jn=363003&reply_id=3473003. (I hope the last link works when I press Reply, otherwise ignore me :) )
  13. HellRaiZer

    Another terrain editing question

    I think what haegarr described is similar to this (if I remember the details correctly) : Indirection Mapping for Quasi-Conformal Relief Texturing. Unfortunately, I don't think this will solve the problem. The algorithm described in the above paper is about increasing the texture resolution on steep slopes. For relief/parallax/etc. mapping this is way better than not doing anything. But in the case of a heighmap terrain, increasing the resolution of those regions wont solve the problem. The textures will still be stretched. To overcome stretching you might need a different set of texcoords for those problematic regions (e.g. like in tri-planar mapping). On the other hand, maybe my understanding/implementation of the above paper wasn't correct after all. HellRaiZer
  14. HellRaiZer

    Vectorising a Dot Product

    I think the comparison is unfair. The "normal" way of doing things isn't the same as the SSE one. Two reasons: 1) You keep the result of all the transformations to the same variable. This can be optimized away by the compiler if it's clever enough. And if it's not, it will certainly help the cache. 2) The 'result' variable is kept on the stack and it's not used by the code following the loop, so again it can be optimized away. Why don't you write the transformVector1() function in the same way BatchTransform1() works? a) Get the z coord from the input vector. b) Use the out[] array in order to keep the results. In order to be sure than the actual work isn't omitted by the compiler, print all the values to a file after you are done (outside of the tic()/toc() blocks, of course). Finally, using intrinsics instead of inline asm, can let the compiler optimize the function when it's inlined. HellRaiZer
  15. HellRaiZer

    alpha blending on cpu SSE

    Unfortunately, afaik there is no pack instruction which interleaves data between registers. The closer you can get is _mm_packus_epi16 but this will not work in this case (interleave 1st and 3rd byte from the 1st DWORD in one reg, with the 1st and 3rd byte from the 1st DWORD of the second reg). I think, the best you can do is an AND, a shift and an OR, like you do it now. Remember that masking and shifting can be the same as shifting and masking, so you might need only one mask (fewer constants) both for packing and unpacking. (i think you do that in your unpacking code) If anyone has a better idea on how to do this (interleaving bytes from xmm regs), i'd be glad to hear it. HellRaiZer
  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!