• Advertisement

HellRaiZer

Member
  • Content count

    663
  • Joined

  • Last visited

Community Reputation

1001 Excellent

About HellRaiZer

  • Rank
    Advanced Member
  1. Useless Snippet #2: AABB/Frustum test

    Everyone, thanks for the constructive comments. I've updated the article based on some of your suggestions.   @Matias Goldberg: Unfortunately, changing _mm_load_ps((float*)&absPlaneMask[0]) to be standard compliant as you suggested, as well as adding the __restrict keyword, requires a rerun of the benchmarks, because otherwise the numbers won't be accurate. I'll keep a note and hope to be able to do it sometime soon.   One small note about your last comment. The 4 AABBs at a time SSE version has a loop at the end which should handle the case where the number of AABBs isn't a multiple of 4. The code isn't shown for clarity (it should actually be the same as the 1 AABB at a time version).   Thanks again for the great input.
  2. Useless Snippet #2: AABB/Frustum test

    @zdlr: Of course you are right. I forgot about those two. I'll edit the article to read "C++" instead of "C".   @Matias: Thank you very much for the tips. I'll try to find some time to do the changes you suggest and edit the article. I'll also try to make a little VS project to attach to the article at the same time.    About the two dot products. Do you mean method 5 on that article? In this case, the resulting code wouldn't be able to distinguish between fully inside and intersecting states which is a requirement for this article. I know, it might sound bad trying to optimize something and add arbitrary restrictions which might affect performance, but I think having the ability to distinguish between those two cases might help in the case of an hierarchy. E.g. parent is completely inside, so there's no need to check any of its children.   Please correct me if I'm wrong. 
  3. Useless Snippet #2: AABB/Frustum test

    Thank you all for the comments.   @Servant: Bacterius is right. The numbers are "cycles per AABB". Culling a batch of 1024 AABBs in a single loop ends up being faster than 32 probably because the function overhead (e.g. calculating the abs of the plane normals) is minimal compared to the main loop.    Example: Assume that the initial loop which calculates the abs of the plane normals requires 200 cycles. Also, assume that each AABB requires 100 cycles. For a batch of 10 AABBs, the function would require 1200 cycles to complete. Or in other words, 120 cycles per AABB. If the batch had 1000 AABBs, the function would require 100200 cycles, or 100.2 cycles per AABB. Hope that makes sense.   @zdlr: Variable names were kept like that in order to match the examples in Fabian Giesen's article which I linked above. Also, the term 'reference implementation' doesn't mean that the code uses references. It means that, that specific snippet is used as a baseline for performance comparisons (if this was what you meant).
  4. Welcome back. This time we'll take a look at one of the most common operations in a 3D graphics engine, frustum culling. How fast can such code become by using SIMD? The problem Classify whether a batch of AABBs are completely inside, completely outside or intersecting a frustum (6 planes). Restrictions AABBs are defined as (Center, Extent) pairs. All vectors are Vector3f's. Structs and initialization #define OUTSIDE 0 #define INSIDE 1 #define INTERSECT 2 // or 3 depending on the algorithm used (see the discussion on the 2nd SSE version) struct Vector3f { float x, y, z; }; struct AABB { Vector3f m_Center; Vector3f m_Extent; }; struct Plane { float nx, ny, nz, d; }; All the functions presented in the snippets below have the same signature: void CullAABBList(AABB* aabbList, unsigned int numAABBs, Plane* frustumPlanes, unsigned int* aabbState); As you can see, I've taken the state of the AABB (with regard to the specified frustum) outside of the struct itself. This has a couple advantages. You can cull the same AABB list with multiple different frustums in parallel without worrying about conflicts. Just pass a different aabbState array to each function call and you are done. All relevant AABB data are close to each other, which helps reducing cache misses (since the array is expected to be read sequentially). Finally, all arrays are assumed to be 16-byte aligned. This is required for the SSE version. Let's take a look at some code. C++ (reference implementation) // Performance (cycles/AABB): Average = 102.5 (stdev = 12.0) void CullAABBList_C(AABB* aabbList, unsigned int numAABBs, Plane* frustumPlanes, unsigned int* aabbState) { for(unsigned int iAABB = 0;iAABB < numAABBs;++iAABB) { const Vector3f& aabbCenter = aabbList[iAABB].m_Center; const Vector3f& aabbSize = aabbList[iAABB].m_Extent; unsigned int result = INSIDE; // Assume that the aabb will be inside the frustum for(unsigned int iPlane = 0;iPlane < 6;++iPlane) { const Plane& frustumPlane = frustumPlanes[iPlane]; float d = aabbCenter.x * frustumPlane.nx + aabbCenter.y * frustumPlane.ny + aabbCenter.z * frustumPlane.nz; float r = aabbSize.x * fast_fabsf(frustumPlane.nx) + aabbSize.y * fast_fabsf(frustumPlane.ny) + aabbSize.z * fast_fabsf(frustumPlane.nz); float d_p_r = d + r; float d_m_r = d - r; if(d_p_r < -frustumPlane.d) { result = OUTSIDE; break; } else if(d_m_r < -frustumPlane.d) result = INTERSECT; } aabbState[iAABB] = result; } } The code is based on method 4c from an excellent post by Fabian Giesen on AABB culling: View frustum culling. If you haven't read it yet, check it out now. He presents several different ways of performing the same test, each of which has its own advantages and disadvantages. There's nothing special about the above code, except from the fact that it breaks out of the inner loop as soon as the box is classified as completely outside of one of the planes. No need to check the others. But we can do better. First thing, precalculate the absolute of the plane normals once, outside of the AABB loop. No need to recalculate the same thing over and over. We can also replace the floating point comparisons with bitwise ANDs. Lets take a look at the code. C++ Optimized // Performance (cycles/AABB): Average = 84.9 (stdev = 12.3) void CullAABBList_C_Opt(AABB* __restrict aabbList, unsigned int numAABBs, Plane* __restrict frustumPlanes, unsigned int* __restrict aabbState) { Plane absFrustumPlanes[6]; for(unsigned int iPlane = 0;iPlane < 6;++iPlane) { absFrustumPlanes[iPlane].nx = fast_fabsf(frustumPlanes[iPlane].nx); absFrustumPlanes[iPlane].ny = fast_fabsf(frustumPlanes[iPlane].ny); absFrustumPlanes[iPlane].nz = fast_fabsf(frustumPlanes[iPlane].nz); } for(unsigned int iAABB = 0;iAABB < numAABBs;++iAABB) { const Vector3f& aabbCenter = aabbList[iAABB].m_Center; const Vector3f& aabbSize = aabbList[iAABB].m_Extent; unsigned int result = INSIDE; // Assume that the aabb will be inside the frustum for(unsigned int iPlane = 0;iPlane < 6;++iPlane) { const Plane& frustumPlane = frustumPlanes[iPlane]; const Plane& absFrustumPlane = absFrustumPlanes[iPlane]; float d = aabbCenter.x * frustumPlane.nx + aabbCenter.y * frustumPlane.ny + aabbCenter.z * frustumPlane.nz; float r = aabbSize.x * absFrustumPlane.nx + aabbSize.y * absFrustumPlane.ny + aabbSize.z * absFrustumPlane.nz; float d_p_r = d + r + frustumPlane.d; if(IsNegativeFloat(d_p_r)) { result = OUTSIDE; break; } float d_m_r = d - r + frustumPlane.d; if(IsNegativeFloat(d_m_r)) result = INTERSECT; } aabbState[iAABB] = result; } } fast_fabsf() and IsNegativeFloat() are simple functions which treat the float as int and remove/check the MSB. I don't have any other C-level optimizations in mind. So let's see what we can do using SSE. SSE (1 AABB at a time) // 2013-09-10: Moved outside of the function body. Check comments by @Matias Goldberg for details. __declspec(align(16)) static const unsigned int absPlaneMask[4] = {0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF, 0xFFFFFFFF}; // Performance (cycles/AABB): Average = 63.9 (stdev = 10.8) void CullAABBList_SSE_1(AABB* aabbList, unsigned int numAABBs, Plane* frustumPlanes, unsigned int* aabbState) { __declspec(align(16)) Plane absFrustumPlanes[6]; __m128 xmm_absPlaneMask = _mm_load_ps((float*)&absPlaneMask[0]); for(unsigned int iPlane = 0;iPlane < 6;++iPlane) { __m128 xmm_frustumPlane = _mm_load_ps(&frustumPlanes[iPlane].nx); __m128 xmm_absFrustumPlane = _mm_and_ps(xmm_frustumPlane, xmm_absPlaneMask); _mm_store_ps(&absFrustumPlanes[iPlane].nx, xmm_absFrustumPlane); } for(unsigned int iAABB = 0;iAABB < numAABBs;++iAABB) { __m128 xmm_aabbCenter_x = _mm_load_ss(&aabbList[iAABB].m_Center.x); __m128 xmm_aabbCenter_y = _mm_load_ss(&aabbList[iAABB].m_Center.y); __m128 xmm_aabbCenter_z = _mm_load_ss(&aabbList[iAABB].m_Center.z); __m128 xmm_aabbExtent_x = _mm_load_ss(&aabbList[iAABB].m_Extent.x); __m128 xmm_aabbExtent_y = _mm_load_ss(&aabbList[iAABB].m_Extent.y); __m128 xmm_aabbExtent_z = _mm_load_ss(&aabbList[iAABB].m_Extent.z); unsigned int result = INSIDE; // Assume that the aabb will be inside the frustum for(unsigned int iPlane = 0;iPlane < 6;++iPlane) { __m128 xmm_frustumPlane_Component = _mm_load_ss(&frustumPlanes[iPlane].nx); __m128 xmm_d = _mm_mul_ss(xmm_aabbCenter_x, xmm_frustumPlane_Component); xmm_frustumPlane_Component = _mm_load_ss(&frustumPlanes[iPlane].ny); xmm_d = _mm_add_ss(xmm_d, _mm_mul_ss(xmm_aabbCenter_y, xmm_frustumPlane_Component)); xmm_frustumPlane_Component = _mm_load_ss(&frustumPlanes[iPlane].nz); xmm_d = _mm_add_ss(xmm_d, _mm_mul_ss(xmm_aabbCenter_z, xmm_frustumPlane_Component)); __m128 xmm_absFrustumPlane_Component = _mm_load_ss(&absFrustumPlanes[iPlane].nx); __m128 xmm_r = _mm_mul_ss(xmm_aabbExtent_x, xmm_absFrustumPlane_Component); xmm_absFrustumPlane_Component = _mm_load_ss(&absFrustumPlanes[iPlane].ny); xmm_r = _mm_add_ss(xmm_r, _mm_mul_ss(xmm_aabbExtent_y, xmm_absFrustumPlane_Component)); xmm_absFrustumPlane_Component = _mm_load_ss(&absFrustumPlanes[iPlane].nz); xmm_r = _mm_add_ss(xmm_r, _mm_mul_ss(xmm_aabbExtent_z, xmm_absFrustumPlane_Component)); __m128 xmm_frustumPlane_d = _mm_load_ss(&frustumPlanes[iPlane].d); __m128 xmm_d_p_r = _mm_add_ss(_mm_add_ss(xmm_d, xmm_r), xmm_frustumPlane_d); __m128 xmm_d_m_r = _mm_add_ss(_mm_sub_ss(xmm_d, xmm_r), xmm_frustumPlane_d); // Shuffle d_p_r and d_m_r in order to perform only one _mm_movmask_ps __m128 xmm_d_p_r__d_m_r = _mm_shuffle_ps(xmm_d_p_r, xmm_d_m_r, _MM_SHUFFLE(0, 0, 0, 0)); int negativeMask = _mm_movemask_ps(xmm_d_p_r__d_m_r); // Bit 0 holds the sign of d + r and bit 2 holds the sign of d - r if(negativeMask & 0x01) { result = OUTSIDE; break; } else if(negativeMask & 0x04) result = INTERSECT; } aabbState[iAABB] = result; } } Again, since we are processing one AABB at a time, we can break out of the inner loop as soon as the AABB is found to be completely outside one of the 6 planes. The SSE code should be straightforward. All arithmetic operations are scalar and we use _mm_movemask_ps in order to extract the signs of (d + r) and (d - r). Depending on the signs, we classify the AABB as completely outside or intersecting the frustum. Even though we are testing one AABB at a time, the code is faster than the optimized C++ implementation. The "problem" with the above snippet is that all SSE operations are scalar. We can do better by swizzling the data from 4 AABBs in such a way that it will be possible to calculate 4 (d + r) and 4 (d - r) simultaneously. Let's try that. SSE (4 AABBs at a time) // 2013-09-10: Moved outside of the function body. Check comments by @Matias Goldberg for details. __declspec(align(16)) static const unsigned int absPlaneMask[4] = {0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF, 0xFFFFFFFF}; // Performance (cycles/AABB): Average = 24.1 (stdev = 4.2) void CullAABBList_SSE_4(AABB* aabbList, unsigned int numAABBs, Plane* frustumPlanes, unsigned int* aabbState) { __declspec(align(16)) Plane absFrustumPlanes[6]; __m128 xmm_absPlaneMask = _mm_load_ps((float*)&absPlaneMask[0]); for(unsigned int iPlane = 0;iPlane < 6;++iPlane) { __m128 xmm_frustumPlane = _mm_load_ps(&frustumPlanes[iPlane].nx); __m128 xmm_absFrustumPlane = _mm_and_ps(xmm_frustumPlane, xmm_absPlaneMask); _mm_store_ps(&absFrustumPlanes[iPlane].nx, xmm_absFrustumPlane); } // Process 4 AABBs in each iteration... unsigned int numIterations = numAABBs >> 2; for(unsigned int iIter = 0;iIter < numIterations;++iIter) { // NOTE: Since the aabbList is 16-byte aligned, we can use aligned moves. // Load the 4 Center/Extents pairs for the 4 AABBs. __m128 xmm_cx0_cy0_cz0_ex0 = _mm_load_ps(&aabbList[(iIter
  5. Shader "plugin" system?

    It has been a long time since I touched any rendering related code, but I'll try to describe what I remember from my implementation.   Each surface shader (e.g. a shader that will be applied to the surface of a 3D model) can use one or more different plugins.  Shader plugins were implemented using Cg interfaces.    So, your example above would look something like this (pseudocode since I haven't written a line of Cg for a couple of years now). IAmbientLighting g_AmbientLighting; sampler2D DiffuseTex; float4 mainPS(VS_OUTPUT i) { float4 diffuseColor; diffuseColor.rgb = tex2D(DiffuseTex, i.UV.xy).rgb; diffuseColor.rgb *= g_AmbientiLighting.CalcAmbientLight(i.PosWS.xyz); diffuseColor.a = 1.0; return diffuseColor; } The IAmbientLighting interface would look like this:  interface IAmbientLighting { float3 CalcAmbientLighting(float3 posWS); } Your current shader would have used a constant ambient color implementation. Something like:  class ConstAmbientLight : IAmbientLighting { float4 AmbientColor; float3 CalcAmbientLighting(float3 posWS) { return AmbientColor.rgb * AmbientColor.a; } } If you would like to change to an SSAO implementation, instead of using this class you would use: class SSAO : IAmbientLighting { sampler2D SSAOTex; float4 AmbientColor; float4x4 WSToSSMatrix; float3 CalcAmbientLighting(float3 posWS) { float2 screenSpacePos = TransformToSS(posWS, WSToSSMatrix); float ssao = tex2D(SSAOTex, screenSpacePos).r; return AmbientColor.rgb * AmbientColor.r * ssao; } With those two interface implementations available, the renderer is responsible for selecting the correct one at run-time, based on some criteria (user prefs, GPU caps, etc.) and linking it to all the surface shaders which use an IAmbientLighting object.   The idea can be extended to other things. E.g. different kind of lights (omni, point, directional) can be implemented as interfaces of one common ILight interface.    This way you can create (e.g.) a Phong shader with or without SSAO, using one or more lights of any type.    That's the basic idea. Hope it makes some sense. If not, just say it and I'll do my best to describe it better.
  6. Try creating one View.OnClickListener object and use the View.getID() on the passed view object.   Something like this: View.OnClickListener listener = new View.OnClickListener() { public void onClick(View v) { int id = v.getID(); switch(id) { case 1000: // button0 was clicked... break; case 1001: // button1 was clicked... break; } } } button0.setOnClickListener(listener); button1.setOnClickListener(listener); ... button0.setID(1000); button1.setID(1001); ... Hope that helps.
  7. This is a rewrite of an old blog post I wrote a couple of years back. Finally, I found some time to redo the performance tests based on an observation made by Sean Barrett on the original post. The code below remains the same. The difference lies in the way performance is measured. Here I use RDTSC instead of Intel's Performance Counter Monitor library, which ended up producing a high overhead compared to the actual time taken by the measured functions. As an added bonus, I'm uploading a small VS project for anyone interested to try out. The problem Goal Multiply a batch of Vector3f's with the same 4x4 matrix. Restrictions 'src' and 'dst' arrays shouldn't point to the same memory location All pointers should be 16-byte aligned (see below for details on array sizes) Treat Vector3f's as positions (w = 1.0) Matrix is column major Structs and helper functions struct _Vector3f { float x, y, z; }; struct _Vector4f { float x, y, z, w; }; _Vector3f* AllocateSrcArray(unsigned int numVertices) { // We are loading 3 xmm regs per loop. So we need numLoops * 12 floats, where // numLoops is (numVertices / 4) + 1 for the SSE version unsigned int memSize = sizeof(float) * ((numVertices >> 2) + 1) * 12; return (_Vector3f*)_aligned_malloc(memSize, 16); } _Vector4f* AllocateDstArray(unsigned int numVertices) { unsigned int memSize = sizeof(_Vector4f) * (((numVertices >> 2) + 1) x = matrix4x4[0] * s->x + matrix4x4[4] * s->y + matrix4x4[ 8] * s->z + matrix4x4[12]; d->y = matrix4x4[1] * s->x + matrix4x4[5] * s->y + matrix4x4[ 9] * s->z + matrix4x4[13]; d->z = matrix4x4[2] * s->x + matrix4x4[6] * s->y + matrix4x4[10] * s->z + matrix4x4[14]; d->w = matrix4x4[3] * s->x + matrix4x4[7] * s->y + matrix4x4[11] * s->z + matrix4x4[15]; } } Performance (8k vertices): ~19 cycles/vertex (?: 1.4) There is nothing special with this code. Straight Vec3/Matrix4x4 multiply. Plain-old FPU code generated by the compiler. Note: Turning on /arch:SSE compiler option doesn't seem to produce any SSE code for the above function. The compiler insisted on using the FPU for all the calculations. Using /arch:SSE2 compiler option ended up producing a lot of SSE2 double-to-float and float-to-double conversions, which in turn made things worse performance-wise. SSE Optimized The code below process 4 vertices at a time. 4 vertices * 3 floats/vertex = 12 floats = 3 loads of 4 floats per loop First load = (x0, y0, z0, x1) Second load = (y1, z1, x2, y2) Third load = (z2, x3, y3, z3) We don't keep the matrix in xmm regs in order to avoid intermediate stores (profiling says this is good). What we do is try to keep the matrix loads to a minimum by reusing each column as much as possible (see comment in code). Hopefully it should be in cache most of the time. Written in asm because I couldn't find a way to force the compiler to generate the code below using intrisics. The compiler insisted on using the stack for intermediate stores (matrix columns). void TrasnformVec3Pos_SSE_Asm(_Vector3f* __restrict src, _Vector4f* __restrict dst, unsigned int numVertices, float* __restrict matrix) { __asm { // Calculate the total number of iterations... mov ecx, [numVertices]; shr ecx, 0x02; inc ecx; // ecx = totalIterations = (numVertices >> 2) + 1; xor esi, esi; // esi = iIter = 0; mov eax, [src]; // eax = &src[0]; mov edx, [dst]; // edx = &dst[0]; mov edi, [matrix]; // edi = &matrix[0]; _LoopStart: // Matrix column 0 is being loaded 2 times. // Matrix column 1 is being loaded 2 times. // Matrix column 2 is being loaded 1 time. // Matrix column 3 is being loaded 1 time. // Other that the above extra memory loads, each vector is loaded once, // and there aren't any intermediate stores. prefetcht0 [eax + 0x100]; movaps xmm0, [eax]; // xmm0 = (x0, y0, z0, x1) movaps xmm1, [edi]; // xmm1 = (m0, m1, m2, m3)
  8. Is my frustum culling slow ?

    If the AABBs correspond to static geometry, translating them to world space every frame is an overkill. You should do it once at start up.   If it's about dynamic geometry, then it shouldn't be that simple when rotation is involved. If your objects rotate, you should calculate the AABB from the OBB defined by the original AABB and the object's transformation, in case you want to use the same code for all your objects. Otherwise you can find/write another function which culls OBBs against the frustum.    In case you go about the OBB route, it might be faster to just check the bounding sphere (which isn't affected by rotation) against the frustum, at the expense of rendering a few more objects (bounding spheres tend to be larger that AABBs depending on the object they enclose). 
  9. Is my frustum culling slow ?

    @lipsryme If you get an access violation as soon as you add a 4th aabb in the list it means that your aabbList array isn't 16-byte aligned. Two choices here: 1) Explicitly allocate the aabbList array to be 16-byte aligned (e.g. using _aligned_malloc()) or 2) Change the 6 _mm_load_ps() calls with _mm_loadu_ps() which allow reading from unaligned addresses.   Hope that helps.   PS. To test if an array address is 16-byte aligned you can use one of the functions found here: http://stackoverflow.com/a/1898487/2136504 E.g. Check &aabbList[(iIter << 2) + 0].center.x and if it returns true but the _mm_load_ps() fails then something else is wrong with your array.
  10. Is my frustum culling slow ?

    I believe the changes you made in the code is the problem. To be more exact:   The original code read the centers and extends of the 4 AABBs from an array with the following layout: c0.x, c0.y, c0.z, e0.x, e0.y, e0.z, c1.x, c1.y, c1.z, e1.x, e1.y, e1.z, c2.x, c2.y, c2.z, e2.x, e2.y, e2.z, c3.x, c3.y, c3.z, e3.x, e3.y, e3.z, ...   When the following instructions are executed, the XMM registers hold the values mentioned in their name:   // NOTE: Since the aabbList is 16-byte aligned, we can use aligned moves. // Load the 4 Center/Extents pairs for the 4 AABBs. __m128 xmm_cx0_cy0_cz0_ex0 = _mm_load_ps(&aabbList[(iIter << 2) + 0].m_Center.x); __m128 xmm_ey0_ez0_cx1_cy1 = _mm_load_ps(&aabbList[(iIter << 2) + 0].m_Extent.y); __m128 xmm_cz1_ex1_ey1_ez1 = _mm_load_ps(&aabbList[(iIter << 2) + 1].m_Center.z); __m128 xmm_cx2_cy2_cz2_ex2 = _mm_load_ps(&aabbList[(iIter << 2) + 2].m_Center.x); __m128 xmm_ey2_ez2_cx3_cy3 = _mm_load_ps(&aabbList[(iIter << 2) + 2].m_Extent.y); __m128 xmm_cz3_ex3_ey3_ez3 = _mm_load_ps(&aabbList[(iIter << 2) + 3].m_Center.z);   If we assume that the initial aabbList array is 16-byte aligned, all loads are 16-byte aligned and the instructions are executed correctly. This is the reason we are loading the XMM regs with those specific array offsets.   On the other hand, your code doesn't do the same thing. It just stores the AABBs on the stack and the layout isn't the one expected by the code. The best case scenario is that your layout is:      c0.x, c0.y, c0.z, c1.x, c1.y, c1.z, c2.x, c2.y, c2.z, c3.x, c3.y, c3.z, e0.x, e0.y, e0.z, e1.x, e1.y, e1.z, e2.x, e2.y, e2.z, e3.x, e3.y, e3.z   but: 1) I think you can't be 100% sure about that (e.g. that the compiler will place the centers before the extends) 2) It's not guaranteed to be 16-byte aligned. 3) Most importantly, it's not what the code expects.   If you have to read the AABBs the way you did (one element at a time) I would suggest something like this:   __declspec(align(16)) _Vector3f aabbData[8]; aabbData[0].x = ... // center0.x aabbData[0].y = ... // center0.y aabbData[0].z = ... // center0.z aabbData[1].x = ... // extend0.x ... And then use this array to load the XMM regs as in the original code snippet.   PS. If you try to understand what the code does with those SSE instructions, you might be able to "optimize" it and get rid of the loads and shuffles completely. This is in case you continue to read the AABB data the way you do it.
  11. Is my frustum culling slow ?

      Now i am confused as in 4box version says:   // NOTE: This loop is identical to the CullAABBList_SSE_1() loop. Not shown in order to keep this snippet small.   where that part of code is that you mentioned.   What I meant is, that the calculations of (d+r) and (d-r) in the 4-box-at-a-time loop are correct.   When you substitute the comment you mentioned, with the loop from CullAABBList_SSE_1(), you have to fix the typo I mentioned, in order for it to be correct.   Hope that makes sense.
  12. Is my frustum culling slow ?

    It's been a long time since my last reply here on gamedev.net.   @lipsryme: Happy to know that my blog post actually helped someone Unfortunately, there seems to be an error in the code. The culling should be incorrect. Since you haven't seen it yet I'd assume that you are just rendering more objects than needed.   The error is in: __m128 xmm_d_p_r = _mm_add_ss(_mm_add_ss(xmm_d, xmm_r), xmm_frustumPlane_d); __m128 xmm_d_m_r = _mm_add_ss(_mm_add_ss(xmm_d, xmm_r), xmm_frustumPlane_d);   Can you spot it? xmm_d_m_r should subtract r from d, not add it! it should be:   __m128 xmm_d_m_r = _mm_add_ss(_mm_sub_ss(xmm_d, xmm_r), xmm_frustumPlane_d);   I don't have the project anymore so I'd assume it's just a blog post typo and it didn't affect the timings.   On the plus side, the last piece of code in the post (4 boxes at a time) does it correctly   Hope this doesn't ruin your benchmarks.
  13. OpenGL Vertex Cache Optimal Grid

    The post from my google reader cache [quote] This is a trick that I learned from Matthias Wloka during my job interview at NVIDIA. I thought I had a good understanding of the behavior of the vertex post-transform cache, but embarrassingly it turned out it wasn’t good enough. I’m sure many people don’t know this either, so here it goes. Rendering regular grids of triangles is common enough to make it worth spending some time thinking about how to do it most efficiently. They are used to render terrains, water effects, curved surfaces, and in general any regularly tessellated object. It’s possible to simulate the native hardware tessellation by rendering a single grid multiple times, and the fastest way of doing that is using instancing. That idea was first proposed in Generic Mesh Refinement on GPU and at NVIDIA we also have examples that show how to do that in OpenGL and Direct3D. That’s enough for the motivation. Imagine we have a 4×4 grid. The first two rows would look like this: * - * - * - * - * | / | / | / | / | * - * - * - * - * | / | / | / | / | * - * - * - * - * With a vertex cache with 8 entries, the location of the vertices after rendering the first 6 triangles of the first row should be as follows: 7 - 5 - 3 - 1 - * | / | / | / | / | 6 - 4 - 2 - 0 - * | / | / | / | / | * - * - * - * - * And after the next two triangles: * - 7 - 5 - 3 - 1 | / | / | / | / | * - 6 - 4 - 2 - 0 | / | / | / | / | * - * - * - * - * Notice that the first two vertices are no longer in the cache. As we proceed to the next two triangles two of the vertices that were previously in the cache need to be loaded again: * - * - * - 7 - 5 | / | / | / | / | 3 - 1 - * - 6 - 4 | / | / | / | / | 2 - 0 - * - * - * Instead of using the straightforward traversal, it’s possible to traverse the triangles in Morton or Hilbert order, which are known to have better cache behavior. Another possibility is to feed the triangles to any of the standard mesh optimization algorithms. All these options are better than not doing anything, but still produce results that are far from the optimal. In the table below you can see the results obtained for a 16×16 grid and with a FIFO cache with 20 entries: Method ACMR ATVR Scanline 1.062 1.882 NVTriStrip 0.818 1.450 Morton 0.719 1.273 K-Cache-Reorder 0.711 1.260 Hilbert 0.699 1.239 Forsyth 0.666 1.180 Tipsy 0.658 1.166 Optimal 0.564 1.000 Note that I’m using my own implementation for all of these methods. So, the results with the code from the original author might differ slightly. The most important observation is that, for every row of triangles, the only vertices that are reused are the vertices that are at the bottom of the triangles, and these are the vertices that we would like to have in the cache when rendering the next row of triangles. When traversing triangles in scanline order the cache interleaves vertices from the first and second row. However, we can avoid that by prefetching the first row of vertices: 4 - 3 - 2 - 1 - 0 | / | / | / | / | * - * - * - * - * | | | | | * - * - * - * - * That can be done issuing degenerate triangles. Once the first row of vertices is in the cache, you can continue adding the triangles in scanline order. The cool thing now is that the vertices that leave the cache are always vertices that are not going to be used anymore: * - 7 - 6 - 5 - 4 | / | / | / | / | 3 - 2 - 1 - 0 - * | / | / | / | / | * - * - * - * - * In general, the minimum cache size to render a W*W grid without transforming any vertex multiple times is W+2. The degenerate triangles have a small overhead, so you also want to avoid them when the cache is sufficiently large to store two rows of vertices. When the cache is too small you also have to split the grid into smaller sections and apply this method to each of them. The following code accomplishes that: void gridGen(int x0, int x1, int y0, int y1, int width, int cacheSize) { if (x1 - x0 + 1 < cacheSize) { if (2 * (x1 - x0) + 1 > cacheSize) { for (int x = x0; x < x1; x++) { indices.push_back(x + 0); indices.push_back(x + 0); indices.push_back(x + 1); } } for (int y = y0; y < y1; y++) { for (int x = x0; x < x1; x++) { indices.push_back((width + 1) * (y + 0) + (x + 0)); indices.push_back((width + 1) * (y + 1) + (x + 0)); indices.push_back((width + 1) * (y + 0) + (x + 1)); indices.push_back((width + 1) * (y + 0) + (x + 1)); indices.push_back((width + 1) * (y + 1) + (x + 0)); indices.push_back((width + 1) * (y + 1) + (x + 1)); } } } else { int xm = x0 + cacheSize - 2; gridGen(x0, xm, y0, y1, width, cacheSize); gridGen(xm, x1, y0, y1, width, cacheSize); } } This may not be the most optimal grid partition, but the method still performs pretty well in those cases. Here are the results for a cache with 16 entries: Method ACMR ATVR Scanline 1.062 1.882 NVTriStrip 0.775 1.374 K-Cache-Reorder 0.766 1.356 Hilbert 0.754 1.336 Morton 0.750 1.329 Tipsy 0.711 1.260 Forsyth 0.699 1.239 Optimal 0.598 1.059 And for a cache with only 12 entries: Method ACMR ATVR Scanline 1.062 1.882 NVTriStrip 0.875 1.550 Forsyth 0.859 1.522 K-Cache-Reorder 0.807 1.491 Morton 0.812 1.439 Hilbert 0.797 1.412 Tipsy 0.758 1.343 Optimal 0.600 1.062 In all cases, the proposed algorithm is significantly faster than the other approaches. In the future it would interesting to take into account some of these observations in a general mesh optimization algorithm. [/quote]
  14. Virtualized Scenes and Rendering

    I think you have to include the reply_id=XXXXXX part of the post. E.g. all links right now seem to point to the journal itself (http://www.gamedev.net/community/forums/mod/journal/journal.asp?jn=363003?). Btw, you don't need the last '?'. Change those to (e.g.) http://www.gamedev.net/community/forums/mod/journal/journal.asp?jn=363003&reply_id=3473003. (I hope the last link works when I press Reply, otherwise ignore me :) )
  15. Another terrain editing question

    I think what haegarr described is similar to this (if I remember the details correctly) : Indirection Mapping for Quasi-Conformal Relief Texturing. Unfortunately, I don't think this will solve the problem. The algorithm described in the above paper is about increasing the texture resolution on steep slopes. For relief/parallax/etc. mapping this is way better than not doing anything. But in the case of a heighmap terrain, increasing the resolution of those regions wont solve the problem. The textures will still be stretched. To overcome stretching you might need a different set of texcoords for those problematic regions (e.g. like in tri-planar mapping). On the other hand, maybe my understanding/implementation of the above paper wasn't correct after all. HellRaiZer
  • Advertisement