• Content count

  • Joined

  • Last visited

Community Reputation

342 Neutral

About corysama

  • Rank
  1. Frustum Culling

    This is a very nice article.   The bit at the at the end of the OBB function __m128i outside_res_i = _mm_cvtps_epi32(outside); ALIGN_SSE int obj_culling_res[4]; _mm_store_si128((__m128i *)&obj_culling_res[0], outside_res_i); culling_res[i] = (obj_culling_res[0] != 0 || obj_culling_res[1] != 0 ||  obj_culling_res[2] != 0) ? 1 : 0; Is unnecessary memory traffic.  The store, loads, all of those compares and the branch can be replaced by a single instruction that keeps everything in registers. culling_res[i] = _mm_movemask_ps(outside);   With SSE, RAM is the enemy.  Registers are your only friend.
  2. Take a look at   It's used by FireFox.   The license is confusing, but here's how I think it works:   The important part is that you do not compile and link the patented code.  You do not ship the shared library binary.  Your app downloads the library and Cisco volunteers to pay the patent license fee for that user on your behalf.   The code is open source. But, that does not grant you license to use the patents.  It's only open source so that anyone can contribute to the source.
  3.     Have a third look.  Last I checked, the runtime exception allows static linking.  Would you use the C stdlib that comes with GCC?   Nearly all programs compiled with GCC do.  This is literally the same license.  I think GNU made their stdlib GPL because they're GNU --that's what they do.  But, they added a minimal "exception" that basically weakens the GPL to "If you modify specifically our code, you must publish specifically those modifications.  The end."  because they knew that making all programs compiled with GCC implicitly GPL would doom GCC.  TBB is the same deal.
  4. Here's a pretty direct translation of your code to SSE.  The big change is that in order to get good speed without huge changes to the algorithm, this code requires the data for 4 triangles to be packed in a transposed {xxxx, yyyy, zzzz} structure-of-arrays format so that 4 triangles can be processed in a single function call.  All of the functions work in SOA form so that they can resemble scalar math while actually working on 4 values at once.  This function takes 4 packed triangles and returns 4 8-bit colors packed as {rrrr,gggg,bbbb,0000}.   I have not checked this code for correctness.  There are probably bugs.  I will not be available to answer further questions about this code.  Sorry. #include <pmmintrin.h> struct FourVec3s { __m128 x; __m128 y; __m128 z; }; struct FourTris { FourVec3s a; FourVec3s b; FourVec3s c; __m128i colors; }; // transposed static FourVec3s lightDirs = {{0.2, 0.5, -0.5, -0.5}, {-1.6,-0.7,-0.3, 1.3}, {-1.7, 20.3,-0.6, 0.6}}; // transposed static FourVec3s lightColors = {{.4, .4145, .584, .41 }, {.414, .451, .51414,.44}, {.515, .543, .43, .3414}}; static __m128 modelRight = {1.0, 0.0, 0.0, 0.0}; static __m128 modelUp = {0.0, 1.0, 0.0, 0.0}; static __m128 modelDir = {0.0, 0.0, 1.0, 0.0}; static __m128 modelPos = {0.0, 0.0, 0.0, 1.0}; static inline __m128 splatX(__m128 v) { return _mm_shuffle_ps(v,v,_MM_SHUFFLE(0,0,0,0)); } static inline __m128 splatY(__m128 v) { return _mm_shuffle_ps(v,v,_MM_SHUFFLE(1,1,1,1)); } static inline __m128 splatZ(__m128 v) { return _mm_shuffle_ps(v,v,_MM_SHUFFLE(2,2,2,2)); } static inline __m128 add(__m128 l, __m128 r) { return _mm_add_ps(l, r); } static inline __m128 sub(__m128 l, __m128 r) { return _mm_sub_ps(l, r); } static inline __m128 mul(__m128 l, __m128 r) { return _mm_mul_ps(l, r); } static inline __m128 and(__m128 l, __m128 r) { return _mm_and_ps(l, r); } static inline __m128 less(__m128 l, __m128 r) { return _mm_cmplt_ps(l, r); } static inline __m128 dot(const FourVec3s &l, const FourVec3s &r) { return add(add(mul(l.x,r.x), mul(l.y,r.y)), mul(l.z,r.z)); } // unpack 8 bit RgbaRgbaRgbaRgba into 32-bit RRRR gggg or bbbb static inline __m128i unpackR(__m128i iv) { return _mm_unpacklo_epi16(_mm_unpacklo_epi8(iv,_mm_setzero_si128()),_mm_setzero_si128()); } static inline __m128i unpackG(__m128i iv) { return _mm_unpackhi_epi16(_mm_unpacklo_epi8(iv,_mm_setzero_si128()),_mm_setzero_si128()); } static inline __m128i unpackB(__m128i iv) { return _mm_unpacklo_epi16(_mm_unpackhi_epi8(iv,_mm_setzero_si128()),_mm_setzero_si128()); } static inline __m128 intsToFloats(__m128i iv) { return _mm_cvtepi32_ps(iv); } static inline __m128i floatToInts(__m128 fv) { return _mm_cvttps_epi32(fv); } static inline __m128i packAndSaturate32To8(__m128i r ,__m128i g, __m128i b, __m128i a) { return _mm_packs_epi16(_mm_packs_epi32(r,g),_mm_packs_epi32(b,a)); } static inline FourVec3s normalizeFourVec3s(const FourVec3s &v) { __m128 length = _mm_sqrt_ps(add(add( mul(v.x,v.x), mul(v.y,v.y)), mul(v.z,v.z) )); FourVec3s result = { _mm_div_ps(v.x,length), _mm_div_ps(v.y,length), _mm_div_ps(v.z,length) }; return result; } __m128i Shade4Triangles(const FourTris &tris) { __m128 x1 = add(add(add( mul(sub(tris.a.x, splatX(modelPos)), splatX(modelRight)), // (*triangle).a.x - modelPos.x)*modelRight.x + mul(sub(tris.a.y, splatY(modelPos)), splatY(modelRight))), // ((*triangle).a.y - modelPos.y)*modelRight.y + mul(sub(tris.a.z, splatZ(modelPos)), splatZ(modelRight))), // ((*triangle).a.z - modelPos.z)*modelRight.z) + splatX(modelPos)); // modelPos.x __m128 y1 = add(add(add( mul(sub(tris.a.x, splatX(modelPos)), splatX(modelUp)), mul(sub(tris.a.y, splatY(modelPos)), splatY(modelUp))), mul(sub(tris.a.z, splatZ(modelPos)), splatZ(modelUp))), splatY(modelPos)); __m128 z1 = add(add(add( mul(sub(tris.a.x, splatX(modelPos)), splatX(modelDir)), mul(sub(tris.a.y, splatY(modelPos)), splatY(modelDir))), mul(sub(tris.a.z, splatZ(modelPos)), splatZ(modelDir))), splatZ(modelPos)); __m128 x2 = add(add(add( mul(sub(tris.b.x, splatX(modelPos)), splatX(modelRight)), mul(sub(tris.b.y, splatY(modelPos)), splatY(modelRight))), mul(sub(tris.b.z, splatZ(modelPos)), splatZ(modelRight))), splatX(modelPos)); __m128 y2 = add(add(add( mul(sub(tris.b.x, splatX(modelPos)), splatX(modelUp)), mul(sub(tris.b.y, splatY(modelPos)), splatY(modelUp))), mul(sub(tris.b.z, splatZ(modelPos)), splatZ(modelUp))), splatY(modelPos)); __m128 z2 = add(add(add( mul(sub(tris.b.x, splatX(modelPos)), splatX(modelDir)), mul(sub(tris.b.y, splatY(modelPos)), splatY(modelDir))), mul(sub(tris.b.z, splatZ(modelPos)), splatZ(modelDir))), splatZ(modelPos)); __m128 x3 = add(add(add( mul(sub(tris.c.x, splatX(modelPos)), splatX(modelRight)), mul(sub(tris.c.y, splatY(modelPos)), splatY(modelRight))), mul(sub(tris.c.z, splatZ(modelPos)), splatZ(modelRight))), splatX(modelPos)); __m128 y3 = add(add(add( mul(sub(tris.c.x, splatX(modelPos)), splatX(modelUp)), mul(sub(tris.c.y, splatY(modelPos)), splatY(modelUp))), mul(sub(tris.c.z, splatZ(modelPos)), splatZ(modelUp))), splatY(modelPos)); __m128 z3 = add(add(add( mul(sub(tris.c.x, splatX(modelPos)), splatX(modelDir)), mul(sub(tris.c.y, splatY(modelPos)), splatY(modelDir))), mul(sub(tris.c.z, splatZ(modelPos)), splatZ(modelDir))), splatZ(modelPos)); FourVec3s normal; normal.x = sub( mul(sub(y1,y1),sub(z3,z2)), mul(sub(z2,z1),sub(y3,y2)) ); normal.y = sub( mul(sub(z2,z1),sub(x3,x2)), mul(sub(x2,x1),sub(z3,z2)) ); normal.z = sub( mul(sub(x2,x1),sub(y3,y2)), mul(sub(y2,y1),sub(x3,x2)) ); normal = normalizeFourVec3s(normal); __m128 s1234 = dot(normal, lightDirs); s1234 = and(s1234, less(_mm_setzero_ps(), s1234)); __m128 l = add(_mm_set_ps1(0.1f), add(add( mul(s1234,lightColors.x), mul(s1234,lightColors.y)), mul(s1234,lightColors.z))); __m128i r = floatToInts(mul(l,intsToFloats(unpackR(tris.colors)))); __m128i g = floatToInts(mul(l,intsToFloats(unpackG(tris.colors)))); __m128i b = floatToInts(mul(l,intsToFloats(unpackB(tris.colors)))); return packAndSaturate32To8(r,g,b,_mm_setzero_si128()); }
  5. Float packing

    Half-precision floats only have a range of +/-65K, but they are otherwise exactly what you want. Here are a couple of 32->16->32 bit converters:
  6. SSE, Alignment and Packing

    For the matrix, the fastest would be to just use an array of _m128s. But, you might get away with an aligned struct containing 4 _m128s. Don't use std::array. BTW: STL does not support aligned types. But, luckily for you EASTL was recently open-sourced and it does support aligned types.
  7. Note that in many cases where per-object sorting breaks down (like the wine in a glass example) there is usually a fixed order for the parts that works best (first wine then glass). That's why every engine I've worked on had a optional per-material z-sorting bias value. The artists would bias the wine to be sorted as if it were "farther away" by a distance approximately equal to the radius of the glass. Another trick for self-intersecting objects (like hair or tree leaves) is to draw the object in two passes. 1st pass : z-write on, blending off, alpha test "a==1.0" 2nd pass : z-write off, blending on, alpha test "a<1.0" If the majority of the object is pretty much opaque then that part will sort correctly because it is drawn with no blending. The blending fringes won't sort perfectly, but the error becomes much harder to notice because it only happens on the edges. Just watch out for artists turning this option on when it's not needed...
  8. Old thread got my attention. You can find a nice explanation of marching cubes and tetrahedrons here: and some example code for both techniques here: For the specific topic of how to break a cube into tetrahedrons, look for this function: //vMarchCube2 performs the Marching Tetrahedrons algorithm on a single cube by making six calls to vMarchTetrahedron GLvoid vMarchCube2(GLfloat fX, GLfloat fY, GLfloat fZ, GLfloat fScale)
  9. Aras did a significant investigation into this problem. His results are here:
  10. A Programmer's choice

    In school you learn the foundations of programming. Programming paradigms, data structures, algorithms, numerical analysis etc... In the industry you apply these foundations to lots of different areas. They are necessary to be effective in whatever you do and what you do will change often. It is very common for people to change what aspect of development they are focusing on. I've been programming games professionally for a long time. I've mentored lots of juniors, given lots of interviews and talked to crowds about how to "get into the biz". My advice has always been the same: Get a 4-year degree from a strong but general college. Learn the theory from school and learn the practice by making your own games and demos on your own time. Over the summer after your junior year, get an internship -it's really a multi-month interview. Do well in the internship and you can easily get a job at the same company after you graduate. Work on a game there from start to finish there and you are golden. You can stick with that company or start shopping around for someplace better. Worked for me. Learn hard, work hard and it'll work for you.
  11. If you want to store a value over 1, you need to know what your range is. Before converting floatToFixed,you will need to divide by the range to map [0,range] down to [0,1]. Then when converting back with fixedToFloat you multiply the [0,1] value by the maximum range. If you only want 16 bits of precision then just do the r and g portions of the code. The algorithm is incremental over any number of channels.
  12. I just posted a response to a later, but almost identical question under the header "Encoding 16 and 32 bit floating point value into RGBA byte texture"
  13. In shaderland, each channel of an 8888 texture represents a number between 0.0 and 1.0 inclusive using an 8 bit integer. The conversion from float to int that happen when a texture is written to is basically int(floatValue*255.0). The conversion from int to float that happens when a texture is read is float(intValue)/255.0 To store a value between 0.0 and 1.0 using 4 8-bit channels to achieve 32-bit precision you need to simulate fixed point operations using floating point math. Use multiplies as shifts, frac() to mask off the >1.0 bits and the assignment to an 8-bit output to mask off the <1.0/255 bits. The obvious way to do this is shown here. WARNING: This doesn't work (explained later). FloatToInt() out.r = frac(floatValue*1); out.g = frac(floatValue*255); out.b = frac(floatValue*255*255); out.a = frac(floatValue*255*255*255); IntToFloat() in = intValue.r/(1) +intValue.g/(255) +intValue.b/(255*255) +intValue.a/(255*255*255); Obviously, FloatToInt() can be optimized to frac(floatValue*vectorConstant) and IntToFloat() can be optimized to dot(intValue, vectorContant2). Unfortunately, we don't want to store 0.0 to 1.0 inclusive in each of the channels. If the channels include both extremes then the extreme values of each channel would overlap because they are equivalent. That means the above math would record 1.0/255 as (1, 255, 0, 0) which is double the correct value. Instead we want to store 0.0 to 255.0/256 in each channel. That is the range of values represented by an 8-bit fixed point value. To convert a floating point [0.0,1.0] to a fixed point-ish [0.0,255.0/256] we multiply by 255.0/256. FloatToInt() const float toFixed = 255.0/256; out.r = frac(floatValue*toFixed*1); out.g = frac(floatValue*toFixed*255); out.b = frac(floatValue*toFixed*255*255); out.a = frac(floatValue*toFixed*255*255*255); IntToFloat() const float fromFixed = 256.0/255; in = intValue.r*fromFixed/(1) +intValue.g*fromFixed/(255) +intValue.b*fromFixed/(255*255) +intValue.a*fromFixed/(255*255*255); Here's the bit of python I wrote to make sure I'm not full of shit. def load(v): r,g,b,a = v return r/255.0, g/255.0, b/255.0, a/255.0 def store(v): r,g,b,a = v return int(r*255), int(g*255), int(b*255), int(a*255) def frac(f): return f - int(f) def floatToFixed(f): toFixed = 255.0/256 return frac(f*toFixed*1), frac(f*toFixed*255), frac(f*toFixed*255*255), frac(f*toFixed*255*255*255) def fixedToFloat(v): r,g,b,a = v fromFixed = 256.0/255 return r*fromFixed/1 + g*fromFixed/(255) + b*fromFixed/(255*255) + a*fromFixed/(255*255*255) print fixedToFloat(load(store(floatToFixed(1.0)))) print fixedToFloat(load(store(floatToFixed(0.0)))) print fixedToFloat(load(store(floatToFixed(0.5)))) print fixedToFloat(load(store(floatToFixed(1.0/3)))) result: 0.999999999763 0.0 0.499999999882 0.333333333254
  14. Recovering world position in postprocess

    When you rendered the depth buffer, you transformed the point from world-space to projection-space. Now you just need to invert the transform to get it back to world-space. You have the projection-space Z value. You must recreate the X and Y values using texture coordinate interpolators. Then just multiply by the inverse of the world-to-projection matrix and divide by W. Viola! // code from memory. not tested. float2 UV; //(0,0)->(1,1) full screen quad UVs float depth = tex2d(sampler, UV); float2 clipUV = UV*2-1; // map 0,1 to -1,1 float4 projPos = float4(clipUV.x, clipUV.y, depth, 1); //projToWorld = inverse(worldToProj); float4 worldPos = mul(projToWorld, projPos); worldPos /= worldPos.w; If you are using DX9 or less, you will need to add that annoying half-texel offset when sampling the depth buffer, but don't let it slip into the clipUV value. This is all from vague memory. No warranties of correctness. :P [Edited by - corysama on February 21, 2008 1:40:23 AM]
  15. Mesh Rendering Idea

    In Quake 1, if the monsters were far enough away, the renderer would switch to single-pixel point rendering to avoid doing the triangle setup. Remenber that they were targeting a 320x240 software renderer -individual pixels were pretty big back then.