Jump to content
  • Advertisement


  • Content Count

  • Joined

  • Last visited

Community Reputation

352 Neutral

About corysama

  • Rank

Personal Information

  • Role
  • Interests
  1. Best advice I can give you is to start out by developing your code on multiple platforms simultaneously. When you do that, you will discover tons of tiny surprises along the way that you need to adjust to keep the code clean on all platforms. Fixing issues as they appear is quick and easy. But, putting off discovery until after lots of code has been written, you are signing yourself up for lots of unpleasant surprises (and rewrites).
  2. corysama

    Gamma correction confusion

    Yep. You understand it now. When sampling a framebuffer, it's just another texture. Having sRGB framebuffers and textures is not just a convenience. Blending and texture filtering need to be done in linear space to work properly (linearly). So, under the hood, the blendop has to do newSrgbPixelColor = toSRGB(blend(linearShaderOutputColor, toLinear(currentSrgbPixelColor))) You can't do that in a pixel shader without "programmable blending" extensions. Similarly, the texture sampler has to convert all samples to linear before performing filtering. In theory you could do a bunch of point samples and convert+filter yourself in a pixel shader. But, you really do not want to. Especially not for complicated sampling like anisotropic.
  3. corysama

    Gamma correction confusion

    In your postprocessing you'll need to set the framebuffer as an sRGB source texture to convert it back to linear when you sample it. It's the same as your loaded textures. The 8-bit render target is being used as intermediate storage between fp32 calculations. The linear->sRGB->linear round-trip is designed to minimize loss during that 8-bit intermediate step. So, it goes: load 8-bit sRGB texture file, sample 8-bit sRGB texture converting it to a linear fp32 color, do lighting math in fp32, convert to sRGB to store in an 8-bit framebuffer, set the frambuffer as an sRGB8888 source texture, sample the 8-bit texture converting from sRGB to linear fp32, do post-processing math, store to another 8-bit sRGB framebuffer. You can avoid the linear->sRGB->linear process if you can afford a higher-precision intermediate format.
  4. corysama

    Gamma correction confusion

    There are 2 purposes to using sRGB as an image/framebuffer format: 1) It's what monitors expect. So, it can be pumped straight to be monitor without extra work. 2) 8 bits is not enough precision to avoid banding artifacts in dark regions when the color is stored linearly. If you use a higher bitrate framebuffer then you can get away with storing linear space colors. And, sRGB888 is OK as an intermediate format. But, RGB888 filled with linear-space colors will result in a lot of banding.
  5. Are you making a math library for games or for "serious" applications --stuff where rare bugs have serious physical/financial consequences? In general, the theme for bug handling in games is to have lots of testing up front. During production, error as early as possible to make it easy to identify the hopefully recent change that caused the bug. When you ship, remove all error checks because you are confident in your testing, right? So, yeah. Assert. If a non-invertible matrix goes through the invert function in a game, something went irrecoverably wrong. There is no exception/error-handling to be done. There's a bug and it needs to be fixed quickly.
  6. corysama

    Marching Tetrahedra algorithm normals

    Hey, Thomas. What is the run time source of your 3D scalar field? The easiest way to generate vertex normals for marching cubes/tets is to do central differencing (take the gradient) on the scalar field. This should also be much more accurate than trying to reconstruct the field normal from the generated triangles. You can see an implementation of this technique in the vGetNormal function of my ancient example code: http://paulbourke.net/geometry/polygonise/marchingsource.cpp It's pretty easy to implement. But, can be slow for complicated procedural fields. If you need that to be faster, you can dig into analytic differentiation. But, be prepared to break out your math textbooks.
  7. corysama

    Relative to Camera rendering.

    r_Grid * vec4(r_Vertex, 1.0) is what's giving you grief. In order to avoid fp32 precision errors you can never allow an fp32 calculation to contain an excessively large value. r_ModelViewProjection * vec4(V, 1.0); Is taking an already stairsteppy V and transforming it to projection space. What you need is a GridModelViewProjection matrix that is calculated as doubles before being converted to floats. How to pull off the normalize trick after that is a good question... Bikeshedding bonus: Personally, I prefer to name my mats in the style of ProjectionFromGrid rather than GridModelViewProjection. That way it connects nicely when I write out "projectionPosition = ProjectionFromGrid * gridPosition"
  8. corysama

    Packing uniforms into matrix

    Don't stop at packing some of them into a matrix. Pack all of them into an array of vec4s and alias them with "#define shininess array[3].y"
  9. So, sounds like what you are doing is: reGamma(Ambient(deGamma(texture)) + reGamma(Light(deGamma(texture)) This doesn't work because reGamma is a non-linear operator. So, you can't expect a linear operator like + to make sense with it's results. Instead what you have to do is: reGamma(Ambient(deGamma(texture) + Light(deGamma(texture)) That means removing the gamma operation from the end of your shader, accumulating linear values into your framebuffer, then switching back to gamma as a post past on the final value. To make that work, you are going to need to switch to an fp16 format for your framebuffer because the whole point of the gamma curve is that rgb888 isn't enough bits to store linear colors without banding.
  10. corysama

    Frustum Culling

    This is a very nice article.   The bit at the at the end of the OBB function __m128i outside_res_i = _mm_cvtps_epi32(outside); ALIGN_SSE int obj_culling_res[4]; _mm_store_si128((__m128i *)&obj_culling_res[0], outside_res_i); culling_res[i] = (obj_culling_res[0] != 0 || obj_culling_res[1] != 0 ||  obj_culling_res[2] != 0) ? 1 : 0; Is unnecessary memory traffic.  The store, loads, all of those compares and the branch can be replaced by a single instruction that keeps everything in registers. culling_res[i] = _mm_movemask_ps(outside);   With SSE, RAM is the enemy.  Registers are your only friend.
  11. Take a look at https://en.wikipedia.org/wiki/OpenH264  http://www.openh264.org/faq.html   It's used by FireFox.  https://blog.mozilla.org/blog/2013/10/30/video-interoperability-on-the-web-gets-a-boost-from-ciscos-h-264-codec/   The license is confusing, but here's how I think it works:   The important part is that you do not compile and link the patented code.  You do not ship the shared library binary.  Your app downloads the library and Cisco volunteers to pay the patent license fee for that user on your behalf.   The code is open source. But, that does not grant you license to use the patents.  It's only open source so that anyone can contribute to the source.
  12.     Have a third look.  Last I checked, the runtime exception allows static linking.  Would you use the C stdlib that comes with GCC?   Nearly all programs compiled with GCC do.  This is literally the same license.  I think GNU made their stdlib GPL because they're GNU --that's what they do.  But, they added a minimal "exception" that basically weakens the GPL to "If you modify specifically our code, you must publish specifically those modifications.  The end."  because they knew that making all programs compiled with GCC implicitly GPL would doom GCC.  TBB is the same deal.
  13. Here's a pretty direct translation of your code to SSE.  The big change is that in order to get good speed without huge changes to the algorithm, this code requires the data for 4 triangles to be packed in a transposed {xxxx, yyyy, zzzz} structure-of-arrays format so that 4 triangles can be processed in a single function call.  All of the functions work in SOA form so that they can resemble scalar math while actually working on 4 values at once.  This function takes 4 packed triangles and returns 4 8-bit colors packed as {rrrr,gggg,bbbb,0000}.   I have not checked this code for correctness.  There are probably bugs.  I will not be available to answer further questions about this code.  Sorry. #include <pmmintrin.h> struct FourVec3s { __m128 x; __m128 y; __m128 z; }; struct FourTris { FourVec3s a; FourVec3s b; FourVec3s c; __m128i colors; }; // transposed static FourVec3s lightDirs = {{0.2, 0.5, -0.5, -0.5}, {-1.6,-0.7,-0.3, 1.3}, {-1.7, 20.3,-0.6, 0.6}}; // transposed static FourVec3s lightColors = {{.4, .4145, .584, .41 }, {.414, .451, .51414,.44}, {.515, .543, .43, .3414}}; static __m128 modelRight = {1.0, 0.0, 0.0, 0.0}; static __m128 modelUp = {0.0, 1.0, 0.0, 0.0}; static __m128 modelDir = {0.0, 0.0, 1.0, 0.0}; static __m128 modelPos = {0.0, 0.0, 0.0, 1.0}; static inline __m128 splatX(__m128 v) { return _mm_shuffle_ps(v,v,_MM_SHUFFLE(0,0,0,0)); } static inline __m128 splatY(__m128 v) { return _mm_shuffle_ps(v,v,_MM_SHUFFLE(1,1,1,1)); } static inline __m128 splatZ(__m128 v) { return _mm_shuffle_ps(v,v,_MM_SHUFFLE(2,2,2,2)); } static inline __m128 add(__m128 l, __m128 r) { return _mm_add_ps(l, r); } static inline __m128 sub(__m128 l, __m128 r) { return _mm_sub_ps(l, r); } static inline __m128 mul(__m128 l, __m128 r) { return _mm_mul_ps(l, r); } static inline __m128 and(__m128 l, __m128 r) { return _mm_and_ps(l, r); } static inline __m128 less(__m128 l, __m128 r) { return _mm_cmplt_ps(l, r); } static inline __m128 dot(const FourVec3s &l, const FourVec3s &r) { return add(add(mul(l.x,r.x), mul(l.y,r.y)), mul(l.z,r.z)); } // unpack 8 bit RgbaRgbaRgbaRgba into 32-bit RRRR gggg or bbbb static inline __m128i unpackR(__m128i iv) { return _mm_unpacklo_epi16(_mm_unpacklo_epi8(iv,_mm_setzero_si128()),_mm_setzero_si128()); } static inline __m128i unpackG(__m128i iv) { return _mm_unpackhi_epi16(_mm_unpacklo_epi8(iv,_mm_setzero_si128()),_mm_setzero_si128()); } static inline __m128i unpackB(__m128i iv) { return _mm_unpacklo_epi16(_mm_unpackhi_epi8(iv,_mm_setzero_si128()),_mm_setzero_si128()); } static inline __m128 intsToFloats(__m128i iv) { return _mm_cvtepi32_ps(iv); } static inline __m128i floatToInts(__m128 fv) { return _mm_cvttps_epi32(fv); } static inline __m128i packAndSaturate32To8(__m128i r ,__m128i g, __m128i b, __m128i a) { return _mm_packs_epi16(_mm_packs_epi32(r,g),_mm_packs_epi32(b,a)); } static inline FourVec3s normalizeFourVec3s(const FourVec3s &v) { __m128 length = _mm_sqrt_ps(add(add( mul(v.x,v.x), mul(v.y,v.y)), mul(v.z,v.z) )); FourVec3s result = { _mm_div_ps(v.x,length), _mm_div_ps(v.y,length), _mm_div_ps(v.z,length) }; return result; } __m128i Shade4Triangles(const FourTris &tris) { __m128 x1 = add(add(add( mul(sub(tris.a.x, splatX(modelPos)), splatX(modelRight)), // (*triangle).a.x - modelPos.x)*modelRight.x + mul(sub(tris.a.y, splatY(modelPos)), splatY(modelRight))), // ((*triangle).a.y - modelPos.y)*modelRight.y + mul(sub(tris.a.z, splatZ(modelPos)), splatZ(modelRight))), // ((*triangle).a.z - modelPos.z)*modelRight.z) + splatX(modelPos)); // modelPos.x __m128 y1 = add(add(add( mul(sub(tris.a.x, splatX(modelPos)), splatX(modelUp)), mul(sub(tris.a.y, splatY(modelPos)), splatY(modelUp))), mul(sub(tris.a.z, splatZ(modelPos)), splatZ(modelUp))), splatY(modelPos)); __m128 z1 = add(add(add( mul(sub(tris.a.x, splatX(modelPos)), splatX(modelDir)), mul(sub(tris.a.y, splatY(modelPos)), splatY(modelDir))), mul(sub(tris.a.z, splatZ(modelPos)), splatZ(modelDir))), splatZ(modelPos)); __m128 x2 = add(add(add( mul(sub(tris.b.x, splatX(modelPos)), splatX(modelRight)), mul(sub(tris.b.y, splatY(modelPos)), splatY(modelRight))), mul(sub(tris.b.z, splatZ(modelPos)), splatZ(modelRight))), splatX(modelPos)); __m128 y2 = add(add(add( mul(sub(tris.b.x, splatX(modelPos)), splatX(modelUp)), mul(sub(tris.b.y, splatY(modelPos)), splatY(modelUp))), mul(sub(tris.b.z, splatZ(modelPos)), splatZ(modelUp))), splatY(modelPos)); __m128 z2 = add(add(add( mul(sub(tris.b.x, splatX(modelPos)), splatX(modelDir)), mul(sub(tris.b.y, splatY(modelPos)), splatY(modelDir))), mul(sub(tris.b.z, splatZ(modelPos)), splatZ(modelDir))), splatZ(modelPos)); __m128 x3 = add(add(add( mul(sub(tris.c.x, splatX(modelPos)), splatX(modelRight)), mul(sub(tris.c.y, splatY(modelPos)), splatY(modelRight))), mul(sub(tris.c.z, splatZ(modelPos)), splatZ(modelRight))), splatX(modelPos)); __m128 y3 = add(add(add( mul(sub(tris.c.x, splatX(modelPos)), splatX(modelUp)), mul(sub(tris.c.y, splatY(modelPos)), splatY(modelUp))), mul(sub(tris.c.z, splatZ(modelPos)), splatZ(modelUp))), splatY(modelPos)); __m128 z3 = add(add(add( mul(sub(tris.c.x, splatX(modelPos)), splatX(modelDir)), mul(sub(tris.c.y, splatY(modelPos)), splatY(modelDir))), mul(sub(tris.c.z, splatZ(modelPos)), splatZ(modelDir))), splatZ(modelPos)); FourVec3s normal; normal.x = sub( mul(sub(y1,y1),sub(z3,z2)), mul(sub(z2,z1),sub(y3,y2)) ); normal.y = sub( mul(sub(z2,z1),sub(x3,x2)), mul(sub(x2,x1),sub(z3,z2)) ); normal.z = sub( mul(sub(x2,x1),sub(y3,y2)), mul(sub(y2,y1),sub(x3,x2)) ); normal = normalizeFourVec3s(normal); __m128 s1234 = dot(normal, lightDirs); s1234 = and(s1234, less(_mm_setzero_ps(), s1234)); __m128 l = add(_mm_set_ps1(0.1f), add(add( mul(s1234,lightColors.x), mul(s1234,lightColors.y)), mul(s1234,lightColors.z))); __m128i r = floatToInts(mul(l,intsToFloats(unpackR(tris.colors)))); __m128i g = floatToInts(mul(l,intsToFloats(unpackG(tris.colors)))); __m128i b = floatToInts(mul(l,intsToFloats(unpackB(tris.colors)))); return packAndSaturate32To8(r,g,b,_mm_setzero_si128()); }
  14. corysama

    Float packing

    Half-precision floats only have a range of +/-65K, but they are otherwise exactly what you want. Here are a couple of 32->16->32 bit converters: http://stackoverflow.com/questions/1659440/32-bit-to-16-bit-floating-point-conversion/3542975#3542975 http://cellperformance.beyond3d.com/articles/2006/07/update-19-july-06-added.html
  15. corysama

    SSE, Alignment and Packing

    For the matrix, the fastest would be to just use an array of _m128s. But, you might get away with an aligned struct containing 4 _m128s. Don't use std::array. BTW: STL does not support aligned types. But, luckily for you EASTL was recently open-sourced and it does support aligned types. https://github.com/paulhodge/EASTL
  • Advertisement

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!