• Announcements

    • khawk

      Download the Game Design and Indie Game Marketing Freebook   07/19/17

      GameDev.net and CRC Press have teamed up to bring a free ebook of content curated from top titles published by CRC Press. The freebook, Practices of Game Design & Indie Game Marketing, includes chapters from The Art of Game Design: A Book of Lenses, A Practical Guide to Indie Game Marketing, and An Architectural Approach to Level Design. The GameDev.net FreeBook is relevant to game designers, developers, and those interested in learning more about the challenges in game development. We know game development can be a tough discipline and business, so we picked several chapters from CRC Press titles that we thought would be of interest to you, the GameDev.net audience, in your journey to design, develop, and market your next game. The free ebook is available through CRC Press by clicking here. The Curated Books The Art of Game Design: A Book of Lenses, Second Edition, by Jesse Schell Presents 100+ sets of questions, or different lenses, for viewing a game’s design, encompassing diverse fields such as psychology, architecture, music, film, software engineering, theme park design, mathematics, anthropology, and more. Written by one of the world's top game designers, this book describes the deepest and most fundamental principles of game design, demonstrating how tactics used in board, card, and athletic games also work in video games. It provides practical instruction on creating world-class games that will be played again and again. View it here. A Practical Guide to Indie Game Marketing, by Joel Dreskin Marketing is an essential but too frequently overlooked or minimized component of the release plan for indie games. A Practical Guide to Indie Game Marketing provides you with the tools needed to build visibility and sell your indie games. With special focus on those developers with small budgets and limited staff and resources, this book is packed with tangible recommendations and techniques that you can put to use immediately. As a seasoned professional of the indie game arena, author Joel Dreskin gives you insight into practical, real-world experiences of marketing numerous successful games and also provides stories of the failures. View it here. An Architectural Approach to Level Design This is one of the first books to integrate architectural and spatial design theory with the field of level design. The book presents architectural techniques and theories for level designers to use in their own work. It connects architecture and level design in different ways that address the practical elements of how designers construct space and the experiential elements of how and why humans interact with this space. Throughout the text, readers learn skills for spatial layout, evoking emotion through gamespaces, and creating better levels through architectural theory. View it here. Learn more and download the ebook by clicking here. Did you know? GameDev.net and CRC Press also recently teamed up to bring GDNet+ Members up to a 20% discount on all CRC Press books. Learn more about this and other benefits here.

Aressera

Members
  • Content count

    425
  • Joined

  • Last visited

Community Reputation

2915 Excellent

About Aressera

  • Rank
    Member

Personal Information

  • Location
    Chapel Hill, NC
  1. Try this code that implements the very fast 4SED algorithm. This code computes the SDF for a large (e.g. 1024x1024) image in tens of milliseconds. It takes a single-channel grayscale image, applies a threshold operation to compute a binary image, then applies the 4SED algorithm twice to compute the distance for the foreground/background. The final image is the subtraction of the two fields, that is then normalized and clamped to a maximum distance range. The only part missing would be conversion to fixed point [0,255], if desired. typedef math::Vector2f Offset; Bool DistanceFilter:: processFrame( const ImageFrame& inputFrame, ImageFrame& outputFrame ) { if ( inputFrame.getImageCount() == 0 || outputFrame.getImageCount() == 0 ) return false; const ImageBuffer* inputImage = inputFrame.getImage(0); ImageBuffer* outputImage = outputFrame.getImage(0); if ( inputImage == NULL || outputImage == NULL ) return false; const PixelFormat& inputPixelFormat = inputImage->getPixelFormat(); // Make sure the input image is 2D and has one channel. if ( inputImage->getDimensionCount() != 2 || inputPixelFormat.getChannelCount() != 1 ) return false; // Determine the size of the output image. const Size2D inputSize = inputImage->getSize2D(); const Size2D outputSize = inputSize + padding*2; PixelFormat outputPixelFormat( ColorSpace::LINEAR_GRAY, ScalarType(ScalarType::FLOAT32) ); outputImage->setFormat( outputPixelFormat, outputSize.x, outputSize.y ); Float32* const outputPixels = outputImage->getPixels(); // Allocate temporary storage. const Size outputPixelCount = outputSize.x*outputSize.y; offsetMap.allocate( outputPixelCount ); Offset* const offsetPixels = offsetMap.getPointer(); //******************************************************************************** // Compute the distance map for the background area. computeDistanceMap( inputImage->getPixels(), inputSize, padding, threshold, offsetPixels, outputSize, invert ); // Distance evaluation. { Offset* offset = offsetPixels; const Offset* const offsetEnd = offsetPixels + outputPixelCount; Float32* output = outputPixels; for ( ; offset != offsetEnd; offset++, output++ ) *output = (*offset).getMagnitude(); } //******************************************************************************** // Compute the distance map for the foreground area. if ( signedDistance ) { computeDistanceMap( inputImage->getPixels(), inputSize, padding, threshold, offsetPixels, outputSize, !invert ); // Distance evaluation. { Offset* offset = offsetPixels; const Offset* const offsetEnd = offsetPixels + outputPixelCount; Float32* output = outputPixels; for ( ; offset != offsetEnd; offset++, output++ ) *output -= (*offset).getMagnitude(); } } //******************************************************************************** // Normalization. if ( normalize ) { Float32 maxDistance = range; if ( maxDistance == 0.0f ) { maxDistance = math::max( math::abs(math::min( outputPixels, outputPixelCount )), math::abs(math::max( outputPixels, outputPixelCount )) ); } if ( maxDistance != 0.0f ) { const Float32 outputCenter = signedDistance ? outputThreshold : 0.0f; const Float32 invRange = (Float32(1.0) - outputCenter)/maxDistance; Float32* output = outputPixels; const Float32* const outputEnd = outputPixels + outputPixelCount; for ( ; output != outputEnd; output++ ) *output = ((*output))*invRange + outputCenter; } } return true; } /** * This implementation uses the 4SED distance mapping algorithm proposed in: * Danielsson, P. "Euclidean Distance Mapping" (1980) * * The algorithm is slightly modified to handle special cases at the image boundaries, * where the original algorithm would produce incorrect results if the foreground touched * the edge of the image. * * TODO: Implement the corrections proposed in: * Cuisenaire, O. and Macq, B. * "Fast and Exact Signed Euclidean Distance Transformation with Linear Complexity" (1999) */ void DistanceFilter:: computeDistanceMap( const Float32* inputPixels, Size2D inputSize, Size2D inputPaddding, Float32 inputThreshold, Offset* offsetPixels, Size2D outputSize, Bool invert ) { const Size outputPixelCount = outputSize.x*outputSize.y; //******************************************************************************** // Initialize the offset image. const Float32 maxPossibleDistance = math::Vector2f(outputSize).getMagnitude(); const Float32 backgroundInitial = invert ? 0.0f : maxPossibleDistance; const Float32 foregroundInitial = invert ? maxPossibleDistance : 0.0f; { Offset* offset = offsetPixels; const Offset* const offsetRowPaddingEnd = offsetPixels + outputSize.x*inputPaddding.y; const Offset* const offsetInputEnd = offsetRowPaddingEnd + outputSize.x*inputSize.y; const Offset* const offsetEnd = offsetPixels + outputPixelCount; const Float32* input = inputPixels; // The padding area above the image. for ( ; offset != offsetRowPaddingEnd; offset++ ) *offset = Offset(backgroundInitial); // The rows containing the input image. while ( offset != offsetInputEnd ) { const Offset* const paddingEnd = offset + inputPaddding.x; const Offset* const inputRowEnd = paddingEnd + inputSize.x; const Offset* const offsetRowEnd = offset + outputSize.x; for ( ; offset != paddingEnd; offset++ ) *offset = Offset(backgroundInitial); for ( ; offset != inputRowEnd; offset++, input++ ) *offset = Offset( (*input < inputThreshold) ? backgroundInitial : foregroundInitial ); for ( ; offset != offsetRowEnd; offset++ ) *offset = Offset(backgroundInitial); } // The padding area below the image. for ( ; offset != offsetEnd; offset++ ) *offset = Offset(backgroundInitial); } //******************************************************************************** // First scan. for ( Index y = 0; y < outputSize.y; y++ ) { if ( y == 0 ) { // Handle first row outside bounds of image. for ( Index x = 0; x < outputSize.x; x++ ) { Offset& L_xy = offsetPixels[y*outputSize.x + x]; Offset Lshift = Offset(backgroundInitial) + Offset(0,1); if ( Lshift.getMagnitudeSquared() < L_xy.getMagnitudeSquared() ) L_xy = Lshift; } } else { for ( Index x = 0; x < outputSize.x; x++ ) { Offset& L_xy = offsetPixels[y*outputSize.x + x]; Offset Lshift = offsetPixels[(y-1)*outputSize.x + x] + Offset(0,1); if ( Lshift.getMagnitudeSquared() < L_xy.getMagnitudeSquared() ) L_xy = Lshift; } } // first column { Offset& L_xy = offsetPixels[y*outputSize.x + 0]; Offset Lshift = Offset(backgroundInitial) + Offset(1,0); if ( Lshift.getMagnitudeSquared() < L_xy.getMagnitudeSquared() ) L_xy = Lshift; } for ( Index x = 1; x < outputSize.x; x++ ) { Offset& L_xy = offsetPixels[y*outputSize.x + x]; Offset Lshift = offsetPixels[y*outputSize.x + (x-1)] + Offset(1,0); if ( Lshift.getMagnitudeSquared() < L_xy.getMagnitudeSquared() ) L_xy = Lshift; } // last column { Offset& L_xy = offsetPixels[y*outputSize.x + (outputSize.x-1)]; Offset Lshift = Offset(backgroundInitial) + Offset(1,0); if ( Lshift.getMagnitudeSquared() < L_xy.getMagnitudeSquared() ) L_xy = Lshift; } for ( Index x = outputSize.x - 2; x < outputSize.x; x-- ) { Offset& L_xy = offsetPixels[y*outputSize.x + x]; Offset Lshift = offsetPixels[y*outputSize.x + (x+1)] + Offset(1,0); if ( Lshift.getMagnitudeSquared() < L_xy.getMagnitudeSquared() ) L_xy = Lshift; } } //******************************************************************************** // Second scan. const Index lastRow = outputSize.y - 1; for ( Index y = lastRow; y < outputSize.y; y-- ) { if ( y == lastRow ) { // Handle last row outside bounds of image. for ( Index x = 0; x < outputSize.x; x++ ) { Offset& L_xy = offsetPixels[y*outputSize.x + x]; Offset Lshift = Offset(backgroundInitial) + Offset(0,1); if ( Lshift.getMagnitudeSquared() < L_xy.getMagnitudeSquared() ) L_xy = Lshift; } } else { for ( Index x = 0; x < outputSize.x; x++ ) { Offset& L_xy = offsetPixels[y*outputSize.x + x]; Offset Lshift = offsetPixels[(y+1)*outputSize.x + x] + Offset(0,1); if ( Lshift.getMagnitudeSquared() < L_xy.getMagnitudeSquared() ) L_xy = Lshift; } } // first column { Offset& L_xy = offsetPixels[y*outputSize.x + 0]; Offset Lshift = Offset(backgroundInitial) + Offset(1,0); if ( Lshift.getMagnitudeSquared() < L_xy.getMagnitudeSquared() ) L_xy = Lshift; } for ( Index x = 1; x < outputSize.x; x++ ) { Offset& L_xy = offsetPixels[y*outputSize.x + x]; Offset Lshift = offsetPixels[y*outputSize.x + (x-1)] + Offset(1,0); if ( Lshift.getMagnitudeSquared() < L_xy.getMagnitudeSquared() ) L_xy = Lshift; } // last column { Offset& L_xy = offsetPixels[y*outputSize.x + (outputSize.x-1)]; Offset Lshift = Offset(backgroundInitial) + Offset(1,0); if ( Lshift.getMagnitudeSquared() < L_xy.getMagnitudeSquared() ) L_xy = Lshift; } for ( Index x = outputSize.x - 2; x < outputSize.x; x-- ) { Offset& L_xy = offsetPixels[y*outputSize.x + x]; Offset Lshift = offsetPixels[y*outputSize.x + (x+1)] + Offset(1,0); if ( Lshift.getMagnitudeSquared() < L_xy.getMagnitudeSquared() ) L_xy = Lshift; } } }
  2. It hasn't been mentioned yet, but the Xoroshiro128+ PRNG is a very new approach that is both very fast and has very good statistical quality. It performs better in the BigCrush test suite than any of the previously mentioned ones. I use it in my path tracer for ray sample generation.
  3. I'll add that the performance of ray tracing strongly depends on whether the rays are coherent (e.g. camera rays) or incoherent (shadow rays). Coherent rays are faster because the same BVH nodes are traversed each time, and are more likely to be in cache. There are usually different ray tracing/BVH traversal algorithms used for coherent or incoherent rays. (e.g. intersect 4 coherent rays vs. 1 BVH / intersect 1 ray vs. 4 BVH).
  4. I think astronomers use frequency-domain analysis techniques to find periodic signals in the doppler shifting of the star, this would allow the detection of multiple planets if a wobble was detected at different orbital periods. They can isolate the signal for each planet and use that data to determine various properties of the planet(s). The main limitation of the wobble method is that it's harder to find small planets within the noise, while big planets show a stronger signal.
  5. I would check to see if the total job size is less than some threshold (e.g. 512 objects), if so then just use a sequential algorithm on the main thread. Otherwise, if you have N_t threads and N_w work items, then I would split it up into pieces of ceiling(N_w/N_t) size. This creates jobs of roughly equal size (aside from the last), maximizes parallelism and minimizes the number of synchronizations. The number of threads to use depends on what else is going on in the background (e.g. audio, physics, asset loading), and how much CPU those tasks use. If you completely saturate the CPU it might cause other tasks to be preempted. I would use one thread for rendering, one for physics, one for audio, and the rest (out of the total available) for the pool, but you should really look at the usage and make an informed decision from there.
  6. I think in that case you need to look at the direction of the geometry surface normals to determine which side of the geometry is the front, then you can just push the object out in that direction with split impulse position correction. It's an issue of choosing the correct contact point and normal.
  7. Hi, I am using signed distance fields for font and GUI icon rendering and I was wondering what the consensus was on the sign convention for these types of images. The most mathematically consistent seems to make the background of the image positive, with the interior of the foreground negative, then the image represents the signed distance. Encoded from [0,1], this would mean outside is >0.5 and inside is <0.5. However, this is backwards from the usual alpha blending where smaller values indicate the background area, and therefore the alpha test must be reversed to get correct results. This isn't a problem if I handle it in the shader, but I wonder if it would be better to generate signed distance field such that inside = positive and outside = negative. Valve's original paper uses this convention, but I've seen other people use the opposite. Side note: out of curiosity, has anyone ever tried SDF for alpha-tested foliage (e.g. packing the SDF in alpha channel of RGBA image)? Seems like it would produce nicer results.
  8. It shouldn't take any extra work to handle non uniform scaling if you handle the ray right. The key is to NOT normalize the ray direction when you transform to local space. The transformation may change the length, but that's OK because then it means you don't have to deal with transforming t_min/t_max to get the right intersection. In local space, t_min/t_max are not the distance but the parameterization of the ray: x(t) = origin + direction*t. Also, there is no need to compute the intersection point anywhere but in world space after the scene traversal. I would look into Intel's Embree source code to see how to do the transforms well. It's a bit hard to decipher but probably the best implementation I've seen. Here's some excerpts from my code, it is very similar to how Embree works: /// Transform a 3D point by this transformation. inline SIMDFloat4 BVHTransform:: transformPoint( const SIMDFloat4& point ) const { return position + basis.x*point[0] + basis.y*point[1] + basis.z*point[2]; } /// Transform a 3D vector by this transformation, neglecting the translation. inline SIMDFloat4 BVHTransform:: transformVector( const SIMDFloat4& vector ) const { return basis.x*vector[0] + basis.y*vector[1] + basis.z*vector[2]; } /// Return whether or not the primitive with the specified index is intersected by the specified ray. inline void intersectSingleBVH( PrimitiveIndex bvhIndex, BVHRay& ray ) const { // Transform the ray to local space. const BVHTransform worldToLocal = transforms[bvhIndex].worldToLocal; const SIMDFloat4 worldOrigin = ray.origin; const SIMDFloat4 worldDirection = ray.direction; const PrimitiveIndex worldPrimitive = ray.primitive; ray.origin = worldToLocal.transformPoint( worldOrigin ); ray.direction = worldToLocal.transformVector( worldDirection ); ray.primitive = BVHGeometry::INVALID_PRIMITIVE; // Intersect the ray with the child BVH. bvhs[bvhIndex]->intersectRay( ray ); // Restore the ray state. ray.origin = worldOrigin; ray.direction = worldDirection; if ( ray.hitValid() ) { // Transform the normal to world space. ray.normal = transforms[bvhIndex].localToWorld.transformVector( ray.normal ); ray.instance = bvhIndex; } else ray.primitive = worldPrimitive; } // Intersect a single ray against 4 AABBs at once and return the result mask inline SIMDInt4 intersectRay( const TraversalRay& ray, const SIMDFloat4& tMin, const SIMDFloat4& tMax, SIMDFloat4& near ) const { SIMDFloat4 txmin = (SIMDFloat4::load((const Float32*)((const UByte*)bounds + ray.signMin[0])) - ray.origin.x) * ray.inverseDirection.x; SIMDFloat4 txmax = (SIMDFloat4::load((const Float32*)((const UByte*)bounds + ray.signMax[0])) - ray.origin.x) * ray.inverseDirection.x; SIMDFloat4 tymin = (SIMDFloat4::load((const Float32*)((const UByte*)bounds + ray.signMin[1])) - ray.origin.y) * ray.inverseDirection.y; SIMDFloat4 tymax = (SIMDFloat4::load((const Float32*)((const UByte*)bounds + ray.signMax[1])) - ray.origin.y) * ray.inverseDirection.y; SIMDFloat4 tzmin = (SIMDFloat4::load((const Float32*)((const UByte*)bounds + ray.signMin[2])) - ray.origin.z) * ray.inverseDirection.z; SIMDFloat4 tzmax = (SIMDFloat4::load((const Float32*)((const UByte*)bounds + ray.signMax[2])) - ray.origin.z) * ray.inverseDirection.z; near = math::max( math::max( txmin, tymin ), math::max( tzmin, tMin ) ); SIMDFloat4 far = math::min( math::min( math::min( txmax, tymax ), tzmax ), tMax ); // near is the intersection with each AABB if (near <= far) return near <= far; }
  9. I prefer the physically correct attenuation. It's still possible to do light culling with infinite lights, you just use a threshold light intensity that is considered perceptually small (e.g. 0.01) and then solve the light attenuation equation to find the distance where the intensity drops below that value. With this method dim lights automatically get culled with small radii and bright lights are allowed to shine farther. If the epsilon is chosen correctly (e.g. considering effects of shading/exposure/tonemapping), the result should be indistinguishable from no culling at all.
  10. Better to use LAB color space, which has been designed for color comparisons (hence its wide use in computer vision). In LAB space, the euclidean distance corresponds closely to the perceptual color difference. Here is the conversion to/from LAB and sRGB: /// Convert the gamma from sRGB to linear RGB space. template < typename T > T gamma_sRGB_RGB( T x ) { return (x <= T(0.0404482362771076)) ? (x / T(12.92)) : math::pow( (x + T(0.055))/T(1.055), T(2.4) ); } /// Convert the gamma from linear RGB to sRGB space. template < typename T > T gamma_RGB_sRGB( T x ) { return (x <= T(0.0031306684425005883)) ? (x * T(12.92)) : (T(1.055)*math::pow( x, T(1)/T(2.4) ) - T(0.055)); } template < typename T > T LAB_f( T x ) { return (x > T(8.85645167903563082e-3)) ? math::pow( x, T(1)/T(3) ) : (T(841)/T(108))*x + T(4)/T(29); } template < typename T > T LAB_f_inverse( T x ) { return (x > T(0.206896551724137931)) ? (x*x*x) : (T(108)/T(841))*(x - T(4)/T(29)); } /// Convert from the sRGB color space to the linear RGB color space. template < typename T > void convert_sRGB_RGB( const T srgb[3], T rgb[3] ) { rgb[0] = gamma_sRGB_RGB(srgb[0]); rgb[1] = gamma_sRGB_RGB(srgb[1]); rgb[2] = gamma_sRGB_RGB(srgb[2]); } /// Convert from the linear RGB color space to the sRGB color space. template < typename T > void convert_RGB_sRGB( const T rgb[3], T srgb[3] ) { srgb[0] = gamma_RGB_sRGB(rgb[0]); srgb[1] = gamma_RGB_sRGB(rgb[1]); srgb[2] = gamma_RGB_sRGB(rgb[2]); } /// Convert from the linear RGB color space to the XYZ color space. template < typename T > void convert_RGB_XYZ( const T rgb[3], T xyz[3] ) { xyz[0] = T(0.4123955889674142161)*rgb[0] + T(0.3575834307637148171)*rgb[1] + T(0.1804926473817015735)*rgb[2]; xyz[1] = T(0.2125862307855955516)*rgb[0] + T(0.7151703037034108499)*rgb[1] + T(0.07220049864333622685)*rgb[2]; xyz[2] = T(0.01929721549174694484)*rgb[0] + T(0.1191838645808485318)*rgb[1] + T(0.9504971251315797660)*rgb[2]; } /// Convert from the XYZ color space to the linear RGB color space. template < typename T > void convert_XYZ_RGB( const T xyz[3], T rgb[3] ) { rgb[0] = T(3.2406)*xyz[0] + T(-1.5372)*xyz[1] + T(-0.4986)*xyz[2]; rgb[1] = T(-0.9689)*xyz[0] + T(1.8758)*xyz[1] + T(0.0415)*xyz[2]; rgb[2] = T(0.0557)*xyz[0] + T(-0.2040)*xyz[1] + T(1.0570)*xyz[2]; // Ensure positive numbers. const T minRGB = math::min( rgb[0], math::min( rgb[1], rgb[2] ) ); if ( minRGB < T(0) ) { rgb[0] -= minRGB; rgb[1] -= minRGB; rgb[2] -= minRGB; } } /// Convert from the XYZ color space to the CIE LAB color space. template < typename T > void convert_XYZ_LAB( const T xyz[3], T lab[3], const VectorND<T,3>& whitePoint = WHITE_POINT_D65 ) { // Divide by XYZ color of the D65 white point and apply the LAB f function. T temp0 = LAB_f(xyz[0] / whitePoint[0]); T temp1 = LAB_f(xyz[1] / whitePoint[1]); T temp2 = LAB_f(xyz[2] / whitePoint[2]); lab[0] = T(116)*temp1 - T(16); lab[1] = T(500)*(temp0 - temp1); lab[2] = T(200)*(temp1 - temp2); } /// Convert from the CIE LAB color space to the XYZ color space. template < typename T > void convert_LAB_XYZ( const T lab[3], T xyz[3], const VectorND<T,3>& whitePoint = WHITE_POINT_D65 ) { T y = (lab[0] + T(16))/T(116); T x = y + lab[1]/T(500); T z = y - lab[2]/T(200); xyz[0] = LAB_f_inverse(x)*whitePoint[0]; xyz[1] = LAB_f_inverse(y)*whitePoint[1]; xyz[2] = LAB_f_inverse(z)*whitePoint[2]; } /// Convert from the linear RGB color space to the CIE LAB color space. template < typename T > void convert_RGB_LAB( const T rgb[3], T lab[3], const VectorND<T,3>& whitePoint = WHITE_POINT_D65 ) { T xyz[3]; convert_RGB_XYZ( rgb, xyz ); convert_XYZ_LAB( xyz, lab, whitePoint ); } /// Convert from the CIE LAB color space to the linear RGB color space. template < typename T > void convert_LAB_RGB( const T lab[3], T rgb[3], const VectorND<T,3>& whitePoint = WHITE_POINT_D65 ) { T xyz[3]; convert_LAB_XYZ( lab, xyz, whitePoint ); convert_XYZ_RGB( xyz, rgb ); } /// Convert from the sRGB color space to the CIE LAB color space. template < typename T > void convert_sRGB_LAB( const T srgb[3], T lab[3], const VectorND<T,3>& whitePoint = WHITE_POINT_D65 ) { T rgb[3]; convert_sRGB_RGB( srgb, rgb ); convert_RGB_LAB( rgb, lab, whitePoint ); } /// Convert from the CIE LAB color space to the sRGB color space. template < typename T > void convert_LAB_sRGB( const T lab[3], T srgb[3], const VectorND<T,3>& whitePoint = WHITE_POINT_D65 ) { T xyz[3]; T rgb[3]; convert_LAB_XYZ( lab, xyz, whitePoint ); convert_XYZ_RGB( xyz, rgb ); convert_RGB_sRGB( rgb, srgb ); }
  11. Why not memory map the files and then directly read the characters as a C string? That will give the fastest read performance and avoid having to do many string allocations.
  12. In my engine, I have a separate thread that deals with the OS GUI's event queue. I listen for events in the WinProc (on Windows, OS X uses a NSWindow subclass), and then convert the OS-specific event format into one that is platform independent. The events are then pushed to window-specific delegate callback methods (std::function objects) that respond to things like mouse motion, mouse wheel, mouse button, keyboard, etc. This is all done on the OS thread. Eventually, after passing through the GUI hierarchy the events are queued in a atomically double-buffered array within the engine's input system. On each update of the engine's main thread, the previously queued events from the OS thread are used by the input system, the array is cleared, and the buffers are swapped atomically. Events have an associated time stamp relative to the epoch so that I can maintain ordering and detect stale events.
  13. I'm going to throw my hat in the ring for #2, I use that pattern extensively in lots of very performance intensive code. I never use #1 because it has few benefits I can see over #2 (other than satisfying the traditional mathematical definition of function or personal fetishes for functional programming). #2 is far more flexible than #1, since it allows the user to either accumulate items to a non-empty vector or provide an empty vector. To implement that with #1 would require a separate copy of the returned values to the accumulation vector (plus the allocation cost for the returned vector's contents). #2 also gives control over the allocation of the vector to the caller, which allows the caller to either heap or stack allocate the vector, and allows the vector to be reused over many function calls. #2 also frees up the return value to indicate the status of the function call. #2 will probably be more efficient, since it doesn't have to construct a vector at all (if reused). #2, due to its named output parameter is more self-documenting, rather than a return value with ambiguous usage. I agree #2 is slightly ambiguous, but it doesn't take much work to write a comment "does not clear the vector, adds to the end". If you want to enforce special usage of the vector, wrap it in a type that only allows certain operations. A common real world example would be adding draw commands to a render queue - pass a reference to the queue to the function which can add commands to the queue as needed. It would be ludicrous to use #1 in that case. Another option I've used in the past for small types/arrays would be to create a custom type ShortVector<T,size_t localCapacity>, where for small arrays the elements (up to localCapacity) are stored using placement new within a buffer of bytes in the vector object itself, for larger arrays it behaves like a standard vector. This would avoid the allocation cost for small arrays if using #1.
  14. I can't post the code, I checked alignment on everything. I am looking at the raw assembly output. Funny is that it works with T = SIMDFloat4 (wrapper of __m128), but not T = SIMDArray<Float,4> (array of wrapper of __m128 e.g. nested arrays cause strangeness). The real culprit of the slowdown is probably more likely a strange "vector constructor" function call (in a loop) that the compiler inserted each time I create an uninitialized array (which was for every single math operation), no idea what it was doing, the assembly is cryptic. To fix this I changed the internal storage type of the array to UByte rather than T and added casts to (T*) everywhere.   I finally got it working at GCC speeds with nice-looking assembly generated by using template recusion.
  15. I tried O2 and "full optimization", both do the same thing. I am trying out the recursive templates and they turned out to not be that verbose if you put all operations in one class, but still it's a pain...