You can further accelerate it by integrating the frustum culling in the collision detection, so a large amount of stuff will be spared tests versus the frustum, due to the broad phase, altho things might get weird.
Is my frustum culling slow ?
@galop1n you mean going through every object first and culling 4 boxes at once ?
But if I were to do that I'd need 2 loops through every object since I need to go through it again right after to actually draw them...
What I do right now is go through every object, check if it should be culled and then either draw it or continue to the next.
Also I was trying to avoid making separate containers since doing 10k push_backs doesn't work that great.
I could use a boolean maybe...would still have to go through all entities again but I wouldn't have the overhead of push_back.
If you have 1 loop, I'm guessing your 'object' contains the AABB data, and the data for drawing, etc... The downside of this is that at the hardware level, when you read the AABB into the CPU cache, you're also reading in the objects other (unrelated) members. If the AABB test fails, you've spent time loading data into the cache for no reason. Also, when you then switch from calling the AABB testing code to calling the drawing code, you're loading all sorts of other drawing related data into the cache, which might evict other data, and will likely prevent your AABB testing loop from effective precaching.But if I were to do that I'd need 2 loops through every object since I need to go through it again right after to actually draw them...
What I do right now is go through every object, check if it should be culled and then either draw it or continue to the next.
If you instead have an array of struct{ AABB a; Drawable* d }, an output array of Drawable* culled results, and an array of actual Drawable instances, then the two loops you mention may actually be more efficient than your existing single loop.
If you're going to be calling push_back a lot, make sure you call reserve first with a sensible guess of the memory requirement, or you can use a raw array (with your own debug error checking, of course ;-) )Also I was trying to avoid making separate containers since doing 10k push_backs doesn't work that great.
It's been a long time since my last reply here on gamedev.net.
@lipsryme: Happy to know that my blog post actually helped someone Unfortunately, there seems to be an error in the code. The culling should be incorrect. Since you haven't seen it yet I'd assume that you are just rendering more objects than needed.
The error is in:
__m128 xmm_d_p_r = _mm_add_ss(_mm_add_ss(xmm_d, xmm_r), xmm_frustumPlane_d);
__m128 xmm_d_m_r = _mm_add_ss(_mm_add_ss(xmm_d, xmm_r), xmm_frustumPlane_d);
Can you spot it?
xmm_d_m_r should subtract r from d, not add it! it should be:
__m128 xmm_d_m_r = _mm_add_ss(_mm_sub_ss(xmm_d, xmm_r), xmm_frustumPlane_d);
I don't have the project anymore so I'd assume it's just a blog post typo and it didn't affect the timings.
On the plus side, the last piece of code in the post (4 boxes at a time) does it correctly
Hope this doesn't ruin your benchmarks.
@Hellraizer
On the plus side, the last piece of code in the post (4 boxes at a time) does it correctly
Now i am confused as in 4box version says:
// NOTE: This loop is identical to the CullAABBList_SSE_1() loop. Not shown in order to keep this snippet small.
where that part of code is that you mentioned.
Also, for some reason it fails on Debug mode (VS 2012) with "Unhandled exception...reading location 0x0"
while working fine in Release?
@galop1n
I don't understand why is it not already aligned? I have array of planes
struct _Plane
{
float nx;
float ny;
float nz;
float d;
};
std::array<_Plane, 6> vFrustum;
Does it need to force alignment or something?
There is nothing in the std and the C++ standard that enforce alignment other than the natural alignment of types ( 4 for float ). new and malloc are also anaware of that. use a __declspec(align(16)) on the _Plane declaration and then just instanciate a _Plane vFrustum[6] without std::array. But remember that if you put the array in dynamically allocated memory, you may also fail on the alignment constraint, so you will need to manualy align the root objet memory.
Ive added this
__declspec(align(16))
struct _Plane
{
float nx;
float ny;
float nz;
float d;
};
and now it works in Debug. But i don't understand why isn't it already 16byte aligned as data is 4 floats ?
sizeof(float) * 4 == sizeof(_Plane)
class and type sizes are unrelated to instance adresses. If you instanciate a class or type on the stack or in the globals ( data, rodata and bss ), the compiler and linker are aware of the native alignment and can do the work for you if you tag thinkgs correctly, but with memory allocated on the heap you depends on the vendor implementation of the allocator. x64 ABI force a 16 byte aligned address for dynamic allocation, but x86 ABI only require 4 bytes aligned address. Of course, you are free to override the allocator with a custom one.