my SIMD implementation is very slow :(

Started by
18 comments, last by Aressera 9 years, 7 months ago

Hellloooooo,

First post here, heard that this is a good forum to get help on these sort of issues.

Basically, I'm building a raytracer/pathtracer which is coming along nicely, but I wanted to speed it up using SIMD intrinsics. I have unit tested each part and it produces the correct results - but it has slowed my raytracer to a crawl.

This seems to be an issue on heavy triangle based scenes (so memory transfer intensive I guess) - my scene of implicit spheres is roughly the same performance (which is still disappointing), but my scene with a mesh tank is now very very slow.

Fired up vTune and I discovered that I'm getting some horrendous bottlenecks in areas of code which make no sense!

eT30VJw.jpg?1

So running it on my triangle based scene for about 25 seconds, according to vTune grabbing the second indice is taking almost 10 seconds in total! Yet the other indices are fine! It also shows some assembly to the left side, but I don't know anything about assembly... I have similar bottlenecks elsewhere in my code, which are not there in my non simd version.

What am I doing wrong? Is there something I should look out for in my code or is there something that I shouldn't be doing with SIMD code?

Any help would be much appreciated!

Karl

Advertisement
From the example you've posted my guess is that you've blown out a cache line on the CPU, causing it to have to go and fetch the value from main memory (an incredibly slow operation compared to accessing the cache).

Things to check:
  • Instead of using double-indirection to get at your triangle "m_triBuffer[idToIntersect]" can you pre-sort your array so you only iterate on it front-to-back? In other words, try to reduce that line to just "sortedBuffer", which is more cache friendly.
  • How big is your tri struct? Does it contain extra padding you can get rid of by rearranging the variables? Remember that C/C++ orders your variables in memory in the order that you put them in code (barring access specifiers) and will have to insert padding around smaller variables to make sure things are aligned.
have you profiled your code _before_ you've started your low level SIMD optimizations? you're indeed totally memory bound and your SIMD code seems to have no effect on that area.

based on your assembly code, your compiler decided to load "indice1" first, then "indice2" and then "indices0".
the profiler shows that it mostly spend time in line 602, as this line waits (read: stalls) for the previous load command to finish (loading the "indice1") which simply means cache miss.

if you look at the assembly, there are 3 more stalls right after "movaps" which are the loads for your 3 vertices.

so, overall, your SIMD code doesn't really do anything useful, as it is probably executed out of order while the CPU is waiting for some load instructions to complete.

the proper way to optimize this would be:
-improve data structures (e.g. pad them to make them always fit into cache lines)
-enforce alignment (64byte) to make them really be in cache lines after the first access
-pool data (you should NEVER call 'new' that goes into system for every single vertex/triangle/boundingbox/...) but allocate from pools e.g. 4kb block chunks
-compress data (e.g. do you really need 32bit indices? are maybe 16bit enough? can you use one big index and two 8bit offsets maybe? could you compress your vertices? e.g. float16 instead of float32? could you maybe even quantitize those into int8_t ? ...
-pool work in a meaningful way e.g. trace rays not scanline by scanline, but e.g. zig-zack ordered or tile based, as near by rays touch most likely similar nodes of your acceleration tree
-make the same kind of work across all cores, if one core is texturing and another core is traversing the hierarchy, both will effectively have just half the cache and even worse, both will evict data that is useful for the other core, you might end up slower than with less threads. so all threads should work on the same data.

if your amount of data is insanely high, use maybe an LOD, visual quality might not be hurt but performance could rise a lot.

and check out Intel's Embree http://embree.github.io/ it's like the best practice guide of path tracing (or tracing in general). they also have some presentations on the net, talking what optimizations they've removed to avoid quality issues, which might save you some headache.

Ah I see now, I had suspected it was something to do with memory. I'll have to play around and do a bit of re-factoring to make it work nicely again.

@SmkViper

Yeah the double indirection looks kinda bad! I'll try to put it into one nice buffer. My tri is 384 bytes, which is quite large, so I may compress some of the data. My SIMD Vec4 uses a __m128, so maybe I could use a __m64 instead? Hopefully that will still be enough precision for the raytracer!

@Krpt0n

I had profiled it before and it never stalled on those areas, I think my new Vec4 maybe using too much data. My original Vec3 had no alignment (or SIMD) and used scaler calculations. Thanks for deciphering the assembly for me, I don't understand assembly yet!

Tips on the optimisation looks good too - to be honest I have been putting of this, and was trying to shortcut it by squeezing SIMD in for a speed up, but alas there are many problems with my program...

The link is really useful too, I'll have a look :)

Thank you both for you responses, its really helpful at pointing me at the right direction. I haven't been programming for that long so I didn't know how much memory could mess up things! I'll try to keep you posted if I make any progress

vec3 to vec4 shouldn't be that bad, just 4bytes. but double checking your code, it seems like you're using fat vertices (as you access position by .v).
split data by usage! for tracing you only need the position, so create a buffer just with vertex position and put all the other data like normals, uvs, tangents etc. into another buffer. that simple change might already reduce your stalls to half.

the problem is that you maybe read 64byte per vertex into your cache line, but you just use the position, if it's not well aligned, you might even load two cache lines. packing that data more tightly, you'll always load 4 vec4 positions in one go, which is way cache friendlier.

also, optimize your vertices/indices for gpu caches e.g. with NVTriStrip http://www.nvidia.de/object/nvtristrip_library.html that vertices that belong to near by triangles will be closer in memory and you'll get more cache hits.

also, consider to create a seperate indexbuffer for your triangles if you decide to split vertices (into positions+rest), as fat vertices tend to be split and have redundant data (e.g. a box might have 24vertices) while you could remove duplicate positions and therefor improve cache hit rates once again (e.g. a box just needs 8 positions).


My tri is 384 bytes

No wonder you're memory bound. I second taking a look at the Embree library, if you check it out you will find they pack their triangle structure in a grand total of 16 floats, that's 9 floats for the vertices, 3 floats for the normal, 3 packed ints that reference the triangle mesh for later shading and texturing, and four bytes of padding. All properly aligned, SIMD-friendly, nicely fitting in a modern CPU's 64 byte cache line. Separate geometry data from rendering data, ray tracing time is absolutely dominated by traversal cost anyway, it makes no sense to bloat your mesh data with stuff that is only used during 1% of the time yet sits in cache taking up valuable space 100% of the time smile.png

“If I understand the standard right it is legal and safe to do this but the resulting value could be anything.”

Yeesh, yeah, a tri structure that big is not doing you any favors. The other two here seem to have you well covered so I'll just exit stage left ;)

but I don't know anything about assembly...


For this kind of low level fiddling this WILL be a problem; in order to get the best from SIMD you are going to need to have some understanding of the underlying assembly code, cache line and memory fetches.

Just throwing intrinsics at the problem won't help without a consideration of how it fits in with the data flow, as we've already seen with regard to your huge data structure.

@krypt0n @Bacterius

Yes your right, accessing the fat vertex wasn't doing me any favours.

I have reduced the raytracing triangle down to 128 bytes ( I could do this further, but with more refactoring) with a slight improvement, but again, the exact line is getting killed by grabbing the data (even if I reduce it to one buffer, rather than the double indirection I had before). I'm still getting the major stalls - and I'm certain that this did not happen before I started doing the SIMD version of my Vec4...

Once it is traced it grabs a different structure for shading, which is 256 bytes with bi-normals, tangents etc etc. There is no performance hit there when getting this data.

@phantom

Yeah I'm starting to think this may have been a terrible idea, I don't really have the experience with this sort of stuff. Do you have any good sources to look over some of the basics of assembly? I'm just so surprised at the performance hit at introducing the SIMD version in, I naively thought it would be a bit smoother than this!!

Either way, I really appreciate you guys helping out, it clearly seems to be a problem with my general program data management and just lack of experience. That intel embree looks incredible though...

:)


I'm certain that this did not happen before I started doing the SIMD version of my Vec4

Random aside: this is one of the reasons I suggest everyone use version control software (even for local development), as it lets you walk back a few days in your projects history and make an honest-to-god comparison.

Tristam MacDonald. Ex-BigTech Software Engineer. Future farmer. [https://trist.am]

This topic is closed to new replies.

Advertisement