Jump to content
  • Advertisement
Sign in to follow this  
Karlos

my SIMD implementation is very slow :(

This topic is 1358 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hellloooooo,

 

First post here, heard that this is a good forum to get help on these sort of issues.

 

Basically, I'm building a raytracer/pathtracer which is coming along nicely, but I wanted to speed it up using SIMD intrinsics. I have unit tested each part and it produces the correct results - but it has slowed my raytracer to a crawl.

 

This seems to be an issue on heavy triangle based scenes (so memory transfer intensive I guess) - my scene of implicit spheres is roughly the same performance (which is still disappointing), but my scene with a mesh tank is now very very slow.

 

Fired up vTune and I discovered that I'm getting some horrendous bottlenecks in areas of code which make no sense!

 

eT30VJw.jpg?1

 

So running it on my triangle based scene for about 25 seconds, according to vTune grabbing the second indice is taking almost 10 seconds in total! Yet the other indices are fine! It also shows some assembly to the left side, but I don't know anything about assembly... I have similar bottlenecks elsewhere in my code, which are not there in my non simd version.

 

What am I doing wrong? Is there something I should look out for in my code or is there something that I shouldn't be doing with SIMD code?

 

Any help would be much appreciated!

 

Karl

 

 

Share this post


Link to post
Share on other sites
Advertisement
From the example you've posted my guess is that you've blown out a cache line on the CPU, causing it to have to go and fetch the value from main memory (an incredibly slow operation compared to accessing the cache).

Things to check:
  • Instead of using double-indirection to get at your triangle "m_triBuffer[idToIntersect]" can you pre-sort your array so you only iterate on it front-to-back? In other words, try to reduce that line to just "sortedBuffer", which is more cache friendly.
  • How big is your tri struct? Does it contain extra padding you can get rid of by rearranging the variables? Remember that C/C++ orders your variables in memory in the order that you put them in code (barring access specifiers) and will have to insert padding around smaller variables to make sure things are aligned.

Share this post


Link to post
Share on other sites

Ah I see now, I had suspected it was something to do with memory. I'll have to play around and do a bit of re-factoring to make it work nicely again.

 

@SmkViper 

 

Yeah the double indirection looks kinda bad! I'll try to put it into one nice buffer. My tri is 384 bytes, which is quite large, so I may compress some of the data. My SIMD Vec4 uses a __m128, so maybe I could use a __m64 instead? Hopefully that will still be enough precision for the raytracer!

 

@Krpt0n

 

I had profiled it before and it never stalled on those areas, I think my new Vec4 maybe using too much data. My original Vec3 had no alignment (or SIMD) and used scaler calculations. Thanks for deciphering the assembly for me, I don't understand assembly yet!

 

Tips on the optimisation looks good too - to be honest I have been putting of this, and was trying to shortcut it by squeezing SIMD in for a speed up, but alas there are many problems with my program...

 

The link is really useful too, I'll have a look :)

 

Thank you both for you responses, its really helpful at pointing me at the right direction. I haven't been programming for that long so I didn't know how much memory could mess up things! I'll try to keep you posted if I make any progress

Share this post


Link to post
Share on other sites
vec3 to vec4 shouldn't be that bad, just 4bytes. but double checking your code, it seems like you're using fat vertices (as you access position by .v).
split data by usage! for tracing you only need the position, so create a buffer just with vertex position and put all the other data like normals, uvs, tangents etc. into another buffer. that simple change might already reduce your stalls to half.

the problem is that you maybe read 64byte per vertex into your cache line, but you just use the position, if it's not well aligned, you might even load two cache lines. packing that data more tightly, you'll always load 4 vec4 positions in one go, which is way cache friendlier.

also, optimize your vertices/indices for gpu caches e.g. with NVTriStrip http://www.nvidia.de/object/nvtristrip_library.html that vertices that belong to near by triangles will be closer in memory and you'll get more cache hits.

also, consider to create a seperate indexbuffer for your triangles if you decide to split vertices (into positions+rest), as fat vertices tend to be split and have redundant data (e.g. a box might have 24vertices) while you could remove duplicate positions and therefor improve cache hit rates once again (e.g. a box just needs 8 positions).

Share this post


Link to post
Share on other sites


My tri is 384 bytes

 

No wonder you're memory bound. I second taking a look at the Embree library, if you check it out you will find they pack their triangle structure in a grand total of 16 floats, that's 9 floats for the vertices, 3 floats for the normal, 3 packed ints that reference the triangle mesh for later shading and texturing, and four bytes of padding. All properly aligned, SIMD-friendly, nicely fitting in a modern CPU's 64 byte cache line. Separate geometry data from rendering data, ray tracing time is absolutely dominated by traversal cost anyway, it makes no sense to bloat your mesh data with stuff that is only used during 1% of the time yet sits in cache taking up valuable space 100% of the time smile.png

Share this post


Link to post
Share on other sites
Yeesh, yeah, a tri structure that big is not doing you any favors. The other two here seem to have you well covered so I'll just exit stage left ;)

Share this post


Link to post
Share on other sites

@krypt0n @Bacterius

 

Yes your right, accessing the fat vertex wasn't doing me any favours.

 

I have reduced the raytracing triangle down to 128 bytes ( I could do this further, but with more refactoring) with a slight improvement, but again, the exact line is getting killed by grabbing the data (even if I reduce it to one buffer, rather than the double indirection I had before). I'm still getting the major stalls - and I'm certain that this did not happen before I started doing the SIMD version of my Vec4...

 

Once it is traced it grabs a different structure for shading, which is 256 bytes with bi-normals, tangents etc etc. There is no performance hit there when getting this data.

 

@phantom

 

Yeah I'm starting to think this may have been a terrible idea, I don't really have the experience with this sort of stuff. Do you have any good sources to look over some of the basics of assembly? I'm just so surprised at the performance hit at introducing the SIMD version in, I naively thought it would be a bit smoother than this!!

 

Either way, I really appreciate you guys helping out, it clearly seems to be a problem with my general program data management and just lack of experience. That intel embree looks incredible though...

 

:)

Share this post


Link to post
Share on other sites


I'm certain that this did not happen before I started doing the SIMD version of my Vec4

Random aside: this is one of the reasons I suggest everyone use version control software (even for local development), as it lets you walk back a few days in your projects history and make an honest-to-god comparison.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

Participate in the game development conversation and more when you create an account on GameDev.net!

Sign me up!