Sign in to follow this  
ari556655

assembly?

Recommended Posts

I have a really expensive raytrace routine that's called many times every frame and I feel like I've trimmed all the far off of the routine. I'm now wondering if I could benefit from rewriting the routine in assembly. I know nothing of assembly language and whether or not it still might produce sizable improvements in speed or not, but if it could this routine would certainly benefit.

Share this post


Link to post
Share on other sites
Writing faster code than the compiler requires huge and deep knowledge of the memory, cpu and assembly. I have used ASM just a couple of times for fun, but everybody here says that its highly improbable to do a better job than the compiler.
Exceptions are when you know you have a special case that the compiler can´t know (magic numbers or some other trick) or when you want to use new cpu features. In this case, you might want to use intrinsics instead than assembly to make use of SSE instructions when possible.

Share this post


Link to post
Share on other sites
By recoding in SIMD/SSE2+ assembly, you might be able to increase the speed of those routines 2x to 4x. I wrote ASM SIMD/SSE2+ transformation routines for my 3D game/graphics/simulation engine, and realized 4x to 5x speed increase (in those routines).

I have not written a ray-tracer for my 3D engine, but I did write a sophisticated optical design program that ray-traces flat/spherical/aspheric optical surfaces (lenses and mirrors). If your equations and algorithms are similar, which I expect they are, I would expect somewhere around 2x improvement if you write your code carefully.

I know most programmers strongly disagree, but I am quite happy I took the time to learn SIMD/SSE2+ assembly language. If you want a couple sample functions, send me a message with your email address via gamedev PM (personal message). You must agree not to send this code to others, however (this is just to help you learn).

However, now I wonder whether I misunderstand your question. Are your raytrace routines running on your GPU --- and your asking whether you should switch to shader assembly language? If so, don't bother... the GPU compilers are quite excellent at optimizing shader source code for whatever GPU you have. Even more important, GPU architectures changed signficantly recently, and might change again in a couple years. This might very well make your carefully crafted shader assembly code much worse than compiled shader source code. Are you ray tracing on CPU or GPU?

Share this post


Link to post
Share on other sites
You probably have a multicore CPU. Make your algorithm work to utilize multiple cores, and convert your code to using vector intrinsics (SSE, etc)

Then you can examine the assembly the compiler generated and see if you can find faults that can be corrected to make it go faster. Simply writing it in assembly without knowing why the current version is slow isn't likely to yield something faster, especially if you don't have much experience.

Examine the cache behavior and see if you can use things like explicit cache prefetches to do a better job.

It's possible to write very slow code in asm.

Share this post


Link to post
Share on other sites
Quote:
Original post by maxgpgpu
By recoding in SIMD/SSE2+ assembly, you might be able to increase the speed of those routines 2x to 4x. I wrote ASM SIMD/SSE2+ transformation routines for my 3D game/graphics/simulation engine, and realized 4x to 5x speed increase (in those routines).

I have not written a ray-tracer for my 3D engine, but I did write a sophisticated optical design program that ray-traces flat/spherical/aspheric optical surfaces (lenses and mirrors). If your equations and algorithms are similar, which I expect they are, I would expect somewhere around 2x improvement if you write your code carefully.


Actually I'm not working on a 3d ray-tracer engine, rather I'm working on a 2d isometric game that traces a number of rays from every surface to every light source in order to achieve a particular type of shadowing effect with penumbras. The routine in question just steps along a ray checking for object collisions.

Quote:
Original post by maxgpgpu
However, now I wonder whether I misunderstand your question. Are your raytrace routines running on your GPU --- and your asking whether you should switch to shader assembly language? If so, don't bother... the GPU compilers are quite excellent at optimizing shader source code for whatever GPU you have. Even more important, GPU architectures changed signficantly recently, and might change again in a couple years. This might very well make your carefully crafted shader assembly code much worse than compiled shader source code. Are you ray tracing on CPU or GPU?


I'm doing everything on the cpu, no shaders or anything.

Also what exactly are vector intrinsics, I've been unable top find a decent explanation?

Share this post


Link to post
Share on other sites
Intrinsics on MSDN

In short, they are a special set of functions integrated in the compiler. Some of them make you able to use SSE vector instructions without the need of ASM.
In addition IIRC inline ASM isn't available for x64 using MS compiler, so unless you don't plan to support VC++, using intrinsics might be your only choice for this enviroment.

Share this post


Link to post
Share on other sites
Quote:
ari556655Actually I'm not working on a 3d ray-tracer engine, rather I'm working on a 2d isometric game that traces a number of rays from every surface to every light source in order to achieve a particular type of shadowing effect with penumbras. The routine in question just steps along a ray checking for object collisions.
While you are displaying an isometric view of the environment, I assume you are tracing the light rays through 3D space, right? Otherwise the lighting would be the same from floor to ceiling everywhere in the environment.

Quote:
Also what exactly are vector intrinsics, I've been unable top find a decent explanation?
They are another device to lure you into not writing portable code. And if that isn't sufficient, McSoft added another barrier - their compilers do not support inline assembly in 64-bit mode (Linux compilers still do). I advise against intrinsics. Write assembly language or stick with source code. In 32-bit or 64-bit mode you can write assembly language functions that your source code can call. That's what I did. Always write a source code function that works first, then write an exactly equivalent assembly language function.

Share this post


Link to post
Share on other sites
Quote:
Original post by ari556655
that's called many times every frame and I feel like I've trimmed all the far off of the routine.


Is it too much of a secret to post it?

Quote:
but if it could this routine would certainly benefit.


Assembly != speed improvement.
Assembly == maintenance nightmare.

All suggestions above are about using different algorithm altogether. Which just happens to require either GPU or specialized instructions which are not available directly in many languages (and consequently, need to use assembly to express them).


I have also written and benchmarked several times about considerable improvement by merely rearranging data in memory. One example in C# decreased running time by 3-4 times.

Share this post


Link to post
Share on other sites
The main issue is that the hardware has changed a lot with time.

In old hardware, there was no cache for the processor, and in many cases there wasn't even any kind of memory management unit. In fact, doing bit shifting could take more cycles than reading/writing something from/into RAM. Compilers weren't very good either. So really your only major bottleneck there was the rest of the hardware which would make your processor have to delay to do their duties (as either they had a slower reaction time or had some other kind of issue eating their bandwidth). So, back then, using assembler you could easily make very fast code, especially if you abused instructions that are not available in high level languages (again, compilers weren't very good at optimizing either).

New hardware is a completely different thing though. Processors are very cache-dependent, and accessing memory can hurt performance badly. Also there's the ability to run more than one instruction at the same time if they don't depend each other, and the whole multicore thing, etc. Also compilers got very good, especially at handling those specific details. In short, a compiler will most likely be able to detect more situations to optimize than a human, and the compiler output will be faster almost all of the time.

So, said that, the only reason to use assembler in modern hardware these days is either for using stuff that simply isn't available in high level languages (if you ever checked into operating system development you'll know what I mean) or if you want a challenge. Unless your brain is a beast machine >_> <_< >_>

On the intrinsics topic, I thought it was just reordering the instructions in such a way to make it clear to the compiler that it can use SIMD(-like) operations on them? If so, shouldn't be them extremely portable? At worst the compiler would just generate code as normal =P

Share this post


Link to post
Share on other sites
_You_ may have optimised your routine as much as _you_ can, but I'm pretty sure you'd find several others that know a number of cunning optimisations to the code in the originals language that will beat even your assembly attempt.
You should post the routine if possible, and some of us may just surprise you.
What language btw?

I managed to optimise by own software 3D engine to get the perspective floating-point divide in parallel with the filling of the next pixel span, without resorting to assembly at all, for example. I moved around code so that in theory it would allow the CPU to do this, and even though it actually had to execute one more divide than before, BAM I suddenly got about another 10fps.

Share this post


Link to post
Share on other sites
Quote:
Original post by RDragon1
You probably have a multicore CPU. Make your algorithm work to utilize multiple cores

This this this. Now that there are intuitive technologies like OpenMP, there's simply no excuse for not parallelizing inherently parallel tasks. The potential speed gain is orders of magnitude greater than anything you could get out of ASM tweaking.

Share this post


Link to post
Share on other sites
Quote:
Original post by Sik_the_hedgehog
On the intrinsics topic, I thought it was just reordering the instructions in such a way to make it clear to the compiler that it can use SIMD(-like) operations on them?


I don't think MS's compiler will ever generate SIMD (mmx, sse, etc) instructions without using intrinsics. It's possible I'm mistaken or newer versions support some level of auto-vectorization - does anyone know if this has been developed yet?

I'm not sure if GCC will, either.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this