# Raycasting performance?

This topic is 3875 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Hello everybody! I'm currently coding a simple raycasting engine (raycasting != raytracing, remember wolfenstein 3D?) and I was wondering how power hungry such an algorithm is. Does anybody else got into raycasting? Actually, on a P4, it runs at 30fps in 1280x1024 and 60fps on a C2D. It's a high resolution raycaster (there's a ray for each width pixel) and I think it quite too much power hungry since wolf3D was running on a 33mhz in 320x240 (my engine runs at 1500fps in 320). What do you think? I didn't calculated the complexity, but it should be O(n²) with n being the width resolution. My algorithm has some room for optimisation, like using look-up tables for sqrt & cos (should be able to get a 25% increase) but I still find it quite slow. If anybody of you already tried coding a raycasting demo, it would be nice to share your experience. Any advice is already welcome. Thx :).

##### Share on other sites
Hmm, are you doing the shadowing via ray casting, or the whole rendering?

never minding, I'd suggest you to use the Kd-Tree for superiour intersection kalkulus on static scenes, or the Bounding Interval Hierarchy (Wächter, Keller) if you have animated stuff, or a combo of them both. Plus: Think about the Mutli Level Ray Tracing Algorithm, or in short, MLRTA (Reshetov and company).

For discussions on those topics, why not visit ompf (in my signature)?

##### Share on other sites
Thank you for your reply, but, it's about raycasting, not raytracing. Even if it's a much simplier type of raytracing, I don't think dk-tree and other structure would help me since I'm just trying to render a 2d array into an orthogonal world with wall of the same length. Nothing really advanced as you could notice.

Thanks anyway.

##### Share on other sites
hmm, but on the other hand, you can use kd-trees/bih's in two dimensions, too.

As long as your looking for intersections between some rays and some objects (let's say 20 objects and more), spatial ordering would increase performance. And instead of creating 1024x1024 texels once at startup, you could recreate them on the fly and on demand.

Hmm, but I guess I got confused over the properly overused term "ray-casting". That's not your fault, it's simply that some say ray-casting for tracing only primo-rays, some say ray-casting for 2d-shadow-map-generators (where you "shoot" a ray for each texel, forward in one step, look if it intersects with your landscape or whatsoever) and so forth.

##### Share on other sites
Quote:
 Original post by MrHelmut...like using look-up tables for sqrt & cos (should be able to get a 25% increase)...

I'm pretty certain raycasting doesn't need sqrt or cos at all.

Anyway, are you sure these are resposible for 25% CPU time? Also, lookup tables might not be the right way to optimize them. But first of all check whether you can eliminate sqrt and cos altogether.

Finally, why do you even worry about 30 FPS at 1280x1024? That's very smooth so don't waste time optimizing while you could be doing more interesting things.

##### Share on other sites
If you're looking for the oldschool tricks used to optimize raycasting, I can suggest several books when I get home.

Consider things like fixed point vs floating point, using crude approximations of any calculations(a 2-3 pixel error isn't a big deal when you're one of the first psuedo3d games), and using inline assembly to get the maximum work per cycle. Also, the wolfenstein 3d source code is available from 'id software' if you want to try to take it apart.

##### Share on other sites
Keep in mind that the Wolfenstein engine had several major restrictions with regard to what it could render. You could only have walls at 90° angles, and all the floor and ceiling heights were the same. So essentially the levels were arranged on a regular 2D grid. This allows for all sorts of optimized calculations. It becomes more like "scanning" than raycasting.

I believe Andre Lamothe talks about and includes such a raycaster/scanner with similar restrictions in his old book Tricks of the Game Programming Gurus. It's actually pretty neat.

##### Share on other sites
Quote:
 Original post by Extrarius... optimize ... things like fixed point vs floating point

nowadays CPU's are faster in fp-math than integer math, at least for maths related things (think of sse) ;)

[Edited by - greenhybrid on June 11, 2007 12:23:23 PM]

##### Share on other sites
Quote:
Original post by greenhybrid
Quote:
 Original post by Extrarius... optimize ... things like fixed point vs floating point

nowadays CPU's are faster in fp-math than integer math, at least for maths related things (think of sse) ;)
Is SSE floating point faster than SSSE3 fixed point? Could be, but I somehow doubt it. Either way, architecture-specific optimizations always run the risk of no longer being optimizations when the architecture changes.

I d/l the wolf3d source just for fun and noticed that Carmack himself says that these days (being when they released the source) raycasting was not the fastest way to draw a 3d environment but it was simpler to implement.
If I remeber correctly, "Michael Abrash's Graphics Programming Black Book"(ISBN 1576101746) gets a textured triangle render down to 9 cycles per pixel on a pentium. It might be possible to get even lower than that on today's hardware using SIMD instruction sets.

##### Share on other sites
If you haven't seen it already Permadi Raycasting tutorial might be of interest.

It was many years ago that I used it for my own wolf3D/raycasting demo, but I see to remember once i'd implemented the basic algorithm plenty of oppertunities for optimsations became apparent. I'm pretty sure it was possible to remove nearly all the trig etc, but sadly it was too long ago to remember.

Perhaps you should post some code, it would give poeple the oppertunity to make specific recommendations for optimsations.

##### Share on other sites
Thank you for all the reply.

So, about sqrt & cos, I managed to get rid of sqrt, but still can't find how to remove the cos (got 2 cos per cycle).

Yes, 30fps is smooth enough for 1280x1024, I was just thinking it could be faster for such an old trick.

For float vs fixed, Core 2 Duo are really powerfull for float, but it's not the case for P4 and other AMD CPUs (at 3ghz, a C2D runs my bench at 350fps, a P4 at 140fps and an amd X2 at 140fps to). Actually, most of my calculation are in float.

I'm aware of the wolfenstein engine restriction and the engine I'm intending to code has pretty much the same restriction. I'm not looking for advanced stuff, it's just for my personnal entertainment :p. So, yes, my engine has also orthogonal walls of the same height. I'm just adding a texture to the ceiling and floor (compared to the wolf engine features).

For Permadi's documentation, my engine is mostly based on it, and also based on Lode's tutorial: http://student.kuleuven.be/~m0216922/CG/raycasting.html.

I will make a test in fixed point, looking if I can get rid of the last 2 cos (it is for calculate the hypotenuse of a triangle) and inline what I can.

Since I'm not so familiar with ASM, I don't think that looking into wolf3D source code would help (I already looked at a enhanced version a little bit, since the original source code is no more on iD's website).

I think it could be interesting to get a book, so if someone can confirm that one of the book, you guys have mentionned, has something about raycasting, it would be nice.

Thx again, all of you.

##### Share on other sites
How are you rendering?

My guess is that the actual rendering of each vertical strip is what will eat the most time.

##### Share on other sites
Quote:
 Original post by Rockoon1How are you rendering? My guess is that the actual rendering of each vertical strip is what will eat the most time.

Yes, there's one cycle per pixel width, so 1280 cycle in 1280xWhatever screen resolution. The purpose is to optimize the single vertical line process.
I know I could introduce a quality level by approximating more than one line in a single pass, but, I would like to first optimize my calculation algorithm.

##### Share on other sites
Quote:
Original post by MrHelmut
Quote:
 Original post by Rockoon1How are you rendering? My guess is that the actual rendering of each vertical strip is what will eat the most time.

Yes, there's one cycle per pixel width, so 1280 cycle in 1280xWhatever screen resolution.

I dont think you understood.

The rendering speed of a vertical strip in a software raycaster is bound by its height. This will be true unless you hardware accelerate the rendering of the strips which is why I was asking. The inner loop isnt the raycasting, but rather its the rendering.

for each strip
..for each pixel in the strip

In software rendering it is usualy the per pixel work that nails you (each pixel gets pushed across the agp/pcie bus which is significantly slower than your systems memory bus, or cpu speed)

Quote:
 Original post by MrHelmut The purpose is to optimize the single vertical line process.

Well wolf3d used a 2D voxel grid and simply casted through it. The speed of the casting was bound by the distance to the intersected wall.

##### Share on other sites
Ok, I think I understand. Thanks you for pointing it.
Well, I'm not rendering a strip at each cycle. I'm writing into a buffer and render all the scene at the end. I know that the rendering take much of the time, but, I can't do anything more (or I don't find) since I'm already using hardware acceleration for the buffer and the rendering. But, as you said, there's always a transfer from the cpu to the graphic ram for each pixel.

##### Share on other sites
It's probably worthwhile, ultimately, to remove those cosines if possible, however the optimization of the strip-drawing will certainly yield larger gains.

If I understand what you are using the cosines for (to calculate the projected distance to the intersected wall to avoid the fish-bowl effect?) Then what you really want is the projection of the ray onto the view vector, and from that you take the length. Its 4 multiplies and a divide, so I'm not sure if its faster than the trig. I'm certain you could work out an incremental algorithm for a correction-factor as well, which is probably the best approach.

There are a number of neat tricks and optimizations you can perform to improve both Image Quality and speed. One thing you can do is transpose your textures to improve caching behavior. This alone should be a good benefit, since its likely that you're now causing a cache-miss on every texture sample. Cache-misses and pipeline stalls have a huge, negative impact on the P4 because its pipeline is much longer than other processors. Another thing you can do is impliment mip-mapping to reduce the number of texture samples you need to take. You can combine mip-mapping with bi-linear filtering (or even tri-linear, if you're feeling saucy) to improve the image quality (and possibly speed, depending how you sample colors now). If you want to move beyond Wolf3D and towards Doom, a quad-tree, portals or similar structure might help as more geometry is added to the level.

##### Share on other sites
Quote:
 Original post by ravyne2001[...]Then what you really want is the projection of the ray onto the view vector, and from that you take the length. Its 4 multiplies and a divide, so I'm not sure if its faster than the trig. I'm certain you could work out an incremental algorithm for a correction-factor as well, which is probably the best approach.[...]
Since the rays are always the same direction relative to eachother, shouldn't there be a constant factor for each column that you could put into a simple lookup table (should be faster than cosine lookup table since you don't need to do any math and will have sequential access).

##### Share on other sites
I think that we're talking about the same calculation, and you're saying that it can be stored in a table since it won't change between frames, correct?

I'm not sure if the table or the calculation would be better. If it can be transformed into a simple incremental algorithm then it might actually be faster to calculate it, rather than read it out of memory. I'm not certain, but I think it could be reduced to just a couple additions per strip, with some setup ammortized up-front.

That said, for simplicity I think your suggestion wins out for the OPs needs, as its much clearer and you don't have to worry about performance when filling out the table.

##### Share on other sites
Lookup Table vs a Trig Call .. moot ..

Keep some perspective (no pun intended) ..

1280 strips
75 frames per second
------------------------
96000 strips per second

Surely we are not worried about 96000 trig calls per second on even low end machines like a P3? It makes no sense to consider this as a candidate for optimization: Code readability and extensibility is far more important than whatever nearly worthless gain can be had.

On an underpowered handheld device I could see a reasonable arguement for optimization of a per-strip trig call (although at lower screen resolutions things move back towards moot)

The first candidates for optimization are in the complexity of the casting itself or in the rendering of each pixel.

For a voxel approach the per frame casting complexity is something like O(screenwidth * voxeltests) and the voxel tests *could* be minimized to some extent (each voxel could store the minimum distance to the next non-empty voxel)

For a non-voxel approach I wouldn't settle for much worse than O(screenwidth * log2 wallcount) in casting complexity. The wallcount itself can be minimized with techniques such as PVS. I'd be carefull not to overcomplicate in the name of efficiency because complications often mean limitations (supporting a highly dynamic environment might be a concern for instance.) Twiddling optimizations will not be important.

As far as rendering the strips, per frame we are talking about O(width * height) unless we dive into hardware acceleration. So here it is twiddling optimizations rather than algorithmic and the external bus speed on the cpu will represent a very hard limit. Probably only L2 cache misses (which might be on the order of 100 clock cycles each) could represent a comparable limit.

Personally I would dive into hardware acceleration of the strips unless there was a very good reason not to (perhaps some per pixel work that cant be implimented on a gpu)

##### Share on other sites
Sounds interesting.

I'm not using cos for avoiding the fisheye effect, I'm already working with vectors. The 2 cos are used to calculate the initial X and Y length of the ray vector in the starting box (since at the initialisation, I only know the gradiant angle of the ray).
Then, I'm "walking" througth the 2D array, increasing the length till I enter a "wall box".

Here's the chunk of code of this part:

            //perform DDA            while (hit == 0)            {                                   //jump to next map square, OR in x-direction, OR in y-direction                if (sideDistX < sideDistY)                {                    sideDistX += deltaDistX; // the actuel X ray length                    mapX += stepX; // the position in the 2D world array                    side = 0;                }                else                {                    sideDistY += deltaDistY;                    mapY += stepY;                    side = 1;                }                               //Check if ray has hit a wall                if (worldMap[mapX][mapY] > 0) hit = 1;                          }

That's why (as said before) the calculation speed *should* be bound to the distance to the wall.
It's also, with the cos part, the part I think the slowest in my algorithm. Casting complexity should be dependent of the width and the wall distance.

But, I noticed something very interesting. When the viewport is near from a wall, fps are decreasing and the farther I'm from a wall, the better the fps will be.

And now I'm thinking that Rockoon1 is completely rigth. Cos and other are not that much of a concern. Things which are blowing out the speed are the cache-misses and obviously the buffer writing.

It should still be a good thing to get rid of the cos and some other calculation but I think that the graphic API is the priority.

Actually I'm using SDL... And well, I now understand its speed reputation, even if using hardware surface, most SDL work look to be software. (I should have considered this from the beginning...).

So, next step, switching to a 3D API.

Well, thank you everybody! It was quite interesting. If you have more recommandations, explenations, tricks or tips to share, feel free :).

##### Share on other sites
Just a thought, what compiler are you using?

For so many years I used my cheap and cheerful MSVC++ 6.0 and was very happy. I downloaded the new MSVC2005 shortly after MS released the 'free for a year' version (now extended indefinately), but it took me along time to actually move projects over to it.

Then earlier this year i was doing some pixel level routines for grabbing images froma camera, applying image processing techniques to it. I invested a considerable amount of time in MSVC 6 optimsing the code to great effect, even getting into asm, only to discover recompiling the exact same code in MSVC2005 resulted in similar or better performances!

Looking through the assembly output from both versions it became apparent that my MSVC6 version literally examined the code line by line, extremely un-optimal, whilst MSVC2005 would examine the whole program/functions/serveal lines. Essentially MSVC2005 came out with similar optimisations that I had to handcraft back in MSVC6.

So moral of this story, if you happen to still be using MSVC6 switch to MSVC2005 now ;)

Also if you are still concerned with performance, isn't it about time you started to profile your code to see exactly where the bottlenecks are?

Off-hand i'd say that ravyne2001 suggestion of transposing the textures/bitmaps sounds like a great idea, definately something i wish i'd considered back when i had a play around with raycasting and so bloody obvious now its been mentioned ;)

Not sure why you'd want to switch to a 3D API, seems to defeat the purpose of doing raycasting in the first place?

##### Share on other sites
I'm already using VS2005. And yes, the microsoft compiler is known to be very good for code optimisation. Anyway, thanks for the recommandation.

I was considering a 3D API because I'm a little bit familiar with opengl and know how hardware accelerated texture and filtering work. It would obviously be a quite a defeat...

I dunno how to take advantage of hardware acceleration without using a 3D API. Documentation are indeed welcome. I'll check google, thanks.

PS: I'm actually at work. I'll upload some code later, if anybody is interested into looking at it.

##### Share on other sites
For 2D API's there is always DirectDraw .. many people will tell you not to use it but its still supported and will continue to be supported until there is a reasonable alternative for hardware overlays (used in most all video players) in widespread use. I figure at least 8 more years before you find gaming/media rigs that dont support it.

Just the same, since you do not need overlays then you might as well use a 3D API for the strip rendering. As for using texture filtering .. its going to be difficult to manage bi/trilinear filtering properly (how wide, in texels, is this strip? How about at a boundary condition?)

Reflection, refraction, and transparency will be fairly simple to impliment (transform the ray and continue casting)

Then there is that whole post-processing (blur tricks and so forth) stuff being done fairly fast on GPU's

Also, deferred shading? Might be something to look into.

For good measure, since you are using a 3D API you can mix in some real 3D stuff (models, volumetric fog, and so forth)

##### Share on other sites
A few points:

- I may be missing something here, but couldn't you render your pixels/strips to either a texture or a buffer (and then copy the buffer to a texture?). Or is that what everyone else is suggesting? :)
- The cache miss/indirection for a lookup table may be more expensive than just calling trig functions directly on modern systems. Profile it to make sure your performance info is up to date. Heck, make sure its a big bottleneck first.