• 15
• 15
• 11
• 9
• 10

Yet another Ray-Triangle Intersection Test Question

This topic is 4855 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

Recommended Posts

That is interesting, I note though that theres nothing obvious in your method which gives u,v bilinear barycentrics (or some other parameterization), can this be added without breaking the performance? (obviously applications vary, but it's not a fair comparison since the moller-trumbore above gives those coordinates as a by-product).

[Edited by - JuNC on December 6, 2004 3:24:02 PM]

Share on other sites
I also listed code to calculate t for r0 + t(r1 - r0), which is used to easily calculate the exact intersection point. Do you need u and v to calculate anything other than the intersection point? Also, remember that (in general) the number of tests will be much larger than the number of intersection results, so it is more important, speed-wise, to make the test code fast.

Straying from mathematics and delving into processor design, the moller-trombore "trivial reject test" is not so trivial. By the time you compute (u < 0.0f || u > 1.0f), you've already done a cross product, 2 dots, 3 vector subs, and a scalar-vec multiply. THEN, the 'trivial reject' is taken approximately 2/3 of the time, which is really bad for branch-predictors. The great irony is that the case where moller-trombore might be faster, where the majority of triangles tested intersect successfully, the div lies on the branch-predict path, which means that if the 'trivial reject' does succeed, the pipeline has to wait for the div to finish before it can flush! That's going to be a huge penalty, probably ballpark 50-cycles... so it's really no win unless the % of successful intersection results is up around 80-90%.

I could actually do a benchmark, but it would be pretty meaningless to test an optimized version of one algorithm against the textbook version of a different one.

 And although, I wish I were smart enough to discover something like this, it's not "my" method :) I originally saw it in an article by segura and feito , although I think they made errors in their paper that I describe in another post, and I suspect that this method turns out to be fundamentally equivelent to plucker space... the equations are suspiciously similar.[/edit]

[Edited by - ajas95 on December 6, 2004 3:11:40 PM]

Share on other sites
Yeah I noticed the t code, but obviously for a raytracer you'll need some parameterization of the triangle as well. As I said, it depends on your application whether you need coordinates on the triangle and comparing an algorithm which doesn't give you them 'for free' and one that does isn't very fair.

If I didn't need the u,v then 67 cycles without branches sounds very attractive, with that requirement then it'll take a bit more convincing - I haven't checked the math so it's possible that some parameterization falls out automatically from the dets and you can just store them.

EDIT: I'm not arguing that the method you gave isn't faster - I'm sure it is. Also your point about not needing u,v for a large number of the tests is taken. In fact, I think I'll actually incorporate the method for my large scale mesh intersections - is your SSE optimized code available or do I have to do it myself? :)

Share on other sites
ajas95: I think Möller uses his code mainly for ray-tracing. He probably does some sort of Octree first so he increases the possibility of an intersection and this limits his test to the triangles in a few voxels. I guess that he also had in mind the possibility of saving results from previous tests and update does with the same basic transformations as for the objects. Also, if many rays are parallel to each other and then checked against the same triangle then the code before the first rejection test needs to calculated only once.

/__fold

Share on other sites
JuNC,
what you said about intermediate determinants being re-used, that may very well be true. If you notice, my code to calculate t is just the top equation of moller's intermediate:

[t u v] = 1.0f / det(-d, e1, e2) * [ det(s,e1,e2), det(s,e2,-d), det(-d,e1,s) ]

Also, it's somewhat misleading to say that Moller's method calculates u,v as a "bi-product" of intersection. Mollers method IS to calculate u,v :) do you see what I mean? You could more accurately say that intersection is a biproduct of determining u and v (by checking 0 <= u,v <= 1). Also, just because u and v are calculated sequentially, (i.e. implying that you don't have to calculate v if u is outside a range), that is not how you would actually do it in an optimized version. Dots and crosses each require about 5 instructions, each of which is dependent on the previous one. So you would calculate these things interleaved. By the end, I can already see that you're not saving anything by trying to branch halfway through. You might save yourself a div, but if you understand my note above, that savings might be an illusion.

Anyway, segura's method can also use pre-computed info to speed it up. It's equivalent (I suspect) to storing the plucker coords of each edge. My implementation assumes that you don't have this info, as my mesh changes every frame. If you're doing many tests over the same mesh every frame, then it would be a benefit to calculate the cross product for each vertex (i.e. v0 x v1, v1 x v2 and v2 x v0). Then you would only need to compute 3 dot products to test intersection... hmmm, I should really try that. However your storage requirements skyrocket, it completely negates the benefit of indexed tri-lists. Although you really only need 2 of the cross-products per triangle... Or you really only need one cross-product per edge, but then you'd need to also store directionality of each edge for each triangle.

Alright, enough stream of conciousness :) It would be complicated to do it in a space-efficient manner. If you do it in a non-space-efficient manner, you risk cache-effects, although it becomes computationally very fast.

Hard to say without actually doing it.

Yeah, I can email you the code. The intersection code is commented, but unless you have a thorough understanding of SSE and the algorithm itself, it will be very difficult to follow, as the operations are heavily interwoven.

__fold : when I wrote it, it was with the assumption that there was no precomputed info. It seemed Moller was making the same assumption with his (no precomputed normal). He also seemed aware that branching might not be such a benefit ("However, on modern CPUs where conditional statements tend to be fairly expensive, this test may actually decrease performance" -- Moller, RTR 2nd Ed.)

Share on other sites
Quote:
 Also, it's somewhat misleading to say that Moller's method calculates u,v as a "bi-product" of intersection. Mollers method IS to calculate u,v :) do you see what I mean? You could more accurately say that intersection is a biproduct of determining u and v (by checking 0 <= u,v <= 1). Also, just because u and v are calculated sequentially, (i.e. implying that you don't have to calculate v if u is outside a range), that is not how you would actually do it in an optimized version. Dots and crosses each require about 5 instructions, each of which is dependent on the previous one. So you would calculate these things interleaved. By the end, I can already see that you're not saving anything by trying to branch halfway through. You might save yourself a div, but if you understand my note above, that savings might be an illusion.

Yes, the method IS to calculate u,v - so therefore having them is 'free'. If I don't have them I'll have to do extra calculations to work them out (admittedly I've never bothered checking the performance of doing it this way since higher-level optimizations tend to make more difference in the field I'm interested in).

I'm totally in sync with what you're saying about branching - my point is more that can I get the exact same information out of your algorithm as I can out of the MT method with *less* work (in whatever way we want to define that - branching, FP ops or whatever)?

There was some optimization done by Moller based on feedback from various architectures - he reorganized which tests were done first to avoid pipeline stalls. Of course, those kind of optimizations change from CPU to CPU and between generations of the same CPU.

I can't really precompute much information for triangles because I'm aiming at ray tracing (in various guises like photon mapping and MLT) many-hundred-million face meshes and memory is a massive problem (and caches are a pain).

Share on other sites
okay, well we can quantify work based on % of tris tested which intersect the ray (this assumes that there is no aggregate benefit to the 'early' exit test 0 <= u <= 1.0f, which I'm fairly convinced of... on the other hand, if you allow his algorithm to run to completion it can be written with no branches also). Moller's method does 2 crosses, 4 dots, 3 scalar * vector muls, 3 subs, and a div. Segura's method does 2 crosses, 3 dots, 4 subs... a difference of 3 muls, a dot and a div (subs are pretty much free :)

So it really depends on how dependencies and registers work out... but the div is going to cost about 30 cycles, 12 or so for the dot, and another 12 for the three muls. This is totally a ball-park guess, but I would guess Mollers method to take an extra 50 cycles or so (remember the big assumption above, however).

So, if m is tests, and n is probability of intersection:

segura          vs  moller67m + mn * 120   =  120m67 + 120n = 120n = (120 - 67)/120n = .44

So if your probability of intersection is < 44%, it's probably better to run Segura's test, then on each intersecting result, run moller's test to calculate u,v,t. I'll accept that all of this (except the # of individual ops) is open for debate.

100 of millions? whew. I guess your concern is probably creating very cache-efficient index/vertex storage, and space partitioning.

Share on other sites
How many intersections per frame(at 60 fps) are we talking about here (generally speaking)?

Share on other sites
Actually, I just thought of a good optimization for MT to eliminate the div with a branch taken almost everytime... it boils down to calculating s.p and d.q (i.e a*u and a*v), and then checking vs a and 0... i.e.

xmm0 = a*u a*v -a*u -a*v
xmm1 = a a 0 0

cmpps xmm1, xmm0, 1 (less_than)
movmskps eax, xmm1
cmp eax, 1111b
jnz no_intersect

// test u+v < 1
// do div

so, say this might cut MT down to 95 cycles or so (on average, but now it's branching, so it will vary).

Using my math above... that would make the intersection expectation > 27% to make MT worth it. However, because of the branch it becomes MORE expensive when there actually is an intersection :
just a thought

Share on other sites
thanks to everyone for such an enormous response! seems like i got much more answers than i could ask for.

based on your replies, i come to realize that it does make sense rays collinear (grhodes_at_work you're right!) to triangle should not intersect. in which case answers my main question.

as for other better algorithms such as moller and trumbore and the ones using barycentric coordinates, i'm aware of their existence. i actually have the geometric tools book by eberly/schneider (bought it last week) and these two are explained. however, i still don't understand them yet. i am just taking it one at a time though, using the easier to understand (although less efficient) algorithm first.

with regards to possible issues i may encounter when using the algorithm i am using (planes used for intersection), at this moment i am happy i don't encounter any trouble with my intersection function. however, i'll take this into account as soon as i get into much more complicated applications.

again, many thanks and i appreciate your continued support!

[Edited by - harmless on December 7, 2004 12:48:34 AM]