Sign in to follow this  

KD-Tree nodes in 4 bytes

This topic is 4711 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Ive been thinking about ways to reduce nodes in a kd tree (as applying to raytracing, but it goes for toher applications just aswell i beleive) to 4 bytes in size, halving them from the currently widely used 8 bytes. three improvements ive come up with: - a node is either a leaf or has one out of three splitaxis specified. thats four states, so two bits needed, not the commonly used 3. - you dont need nearly as much bits for a childpointer as are commonly used. its really simple: if your tree has 64k nodepairs, so 128k nodes (not a small tree), you need only 16 bits to differentiate between them. just multiply this value by the size of a nodepair to get the offset of the rootnode, + rootnode is memory adress. however, it can be done even better by using relative pointers, specifying the distance a child is away from its parent. again using 16 bits as an example, a parent and its leaf are constrained to being 128k nodespairs apart. without any clever node reorginization, just a naive 'sort all nodes by treepdeth', this would mean the maximum width of the tree would be 128k nodes. if kdtrees were balanced this would mean a maximum of 256k nodes, however, they are highly unbalanced, meaning much deeper and hence less broad, so this figure will be highly stretchable. i suspect 1M nodes wouldnt be any problem, but this requires some testing. also a tree can be built so as to limit width. - again using 16 bits as an example for pointer data, that leaves us with 14 bits for the splitplane position. as the clever reader notes: you cant fit a float in there :). however, 14 bits is still a resolution of 16k nonetheless. thats still a lot better than an octree with its resolution of 2, and not much worse than a float can provide, with its anisotropic resolution. (a float can only guarantee 7 digits of precision (correct me if im wrong), which isnt much worse than 5.) also, each splitplane could be specified relative to its parent, yielding a much better resolution at the lower levels of the tree, where precise positioning actually matters. so it seems that with some adaptation its quite possible to halve the memory needed for your tree. there is one main point of concern i have though: fixed point math & the splitplane position. i must say im not too familiar with the precise details of it, and if its possible without completely castrating your traversal speed. or would it for example be possible to interpret these 14 (or more for trees which need smaller maximum depth) as the most significant bits of a floats mantissa? or if that doesnt work is there another way to store only part of a float, discarding some precision? also im aware of the problem of specifying a splitplane at lower resolution than that of your vertex coordinates. however, this can be taken into account during tree construction: if the resolution of the position of a plane is too low to determine if an AABB falls on one side or the other, it simply falls in both. it slightly compromises the efficiency of your tree, but: -once again, it still beats octrees hands down. -by doing this you also adress numerical instabilities (that a 32bit float splitplane position also suffers from) which can lead to uncorrecly missed triangles. namely a vertex will never again lie precisely on a splitplane, but always a distanceaway from it, providing a buffer for numerical instabilities. another (experimental) solution: store vertices in 16bit fixed. splitplane position should easily surpass that on leaflevel. since rays get transformed into model space in my implementation, accuracy shouldnt be a problem. however, fixedpoint mathagain might be. other experimental idea: make split position precision dynamic: ie, minimize and count the maximum number of bits needed per childpointer for each treelevel, and use all unneeded bits for the splitplane. that should yield very accurate splitplanes in most cases, and where it doesnt, well, youll still suffer some slightly less effienciently constructed trees. wow, thats a lot of rambling, and there are probably not even a handfull of people visiting these boards that care at all about this, but for those that do care and have taken the time to give this some thought: id love to hear your vision on this. i think the possibilities are promising, but maybe im overlooking something. is this even worth it, for instance? it must save some memory and memory bandwidth, aswell as improve cache performance, but there are probably some instructions extra needed here and there, especially concerning the fixed point math.

Share this post


Link to post
Share on other sites
Very interesting. I was wondering the same thing lately, but I hadn't though about it in this much detail.


I have got around 6 bytes per triangle average storage with my aabb tree. I'ts really not clever - the main idea is to store several tris at once ( 2 - 8 ) in a leaf, which amortizes the cost of the full 24-byte AABB at each node & leaf.

I could get it much smaller if I compressed the AABB, but its not a bottleneck right now...

Share this post


Link to post
Share on other sites
I posted once about how I store my kd nodes in 32 bits. I do the axis/leaf identifier in the lower 2 bits of the split-plane. If the low 2 bits specify 'leaf' then the upper 30 bits are an offset into a list of primitive indexes. The primitive indexes are specified by a count of the primitives in the leaf in the first entry, then a list of unsigned ints that point to the actual leaf data.


_______________ ______ ______________ ______ ______________ ________
| 30-bit offset | leaf || 30-bit offset| leaf || 30-bit split | x-axis |
--------------- ------ -------------- ------ -------------- --------
| /
| /
___|_____ _______ __/______ _____
| # prims | 0 1 2 | # prims | 2 4 |
--------- ------- --------- -----
/ / \ ______/ \ .
/ | \ / \ .
[leaf] [leaf] [leaf] [leaf] [leaf]


and the children of the node stored at m_nodes[i] are implicitly stored at left = m_nodes[2i] and right = m_nodes[2i + 1].

Share this post


Link to post
Share on other sites
i do the same thing for my leafs. nearly the same. i store the amount of primitives in a leaf also in the node itself. plenty of space, and it saves a dereference on empty nodes.

as for implicitly storing nodes: isnt that very memory AND cache inefficient? note that this is just my reasoning, i havnt tested this, but deeper down in the tree it hardly grows exponentionally, even more so, it should shrink in the deepest levels. youll have seas of empty space deep in your tree, and no cache coherency at all.

however, after posting this topic i had a similar idea. to reduce the amount of bits needed to specify an offset, you could, instead of specifying the distance from object to object, specify the distance from a trendline, for example j = 2*i, although i think j = i is better on average. if you store the widest level of the tree, you could use a varying trendline depending on wether its expectde to grow or shrink. that really should bring back the number of needed bits for a decent sized tree to something like 14 or so, leaving a nice 16 bits for splitplane location.

if you store vertices as 16bit short, you only need 16bit short splitplane positions anyway.

however, im still not sure about the fixed-point aspect of all the maths, and its speed in particular. maybe i should start a topic on that in the math forums.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
By tracking the aabb of a node while traversing the tree, you could use your 16-bits like this:


unsigned int nodeBits = m_nodes[currentNodeIndex]; // Upper 16 bits are the split position.

float weight = float(splitIndex) * (1.0f / float(0xffffffff)); // Get the weight [0, 1]
float splitPos = aabbMax[axis] * weight + aabbMin[axis] * (1.0f - weight); // Get the actual split-axis



You could probably ignore to clear the lower bits since they would only shift the position very little.

Share this post


Link to post
Share on other sites
Quote:
Original post by Eelco
i store the amount of primitives in a leaf also in the node itself. plenty of space, and it saves a dereference on empty nodes.
...
as for implicitly storing nodes: isnt that very memory AND cache inefficient? note that this is just my reasoning, i havnt tested this, but deeper down in the tree it hardly grows exponentionally, even more so, it should shrink in the deepest levels. youll have seas of empty space deep in your tree, and no cache coherency at all.


right, if the leaf is empty I save the offset as 0, so there is no dereference. And as for memory/cache efficient... that's 2 different questions. Yes, memory usage it is exponential, but even at the 20th level, it's 4 Megs. Beyond that it gets much less efficient. Phantomus talked about a 45-deep tree for Buddah... I don't have 128 Terabytes of RAM, so it's not an option for me :)

As for cache-efficient, that was the whole reason why I originally came up with this scheme. It allows to determine address of grandchildren nodes without dereferencing and prefetch way ahead of time. Also, grandchildren are stored consecutively in 32 bytes -- size of a cache-line, so they can all be prefetched in one call.

As it turned out, when I profiled with prefetching turned off, it really didn't spend any time stalling on d-cache. I'm honestly not sure why this is. I was kind of mad because it eliminated what I thought was a huge performance benefit of my design. Oh well.

Share this post


Link to post
Share on other sites
yeah, thats sort of what i was thinking. by specifying everything (memory locations and positions in space) relative to a parent, you can archieve the needed resolution in only a couple of bits. however, i think any casts to float ought to be avoided in your inner loop, so another method should be used i think.

for example, you could store 7 exponent bits and 9 mantissa bits to store a truncated float in the range 0-1.

thinking about the implicit storage of nodes again, i just realized a way to do it with minimal memory waste. say you have a treelevel i and j=i+1, with two indices in it, I, and J respectively. for each node in i, count how much leafs came before it in i. since leafs wont have children, you can then store semi-implicit pointers of the form J = j.start + 2*(I - nodesbeforeI). if you put all nodepairs which are both children at the end of a level, i think you could build huge trees with only a couple of bits.

Share this post


Link to post
Share on other sites
Quote:
Original post by ajas95
Quote:
Original post by Eelco
i store the amount of primitives in a leaf also in the node itself. plenty of space, and it saves a dereference on empty nodes.
...
as for implicitly storing nodes: isnt that very memory AND cache inefficient? note that this is just my reasoning, i havnt tested this, but deeper down in the tree it hardly grows exponentionally, even more so, it should shrink in the deepest levels. youll have seas of empty space deep in your tree, and no cache coherency at all.


right, if the leaf is empty I save the offset as 0, so there is no dereference. And as for memory/cache efficient... that's 2 different questions. Yes, memory usage it is exponential, but even at the 20th level, it's 4 Megs. Beyond that it gets much less efficient. Phantomus talked about a 45-deep tree for Buddah... I don't have 128 Terabytes of RAM, so it's not an option for me :)

As for cache-efficient, that was the whole reason why I originally came up with this scheme. It allows to determine address of grandchildren nodes without dereferencing and prefetch way ahead of time. Also, grandchildren are stored consecutively in 32 bytes -- size of a cache-line, so they can all be prefetched in one call.

As it turned out, when I profiled with prefetching turned off, it really didn't spend any time stalling on d-cache. I'm honestly not sure why this is. I was kind of mad because it eliminated what I thought was a huge performance benefit of my design. Oh well.

yeah id also think prefetching grandchildren would be a good idea.

but youre saying it basicly it didnt matter? was it only the prefetching that didnt help, or was the switch to 4 byte simply completely useless?


btw, even if there is no speed benefit, i still like the idea of fixed point math. i can never stand floats and the fact that their resolution isnt constant, whereas for such an application, it obviously should be.

Share this post


Link to post
Share on other sites
Quote:
Original post by Eelco
btw, even if there is no speed benefit, i still like the idea of fixed point math. i can never stand floats and the fact that their resolution isnt constant, whereas for such an application, it obviously should be.


I'm not sure squeezing nodes to 32bits is a win. First, once you have accessed a node you've wasted an entire cache line for it, say 64 bytes (read locality matters more).

Then, and that's the biggest problem in my eye, when doing a traversal à la Wald you better have the distance computed as early as possible, because you have few cycles to waste on concurrent computation before you have to decide (that is, use that distance) where you're going.
Going from fixed point to float/vector is going to take time you don't have; unless you make an all fixed point raytracer :)


Btw there's a good reason to use 3+1 bits to encode the leaf flag and axis (like Wald pointed out): with the leaf bit as the msb you can just test for signedness to know if it's a node or a leaf.
Then if you store the axis in the lowest part, you just need a compact and with a small immediate to extract it and you don't need to shift/scale the offset before you use it.
Can't be more efficient computationnaly.

Share this post


Link to post
Share on other sites
Quote:
Original post by tbp
Quote:
Original post by Eelco
btw, even if there is no speed benefit, i still like the idea of fixed point math. i can never stand floats and the fact that their resolution isnt constant, whereas for such an application, it obviously should be.


I'm not sure squeezing nodes to 32bits is a win. First, once you have accessed a node you've wasted an entire cache line for it, say 64 bytes (read locality matters more).

yeah that makes sense.

Quote:

Then, and that's the biggest problem in my eye, when doing a traversal à la Wald you better have the distance computed as early as possible, because you have few cycles to waste on concurrent computation before you have to decide (that is, use that distance) where you're going.
Going from fixed point to float/vector is going to take time you don't have; unless you make an all fixed point raytracer :)

does the smiley indicate you think thats a bad idea? note that im not a numerical guru, but i have the feeling floats are used far to widely, in applications where they shouldnt be. however, going fixed point is often considered a hassle, because everything is built around floats, for seemingly no other reason than that its assumed to be the better choice.

why would i want to store vertex positions as floats? i might want to do so for vertices in camera-space, where adaptive resolution makes sense, but i see no compelling reasons to let my coordinates in model space have a resolution of 2^128 near the origin, and a resolution +-120 orders of magnitude lower just one unit away from it. floats enable me to cover a huge range, but my model space (in which i perform all calculations) is very much a bounded domain, so any ability to specify positions beyond that are wasted bits.

Quote:

Btw there's a good reason to use 3+1 bits to encode the leaf flag and axis (like Wald pointed out): with the leaf bit as the msb you can just test for signedness to know if it's a node or a leaf.
Then if you store the axis in the lowest part, you just need a compact and with a small immediate to extract it and you don't need to shift/scale the offset before you use it.
Can't be more efficient computationnaly.

yeah that makes sense from an efficiency POV. i was assuming simple operations like bitshifting and bitmasking could be scheduled away somehow, but i suppose thats a little optimistic.

so your conclusion is going 32bit is a waste of time? i certainly am inclined to say so after the input ive been given (for which id like to thank you all).

however, id still like to investigate the possibilities of fixed point tracing further. i have applied fixed point math to a raycaster once, but needless to say it was a little less time crititcal. i cant remember the specifics, but it was a huge speedboost in texture lookup/interpolation anyway, and in general, i dont see why it has to be slower, as long as you stay away from casts. and ofcource make sure your algorithms are stable with fixed point math under any circumstances, ie no overflows.

Share this post


Link to post
Share on other sites
Quote:
Original post by Eelco
does the smiley indicate you think thats a bad idea? note that im not a numerical guru, but i have the feeling floats are used far to widely, in applications where they shouldnt be. however, going fixed point is often considered a hassle, because everything is built around floats, for seemingly no other reason than that its assumed to be the better choice.

I don't think it's a bad idea in itself, and depending on the platform it can be the only option. But on x86, unless you also schedule concurrent float ops at the same time you'd have a large part of your cpu iddling.
The problem is crossing the fixed point/float line is expensive, well you can't really afford it in a tight kd tree traversal.
So it's an all or nothing type of deal.

Quote:

yeah that makes sense from an efficiency POV. i was assuming simple operations like bitshifting and bitmasking could be scheduled away somehow, but i suppose thats a little optimistic.

To begin computing the distance and decide where to go you have to extract some stuff and cook some addresses. Basically, you're serialized at that point.
Then before you even think about it you already have to branch away.
There's little time to do fancy stuff :)

Quote:

so your conclusion is going 32bit is a waste of time? i certainly am inclined to say so after the input ive been given (for which id like to thank you all).

however, id still like to investigate the possibilities of fixed point tracing further. i have applied fixed point math to a raycaster once, but needless to say it was a little less time crititcal. i cant remember the specifics, but it was a huge speedboost in texture lookup/interpolation anyway, and in general, i dont see why it has to be slower, as long as you stay away from casts. and ofcource make sure your algorithms are stable with fixed point math under any circumstances, ie no overflows.

No, 32bit nodes aren't a waste of time; i just don't think they suit a vectorized/float approach.
Now a fixed point raytracer is something completly different.

Share this post


Link to post
Share on other sites
Quote:
Original post by tbp
Quote:
Original post by Eelco
does the smiley indicate you think thats a bad idea? note that im not a numerical guru, but i have the feeling floats are used far to widely, in applications where they shouldnt be. however, going fixed point is often considered a hassle, because everything is built around floats, for seemingly no other reason than that its assumed to be the better choice.

I don't think it's a bad idea in itself, and depending on the platform it can be the only option. But on x86, unless you also schedule concurrent float ops at the same time you'd have a large part of your cpu iddling.
The problem is crossing the fixed point/float line is expensive, well you can't really afford it in a tight kd tree traversal.
So it's an all or nothing type of deal.

yes, exactly, you last phrase catches it well.

as long as there are no things that really NEED floats, you can stay away from them completely, because if you dont, youll probably face a cast sooner or later, which is out of the question.

btw, being an ASM noob i dont have a clue, but i could imagine a float*int or float/int not requiring a cast, but having its own 1-cycle cpu instruction, since they are pretty straightforward operations. is this the case, or does the compiler make an implict cast to float in such a case?

anyway, ill give this some thought.

Share this post


Link to post
Share on other sites
Quote:
Original post by Eelco
yes, exactly, you last phrase catches it well.

as long as there are no things that really NEED floats, you can stay away from them completely, because if you dont, youll probably face a cast sooner or later, which is out of the question.

More exactly, on x86, you should also use "floats", but in an orthogonal fashion to the integer part, it's the crossing that's expensive (tho it's bit more subtle than that because some integer operations, ie muls, might use fpu/vector units etc).

Quote:

btw, being an ASM noob i dont have a clue, but i could imagine a float*int or float/int not requiring a cast, but having its own 1-cycle cpu instruction, since they are pretty straightforward operations. is this the case, or does the compiler make an implict cast to float in such a case?

anyway, ill give this some thought.


The problem is not with the compiler (ie you can completly re implement floating point operations using integer only) but how it's done on the cpu.
Basically there's 3 register files: general purpose (integers), fpu (represented as a stack) and SSE1/2.
Going from the GPR to the fpu (vice versa) has to go through memory; that's, i guess what you call a cast (fld*, fist*, fst* etc...).
There's some slow ops to go from GPR to SSE (vice versa) without touching the memory, but that's not recommended; btw SSE also brings its own set of conversions.

There's one conclusion to draw from all that crap, crossing register files is a mess :)

Voilà.

Share this post


Link to post
Share on other sites
Quote:

Ive been thinking about ways to reduce nodes in a kd tree (as applying to raytracing, but it goes for toher applications just aswell i beleive) to 4 bytes in size, halving them from the currently widely used 8 bytes.


At GDC 2003, as part of my talk about Memory Optimization, I presented a highly cache-efficient k-d tree representation that uses 4 bytes per node. You can find the slides from that presentation here:

http://realtimecollisiondetection.net/pubs/
or
http://www.gdconf.com/archives/2003/

I talk about the same representation in some more detail in my collision detection book, but the GDC slides should be enough to figure out the details from.

You could easily reduce the memory requirements further if you're willing to represent the axis splitting value with a lower-precision value (such as, say, a 16-bit float or some fixed-point representation). The cache-efficient representation could easily be adapted to a smaller node size.

A somewhat less compact/cache-efficient k-d tree representation is given by Laszlo Szecsi in the book Graphics Programming Methods, that also uses 4-byte nodes.


Christer Ericson
Sony Computer Entertainment, Santa Monica

Share this post


Link to post
Share on other sites
Christer,
That's awesome. That's exactly the memory layout I use for my kd-tree. The only (small) difference is that I don't store vert indices, just face indices. The downside is that 2 faces sharing a vertex get duped verts, but each face is guaranteed to have all 3 verts in a row.

Anyway, just so you know.. when my boss got back from gdc that year he required all the programmers to read your article. That was long before I'd even heard of a kd-tree and the first time I ever learned what pointer aliasing was. We were having problems dealing with those tiny 16kb PS2 caches... realizing that they don't play well with virtual functions :)

Anyway, thanks. Good work.

Share this post


Link to post
Share on other sites
thanks for the link Christer, certainly interesting, that aliasing stuff. never head about it.

made me realize once again: the more a learn about programming, the more i realize i know nothing yet :P.

yeah, subtrees. thats definitly a good one. i woulndt be surpired if its faster too.

im weighting the pros and cons of going 2 byte node, but i think its counterproductive. first of all its limiting to store a leafpointer in 2 bytes, unless you store the leafdata closeby in the same pool. (thats not a bad idea actually). however, i dont want to go beneath 16 bit splitplanes. you need 4 special cases to store leaf&axis, so that could be 0xFFFF..0xFFFC, but thats kinda hacky aswell.

besides im not sure bigger subtrees arnt a slowdown. the potential for wasted space gets bigger.

however, utilizing 64 bit cachelines might be worth it, yielding 15 node subtrees without the drawbacks. one problem i see with your tree is the single childpointer, meaning 8 subtrees need to be allocated regardless of being used (unless im missing something). however, optimization is possible here, by interleaving half-used childspace. say one subtree only uses the first four children, and another the last 4, you can give them both the same pointer without trouble. ofcource the same goes for more complicated patterns, but im sure there is a nice algorithm to fit them tightly.

however, another idea is to store a 3 byte cachelineoffset in in node, which is possible since i only want 16 bit fixed splitplanes anyway. (16bit fixed vertices require no more)

layout would then be:
31 = leafbit
split:
30-29 = axis
26-24 = cacheline offset in case it points to subtree (0-7)
23-0 = splitplane
leaf:
30-24 = leafsize (0-127)
23-0 = leafpointer (relative to leafpool or current location, latter enables possible storing in empty space of subtree)

Share this post


Link to post
Share on other sites
Quote:

Long live Adaptive Binary Trees.
/me runs away


First, ABTs (unless I've greatly misunderstood something) are nothing but AABB trees, so call it that.

Second, AABB trees (or any other bounding volume hierarchy) are different beasts from spatial partitioning structures (like k-d trees and octrees). The latter partition space itself whereas the former partition that which is contained in space. They therefore serve rather different purposes and one cannot be said to be generically better than the other.

So, yeah, better run! ;P


Christer Ericson
Sony Computer Entertainment, Santa Monica


Share this post


Link to post
Share on other sites
Quote:
Original post by Christer Ericson
Second, AABB trees (or any other bounding volume hierarchy) are different beasts from spatial partitioning structures (like k-d trees and octrees). The latter partition space itself whereas the former partition that which is contained in space. They therefore serve rather different purposes and one cannot be said to be generically better than the other.


I've tried using a bvh for coherent/mono raytracing: nodes are larger and the ray/node intersection is bit more expensive but the traversal is a tad tighter (bvh in array form with skip pointers), see http://homepages.paradise.net.nz/nickamy/benchmark.html for mono ray numbers.
It's quite decent.

Of course, the biggest issue for that usage is precisely that a bvh doesn't partition space :)

Share this post


Link to post
Share on other sites
your link is broken.

but anyway, unless havran is a lying piece of ****, bvh's arnt better than kdtrees, are they? not in terms of performance and certainly not in terms of memory use.

however i can see them play a role in the future for accelerating the raytracing of (semi)orderly moving primitives (take cloth for example), since for example a spheretree would hardly need rebuilding on such meshes.

Share this post


Link to post
Share on other sites
Quote:
Original post by Christer Ericson
First, ABTs (unless I've greatly misunderstood something) are nothing but AABB trees, so call it that.

Mr. Ericson! Surprised to see me? (said in the voice of Agent Smith)...

I was hoping you would reply to my alluring bait; this is such a great time to discuss ABTs :). I'm sorry to say that you may be slightly mistaken in your understanding of Adaptive Binary Trees.

ABTs are closely related to both kD-trees and AABB trees. You see ABTs are essentially adaptive kD-trees that store volumetric information as well. This allows ABT nodes to represent an approximate region in space, in contrast to kd/oct/bsp trees which describe a distinct region in space.

In fact a simple ABT could be created from a naïve kD-tree implementation with relative ease. You would simply store an AABB in each node along with the split plane that would define the kD-tree. Aside from that all you have to do is modify the way the you choose the split plane. The heuristic functions that were originally proposed by Yann do a very good job at creating an optimal tree for nearly any situation. kD-trees on the other hand (please note: I am referring to kD-trees as discussed in the originally published white paper), are very rigid in nature. They commonly create unbalanced trees which can degrade performance severely, especially on modern processors.

Quote:
Original post by Christer Ericson
Second, AABB trees (or any other bounding volume hierarchy) are different beasts from spatial partitioning structures (like k-d trees and octrees). The latter partition space itself whereas the former partition that which is contained in space. They therefore serve rather different purposes and one cannot be said to be generically better than the other.


Please see section one. I'm currently writing a paper on the Adaptive Binary Tree algorithm. Unfortunately, I've been rather busy at work lately, so I'm not actually going to finish for quite some time. I can however, explain the algorithm more thoroughly if anyone is interested. I assure you it has many benefits over the standard kD-tree algorithm.

Quote:
Original post by Christer Ericson
So, yeah, better run! ;P


I am a lazy asthmatic programmer... I would much rather post a drawn out retort than leave the comfort of my chair and run. I'm out of breath just thinking about it.

/me codes

Share this post


Link to post
Share on other sites
Weird. As Havran discussed, if by 'unbalanced' you mean that 'at high levels in the tree, nodes are created that have large volumes of empty space as one child', then yeah... those are good.

The main difference between the 2 is that kd-tree put primitives that overlap the split plane into both children, whereas ABT split children are expanded until each primitive is in either one child or the other (so the children have to overlap). But that only serves to increase the probability that both children will have to be traversed...

For example, this case, where only one child is traversed for kd, but both are for ABT:

kd-tree ABT
+-----+ +------+
| | | |
|_____| |------|
\ | \------|
|\ | |\ |
|_\___| |_\____|


The only real advantage ABTs have over kd-trees is that primitives are guaranteed to only be encountered at most once during traversal... but as Phantomus pointed out, a simple unique id can be tagged to the primitives in a kd-tree so that already-seen prims are dismissed immediately, so that pretty much wipes that argument out.

And if your argument is that heuristics can be used to create ABTs but no good heuristics can be used for kd-trees.... well. That's just not an informed argument.

Share this post


Link to post
Share on other sites
Quote:
Original post by ajas95
The main difference between the 2 is that kd-tree put primitives that overlap the split plane into both children, whereas ABT split children are expanded until each primitive is in either one child or the other (so the children have to overlap). But that only serves to increase the probability that both children will have to be traversed...

That is incorrect. An ABT will split faces/primitives that are part of two children, so that each child will include the respective part of the primitive without any overlap. The child node overlapping is just an additional heuristic that helps to weight partitioning efficiency against exponentially increasing number of splits.

Quote:
Original post by ajas95
The only real advantage ABTs have over kd-trees is that primitives are guaranteed to only be encountered at most once during traversal... but as Phantomus pointed out, a simple unique id can be tagged to the primitives in a kd-tree so that already-seen prims are dismissed immediately, so that pretty much wipes that argument out.

One word: GPU.

Perhaps the misconceptions come from the way the term "primitive" is defined. Static ABTs are meant to be partitioning systems for a low level geometry, ie. face by face. Although they will also work with entire objects in their nodes, they will yield maximum efficiency when used on mesh level. If the splitting and partitioning heuristics are well balanced, it can achieve a very tight fit, at minimal splitting costs, and guarantee that no primitive will be visited more than one. This guarantee is a requirement for high performance rendering.

Also, next generation GPUs will probably feature hardware support for bounding boxed geometry, ie. conditional batched rendering. This means, that the GPU will take the task of visibility determination on a higher level, ie. node frustum culling and high level occlusion culling, without interference from the CPU. ABTs will perfectly blend into this system, without even the slightest modification. Basic Kd-trees will not.

As an additional benefit, ABTs can easily be made completely dynamic. They will keep a near-perfect fit, even if the entire geometry they contain moves or changes, by using successive and lazy updates of individual tree branches. This is exactly because of the way they partition both space and geometry, and establish a link between them.

Share this post


Link to post
Share on other sites
ABTs are not a space partitioning structure if you use the common definition of "partition", i.e a partition of a set is a collection of subsets such that the union of all subsets is the original set and that for any two distinct subsets S1 and S2 the intersection of them is the empty set. That isn't a bad thing per se, considering Yann originally used them to store mostly static geometry for rendering IIRC, so representing empty space is unnecessary.

EDIT: If you make the needed modifications to an ABT to turn it into a spatial partitioning, it turns into a kd-tree with a certain algorithm for choosing split plane position.

Besides, you can't tag "primitives" with a wasDrawn flag because that disables fast submittal of large batches of tris to the GPU just like Yann said. So either yuo clip intersecting tris and live with the extra tris this generates, or you use a sloppy BVH to decrease the amount of splits. If you're very geometry bound (which you are if you're using occlusion culling) getting lots of extra tris isn't really a good thing.

Share this post


Link to post
Share on other sites
Quote:
Original post by Yann L
Also, next generation GPUs will probably feature hardware support for bounding boxed geometry, ie. conditional batched rendering. This means, that the GPU will take the task of visibility determination on a higher level, ie. node frustum culling and high level occlusion culling, without interference from the CPU. ABTs will perfectly blend into this system, without even the slightest modification. Basic Kd-trees will not.


Well, the NV40 already supports (well, sort of supports) NVX_conditional_render so it's already here in some form. However, if you restrict all geometry to lie in a (huge) AABB, it's perfectly possible to use a kd-tree and maintain an updated AABB that represents the curent level as you traverse.

Share this post


Link to post
Share on other sites

This topic is 4711 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this