KD-Tree nodes in 4 bytes

Started by
31 comments, last by Eelco 19 years, 2 months ago
Ive been thinking about ways to reduce nodes in a kd tree (as applying to raytracing, but it goes for toher applications just aswell i beleive) to 4 bytes in size, halving them from the currently widely used 8 bytes. three improvements ive come up with: - a node is either a leaf or has one out of three splitaxis specified. thats four states, so two bits needed, not the commonly used 3. - you dont need nearly as much bits for a childpointer as are commonly used. its really simple: if your tree has 64k nodepairs, so 128k nodes (not a small tree), you need only 16 bits to differentiate between them. just multiply this value by the size of a nodepair to get the offset of the rootnode, + rootnode is memory adress. however, it can be done even better by using relative pointers, specifying the distance a child is away from its parent. again using 16 bits as an example, a parent and its leaf are constrained to being 128k nodespairs apart. without any clever node reorginization, just a naive 'sort all nodes by treepdeth', this would mean the maximum width of the tree would be 128k nodes. if kdtrees were balanced this would mean a maximum of 256k nodes, however, they are highly unbalanced, meaning much deeper and hence less broad, so this figure will be highly stretchable. i suspect 1M nodes wouldnt be any problem, but this requires some testing. also a tree can be built so as to limit width. - again using 16 bits as an example for pointer data, that leaves us with 14 bits for the splitplane position. as the clever reader notes: you cant fit a float in there :). however, 14 bits is still a resolution of 16k nonetheless. thats still a lot better than an octree with its resolution of 2, and not much worse than a float can provide, with its anisotropic resolution. (a float can only guarantee 7 digits of precision (correct me if im wrong), which isnt much worse than 5.) also, each splitplane could be specified relative to its parent, yielding a much better resolution at the lower levels of the tree, where precise positioning actually matters. so it seems that with some adaptation its quite possible to halve the memory needed for your tree. there is one main point of concern i have though: fixed point math & the splitplane position. i must say im not too familiar with the precise details of it, and if its possible without completely castrating your traversal speed. or would it for example be possible to interpret these 14 (or more for trees which need smaller maximum depth) as the most significant bits of a floats mantissa? or if that doesnt work is there another way to store only part of a float, discarding some precision? also im aware of the problem of specifying a splitplane at lower resolution than that of your vertex coordinates. however, this can be taken into account during tree construction: if the resolution of the position of a plane is too low to determine if an AABB falls on one side or the other, it simply falls in both. it slightly compromises the efficiency of your tree, but: -once again, it still beats octrees hands down. -by doing this you also adress numerical instabilities (that a 32bit float splitplane position also suffers from) which can lead to uncorrecly missed triangles. namely a vertex will never again lie precisely on a splitplane, but always a distanceaway from it, providing a buffer for numerical instabilities. another (experimental) solution: store vertices in 16bit fixed. splitplane position should easily surpass that on leaflevel. since rays get transformed into model space in my implementation, accuracy shouldnt be a problem. however, fixedpoint mathagain might be. other experimental idea: make split position precision dynamic: ie, minimize and count the maximum number of bits needed per childpointer for each treelevel, and use all unneeded bits for the splitplane. that should yield very accurate splitplanes in most cases, and where it doesnt, well, youll still suffer some slightly less effienciently constructed trees. wow, thats a lot of rambling, and there are probably not even a handfull of people visiting these boards that care at all about this, but for those that do care and have taken the time to give this some thought: id love to hear your vision on this. i think the possibilities are promising, but maybe im overlooking something. is this even worth it, for instance? it must save some memory and memory bandwidth, aswell as improve cache performance, but there are probably some instructions extra needed here and there, especially concerning the fixed point math.
Advertisement
Very interesting. I was wondering the same thing lately, but I hadn't though about it in this much detail.


I have got around 6 bytes per triangle average storage with my aabb tree. I'ts really not clever - the main idea is to store several tris at once ( 2 - 8 ) in a leaf, which amortizes the cost of the full 24-byte AABB at each node & leaf.

I could get it much smaller if I compressed the AABB, but its not a bottleneck right now...
I posted once about how I store my kd nodes in 32 bits. I do the axis/leaf identifier in the lower 2 bits of the split-plane. If the low 2 bits specify 'leaf' then the upper 30 bits are an offset into a list of primitive indexes. The primitive indexes are specified by a count of the primitives in the leaf in the first entry, then a list of unsigned ints that point to the actual leaf data.

 _______________ ______  ______________ ______  ______________ ________| 30-bit offset | leaf || 30-bit offset| leaf || 30-bit split | x-axis | --------------- ------  -------------- ------  -------------- --------      |                  /      |                 /   ___|_____ _______ __/______ _____  | # prims | 0 1 2 | # prims | 2 4 |   --------- ------- --------- -----            /  /    \    ______/   \     .           /   |     \  /           \     .        [leaf] [leaf] [leaf] [leaf] [leaf]


and the children of the node stored at m_nodes are implicitly stored at left = m_nodes[2i] and right = m_nodes[2i + 1].
i do the same thing for my leafs. nearly the same. i store the amount of primitives in a leaf also in the node itself. plenty of space, and it saves a dereference on empty nodes.

as for implicitly storing nodes: isnt that very memory AND cache inefficient? note that this is just my reasoning, i havnt tested this, but deeper down in the tree it hardly grows exponentionally, even more so, it should shrink in the deepest levels. youll have seas of empty space deep in your tree, and no cache coherency at all.

however, after posting this topic i had a similar idea. to reduce the amount of bits needed to specify an offset, you could, instead of specifying the distance from object to object, specify the distance from a trendline, for example j = 2*i, although i think j = i is better on average. if you store the widest level of the tree, you could use a varying trendline depending on wether its expectde to grow or shrink. that really should bring back the number of needed bits for a decent sized tree to something like 14 or so, leaving a nice 16 bits for splitplane location.

if you store vertices as 16bit short, you only need 16bit short splitplane positions anyway.

however, im still not sure about the fixed-point aspect of all the maths, and its speed in particular. maybe i should start a topic on that in the math forums.
By tracking the aabb of a node while traversing the tree, you could use your 16-bits like this:

unsigned int nodeBits = m_nodes[currentNodeIndex]; // Upper 16 bits are the split position.float weight = float(splitIndex) * (1.0f / float(0xffffffff)); // Get the weight [0, 1]float splitPos = aabbMax[axis] * weight + aabbMin[axis] * (1.0f - weight); // Get the actual split-axis


You could probably ignore to clear the lower bits since they would only shift the position very little.
Quote:Original post by Eelco
i store the amount of primitives in a leaf also in the node itself. plenty of space, and it saves a dereference on empty nodes.
...
as for implicitly storing nodes: isnt that very memory AND cache inefficient? note that this is just my reasoning, i havnt tested this, but deeper down in the tree it hardly grows exponentionally, even more so, it should shrink in the deepest levels. youll have seas of empty space deep in your tree, and no cache coherency at all.


right, if the leaf is empty I save the offset as 0, so there is no dereference. And as for memory/cache efficient... that's 2 different questions. Yes, memory usage it is exponential, but even at the 20th level, it's 4 Megs. Beyond that it gets much less efficient. Phantomus talked about a 45-deep tree for Buddah... I don't have 128 Terabytes of RAM, so it's not an option for me :)

As for cache-efficient, that was the whole reason why I originally came up with this scheme. It allows to determine address of grandchildren nodes without dereferencing and prefetch way ahead of time. Also, grandchildren are stored consecutively in 32 bytes -- size of a cache-line, so they can all be prefetched in one call.

As it turned out, when I profiled with prefetching turned off, it really didn't spend any time stalling on d-cache. I'm honestly not sure why this is. I was kind of mad because it eliminated what I thought was a huge performance benefit of my design. Oh well.
yeah, thats sort of what i was thinking. by specifying everything (memory locations and positions in space) relative to a parent, you can archieve the needed resolution in only a couple of bits. however, i think any casts to float ought to be avoided in your inner loop, so another method should be used i think.

for example, you could store 7 exponent bits and 9 mantissa bits to store a truncated float in the range 0-1.

thinking about the implicit storage of nodes again, i just realized a way to do it with minimal memory waste. say you have a treelevel i and j=i+1, with two indices in it, I, and J respectively. for each node in i, count how much leafs came before it in i. since leafs wont have children, you can then store semi-implicit pointers of the form J = j.start + 2*(I - nodesbeforeI). if you put all nodepairs which are both children at the end of a level, i think you could build huge trees with only a couple of bits.
Quote:Original post by ajas95
Quote:Original post by Eelco
i store the amount of primitives in a leaf also in the node itself. plenty of space, and it saves a dereference on empty nodes.
...
as for implicitly storing nodes: isnt that very memory AND cache inefficient? note that this is just my reasoning, i havnt tested this, but deeper down in the tree it hardly grows exponentionally, even more so, it should shrink in the deepest levels. youll have seas of empty space deep in your tree, and no cache coherency at all.


right, if the leaf is empty I save the offset as 0, so there is no dereference. And as for memory/cache efficient... that's 2 different questions. Yes, memory usage it is exponential, but even at the 20th level, it's 4 Megs. Beyond that it gets much less efficient. Phantomus talked about a 45-deep tree for Buddah... I don't have 128 Terabytes of RAM, so it's not an option for me :)

As for cache-efficient, that was the whole reason why I originally came up with this scheme. It allows to determine address of grandchildren nodes without dereferencing and prefetch way ahead of time. Also, grandchildren are stored consecutively in 32 bytes -- size of a cache-line, so they can all be prefetched in one call.

As it turned out, when I profiled with prefetching turned off, it really didn't spend any time stalling on d-cache. I'm honestly not sure why this is. I was kind of mad because it eliminated what I thought was a huge performance benefit of my design. Oh well.

yeah id also think prefetching grandchildren would be a good idea.

but youre saying it basicly it didnt matter? was it only the prefetching that didnt help, or was the switch to 4 byte simply completely useless?


btw, even if there is no speed benefit, i still like the idea of fixed point math. i can never stand floats and the fact that their resolution isnt constant, whereas for such an application, it obviously should be.
Quote:Original post by Eelco
btw, even if there is no speed benefit, i still like the idea of fixed point math. i can never stand floats and the fact that their resolution isnt constant, whereas for such an application, it obviously should be.


I'm not sure squeezing nodes to 32bits is a win. First, once you have accessed a node you've wasted an entire cache line for it, say 64 bytes (read locality matters more).

Then, and that's the biggest problem in my eye, when doing a traversal à la Wald you better have the distance computed as early as possible, because you have few cycles to waste on concurrent computation before you have to decide (that is, use that distance) where you're going.
Going from fixed point to float/vector is going to take time you don't have; unless you make an all fixed point raytracer :)


Btw there's a good reason to use 3+1 bits to encode the leaf flag and axis (like Wald pointed out): with the leaf bit as the msb you can just test for signedness to know if it's a node or a leaf.
Then if you store the axis in the lowest part, you just need a compact and with a small immediate to extract it and you don't need to shift/scale the offset before you use it.
Can't be more efficient computationnaly.
Quote:Original post by tbp
Quote:Original post by Eelco
btw, even if there is no speed benefit, i still like the idea of fixed point math. i can never stand floats and the fact that their resolution isnt constant, whereas for such an application, it obviously should be.


I'm not sure squeezing nodes to 32bits is a win. First, once you have accessed a node you've wasted an entire cache line for it, say 64 bytes (read locality matters more).

yeah that makes sense.

Quote:
Then, and that's the biggest problem in my eye, when doing a traversal à la Wald you better have the distance computed as early as possible, because you have few cycles to waste on concurrent computation before you have to decide (that is, use that distance) where you're going.
Going from fixed point to float/vector is going to take time you don't have; unless you make an all fixed point raytracer :)

does the smiley indicate you think thats a bad idea? note that im not a numerical guru, but i have the feeling floats are used far to widely, in applications where they shouldnt be. however, going fixed point is often considered a hassle, because everything is built around floats, for seemingly no other reason than that its assumed to be the better choice.

why would i want to store vertex positions as floats? i might want to do so for vertices in camera-space, where adaptive resolution makes sense, but i see no compelling reasons to let my coordinates in model space have a resolution of 2^128 near the origin, and a resolution +-120 orders of magnitude lower just one unit away from it. floats enable me to cover a huge range, but my model space (in which i perform all calculations) is very much a bounded domain, so any ability to specify positions beyond that are wasted bits.

Quote:
Btw there's a good reason to use 3+1 bits to encode the leaf flag and axis (like Wald pointed out): with the leaf bit as the msb you can just test for signedness to know if it's a node or a leaf.
Then if you store the axis in the lowest part, you just need a compact and with a small immediate to extract it and you don't need to shift/scale the offset before you use it.
Can't be more efficient computationnaly.

yeah that makes sense from an efficiency POV. i was assuming simple operations like bitshifting and bitmasking could be scheduled away somehow, but i suppose thats a little optimistic.

so your conclusion is going 32bit is a waste of time? i certainly am inclined to say so after the input ive been given (for which id like to thank you all).

however, id still like to investigate the possibilities of fixed point tracing further. i have applied fixed point math to a raycaster once, but needless to say it was a little less time crititcal. i cant remember the specifics, but it was a huge speedboost in texture lookup/interpolation anyway, and in general, i dont see why it has to be slower, as long as you stay away from casts. and ofcource make sure your algorithms are stable with fixed point math under any circumstances, ie no overflows.

This topic is closed to new replies.

Advertisement