Excessive copying in a tree structure

Started by
12 comments, last by Zahlman 17 years, 1 month ago
Represent your node (if you don't already; it looked to me like you don't, but the description of your struct is sufficiently vague...) as a begin-offset and length count in the file data, plus child nodes; then make a tree of those (rather lighter) objects, and store the whole file in a vector somewhere else, using the offsets and counts to index in. If the format is as simple as I think it is, then the non-container nodes simply have an empty child-vector, and the begin-offset and length count simply identify the location of that node's "unstructured data" - within the master storage vector, rather than any copy (BTW, you mentioned a POD? You're not copying into some fixed size array within the Atom struct, are you? Ewwww....).

This combines with the hold-pointers-to-nodes idea and actually sort of reflects it as well (the offsets are "pointers" in the main vector's "address space"). You might get vastly better results with a memory-mapped file (possibly using some platform-specific hacks?) than a std::vector for the "master" storage.

(EDIT: Going along with this - if you're reading the whole file at once, then you may as well really read the whole file at once - i.e. in one go, in order, into the master storage; then you can hop around inside the vector to find atom beginnings and endings, instead of seeking and telling around a file. Again, memory mapped files may make that moot.)
Advertisement
Actually, here's the real problem, I think (although the previous idea again works in tandem with this): everything is copied at least once when it's push_back()ed into a vector, but naive recursive algorithms tend to do things in a depth-first rather than a breadth-first way - net result, you copy too much by (especially at the top level) creating the entire object before putting it into the vector, when you could put incomplete objects into the vector first and then complete them. Of course, this will require two-phase construction, and we don't want to expose that ugliness, but we can still handle things quite nicely - something like:

typedef std::vector<char> buffer;class MP4Atom {  int begin, end;  vector<MP4Atom> children;  typedef std::vector<MP4Atom>::iterator iterator;  MP4Atom(int begin, int end) : begin(begin), end(end) {}  void loadChildren(const buffer& b) {    if (should_be_a_leaf()) { return; }    int cbegin = begin, cend;    // make proxies first    while((cend = nextChildBegin(b)) != end) {      children.push_back(MP4Atom(cbegin, cend));      cbegin = cend;    }    // then recurse    for (iterator it = children.begin(); it != children.end(); ++it) {      it->loadChildren(b);    }  }  public:  MP4Atom(const buffer& b) : begin(0), end(b.size()) {    loadChildren(b);  }  bool isLeaf() { return children.empty(); }};buffer readFile(istream& is) {  buffer result;  typedef std::istreambuf_iterator<char> in_iterator;  std::copy(in_iterator(is), in_iterator(),            std::back_inserter(result));  return result;}class MP4File {  buffer buf;  MP4Atom root;  public:  // Yes, this will copy the whole big *file buffer* an extra time or two  // but the fast way can't use an initialization list ;(  MP4File(istream& is) : buf(readFile(is)), root(buf) {}};
Quote:Original post by Zahlman
Actually, here's the real problem, I think (although the previous idea again works in tandem with this): everything is copied at least once when it's push_back()ed into a vector, but naive recursive algorithms tend to do things in a depth-first rather than a breadth-first way - net result, you copy too much by (especially at the top level) creating the entire object before putting it into the vector, when you could put incomplete objects into the vector first and then complete them. Of course, this will require two-phase construction, and we don't want to expose that ugliness, but we can still handle things quite nicely - something like:

*** Source Snippet Removed ***


I'm sorry I don't really understand the problem.

You have a file on disk that contains a collection of nodes. These nodes are hiearchical. The root node is the first node in the tree, and the children follow it? There must be some information in the file to indicate the offsets of the children, as well.

What are you going to be doing with the data - manipulating the tree, or extracting the data and putting it somewhere else?

The three or four times I've had to deal with this kind of structure I've always simply loaded the entire file into memory. Once loaded, iterate over the data creating a tree of objects that simply point to the blocks of memory, and provide accessors ( Node::GetData() const ). The client shouldn't have to know that the data is not in the actual node, nor that it is in a contiguous block. If you need move blocks, just move the Nodes. When you serialize the data back to disk, walk to three and write the data out sequentially. If you need to add, just allocate more blocks and store a list of allocated blocks in the container for the tree - this can fee the blocks once the tree is destroyed. If you need to remove blocks just remove the nodes from the tree. If you want to get rid of the memory write a compact function that copies the data to a new block of memory by serializing the tree to memory, and switch the pointers to that - free the old block, and voila.

Maybe I am naive so if I am missing something let me know so that I can see where I am going wrong.

[Edited by - Sphet on March 3, 2007 4:29:33 PM]
Quote:Original post by Sphet
The three or four times I've had to deal with this kind of structure I've always simply loaded the entire file into memory. Once loaded, iterate over the data creating a tree of objects that simply point to the blocks of memory, and provide accessors ( Node::GetData() const ). The client shouldn't have to know that the data is not in the actual node, nor that it is in a contiguous block. If you need move blocks, just move the Nodes.
...
Maybe I am naive so if I am missing something let me know so that I can see where I am going wrong.


What you're missing is:

(a) in my *first* post, I was suggesting exactly that.

(b) in my *second* post, I assumed that "just moving the Nodes" was *still* too expensive, because more Nodes were being moved than necessary. But this isn't "moving" for the purpose of rearranging the data, but for the purpose of constructing the tree in the first place. See, when you put something into a std::vector, it has to be copied in. The tree of objects in the OP's case is an N-ary tree, so a std::vector is being used to hold the children. If you build the children completely before putting them into the std::vector, then for each child, the entire subtree gets copied (recursively and indirectly, via the node's copy constructor, which uses the std::vector copy constructors). The second post is showing how to organize the code so that just the child nodes are constructed and put into the vector, and *then* the contents of the children are recursively constructed. This avoids extra copying work.

This topic is closed to new replies.

Advertisement