Sign in to follow this  
TheAdmiral

Excessive copying in a tree structure

Recommended Posts

Hi. A project I'm working on needs to parse an MP4 file. The macro-structure is very simple: the file is one big tree of what Apple have called 'atoms'. For those who don't know, an atom is a contiguous block of data which may contain other atoms or other unstructured data (but not both, incidentally). Obviously, this infrastructure lends itself nicely to OOP, so I've proceeded in the most predictable way; with a recursive class:
// Class declaration
class CAtom {
	// Some POD
	// ...
protected:
	vector<CAtom> m_children; // It's not important why this is protected
public:
	// More irrelevant details
	// ...

	// Constructors and operators
	CAtom();
	CAtom(const CAtom &rhs);
	CAtom(ifstream &in_file);
	CAtom& operator=(const CAtom& rhs);
	~CAtom();
};

// Definition of the pertinent constructor
CAtom::CAtom(ifstream &in_file) :
	// ...
{
	if (atom_is_container) {
		while (in_file.tellg() < m_offset + m_size) {
			if (in_file) {
				m_children.push_back(CAtom(in_file));
			}
		}
	} else {
		// Store unstructured data in a buffer
	}
}
The user passes an ifstream to the constructor of a new atom, and this constructor goes on to generate a vector of its own children as necessary, who, in turn, do the same. This works because of the depth-first nature of the atom structure on disk. I'm not convinced that passing a file stream around like this is ideal, so speak up if you have a better idea. Now the whole thing works, but it's really inefficient, due to the copytastic nature of vectors. Bearing in mind that each atom may contain arbitrarily many other atoms, which could feasibly contain several megabytes of data, my copy-constructor is somewhat intensive (deep copying is necessary), and the number of copies being made on even a modest MP4A file doesn't even bear thinking about. So it would make my day if anyone can either: 1) Provide a method of populating a vector without invoking the copy constructor. 2) Suggest a better design. 3) Otherwise chastise me for such awful code [wink]. Admiral

Share this post


Link to post
Share on other sites
Ok, it needs to be asked: [grin]
What about the following changes?
//...
vector<CAtom*> m_children;
//...
m_children.push_back(new CAtom(in_file));
//...


And do you really want to read entire mp4 file to RAM at once?
I don't think anyone would recommend doing that...

Share this post


Link to post
Share on other sites
Quote:
Original post by TheAdmiral
Hi.

A project I'm working on needs to parse an MP4 file. The macro-structure is very simple: the file is one big tree of what Apple have called 'atoms'. For those who don't know, an atom is a contiguous block of data which may contain other atoms or other unstructured data (but not both, incidentally). Obviously, this infrastructure lends itself nicely to OOP, so I've proceeded in the most predictable way; with a recursive class:

*** Source Snippet Removed ***
The user passes an ifstream to the constructor of a new atom, and this constructor goes on to generate a vector of its own children as necessary, who, in turn, do the same. This works because of the depth-first nature of the atom structure on disk. I'm not convinced that passing a file stream around like this is ideal, so speak up if you have a better idea.

Now the whole thing works, but it's really inefficient, due to the copytastic nature of vectors. Bearing in mind that each atom may contain arbitrarily many other atoms, which could feasibly contain several megabytes of data, my copy-constructor is somewhat intensive (deep copying is necessary), and the number of copies being made on even a modest MP4A file doesn't even bear thinking about.

So it would make my day if anyone can either:

1) Provide a method of populating a vector without invoking the copy constructor.
2) Suggest a better design.
3) Otherwise chastise me for such awful code [wink].

Admiral


There's an easier way to do this in languages that support more advanced control structures, but here's my crappy pseudocode solution (using some setter function to actually mutate data):

setter(stuff) {
foo = deep_copy(this);
foo.real_setter(stuff);
foo.rebuild(this) }
rebuild(cursor) {
if(cursor.parent == NULL) return true;
frob = shallow_copy(cursor.parent);
frob.child[cursor.index] = foo;
foo.parent = frob;
rebuild(cursor.parent, frob); }


Basically, you need to be able to travel from any point in the tree to any other, so every node contains not only it's children, but it's parent and its own index into the parent's children vector. This way, when we mutate a node, we only have to do a deep copy of the node itself, and then shallow copies all the way up the tree until we hit the root.

Share this post


Link to post
Share on other sites
Thanks for the prompt replies.

Deffer:

I considered the 'vector of pointers' approach, but I view it as a workaround at best. Things are bound to get pretty messy pretty quickly. That said, none of us learned C++ to avoid mess.

And yes, I do want to load an entire MP4 at once, although maybe what I want to do and what I should be doing are different things. The code will only accept MP4 audio (so nothing above 10MB, for typical popular music files), and I need to act on just about every part of it with effective random-access, so paging back and forth to disk seems like an unnecessary pain. If I get a little time, maybe I'll tell my program to refrain from loading the 'mdat' atoms (i.e. the lion's share) in full.

The Reindeer Effect:

Your idea is interesting, but it looks like there's a lot of housekeeping involved. Unless you know of an existing templated container that supports this functionality, I think I'd prefer the pointer-based approach.

Admiral

Share this post


Link to post
Share on other sites
Can you count the number of sub-atoms in each atom before you've read all its contents (either by there being an explicit count, or by seeking past each sub-atom to find the end)? If so, just use vector::reserve to set the right size before doing any push_backs, and then it will never have to reallocate and copy.

Alternatively, you could make the !atom_is_container case remember only the offset to the data (and its size), so you read the whole file and construct a tree that only contains the control information. Then you can go back and find all the data-offset nodes in your tree and read the data into them, without altering the tree's structure.

If you don't want to seek at all, I can't think of anything other than dynamic allocation, though list<CAtom> might be easier than vector<CAtom*>.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
Quote:
Original post by TheAdmiral
The Reindeer Effect:

Your idea is interesting, but it looks like there's a lot of housekeeping involved. Unless you know of an existing templated container that supports this functionality, I think I'd prefer the pointer-based approach.

Admiral


It's not really alot of housekeeping. All that is required is that your atoms have a pointer to the parent and the corresponding index into the parent's children vector. The logic behind rebuilding an atom tree with a copy replacing a specific atom is pretty simple; you could code it up in 5 minutes. And you don't really need setter functions, but you do need to deep copy an atom before making changes, and then rebuild the tree at some point in time. The housekeeping amounts to a single function call; I just chose to add this into setter methods so that the actual housekeeping process is opaque.

But yea, I'm sure that there are simpler ways to do what you want.

Share this post


Link to post
Share on other sites
Quote:
Original post by Excors
Can you count the number of sub-atoms in each atom before you've read all its contents (either by there being an explicit count, or by seeking past each sub-atom to find the end)? If so, just use vector::reserve to set the right size before doing any push_backs, and then it will never have to reallocate and copy.
That wouldn't entirely eliminate the need to copy each atom at least once though, right? I'm already reserveing quite generously to avoid unnecessary reallocations, but push_back necessarily creates a copy. Copying each atom 'once' doesn't seem like such a big deal, but this inevitably involves copying all of the children. The crippling result of this is that the number of copies being made on a given atom grows exponentially with the depth into the tree.
Quote:
Alternatively, you could make the !atom_is_container case remember only the offset to the data (and its size), so you read the whole file and construct a tree that only contains the control information. Then you can go back and find all the data-offset nodes in your tree and read the data into them, without altering the tree's structure.
That's a great idea.

Quote:
If you don't want to seek at all, I can't think of anything other than dynamic allocation...
It's not the seeking that I'm worried about, but the trade-off between memory ping-pong and the absence of true copying. Your previous suggestion fixes both of these.

Thanks for the answers, all. I'll take a long look at my design and choose between my options.

Admiral

Share this post


Link to post
Share on other sites
Quote:
Original post by TheAdmiral
Quote:
Original post by Excors
... use vector::reserve to set the right size before doing any push_backs, and then it will never have to reallocate and copy.
That wouldn't entirely eliminate the need to copy each atom at least once though, right? ...

Ah, I hadn't thought about that case of copying. In general, you can avoid the copying from temporary into vector by doing a children.push_back(CAtom()); children.back().read_from_file(in_file) (in addition to doing reserve so there's no copying caused by reallocation), though it does mean you can't use the CAtom(in_file) constructor and it doesn't feel especially elegant.

Share this post


Link to post
Share on other sites
class Mp4Atom
{
// oh noes, -pointers- to objects!
// yes, because this is a tree structure, pointers to the data
// are the best course of action, as otherwise you will end up
// with a massive amount of copying, as you've already guessed.
//
// I've chosen list here rather than vector because it allows
// constant time insertion and removal, although I don't know the
// Mp4 format, so I don't know if these operations happen often.

// shared pointers good idea? maybe just auto_ptr? (parent pointer
// would have to be raw pointer in that case)
boost::weak_ptr<Mp4Atom> m_parent;
std::list<boost::shared_ptr<Mp4Atom> > m_children;

public:
// all other stuff here
};

// Definition of the pertinent constructor
Mp4Atom::Mp4Atom(ifstream &in_file) :
// ...
{
if (atom_is_container)
{
while (in_file.tellg() < m_offset + m_size)
{
if (in_file)
m_children.push_back(boost::shared_ptr<Mp4Atom>(new Mp4Atom(in_file)));
}
}
else
{
// Store unstructured data in a buffer
}
}


That's how I'd write it. Eliminates redundant copying and sets up a flexible tree format. Also allows for most of what The Reindeer Effect was describing (although I'm not going include the index of an entry in a parent node simply because that immediately breaks all OOP rules).

Share this post


Link to post
Share on other sites
Represent your node (if you don't already; it looked to me like you don't, but the description of your struct is sufficiently vague...) as a begin-offset and length count in the file data, plus child nodes; then make a tree of those (rather lighter) objects, and store the whole file in a vector somewhere else, using the offsets and counts to index in. If the format is as simple as I think it is, then the non-container nodes simply have an empty child-vector, and the begin-offset and length count simply identify the location of that node's "unstructured data" - within the master storage vector, rather than any copy (BTW, you mentioned a POD? You're not copying into some fixed size array within the Atom struct, are you? Ewwww....).

This combines with the hold-pointers-to-nodes idea and actually sort of reflects it as well (the offsets are "pointers" in the main vector's "address space"). You might get vastly better results with a memory-mapped file (possibly using some platform-specific hacks?) than a std::vector for the "master" storage.

(EDIT: Going along with this - if you're reading the whole file at once, then you may as well really read the whole file at once - i.e. in one go, in order, into the master storage; then you can hop around inside the vector to find atom beginnings and endings, instead of seeking and telling around a file. Again, memory mapped files may make that moot.)

Share this post


Link to post
Share on other sites
Actually, here's the real problem, I think (although the previous idea again works in tandem with this): everything is copied at least once when it's push_back()ed into a vector, but naive recursive algorithms tend to do things in a depth-first rather than a breadth-first way - net result, you copy too much by (especially at the top level) creating the entire object before putting it into the vector, when you could put incomplete objects into the vector first and then complete them. Of course, this will require two-phase construction, and we don't want to expose that ugliness, but we can still handle things quite nicely - something like:


typedef std::vector<char> buffer;

class MP4Atom {
int begin, end;
vector<MP4Atom> children;

typedef std::vector<MP4Atom>::iterator iterator;

MP4Atom(int begin, int end) : begin(begin), end(end) {}

void loadChildren(const buffer& b) {
if (should_be_a_leaf()) { return; }
int cbegin = begin, cend;
// make proxies first
while((cend = nextChildBegin(b)) != end) {
children.push_back(MP4Atom(cbegin, cend));
cbegin = cend;
}
// then recurse
for (iterator it = children.begin(); it != children.end(); ++it) {
it->loadChildren(b);
}
}

public:
MP4Atom(const buffer& b) : begin(0), end(b.size()) {
loadChildren(b);
}

bool isLeaf() { return children.empty(); }
};

buffer readFile(istream& is) {
buffer result;
typedef std::istreambuf_iterator<char> in_iterator;
std::copy(in_iterator(is), in_iterator(),
std::back_inserter(result));
return result;
}

class MP4File {
buffer buf;
MP4Atom root;

public:
// Yes, this will copy the whole big *file buffer* an extra time or two
// but the fast way can't use an initialization list ;(
MP4File(istream& is) : buf(readFile(is)), root(buf) {}
};

Share this post


Link to post
Share on other sites
Quote:
Original post by Zahlman
Actually, here's the real problem, I think (although the previous idea again works in tandem with this): everything is copied at least once when it's push_back()ed into a vector, but naive recursive algorithms tend to do things in a depth-first rather than a breadth-first way - net result, you copy too much by (especially at the top level) creating the entire object before putting it into the vector, when you could put incomplete objects into the vector first and then complete them. Of course, this will require two-phase construction, and we don't want to expose that ugliness, but we can still handle things quite nicely - something like:

*** Source Snippet Removed ***


I'm sorry I don't really understand the problem.

You have a file on disk that contains a collection of nodes. These nodes are hiearchical. The root node is the first node in the tree, and the children follow it? There must be some information in the file to indicate the offsets of the children, as well.

What are you going to be doing with the data - manipulating the tree, or extracting the data and putting it somewhere else?

The three or four times I've had to deal with this kind of structure I've always simply loaded the entire file into memory. Once loaded, iterate over the data creating a tree of objects that simply point to the blocks of memory, and provide accessors ( Node::GetData() const ). The client shouldn't have to know that the data is not in the actual node, nor that it is in a contiguous block. If you need move blocks, just move the Nodes. When you serialize the data back to disk, walk to three and write the data out sequentially. If you need to add, just allocate more blocks and store a list of allocated blocks in the container for the tree - this can fee the blocks once the tree is destroyed. If you need to remove blocks just remove the nodes from the tree. If you want to get rid of the memory write a compact function that copies the data to a new block of memory by serializing the tree to memory, and switch the pointers to that - free the old block, and voila.

Maybe I am naive so if I am missing something let me know so that I can see where I am going wrong.

[Edited by - Sphet on March 3, 2007 4:29:33 PM]

Share this post


Link to post
Share on other sites
Quote:
Original post by Sphet
The three or four times I've had to deal with this kind of structure I've always simply loaded the entire file into memory. Once loaded, iterate over the data creating a tree of objects that simply point to the blocks of memory, and provide accessors ( Node::GetData() const ). The client shouldn't have to know that the data is not in the actual node, nor that it is in a contiguous block. If you need move blocks, just move the Nodes.
...
Maybe I am naive so if I am missing something let me know so that I can see where I am going wrong.


What you're missing is:

(a) in my *first* post, I was suggesting exactly that.

(b) in my *second* post, I assumed that "just moving the Nodes" was *still* too expensive, because more Nodes were being moved than necessary. But this isn't "moving" for the purpose of rearranging the data, but for the purpose of constructing the tree in the first place. See, when you put something into a std::vector, it has to be copied in. The tree of objects in the OP's case is an N-ary tree, so a std::vector is being used to hold the children. If you build the children completely before putting them into the std::vector, then for each child, the entire subtree gets copied (recursively and indirectly, via the node's copy constructor, which uses the std::vector copy constructors). The second post is showing how to organize the code so that just the child nodes are constructed and put into the vector, and *then* the contents of the children are recursively constructed. This avoids extra copying work.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this