Excessive copying in a tree structure

Started by
12 comments, last by Zahlman 17 years, 1 month ago
Hi. A project I'm working on needs to parse an MP4 file. The macro-structure is very simple: the file is one big tree of what Apple have called 'atoms'. For those who don't know, an atom is a contiguous block of data which may contain other atoms or other unstructured data (but not both, incidentally). Obviously, this infrastructure lends itself nicely to OOP, so I've proceeded in the most predictable way; with a recursive class:
// Class declaration
class CAtom {
	// Some POD
	// ...
protected:
	vector<CAtom> m_children; // It's not important why this is protected
public:
	// More irrelevant details
	// ...

	// Constructors and operators
	CAtom();
	CAtom(const CAtom &rhs);
	CAtom(ifstream &in_file);
	CAtom& operator=(const CAtom& rhs);
	~CAtom();
};

// Definition of the pertinent constructor
CAtom::CAtom(ifstream &in_file) :
	// ...
{
	if (atom_is_container) {
		while (in_file.tellg() < m_offset + m_size) {
			if (in_file) {
				m_children.push_back(CAtom(in_file));
			}
		}
	} else {
		// Store unstructured data in a buffer
	}
}
The user passes an ifstream to the constructor of a new atom, and this constructor goes on to generate a vector of its own children as necessary, who, in turn, do the same. This works because of the depth-first nature of the atom structure on disk. I'm not convinced that passing a file stream around like this is ideal, so speak up if you have a better idea. Now the whole thing works, but it's really inefficient, due to the copytastic nature of vectors. Bearing in mind that each atom may contain arbitrarily many other atoms, which could feasibly contain several megabytes of data, my copy-constructor is somewhat intensive (deep copying is necessary), and the number of copies being made on even a modest MP4A file doesn't even bear thinking about. So it would make my day if anyone can either: 1) Provide a method of populating a vector without invoking the copy constructor. 2) Suggest a better design. 3) Otherwise chastise me for such awful code [wink]. Admiral
Ring3 Circus - Diary of a programmer, journal of a hacker.
Advertisement
Ok, it needs to be asked: [grin]
What about the following changes?
//...vector<CAtom*> m_children;//...m_children.push_back(new CAtom(in_file));//...


And do you really want to read entire mp4 file to RAM at once?
I don't think anyone would recommend doing that...
Quote:Original post by TheAdmiral
Hi.

A project I'm working on needs to parse an MP4 file. The macro-structure is very simple: the file is one big tree of what Apple have called 'atoms'. For those who don't know, an atom is a contiguous block of data which may contain other atoms or other unstructured data (but not both, incidentally). Obviously, this infrastructure lends itself nicely to OOP, so I've proceeded in the most predictable way; with a recursive class:

*** Source Snippet Removed ***
The user passes an ifstream to the constructor of a new atom, and this constructor goes on to generate a vector of its own children as necessary, who, in turn, do the same. This works because of the depth-first nature of the atom structure on disk. I'm not convinced that passing a file stream around like this is ideal, so speak up if you have a better idea.

Now the whole thing works, but it's really inefficient, due to the copytastic nature of vectors. Bearing in mind that each atom may contain arbitrarily many other atoms, which could feasibly contain several megabytes of data, my copy-constructor is somewhat intensive (deep copying is necessary), and the number of copies being made on even a modest MP4A file doesn't even bear thinking about.

So it would make my day if anyone can either:

1) Provide a method of populating a vector without invoking the copy constructor.
2) Suggest a better design.
3) Otherwise chastise me for such awful code [wink].

Admiral


There's an easier way to do this in languages that support more advanced control structures, but here's my crappy pseudocode solution (using some setter function to actually mutate data):
setter(stuff) {   foo = deep_copy(this);   foo.real_setter(stuff);   foo.rebuild(this) }rebuild(cursor) {   if(cursor.parent == NULL) return true;   frob = shallow_copy(cursor.parent);   frob.child[cursor.index] = foo;   foo.parent = frob;   rebuild(cursor.parent, frob); }


Basically, you need to be able to travel from any point in the tree to any other, so every node contains not only it's children, but it's parent and its own index into the parent's children vector. This way, when we mutate a node, we only have to do a deep copy of the node itself, and then shallow copies all the way up the tree until we hit the root.
Thanks for the prompt replies.

Deffer:

I considered the 'vector of pointers' approach, but I view it as a workaround at best. Things are bound to get pretty messy pretty quickly. That said, none of us learned C++ to avoid mess.

And yes, I do want to load an entire MP4 at once, although maybe what I want to do and what I should be doing are different things. The code will only accept MP4 audio (so nothing above 10MB, for typical popular music files), and I need to act on just about every part of it with effective random-access, so paging back and forth to disk seems like an unnecessary pain. If I get a little time, maybe I'll tell my program to refrain from loading the 'mdat' atoms (i.e. the lion's share) in full.

The Reindeer Effect:

Your idea is interesting, but it looks like there's a lot of housekeeping involved. Unless you know of an existing templated container that supports this functionality, I think I'd prefer the pointer-based approach.

Admiral
Ring3 Circus - Diary of a programmer, journal of a hacker.
Can you count the number of sub-atoms in each atom before you've read all its contents (either by there being an explicit count, or by seeking past each sub-atom to find the end)? If so, just use vector::reserve to set the right size before doing any push_backs, and then it will never have to reallocate and copy.

Alternatively, you could make the !atom_is_container case remember only the offset to the data (and its size), so you read the whole file and construct a tree that only contains the control information. Then you can go back and find all the data-offset nodes in your tree and read the data into them, without altering the tree's structure.

If you don't want to seek at all, I can't think of anything other than dynamic allocation, though list<CAtom> might be easier than vector<CAtom*>.
Quote:Original post by TheAdmiral
The Reindeer Effect:

Your idea is interesting, but it looks like there's a lot of housekeeping involved. Unless you know of an existing templated container that supports this functionality, I think I'd prefer the pointer-based approach.

Admiral


It's not really alot of housekeeping. All that is required is that your atoms have a pointer to the parent and the corresponding index into the parent's children vector. The logic behind rebuilding an atom tree with a copy replacing a specific atom is pretty simple; you could code it up in 5 minutes. And you don't really need setter functions, but you do need to deep copy an atom before making changes, and then rebuild the tree at some point in time. The housekeeping amounts to a single function call; I just chose to add this into setter methods so that the actual housekeeping process is opaque.

But yea, I'm sure that there are simpler ways to do what you want.

Quote:Original post by Excors
Can you count the number of sub-atoms in each atom before you've read all its contents (either by there being an explicit count, or by seeking past each sub-atom to find the end)? If so, just use vector::reserve to set the right size before doing any push_backs, and then it will never have to reallocate and copy.
That wouldn't entirely eliminate the need to copy each atom at least once though, right? I'm already reserveing quite generously to avoid unnecessary reallocations, but push_back necessarily creates a copy. Copying each atom 'once' doesn't seem like such a big deal, but this inevitably involves copying all of the children. The crippling result of this is that the number of copies being made on a given atom grows exponentially with the depth into the tree.
Quote:Alternatively, you could make the !atom_is_container case remember only the offset to the data (and its size), so you read the whole file and construct a tree that only contains the control information. Then you can go back and find all the data-offset nodes in your tree and read the data into them, without altering the tree's structure.
That's a great idea.

Quote:If you don't want to seek at all, I can't think of anything other than dynamic allocation...
It's not the seeking that I'm worried about, but the trade-off between memory ping-pong and the absence of true copying. Your previous suggestion fixes both of these.

Thanks for the answers, all. I'll take a long look at my design and choose between my options.

Admiral
Ring3 Circus - Diary of a programmer, journal of a hacker.
Quote:Original post by TheAdmiral
Quote:Original post by Excors
... use vector::reserve to set the right size before doing any push_backs, and then it will never have to reallocate and copy.
That wouldn't entirely eliminate the need to copy each atom at least once though, right? ...

Ah, I hadn't thought about that case of copying. In general, you can avoid the copying from temporary into vector by doing a children.push_back(CAtom()); children.back().read_from_file(in_file) (in addition to doing reserve so there's no copying caused by reallocation), though it does mean you can't use the CAtom(in_file) constructor and it doesn't feel especially elegant.
Use a list instead of a vector :P Identical code, no reallocations.

Edit: missed the random access bit. Hmm... it's a thought though!
Construct (Free open-source game creator)
class Mp4Atom{    // oh noes, -pointers- to objects!    // yes, because this is a tree structure, pointers to the data    // are the best course of action, as otherwise you will end up    // with a massive amount of copying, as you've already guessed.    //    // I've chosen list here rather than vector because it allows    // constant time insertion and removal, although I don't know the    // Mp4 format, so I don't know if these operations happen often.    // shared pointers good idea? maybe just auto_ptr? (parent pointer    // would have to be raw pointer in that case)    boost::weak_ptr<Mp4Atom> m_parent;    std::list<boost::shared_ptr<Mp4Atom> > m_children;public:    // all other stuff here};// Definition of the pertinent constructorMp4Atom::Mp4Atom(ifstream &in_file) :	// ...{    if (atom_is_container)    {        while (in_file.tellg() < m_offset + m_size)        {            if (in_file)                 m_children.push_back(boost::shared_ptr<Mp4Atom>(new Mp4Atom(in_file)));        }    }    else    {        // Store unstructured data in a buffer    }}


That's how I'd write it. Eliminates redundant copying and sets up a flexible tree format. Also allows for most of what The Reindeer Effect was describing (although I'm not going include the index of an entry in a parent node simply because that immediately breaks all OOP rules).
[ search: google ][ programming: msdn | boost | opengl ][ languages: nihongo ]

This topic is closed to new replies.

Advertisement