Archived

This topic is now archived and is closed to further replies.

Binary trees & large files

This topic is 4997 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Okay I''m working on a program for Markov chains and it works fine except there seems to be a limit on the size of my binary trees - I think I''m running out of stack space (or is the heap?)or something. Is there any way to get around this? I can''t get it to build a tree of Alice in Wonderland but it will work fine for smaller files like my collection of welsh fairy tales. I''d post my tree-building code here but its long.

Share this post


Link to post
Share on other sites
You haven''t given very much to go on. When you say that there''s a limit, you don''t say how you know there''s a limit. You don''t mention what programming language you''re working in, explain how you structure your tree, give a comparison in size of the welsh fairy tales verus "Alice in Wonderland", mention what platform you''re running on, etc.

But let''s start with the basics: what kind of error are you getting?

Share this post


Link to post
Share on other sites
I''m using C++ with Dev-C++.

I''m fairly sure it''s some sort of limit because of the sizes of the files I''m working off - alice.txt is 148 KB, while welshfairytales.txt is 85.2. The program works perfectly unless I try to feed it a large file.

Welshfairytales.txt comes out to about 17,000 nodes. Each node stores one word the word that''s immediately after it. (I''m trying to make so it can lookahead more than one word, but that''s for next time...)

For alice.txt, the program gives me an error around the 31,000st node. For Through the Looking Glass (the sequel), it tops out at a little under 24,500. The disrepancy is probably due to word lengths, but it''s close enough to make me suspect there''s some kind of memory limit.

I can''t for the life of me figure out how to use the Dev-C++ debugger, but I''ve been able to glean that I''m getting an access violation (segmentation fault) when I try to run those programs. My experience with those has been with memory-related errors.

I structure my tree like so:
The typical left-right.
If a child node''s first word matches the first word of the parent, I''ve added a "singleMatch" list - the node goes here. If both words are identical, I add it to a "doubleMatch" list.

Here''s my struct:

struct WordNode{
int nodeNum;
apstring word1, word2;
WordNode *left, *right, *singleMatch, *doubleMatch;
(left out constructors for clarity)
};


apstring is a string class I use, and I''m 99.9% sure there''s nothing wrong with it - it''s just a failsafe wrapper around char*[].

There exists a slight possibility that there really may be something wrong with the way I''m building my tree and the error only pops up under very exceptional circumstances, but I can''t think of what it could be. For now, try to work on what I''ve provided.

Share this post


Link to post
Share on other sites
I''m not sure if you''re saying running the debugger gives you a segfault or your program gives you a segfault. You say your program "gives an error". What exactly do you mean by that. Does it pop up a message box? Does it spit something cryptic at the command prompt? Does it take over your speakers and scream "AIEEE! I''M ON FIRE!"?

However, it''s fairly unlikely that you''re running out of heap memory, unless I''m totally misunderstanding the nature of your binary tree. 31,000 nodes, at even something as large as 4K per node shouldn''t overflow the available user virtual memory of a windows machine.

It may be a stack overflow, depending on how you structured your tree generator. If you''re using Dev C++, then you''re probably using ld as your linker. Try getting at the linker settings and increasing the default stack size. The command line switch is --stack. The default reserve stack size is 2 MB for ld, so try bumping it up to 4 MB.

Share this post


Link to post
Share on other sites
Odo, try appending your welsh fairy tales to itself, thus making a 170-kb file, and running your program on that.


"Sneftel is correct, if rather vulgar." --Flarelocke

Share this post


Link to post
Share on other sites
Are you using a recursive function?

It seems odd that you''re overflowing with only 31000 entries.

How''re you allocating the memory for each node in the tree?
Are you allocating off of the heap? or are you using the stack?

Have you tried running on a machine with more memory?

If you are using a recursive function, you can cut down your stack size significantly by using a loop instead.

Function calls are really costly, since each function call has to store its values onto the stack.

Good luck.

Share this post


Link to post
Share on other sites
The exact type of error is an ''illegal operation'' box that pops up. The details box shows it''s a stack fault in module MSVCRT.DLL.

The weird thing is, I did what Sneftel asked - I had my program read in my welshfairytales.txt file twice (3 & 4 times too actually) and other than a slightly delay when it builds the tree, there''s no problem.

I''m building my tree recursively - pretty much the only way I can think of doing it without having to go to the pain of pushing/popping off a stack manually.

Now I''m starting to think it may be something wrong with the way I''m building my tree...? (because that''s where it gives me the error).


//story: Our main document tree.

//oldWord: The world to be compared, will be inserted as ''word1'' in the new WordNode element.

//newWord: The word we''ve just recieved from text input, its only use now is to fill in ''word2''

// When the new node is created, newWord will go on to become the oldWord for the next

// node. (This takes place in the getFile function though.)

WordNode &buildTree(WordNode &story, const apstring &oldWord, const apstring &newWord, const int &nodeNum){
if(oldWord == story.word1){ //checks if first word is the same

if(newWord == story.word2){ //checks for second word

if(story.doubleMatch == NULL) //double-word match

story.doubleMatch = new WordNode(oldWord, newWord, nodeNum);
else
buildTree(*story.doubleMatch, oldWord, newWord, nodeNum);
}else if(story.singleMatch == NULL){ //single-word match

story.singleMatch = new WordNode(oldWord, newWord, nodeNum); }
else
buildTree(*story.singleMatch, oldWord, newWord, nodeNum);
}else if(oldWord > story.word1){ //checks if larger

if(story.right == NULL)
story.right = new WordNode(oldWord, newWord, nodeNum);
else
buildTree(*story.right, oldWord, newWord, nodeNum);
}else{
if(story.left == NULL) { //if not larger, should be smaller (or equaal to)

story.left = new WordNode(oldWord, newWord, nodeNum);
} else {
buildTree(*story.left, oldWord, newWord, nodeNum); }
}
return story;
}


See if you can take a stab at this and see if there''s a special case that''ll break the tree. I''ve tried to comment and make it as clear as possible.

Share this post


Link to post
Share on other sites
Recursive function + stack error => most likely a stack overflow.

Did you try changing the settings on ld to adjust the executable's stack reserve size, as I mentioned in my earlier post?

edit: actually getting an iterative version of your tree builder isn't that bad.

WordNode & buildTree2(WordNode & story, const apstring & oldWord, const apstring & newWord, const int & nodeNum) {
WordNode * current_node = &story;
for (;;) {
WordNode * WordNode::* path = 0;
if (oldWord == current_node->word1) {
if (newWord == current_node->word2) {
path = &WordNode::doubleMatch;
} else {
path = &WordNode::singleMatch;
}
} else {
if (oldWord > current_node->word1) {
path = &WordNode::right;
} else {
path = &WordNode::left;
}
}

if (current_node->*path == NULL) {
current_node->*path = new WordNode(oldWord, newWord, nodeNum);
return story;
} else {
current_node = current_node->*path;
}
}
}


[edited by - SiCrane on April 9, 2004 4:27:03 AM]

Share this post


Link to post
Share on other sites
Wow, your builder works perfectly, muchas gracias.

One question about your code though.

WordNode * WordNode::* path = 0; 


What is this and what does it do? I''ve never seen anything like it before.

Thank you for helping me rediscover a principle I realized in CS 101 - to make almost any recursive function iterative, all you have to do is add one more variable.

Share this post


Link to post
Share on other sites
quote:
Original post by Odoacer
WordNode * WordNode::* path = 0;  


What is this and what does it do? I''ve never seen anything like it before.


It''s a pointer to a WordNode* member of the WordNode class.
class Foo
{
public:
int Bar;
};

int Foo::* mptr = &Foo::Bar;
Foo object;
object.*mptr = 25;

Foo* pointer = &object;
pointer->*mptr = 30;



“Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.”
— Brian W. Kernighan

Share this post


Link to post
Share on other sites
It''s called a pointer to member. It''s basically the notation for pointer to member functions generalized to class data. Or you can think of it as a specialization of the pointer notation.

I guess the best way to understand it is to look at the equivalent code that doesn''t use pointer to members.

WordNode & buildTree2(WordNode & story, const apstring & oldWord, const apstring & newWord, const int & nodeNum) {
WordNode * current_node = &story;
for (;;) {
WordNode ** path = 0;
if (oldWord == current_node->word1) {
if (newWord == current_node->word2) {
path = ¤t_node->doubleMatch;
} else {
path = ¤t_node->singleMatch;
}
} else {
if (oldWord > current_node->word1) {
path = ¤t_node->right;
} else {
path = ¤t_node->left;
}
}

if (*path == NULL) {
*path = new WordNode(oldWord, newWord, nodeNum);
return story;
} else {
current_node = *path;
}
}
}

In this version I use a pointer to a WordNode pointer for the type of path. So let''s compare the differences. In the first version path is defined with an extra bit that says that it''s a pointer to a member that is a pointer instead of a plain pointer to a pointer.

WordNode * * path;
WordNode * WordeNode::* path; // that''s the extra WordNode:: part

To assign the a pointer to member, instead of taking the address of the a specific instance, I take the address of the class''s variable.

path = ¤t_node->right; // address of instance
path = &WordNode::right; // address of class

To dereference a pointer to member, I need an object of the class instead of just the pointer.

(*path == NULL) // plain dereference
(current_node->*path == NULL) // derefernce as part of object

Basically, I used a pointer to member pointer instead of a regular pointer to pointer in order emphasize that the program iterates along the structure of the tree. There are also (minor) type safety advantages. But mostly it''s a form of commenting that uses the code''s structure as a comment.

Share this post


Link to post
Share on other sites