Binary trees & large files

Started by
11 comments, last by Odoacer 20 years ago
Okay I''m working on a program for Markov chains and it works fine except there seems to be a limit on the size of my binary trees - I think I''m running out of stack space (or is the heap?)or something. Is there any way to get around this? I can''t get it to build a tree of Alice in Wonderland but it will work fine for smaller files like my collection of welsh fairy tales. I''d post my tree-building code here but its long.
Advertisement
No one has anything to say?
If I''m not making my problem clear, please say so and I''ll try to explain the best I can.
You haven''t given very much to go on. When you say that there''s a limit, you don''t say how you know there''s a limit. You don''t mention what programming language you''re working in, explain how you structure your tree, give a comparison in size of the welsh fairy tales verus "Alice in Wonderland", mention what platform you''re running on, etc.

But let''s start with the basics: what kind of error are you getting?
I''m using C++ with Dev-C++.

I''m fairly sure it''s some sort of limit because of the sizes of the files I''m working off - alice.txt is 148 KB, while welshfairytales.txt is 85.2. The program works perfectly unless I try to feed it a large file.

Welshfairytales.txt comes out to about 17,000 nodes. Each node stores one word the word that''s immediately after it. (I''m trying to make so it can lookahead more than one word, but that''s for next time...)

For alice.txt, the program gives me an error around the 31,000st node. For Through the Looking Glass (the sequel), it tops out at a little under 24,500. The disrepancy is probably due to word lengths, but it''s close enough to make me suspect there''s some kind of memory limit.

I can''t for the life of me figure out how to use the Dev-C++ debugger, but I''ve been able to glean that I''m getting an access violation (segmentation fault) when I try to run those programs. My experience with those has been with memory-related errors.

I structure my tree like so:
The typical left-right.
If a child node''s first word matches the first word of the parent, I''ve added a "singleMatch" list - the node goes here. If both words are identical, I add it to a "doubleMatch" list.

Here''s my struct:
struct WordNode{    int nodeNum;    apstring word1, word2;    WordNode *left, *right, *singleMatch, *doubleMatch;    (left out constructors for clarity)}; 


apstring is a string class I use, and I''m 99.9% sure there''s nothing wrong with it - it''s just a failsafe wrapper around char*[].

There exists a slight possibility that there really may be something wrong with the way I''m building my tree and the error only pops up under very exceptional circumstances, but I can''t think of what it could be. For now, try to work on what I''ve provided.
I''m not sure if you''re saying running the debugger gives you a segfault or your program gives you a segfault. You say your program "gives an error". What exactly do you mean by that. Does it pop up a message box? Does it spit something cryptic at the command prompt? Does it take over your speakers and scream "AIEEE! I''M ON FIRE!"?

However, it''s fairly unlikely that you''re running out of heap memory, unless I''m totally misunderstanding the nature of your binary tree. 31,000 nodes, at even something as large as 4K per node shouldn''t overflow the available user virtual memory of a windows machine.

It may be a stack overflow, depending on how you structured your tree generator. If you''re using Dev C++, then you''re probably using ld as your linker. Try getting at the linker settings and increasing the default stack size. The command line switch is --stack. The default reserve stack size is 2 MB for ld, so try bumping it up to 4 MB.
Odo, try appending your welsh fairy tales to itself, thus making a 170-kb file, and running your program on that.

"Sneftel is correct, if rather vulgar." --Flarelocke
Are you using a recursive function?

It seems odd that you''re overflowing with only 31000 entries.

How''re you allocating the memory for each node in the tree?
Are you allocating off of the heap? or are you using the stack?

Have you tried running on a machine with more memory?

If you are using a recursive function, you can cut down your stack size significantly by using a loop instead.

Function calls are really costly, since each function call has to store its values onto the stack.

Good luck.
The exact type of error is an ''illegal operation'' box that pops up. The details box shows it''s a stack fault in module MSVCRT.DLL.

The weird thing is, I did what Sneftel asked - I had my program read in my welshfairytales.txt file twice (3 & 4 times too actually) and other than a slightly delay when it builds the tree, there''s no problem.

I''m building my tree recursively - pretty much the only way I can think of doing it without having to go to the pain of pushing/popping off a stack manually.

Now I''m starting to think it may be something wrong with the way I''m building my tree...? (because that''s where it gives me the error).

//story: Our main document tree.//oldWord: The world to be compared, will be inserted as ''word1'' in the new WordNode element.//newWord: The word we''ve just recieved from text input, its only use now is to fill in ''word2''//         When the new node is created, newWord will go on to become the oldWord for the next //         node. (This takes place in the getFile function though.)WordNode &buildTree(WordNode &story, const apstring &oldWord, const apstring &newWord, const int &nodeNum){    if(oldWord == story.word1){ //checks if first word is the same        if(newWord == story.word2){ //checks for second word            if(story.doubleMatch == NULL) //double-word match                story.doubleMatch = new WordNode(oldWord, newWord, nodeNum);            else                buildTree(*story.doubleMatch, oldWord, newWord, nodeNum);        }else if(story.singleMatch == NULL){ //single-word match            story.singleMatch = new WordNode(oldWord, newWord, nodeNum); }        else            buildTree(*story.singleMatch, oldWord, newWord, nodeNum);     }else if(oldWord > story.word1){ //checks if larger        if(story.right == NULL)             story.right = new WordNode(oldWord, newWord, nodeNum);        else            buildTree(*story.right, oldWord, newWord, nodeNum);    }else{        if(story.left == NULL) { //if not larger, should be smaller (or equaal to)            story.left = new WordNode(oldWord, newWord, nodeNum);        } else {            buildTree(*story.left, oldWord, newWord, nodeNum); }   }    return story;}


See if you can take a stab at this and see if there''s a special case that''ll break the tree. I''ve tried to comment and make it as clear as possible.

Recursive function + stack error => most likely a stack overflow.

Did you try changing the settings on ld to adjust the executable's stack reserve size, as I mentioned in my earlier post?

edit: actually getting an iterative version of your tree builder isn't that bad.
WordNode & buildTree2(WordNode & story, const apstring & oldWord, const apstring & newWord, const int & nodeNum) {  WordNode * current_node = &story  for (;;) {    WordNode * WordNode::* path = 0;    if (oldWord == current_node->word1) {      if (newWord == current_node->word2) {        path = &WordNode::doubleMatch;      } else {        path = &WordNode::singleMatch;      }    } else {      if (oldWord > current_node->word1) {        path = &WordNode::right;      } else {        path = &WordNode::left;      }    }    if (current_node->*path == NULL) {      current_node->*path = new WordNode(oldWord, newWord, nodeNum);      return story;    } else {      current_node = current_node->*path;    }  }}


[edited by - SiCrane on April 9, 2004 4:27:03 AM]
Wow, your builder works perfectly, muchas gracias.

One question about your code though.

WordNode * WordNode::* path = 0; 


What is this and what does it do? I''ve never seen anything like it before.

Thank you for helping me rediscover a principle I realized in CS 101 - to make almost any recursive function iterative, all you have to do is add one more variable.

This topic is closed to new replies.

Advertisement