Word count problem
Hi,
I've been working on this homework problem where I have to write a program which counts the number of unique words in a text file, ignoring punctuation (other than apostrophes) and capitalization.
For example "can't" is a different word from "cant", but "Dog" and "dog" are the same word.
I can't use the STL and the only headers I can use are <stdio.h>, <stdlib.h>, and <memory.h>.
My basic idea to approach this is as follows(pseudo-code):
char c;
wordArray[10000];
wordCount = 0;
while(!eof) {
while(!whitespace || period || exclaimation || questionmark ){
wordHolder[20];
arrayCounter = 0;
c = getC (myFile);
wordHolder[arrayCounter] = c;
arrayCounter++;
}
CheckForDuplicate();
wordHolder = wordArray[wordCount]; //I know this won't work...
wordCount++;
}
I know the above has some problems with it, but that's bascially my idea of how to do this program given the parameters of the assignment.
How do I get the character array into the word array as a complete word?
Also are there any other problems you guys can see me running into that I haven't identified? Thanks in advance for any help.
Strings in C are copied using strcpy.
Also, why not directly read the word into the correct array position?
Also, why not directly read the word into the correct array position?
I wouldn't copy any strings.
I would make it a tree. Each node containing an array with the size equal the number of diffrent characters you allow (A-Z and ' in this case). Just traverse the tree for each word and when a new space is found, determine if that word already existed or if the last character before the space was added. Keep a counter for each new word found.
I would make it a tree. Each node containing an array with the size equal the number of diffrent characters you allow (A-Z and ' in this case). Just traverse the tree for each word and when a new space is found, determine if that word already existed or if the last character before the space was added. Keep a counter for each new word found.
Quote:Original post by ToohrVyk
Strings in C are copied using strcpy.
Also, why not directly read the word into the correct array position?
Well if I was able to create "string" variables, I would just have a "string" holder and set that equal to the current array position, and then keep iterating.
Can that be done in C?
Okay nevermind on the "C" only stuff.
I was told I could use C++ to complete the assignment so that makes it a lot easier.
However, I had a question for the data structures experts in here.
I want to use an array to store my word list, and then check against the array to see if a given word is a duplicate or not.
Is this a good way to go about it?
Is a hash table better???
I was told I could use C++ to complete the assignment so that makes it a lot easier.
However, I had a question for the data structures experts in here.
I want to use an array to store my word list, and then check against the array to see if a given word is a duplicate or not.
Is this a good way to go about it?
Is a hash table better???
A hash table is much better, because it makes the complexity linear. A trie is also linear in complexity.
Edit: OK, I should have read the rest of the thread first; but the fact that they'd even consider telling you what they originally told you sickens me, and I'm already sick.
My honest recommendation to you is to drop the course and tell everyone you can not to take it. Seriously. I am telling you this as a university graduate (who went through some similarly worthless programming courses) with real-world, professional C++ experience.
Unfortunately, I can't recommend any alternative courses for you at your school or university, and probably couldn't even if I knew what that school or university was. In fact, I can't even recommend, off the top of my head, a school or university that teaches things properly. I really can't. I really, really wish I could, but I can't. The state of things is really that bad.
If you're even *mentioning* the STL, then your course is presumably claiming to teach C++.
stdio.h, stdlib.h and memory.h don't even EXIST in proper, standardized C++, as of nineteen freakin' ninety-eight. The proper names are <cstdio>, <cstdlib> and <memory>.
Also, "the STL" is poor phrasing, because what we are really talking about is the standard C++ library. Not all of the STL is available from the standard C++ headers, and there are many other things covered in the C++ standard library (for example, all the *stream headers).
But more to the point, consider that phrase. Standard C++ library. This is as close to built-in-to-the-language as it is possible for code to get. The reason that courses at "educational" institutions make you do things manually is in some vain hope that you will learn something about "how the machine works" by hands-on-and-in experience.
In my mind, it's something like trying to teach about electricity by drawing a diagram of a battery and of a lightbulb on a chalkboard, along with "V = IR"; then putting you in the lab with a beaker of sulfuric acid, some coins, alligator clips, and lots of bare copper wire and electrical tape; and hoping that you will figure out on your own the concept of "insulation", or that sulfuric acid is really not something you want to get on your skin. The main difference being that nothing you type into your computer is likely to injure or kill anyone.
You might not want to make your pseudo-code look so much like the implementation language. :) (Python programmers have an excuse ;) )
Quote:Original post by xeloj
Hi,
I've been working on yadda yadda yadda...
I can't use the STL and the only headers I can use are <stdio.h>, <stdlib.h>, and <memory.h>.
My honest recommendation to you is to drop the course and tell everyone you can not to take it. Seriously. I am telling you this as a university graduate (who went through some similarly worthless programming courses) with real-world, professional C++ experience.
Unfortunately, I can't recommend any alternative courses for you at your school or university, and probably couldn't even if I knew what that school or university was. In fact, I can't even recommend, off the top of my head, a school or university that teaches things properly. I really can't. I really, really wish I could, but I can't. The state of things is really that bad.
If you're even *mentioning* the STL, then your course is presumably claiming to teach C++.
stdio.h, stdlib.h and memory.h don't even EXIST in proper, standardized C++, as of nineteen freakin' ninety-eight. The proper names are <cstdio>, <cstdlib> and <memory>.
Also, "the STL" is poor phrasing, because what we are really talking about is the standard C++ library. Not all of the STL is available from the standard C++ headers, and there are many other things covered in the C++ standard library (for example, all the *stream headers).
But more to the point, consider that phrase. Standard C++ library. This is as close to built-in-to-the-language as it is possible for code to get. The reason that courses at "educational" institutions make you do things manually is in some vain hope that you will learn something about "how the machine works" by hands-on-and-in experience.
In my mind, it's something like trying to teach about electricity by drawing a diagram of a battery and of a lightbulb on a chalkboard, along with "V = IR"; then putting you in the lab with a beaker of sulfuric acid, some coins, alligator clips, and lots of bare copper wire and electrical tape; and hoping that you will figure out on your own the concept of "insulation", or that sulfuric acid is really not something you want to get on your skin. The main difference being that nothing you type into your computer is likely to injure or kill anyone.
Quote:
My basic idea to approach this is as follows(pseudo-code):
You might not want to make your pseudo-code look so much like the implementation language. :) (Python programmers have an excuse ;) )
Okay so here's what I have so far.
My problem seems to be when I'm trying to iterate through the array to check if the word already exist in the array, it seems to give me problems. Is it okay to check the current value of holder which is a string, to the array indices?
Also I just noticed that my code counts the words if they are each on their own line, but I want it to work where it would check through a random paragraph and still count the words.
Any tips would be helpful!
[Edited by - xeloj on April 20, 2007 3:14:17 AM]
#include <iostream>#include <fstream>#include <string>#include <ctype.h> using namespace std;int main(){ string filename = "test.txt"; string wordList[10000]; //Open File Stream ifstream file_stream; file_stream.open(filename.c_str()); //Variable that keeps count of "all" words unsigned int total_words = 0; //Variable that keeps count of "unique words unsigned int unique_words = 0; if ( file_stream.is_open() ) { while ( !file_stream.eof() ) // loop until the end of the file { string holder; // just a holder getline( file_stream, holder ); // read line from the file cout << "Current Word: " << holder << endl; holder[0] = tolower(holder[0]); //Converts first letter capitals to lower case //total_words++; // increment word count //cout << "We have " << total_words << " words in our array." << endl; for(int i = 0; i<10000; i++) { if(holder == wordList[unique_words]) { cout << holder << " creates collision(word is in current index)." << endl; } else { wordList[unique_words] = holder; cout << holder << " is inserted into word list." << endl; unique_words++; // increment word count cout << "Word " << unique_words << " in array: " << wordList[unique_words] << endl; cout << "We have " << unique_words << " words in our array." << endl; break; } }// End for-loop }// End while-loop } else { //catch case cout << "Could not load file: " << filename << endl; } return 0;}
My problem seems to be when I'm trying to iterate through the array to check if the word already exist in the array, it seems to give me problems. Is it okay to check the current value of holder which is a string, to the array indices?
Also I just noticed that my code counts the words if they are each on their own line, but I want it to work where it would check through a random paragraph and still count the words.
Any tips would be helpful!
[Edited by - xeloj on April 20, 2007 3:14:17 AM]
I personally would create a quick linked list of words and just iterate through them to find uniqueness. I think the linked list is a little more work up front, but much easier to mess with once the groundwork is laid. If you are not allowed to use something like an STL vector, list, or set just create your own quick linked list that is pointer based.
[Edited by - vtchill on April 20, 2007 4:16:07 PM]
[Edited by - vtchill on April 20, 2007 4:16:07 PM]
This topic is closed to new replies.
Advertisement
Popular Topics
Advertisement