Word count problem

Started by
11 comments, last by Zahlman 17 years ago
Hi, I've been working on this homework problem where I have to write a program which counts the number of unique words in a text file, ignoring punctuation (other than apostrophes) and capitalization. For example "can't" is a different word from "cant", but "Dog" and "dog" are the same word. I can't use the STL and the only headers I can use are <stdio.h>, <stdlib.h>, and <memory.h>. My basic idea to approach this is as follows(pseudo-code): char c; wordArray[10000]; wordCount = 0; while(!eof) { while(!whitespace || period || exclaimation || questionmark ){ wordHolder[20]; arrayCounter = 0; c = getC (myFile); wordHolder[arrayCounter] = c; arrayCounter++; } CheckForDuplicate(); wordHolder = wordArray[wordCount]; //I know this won't work... wordCount++; } I know the above has some problems with it, but that's bascially my idea of how to do this program given the parameters of the assignment. How do I get the character array into the word array as a complete word? Also are there any other problems you guys can see me running into that I haven't identified? Thanks in advance for any help.
------------------------Arise,arise, Riders of Theoden! Fell deeds awake: fire and slaughter! spears shall be shaken, shields be splintered, a sword-day, a red day, ere the sun rises! Ride now, ride now! Ride to Gondor!
Advertisement
Strings in C are copied using strcpy.

Also, why not directly read the word into the correct array position?
I wouldn't copy any strings.

I would make it a tree. Each node containing an array with the size equal the number of diffrent characters you allow (A-Z and ' in this case). Just traverse the tree for each word and when a new space is found, determine if that word already existed or if the last character before the space was added. Keep a counter for each new word found.
Quote:Original post by ToohrVyk
Strings in C are copied using strcpy.

Also, why not directly read the word into the correct array position?


Well if I was able to create "string" variables, I would just have a "string" holder and set that equal to the current array position, and then keep iterating.

Can that be done in C?
------------------------Arise,arise, Riders of Theoden! Fell deeds awake: fire and slaughter! spears shall be shaken, shields be splintered, a sword-day, a red day, ere the sun rises! Ride now, ride now! Ride to Gondor!
Okay nevermind on the "C" only stuff.

I was told I could use C++ to complete the assignment so that makes it a lot easier.

However, I had a question for the data structures experts in here.

I want to use an array to store my word list, and then check against the array to see if a given word is a duplicate or not.

Is this a good way to go about it?

Is a hash table better???
------------------------Arise,arise, Riders of Theoden! Fell deeds awake: fire and slaughter! spears shall be shaken, shields be splintered, a sword-day, a red day, ere the sun rises! Ride now, ride now! Ride to Gondor!
A hash table is much better, because it makes the complexity linear. A trie is also linear in complexity.
Moved to For Beginners.
Edit: OK, I should have read the rest of the thread first; but the fact that they'd even consider telling you what they originally told you sickens me, and I'm already sick.

Quote:Original post by xeloj
Hi,

I've been working on yadda yadda yadda...

I can't use the STL and the only headers I can use are <stdio.h>, <stdlib.h>, and <memory.h>.


My honest recommendation to you is to drop the course and tell everyone you can not to take it. Seriously. I am telling you this as a university graduate (who went through some similarly worthless programming courses) with real-world, professional C++ experience.

Unfortunately, I can't recommend any alternative courses for you at your school or university, and probably couldn't even if I knew what that school or university was. In fact, I can't even recommend, off the top of my head, a school or university that teaches things properly. I really can't. I really, really wish I could, but I can't. The state of things is really that bad.

If you're even *mentioning* the STL, then your course is presumably claiming to teach C++.

stdio.h, stdlib.h and memory.h don't even EXIST in proper, standardized C++, as of nineteen freakin' ninety-eight. The proper names are <cstdio>, <cstdlib> and <memory>.

Also, "the STL" is poor phrasing, because what we are really talking about is the standard C++ library. Not all of the STL is available from the standard C++ headers, and there are many other things covered in the C++ standard library (for example, all the *stream headers).

But more to the point, consider that phrase. Standard C++ library. This is as close to built-in-to-the-language as it is possible for code to get. The reason that courses at "educational" institutions make you do things manually is in some vain hope that you will learn something about "how the machine works" by hands-on-and-in experience.

In my mind, it's something like trying to teach about electricity by drawing a diagram of a battery and of a lightbulb on a chalkboard, along with "V = IR"; then putting you in the lab with a beaker of sulfuric acid, some coins, alligator clips, and lots of bare copper wire and electrical tape; and hoping that you will figure out on your own the concept of "insulation", or that sulfuric acid is really not something you want to get on your skin. The main difference being that nothing you type into your computer is likely to injure or kill anyone.

Quote:
My basic idea to approach this is as follows(pseudo-code):


You might not want to make your pseudo-code look so much like the implementation language. :) (Python programmers have an excuse ;) )
Okay so here's what I have so far.

 #include <iostream>#include <fstream>#include <string>#include <ctype.h> using namespace std;int main(){	string filename = "test.txt";	string wordList[10000];	//Open File Stream	ifstream file_stream;	file_stream.open(filename.c_str());	//Variable that keeps count of "all" words	unsigned int total_words = 0;	//Variable that keeps count of "unique words	unsigned int unique_words = 0;	if ( file_stream.is_open() ) 	{		while ( !file_stream.eof() )   // loop until the end of the file		{			string holder;  // just a holder			getline( file_stream, holder );  // read line from the file			cout << "Current Word: " << holder << endl;			holder[0] = tolower(holder[0]); //Converts first letter capitals to lower case			//total_words++;  // increment word count			//cout << "We have " << total_words << " words in our array." << endl;			for(int i = 0; i<10000; i++)			{				if(holder == wordList[unique_words])				{					cout << holder << " creates collision(word is in current index)." << endl;				}				else				{					wordList[unique_words] = holder;					cout << holder << " is inserted into word list." << endl;					unique_words++;  // increment word count					cout << "Word " << unique_words << " in array: " << wordList[unique_words] << endl;					cout << "We have " << unique_words << " words in our array." << endl;					break;				}			}// End for-loop		}// End while-loop	}	else	{		//catch case		cout << "Could not load file: " << filename << endl;	}	return 0;}


My problem seems to be when I'm trying to iterate through the array to check if the word already exist in the array, it seems to give me problems. Is it okay to check the current value of holder which is a string, to the array indices?

Also I just noticed that my code counts the words if they are each on their own line, but I want it to work where it would check through a random paragraph and still count the words.

Any tips would be helpful!

[Edited by - xeloj on April 20, 2007 3:14:17 AM]
------------------------Arise,arise, Riders of Theoden! Fell deeds awake: fire and slaughter! spears shall be shaken, shields be splintered, a sword-day, a red day, ere the sun rises! Ride now, ride now! Ride to Gondor!
I personally would create a quick linked list of words and just iterate through them to find uniqueness. I think the linked list is a little more work up front, but much easier to mess with once the groundwork is laid. If you are not allowed to use something like an STL vector, list, or set just create your own quick linked list that is pointer based.


[Edited by - vtchill on April 20, 2007 4:16:07 PM]

This topic is closed to new replies.

Advertisement