Sign in to follow this  
Endar

Simple parsing of a text file

Recommended Posts

Nothing special so far, except that here is the parse function:
[source="cpp"]

int main()
{
	char s[1024];
	Parser p;
	Parser::Identifier id;

	ifstream file("test.txt");

	while( file.getline(s, 1024) ){
		p.giveLine(s);

		while( id=p.parse() ){
			if( id == -1 || id == -2 )
				break;
			
			if( id == Parser::NUMBER )
				cout << "NUMBER: ";
			else if( id == Parser::STRING )
				cout << "STRING: ";
			else if( id == Parser::QUOTED_STRING )
				cout << "QUOTED_STRING: ";
			else if( id == Parser::OPERATOR )
				cout << "OPERATOR: ";


			// print parser.text
			cout << p.text << endl;
			//cout << (int)p.text[0] << endl;
		}
	}

	cout << endl << endl;

return 0;
};

enum Identifier{
	NUMBER = (Instruction::NUM_OF_INSTRUCTIONS + 1),	///< Found a number
	STRING,			///< Found a string (letters and numbers)
	QUOTED_STRING,	///< Found a string surrounded by quotes: "hello"
	OPERATOR,		///< Found an operator.

	NUM_OF_IDENTIFIERS	///< Implicit number of identifiers
};


// Note: buffer is a char[1024], text and start are char*
// data members.
Identifier parse()
{
	int i, j;
	Identifier id;
	// Skip all whitespaces
	for(i=0; start[i]==' ' || start[i]=='\t' ; i++ );

	if( start[i] == '\n' || start[i] == '\0' )
		return (Identifier) -1;

	// if first character is a letter
	if( isalpha(start[i]) != 0 ){
		// Copy characters until is not a letter or a number (can have numbers in strings) anymore
		for(j=0; isalpha(start[i])!=0 || isdigit(start[i])!=0 ; j++, i++)
			buffer[j] = start[i];

		id = STRING;
	}

	// If first character is a number
	else if( isdigit(start[i]) != 0 ){
		// Copy character until is not a digit
		for(j=0; isdigit(start[i]) != 0; j++, i++)
			buffer[j] = start[i];

		id = NUMBER;
	}

	// if first character is a "
	else if( start[i] == '"' ){
		// to pass first "
		i++;
		// Copy all characters until another quote
		for( j=0; start[i] != '"' ; j++, i++)
			buffer[j] = start[i];

		id = QUOTED_STRING;
	}

	// if none of them
	else
		return (Identifier) -2;

	// end buffer string
	buffer[j] = '\0';

	// Free the text
	if( text != NULL ){
		free(text);
		text = NULL;
	}

	// Allocate space and copy 'buffer' string to 'text' for outside access
	text = strdup(buffer);

	// advance start pointer to start of next part to parse
	start += (i+1);		// + 1 else the last letter looked at last time will be the first this time
			
	// return error
	return id;
}


The point is to be able to supply a line to the parser and then call the parse function several times until it returns -1. Each time parse is called, it works on the next bit of the line until it hits a newline char or a null char. Here is the test file:



hello

67464
hr56355dfsd

dfgs56

"dghd4564m,lh89"





And finally, here is the output:

STRING: hello
STRING: ²²²²
NUMBER: 67464
STRING: ²²²²
STRING: hr56355dfsd
STRING: ²²²²
STRING: dfgs56
STRING: ²²²²
QUOTED_STRING: dghd4564m,lh89
STRING: ²²²²



Can anyone think of reason why I'm getting these superscript (or sub, not sure which is which) 2's? I assume its something to do with the attempt to print out the 'text' pointer that points to nothing, but I'm not sure why its happening.

Share this post


Link to post
Share on other sites
I can't tell you too much without seeing the class declaration, but I think you are on the right track for the debugging.

However, better to stop this problem where it starts - get rid of all this weird manipulation of the "text" member, which apparently is a char * (since you're using strdup). In C++ there is generally no good reason for this low-level hackage. Make the member be a std::string, and use its .assign() method to change the contents (or the assignment operator, if you extract a std::string from the input buffer).

Better yet, why not let cin do (most of) the parsing for you?

pseudocode:

Define a base class "Token", and subclasses Number, Operator, etc.
(Each of these contains a single data member of the appropriate
type, which holds the information for that token e.g. the
numeric value of a Number). The base class instances are empty.

Try to cin into int variable
If successful: return Number(the int variable)
(otherwise...)
cin into char variable
If it's a valid operator: return Operator(the char variable)
else if it's a double-quote:
Use 3-arg form of cin.getline() to read up to the next double-quote: std::cin.getline(temp, std::numeric_limits<int>.max(), '"');
return QuotedString(string(the read-in stuff))
(otherwise, we have a normal string...)
Make a new string with the char variable
cin a single word (by cin into a std::string) and append it to the char
return String(the string)


Share this post


Link to post
Share on other sites
I don't know, but perhaps Flex & Bison could be of use to you? [smile]

EDIT: Hehe, in case you don't know what Flex & Bison is..
They're parsing utilities. Flex can be used to quickly get tokens from a text file, and combined with bison, with which you design a structure of how tokens is combined, you can create incredible powerful parsing routines.
Parsing c-style code has never been easier [grin]

Share this post


Link to post
Share on other sites
my guess is that the line

ifstream file("test.txt");

is opening your text file in binary. in binary, on some systems, a newline char created in text mode is encoded as the sequence of charachters '0x0d' followed by '0x0a'. the '0x0a' character is the actual newline character. this is transparent when using the file in text mode.

and so your parse function is breaking when it reaches one char before the newline character (the '0x0d' char) and your STRING test case is accepting the '0x0d' char as a one character string before it actually reaches '0x0a' (newline).

to fix this, you could either open the file in text mode, or change the lines in your parse() function from:

if( start[i] == '\n' || start[i] == '\0' )
return (Identifier) -1;

to

if (start[i] == '\n' || start[i] == '\0' || (start[i] == '0x000d' && start[i+1] == '0x000a'))
return (Identifier) -1;

illone

Share this post


Link to post
Share on other sites
Some theories...

Quote:

my guess is that the line

ifstream file("test.txt");

is opening your text file in binary.


This should not be the case. In order to open a file in binary mode with fstream, you need to specify the param ios::binary, like this
ifstream file("test.txt", ios::binary);


About the OP's code. You write

else if( start[i] == '"' ){
// to pass first "
i++;
...

I'm not sure, but shouldn't you put a backslash(escape sequence) before the "-sign?


Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this