How does C++ read its syntax in?

Started by
21 comments, last by GameMasterXL 18 years, 8 months ago
I am just wondering does C++ read the whole source file into one line inside of a stack? or does it read each individual line? Like this #include <iostream> using namespace std; int main() { cout << "Hello World!!" << endl; return 0; } would it read this in like this #include <iostream>using namespace std;int main(){cout << "Hello World!!" << endl;return 0;} Or just like this: #include <iostream> // link code here... after linkage now using namespace std; // validate this line // skip white space int main() // validate this line { // read in start block character cout << "Hello World!!" << endl; // validate this line return 0; // read in this line } // read in end block character
Advertisement
Any C++ parser worth its salt completely ignores whitespace (except where necessary). So the end result is much like your first option, except that of course the preprocessor interprets the # directives and acts accordingly. (In this case, by inserting the contents of iostream). Validation doesn't occur until after the code has been converted into internal symbols.
{[JohnE, Chief Architect and Senior Programmer, Twilight Dragon Media{[+++{GCC/MinGW}+++{Code::Blocks IDE}+++{wxWidgets Cross-Platform Native UI Framework}+++
Typically, it's an even more complex version of your second example. [smile]

One think you need to know, though, is that preprocessor directives are handled before the compiler even sees the code. The compiler wouldn't see #include <iostream>. Instead the data stream it receives directly include the contents of the iostream header. Likewise, all macros are evaluated.

Another thing that is important to know is that line breaks are whitespace like spaces and tabs. What matters to the compiler are statements, which are broken by semicolons for individual statements, or bounded by braces for compound statements.

If you want more details you need to get started on compiler theory. [smile]
Check out flex and bison. Even reading the docs should give you some insight.
"Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." — Brian W. Kernighan
I'm not an expert but I'll try to explain it the best I can =)

It's a bit more complicated then that. The entire source code is first broken down into a list of tokens. So the list might go something like this-


int
main
(
)
{
cout
<<
"Hello World!!"
<<
endl
;
return
0
;
}


The #include <iosream> is a preprocessor directive and should be handled before tokenizeing I believe.

The syntex is is then checked using a parser that runs through each token and makes sure it's a valid syntex/grammar. If the syntax is valid it builds a parse tree which is then converted into assembly. The assembly is then translated into byte code (in interpreted languages like java/c#) or machine language/binary.

There are different types of parseing techniques but one called Recursive Descent is one of the more popular.

Above is a *very* non-descriptive explanation of how it works to get you started on your journey. You'll want to look into these topics: compiler theory, context-free grammars, recursive descent parsers, syntax/parse trees.

It may help if you look into or have some background in computational theory whcih is taught at a lot of schools. The end of that class (for me at least) lead right into compiler theory.

Good luck!
@Fruny: The second example implies that the code is parsed line by line, ne? As you then proceeded to state, it's parsed statement by statement (to generalize, of course); that, to me, is more concretely symbolized by the first example...
{[JohnE, Chief Architect and Senior Programmer, Twilight Dragon Media{[+++{GCC/MinGW}+++{Code::Blocks IDE}+++{wxWidgets Cross-Platform Native UI Framework}+++
Quote:Original post by TDragon
that, to me, is more concretely symbolized by the first example...


I think it closer to the second example since it implied that there were breaks in the parsing process. It's just that the breaks are on statements, not on new lines.
"Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." — Brian W. Kernighan
*Shrug*
Gotcha.
OuncleJulien summed it up in layman's terms pretty well, I guess.
{[JohnE, Chief Architect and Senior Programmer, Twilight Dragon Media{[+++{GCC/MinGW}+++{Code::Blocks IDE}+++{wxWidgets Cross-Platform Native UI Framework}+++
Well i was just interested since i am currently building my own recursive-descent parser [smile] and am having trouble with finding a solution to validating if, for, else statments. The way the parser is now it is soposed to read one line of code from the file then validate it then store the data or output the results. But if it is an if statment i would need to get another line of code from the file agian and agian and agian untill i reach my end statment but i can't figure out how i will do this. Since my compiler calls functions within itself it would just start fresh and read a new line and it wouldn't know if it was looking for a end statment or not. Can anyone give me any ideas on how to achive this?

So the program is read line by line then?

In my book it said that C++ dosn't know what new-lines are so that made me think does it read all the statments onto one single line inside a buffer for valiadtion.

Quote:Original post by GameMasterXL
Well i was just interested since i am currently building my own recursive-descent parser [smile] and am having trouble with finding a solution to validating if, for, else statments. The way the parser is now it is soposed to read one line of code from the file then validate it then store the data or output the results.

But that's not how a true recursive-descent parser works. A recursive descent parser will read a token at a time, not a line at a time.
Usually how it works is that it tries to build the syntax tree as it reads in individual lexemes. The parser requests lexemes from the lexer independent of the amount of whitespace in between the lines. For an if/else construct (ignoring comments), what probably happens is that it sees the if, asks for the next lexeme and if the lexeme is not a ( then it errors out. Then it jumps into a parse expression mode, and parses the expression inside the ()s, grabs the next ) and then asks for another lexeme. If the lexeme is a { then it jumps into a block parsing mode, if the lexeme is anything else, it tries to parse it as a statement.

This topic is closed to new replies.

Advertisement