Sign in to follow this  

Compiler Design - Lexical & syntactical analysis

This topic is 3857 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hi all I have a compiler construction question for you. I wonder if anyone can recommend good tutorials/webpages/books that explain how to create a lexical & syntactical analyser (scanner)? I know there are several free tools for this such as Lex, Flex, Bison etc, but I want to develop one myself. Otherwise I do not learn how to implement this. ;) Thank you folks

Share this post


Link to post
Share on other sites
lex, yacc, flex, bison, ocamlyacc, ocamllex and menhir (to cite only a few) are open source and well-documented. Taking a peek at the source code and documentation would be an excellent learning experience.

Share this post


Link to post
Share on other sites
Hi Fred

Do you mean this book:
http://www.amazon.com/Compilers-Principles-Techniques-Tools-2nd/dp/0321486811/ref=pd_lpo_k2_dp_k2a_1_img/002-3598064-9116857

Share this post


Link to post
Share on other sites
I would pick up the first edition of the Dragon Book instead. Most of the changes for the second edition were to do with optimizations and more in depth topics. The chapters on lexing and parsing were mostly the same. Seeing as Amazon has that one for $6, it'd be good deal.

Share this post


Link to post
Share on other sites
Ok, so the first or second edition of the dragon book makes no difference.

It's the most recommended book then if you want to learn how to build a compiler?

Share this post


Link to post
Share on other sites
Quote:
Original post by CodeMachine
Hi all
I have a compiler construction question for you.

I wonder if anyone can recommend good tutorials/webpages/books that explain how to create a lexical & syntactical analyser (scanner)?

I know there are several free tools for this such as Lex, Flex, Bison etc, but I want to develop one myself. Otherwise I do not learn how to implement this. ;)

Thank you folks


Ok I've done this for 2 years and this is my personal recommendation. The best *overall* book (and yes I know... Blasphemy!) is *not* the Dragon book. The Dragon book is very unfriendly to someone implementing a compiler and is very VERY theory driven. The best book IMHO is "Compiler Design in C" by Allen Holub. The source can be downloaded for free from here : http://www.holub.com/software/compiler.design.in.c.html

It very effectively mixes the theory *with* the application and is thus drastically more usable than the paperweight Dragon book. Also, believe it or not, I wrote a compiler and interpreter (very elementry one but it is a compiler nonetheless) from the Deitel and Deitel book "C: How To Program". They have a series of exercises in the end of the chapters that give enough information to get you started without removing the fun of figuring out the implemetation yourself and provides a good starting point to go deeper into this topic.

I actually did exactly this and then read the Dragon book for about 2 years until it got into compiler compilers where I wanted to gouge my eyes out with pickaxes! Not a recommended path to glory...

Share this post


Link to post
Share on other sites
i'll recommend the dragon book as well, 2nd edition. however, if you get that one, i'd also recommend getting another book on automata and grammars, as the coverage of them in the dragon book, such as subset construction and DFA state minimization, is blah. another thing i was pissed off about this book is that there is a lot of errors (especially in the code listings and algorithm descriptions... i remember pg. 224 Algorithm 4.31 really pissed me off) he he he. make sure you download the errata for this book!

i also liked this book too. it doesn't talk about lex and yacc as much as the aho book does, and it covers automata much better (like how to code them).

Share this post


Link to post
Share on other sites
Ok, so the recommended books (if you want to learn how to build a parser), are "Engineering a Compiler" and "Compiler Design in C" then.

Is an AST to recommend if you want to implement editor-intellisense?

Share this post


Link to post
Share on other sites
an ast is mainly for translating your code into psuedo-assembly... err... three-byte code or three-address code something like that he he he which i think goes into your OBJ file (if my memory serves me correct it's been a while). it can also be used for type-checking/conversion and stuff. it's really cool stuff. i think the symbol table is what you're thinking is good for intellisense... it keeps track of the variables and function identifiers, what parent scope they're in and such...

[Edited by - yadango on June 25, 2007 3:51:00 PM]

Share this post


Link to post
Share on other sites
a scanner is the first part of the compiler. It turns a character stream into a token stream. Most programming languages have pretty simple token syntax and so they can be described with a regular expression. Regular expressions are equivilent to finite state machines. Study up on finite state machines. I recommend using goto to implement your fsm. It will be far more elegant than an object oriented solution. Just make sure to make a new scope under each label, example:

state1:
{



}

this keeps things from getting ugly.

Also, I found compiler construction tools to be worse than useless when designing a new language.

Share this post


Link to post
Share on other sites
Quote:
Original post by Glak
Regular expressions are equivilent to finite state machines. Study up on finite state machines. I recommend using goto to implement your fsm. It will be far more elegant than an object oriented solution.


If you have access to a smart language (one with tail-recursion), you can and should use mutually recursive functions, which are safer to manipulate and allow the passing of arguments. This, as well as goto and switches, however, are slow code-based solutions which will get ugly as soon as your fsm reaches a few thousand nodes (as it does in almost any non-trivial programming language) and quite a bit slower than an automaton compiled as data with a tight and small code-bit working on it.

Quote:
Also, I found compiler construction tools to be worse than useless when designing a new language.


I found lexer and parser generation tools to be insanely useful when designing a language of reasonable complexity—they have the inherent advantage of letting you specify what you want done without having to explain how you want it done, and they usually do it in the most efficient way possible. The lexer side of things is not that indispensable (you can get it done quite easily with regular expressions, it's true) but the parser side involves a lot of work to get it working correctly and efficiently without sacrificing functionality.

Share this post


Link to post
Share on other sites
"which will get ugly as soon as your fsm reaches a few thousand nodes (as it does in almost any non-trivial programming language)"

Are you talking about tokenizing? You need like 15 states max. Oh wait, operators and keywords. What I do is treat keywords like identifiers so that problem is solved. I am guessing that nearly all compilers do it that way too. Then for operators what I do it allow them to build up as one big token until a space or non operator symbol intervenes. Then I take the resulting string and greedily consume it from the start. I have a vector of strings representing my operators sorted by length. Then I check to see if the first element of that vector is a prefix of the string. If so I make it a token, remove it, and repeat. If not I go down the vector until a match is found.

"found lexer and parser generation tools to be insanely useful when designing a language of reasonable complexity—they have the inherent advantage of letting you specify what you want done without having to explain how you want it done, and they usually do it in the most efficient way possible."

Well maybe for experienced compiler writers. However in my case I know that I could not have succeeded with a parser generator like yacc or bison. I also looked into the boost one whatever it is called. However handwriting a parser was not that difficult. It was actually much shorter than the scanner.

Share this post


Link to post
Share on other sites
Quote:
Original post by Glak
Are you talking about tokenizing? You need like 15 states max. Oh wait, operators and keywords.


Recognizing integers with an FSM is at least 6 states. Recognizing floating-point numbers with an FSM is at least 15 states. Strings and comments add a dozen states as well. It's true that keywords are often moved out of the lexer automaton, but it's not really the case for operators.

Quote:
Well maybe for experienced compiler writers.


I may sound elitist, but inexperienced compiler writers seldom "develop new languages" as much as they "develop a toy language based on an existing one". The question of whether a tool is adapted to compiler development should be first and foremost applied to people who actually develop compilers as a living—one does not provide single-developer teams as a reason why CVS (or alternatives) is utterly useless in software development.

Quote:
However in my case I know that I could not have succeeded with a parser generator like yacc or bison.


I find it strange that you managed to specify both the how and the what of your parser, yet couldn't specify the what alone. I usually find parser generators to be extremely simple to manipulate, much simpler than writing things by hand.

Quote:
However handwriting a parser was not that difficult. It was actually much shorter than the scanner.


It casually takes me a few hundred lines explaining what the grammar of the language is—and I have trouble imagining a more concise description than that of Menhir. My only guess is that the grammar of your language was extremely short and simple, which is why the parser was as well. For instance, this is a small portion of the Caml Language syntax (there are several other bits quite similar to this one in size and complexity), and can be expressed with Menhir in just as many lines of code as there are grammar possibilities. Unless you're using a very innovative trick, a hand-written parser for a grammar like this one is bound to be much longer.


Share this post


Link to post
Share on other sites
Quote:
Original post by Glak
handwriting a parser was not that difficult. It was actually much shorter than the scanner.

Very unlikely. Can we see a a language description and some sourcecode?

Share this post


Link to post
Share on other sites
Quote:
Original post by Running_Wild
Ok I've done this for 2 years and this is my personal recommendation. The best *overall* book (and yes I know... Blasphemy!) is *not* the Dragon book. The Dragon book is very unfriendly to someone implementing a compiler and is very VERY theory driven. The best book IMHO is "Compiler Design in C" by Allen Holub. The source can be downloaded for free from here : http://www.holub.com/software/compiler.design.in.c.html

What about error handling during compilation, like a missing semicolon or something more complex. The dragon book contains only a few pages on that topic. However it seems important to let the compiler continue to parse over the source code if an error is found, so that as many errors as possible are found.

Are there any good books concerning this topic ?

Share this post


Link to post
Share on other sites
ok, I'll give you guys some info.

My language has simple syntax. It has a bunch of binary operators and some postfix unary operators. There are three outfix operators () [] {} that I call grouping operators in my code. There is also an implied adjacency operator. So the code A B means A adj B. The semicolon is the sequence operator and the comma is the tuple operator. This means that there are no statements, the entire program is always a single expression. So for example f x and f(x) are equivilent statements. () only does precedence. Now f x,y and f(x,y) are completely different. The first is a tuple, the first element of which is a function call. The second expression is a function call with a tuple as an argument. Well that is if f is a function. If f is a type then the expression f x is a variable declaration. If f is the class keyword then x had better be a class expression and the overall expression is a class definition. If f is a container type x had better be an index. It all just depends. Operators were chosen to eliminate ambiguity. That is one reason why there are no prefix operators and why < > is not a outfix operator. Ok now to reply to some quotes:

"I may sound elitist, but inexperienced compiler writers seldom "develop new languages" as much as they "develop a toy language based on an existing one". The question of whether a tool is adapted to compiler development should be first and foremost applied to people who actually develop compilers as a living—one does not provide single-developer teams as a reason why CVS (or alternatives) is utterly useless in software development."

No my language is pretty new. All of the features exist in other languages but the combination is pretty unique and I really like it.

"I find it strange that you managed to specify both the how and the what of your parser, yet couldn't specify the what alone. I usually find parser generators to be extremely simple to manipulate, much simpler than writing things by hand."

Now I would probably have no trouble using a parser generator for my language, but at the time I wasn't really sure how my language would look. Sure I had an idea, but not a formal understanding. Some stuff changed as I figured out that certain things were nearly unparseable. Also a language that is easy to write a parser for is easy to learn. So I started pretty simple and worked from there. I kept printing out the parse tree and reviewing it. Now I am using graphviz to help visualize things.

I decided to take a year off from working on it due mainly to work issues and also I wasn't sure how I was going to handle some stuff. I am not a professional programmer nor have I ever been and I have never completed a program (including this one) so my code will have a rough look. I also don't have much error checking in the the compiler (I will add it later) however if given correct input a correct parse tree is generated.

I have a lot of files, if you want them all I will post them but for now I will just give you the tokenizer and parser

first the tokenizer (btw I am using the word rune to refer to operators)

#include "../Core/global.h"
#include "../CharacterStream/CharacterStream.h"
#include "../Core/Runes and Articles.h"


bool is_prefix_of(const string& substr, const string& str)
{
if( substr.size() > str.size() )
return false;

for(int i=0; i<substr.size(); ++i)
{
if( substr[i] != str[i] )
return false;
}

return true;
}

string remove_prefix_from(const string& substr, const string& str)
{
string s(str.begin()+substr.size(),str.end());

if(substr+s != str)
cout<<"error in remove_prefix_from"<<endl;

return s;
}




vector<Token> AllTokens;

set<char> alpha; //$ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz
set<char> numeric; //1234567890
set<char> whitespace; //\t \n
set<char> special; //\"\\\'

set<char> alphanumeric;
set<char> non_rune;

Token current("","");

void InitializeTokenizer(bool testing)
{
char a[] = "$ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz";
alpha.insert(a,a+sizeof(a)-1);

char n[] = "1234567890";
numeric.insert(n,n+sizeof(n)-1);

char w[] = "\t \n";
whitespace.insert(w,w+sizeof(w)-1);

char s[] = "\"\\\'";
special.insert(s,s+sizeof(s)-1);

alphanumeric.insert(alpha.begin(),alpha.end());
alphanumeric.insert(numeric.begin(),numeric.end());

non_rune.insert(alphanumeric.begin(),alphanumeric.end());
non_rune.insert(whitespace.begin(),whitespace.end());
non_rune.insert(special.begin(),special.end());


if(testing)
{
set<char>::iterator iter;

cout<<"alpha: ";
for(iter=alpha.begin(); iter!=alpha.end(); ++iter)
cout<<(*iter);
cout<<endl;

cout<<"numeric: ";
for(iter=numeric.begin(); iter!=numeric.end(); ++iter)
cout<<(*iter);
cout<<endl;

cout<<"whitespace: ";
for(iter=whitespace.begin(); iter!=whitespace.end(); ++iter)
cout<<(*iter);
cout<<endl;

cout<<"special: ";
for(iter=special.begin(); iter!=special.end(); ++iter)
cout<<(*iter);
cout<<endl;

cout<<"alphanumeric: ";
for(iter=alphanumeric.begin(); iter!=alphanumeric.end(); ++iter)
cout<<(*iter);
cout<<endl;

cout<<"non_rune: ";
for(iter=non_rune.begin(); iter!=non_rune.end(); ++iter)
cout<<(*iter);
cout<<endl;


}
}

void Tokenize(CharacterStream& stream, bool testing)
{
InitializeTokenizer(false);

//----Ready-State-------------------------------------------------
READY_STATE:
{
//check for tokenization complete
if(stream.empty())
{
return;
}

char c = stream.top();
current.type="";
current.value="";

if(c=='\"')
{
current.type="string";
stream.pop();
goto STRING;
}

if(c=='\'')
{
current.type="char";
stream.pop();
goto CHAR;
}

if(numeric.count(c))
{
current.type="int";
goto NUMBER;
}

if(alpha.count(c))
{
current.type="identifier";
goto IDENTIFIER;
}

if(whitespace.count(c))
{
stream.pop();
goto READY_STATE;
}


current.type="rune"; //not really relevant
goto RUNE;


}

//----Rune-----------------------------------------------------------
RUNE:
{
char c = stream.top();

if(stream.empty() || non_rune.count(c))
{
//break up the runes here and push them onto all tokens
//also where all errors are signaled

while(current.value.size())
{
set<Token>::iterator iter;
set<Token>::iterator end=AllRunes.end();

for(iter=AllRunes.begin(); iter!=end; ++iter)
{
if(is_prefix_of((*iter).value,current.value))
{
AllTokens.push_back(*iter);
if(testing)
cout<<*iter<<endl;
current.value=remove_prefix_from((*iter).value,current.value);
goto continue_;
}

}
if(iter==end) //match not found!
{
cout<<"illegal rune("<<current.value<<") size: "<<current.value.size()<<endl;
//cout<<unsigned int(s[0])<<endl;
return;

//cout<<'('<<s<<')'<<endl;
}

continue_:;
}
goto READY_STATE;
}

current.value.push_back(c);
stream.pop();
goto RUNE;

}

//----Identifier-----------------------------------------------------------
IDENTIFIER:
{
char c = stream.top();

if(alphanumeric.count(c))//found
{
current.value.push_back(c);
stream.pop();
goto IDENTIFIER;
}

AllTokens.push_back(current);
if(testing)
cout<<current<<endl;
goto READY_STATE;
}

//----String----------------------------------------------------------
STRING:
{
if(stream.empty())
{
cout<<"error, EOF in string literal";
}
char c = stream.top();
stream.pop();

if(c=='\\')
{
goto STRING_ESCAPE;
}
if(c=='\"')
{
AllTokens.push_back(current);
if(testing)
cout<<current<<endl;
goto READY_STATE;
}

current.value.push_back(c);
goto STRING;


STRING_ESCAPE:
{
//put special cases here
if(stream.empty())
{
cout<<"error, EOF in string literal";
}
char c = stream.top();
stream.pop();

if(c=='\\')
{
current.value.push_back('\\');
goto STRING;
}
if(c=='\"')
{
current.value.push_back('\"');
goto STRING;
}
if(c=='\'')
{
current.value.push_back('\"');
goto STRING;
}
if(c=='t')
{
current.value.push_back('\t');
goto STRING;
}
if(c=='n')
{
current.value.push_back('\n');
goto STRING;
}

cout<<"illegal escape sequence in string literal"<<endl;
return;

}


}


//----Number------------------------------------------------------------
NUMBER:
{

char c = stream.top();

if(numeric.count(c))
{
current.value.push_back(c);
stream.pop();
goto NUMBER;
}

if(c=='.')
{
current.value.push_back(c);
stream.pop();
current.type="float";
goto FLOAT;
}

AllTokens.push_back(current);
if(testing)
cout<<current<<endl;

goto READY_STATE;

FLOAT:
{
char c = stream.top();

if(!stream.empty() && numeric.count(c))
{
current.value.push_back(c);
stream.pop();
goto FLOAT;
}

if(current.value[current.value.size()-1]=='.')
{
cout<<"error: unfinished floating point literal"<<endl;
}

AllTokens.push_back(current);
if(testing)
cout<<current<<endl;

goto READY_STATE;


}
}

//----Character------------------------------------------------------
CHAR:
{

if(stream.empty())
{
cout<<"error, EOF in char literal";
}
char c = stream.top();
stream.pop();

if(c=='\\')
{
goto CHAR_ESCAPE;
}

current.value.push_back(c);
goto CHAR_END;

CHAR_ESCAPE:
{
//this was left out, I just put it in
if(stream.empty())
{
cout<<"error, EOF in char literal";
}
char c = stream.top();
stream.pop();

if(c=='\\')
{
current.value.push_back('\\');
goto CHAR_END;
}
if(c=='\"')
{
current.value.push_back('\"');
goto CHAR_END;
}
if(c=='\'')
{
current.value.push_back('\"');
goto CHAR_END;
}
if(c=='t')
{
current.value.push_back('\t');
goto CHAR_END;
}
if(c=='n')
{
current.value.push_back('\n');
goto CHAR_END;
}

cout<<"invalid escape character in char literal"<<endl;
}

CHAR_END:
{
if(stream.empty())
{
cout<<"error, EOF in char literal";
}
char c = stream.top();
stream.pop();

if(c=='\'')
{
AllTokens.push_back(current);
if(testing)
cout<<current<<endl;
goto READY_STATE;
}
else
{
//signal error
}
}
}



}





ok here is parser.cpp, the main parsing file. The function "show_top" is a display function.

#include "../Core/global.h"
#include "../Core/Runes and Articles.h"


void FixOperators
(
Iterator begin,
Iterator end,
Token* operators,
int size,
string fixation="infix",
string associativity="left"
);

void Group(Iterator begin, Iterator end);
void HandleAdjacent(Iterator begin, Iterator end, Token* operators, int size);


Iterator next(Iterator iter)
{
return ++iter;
}

void show_top(string text, Iterator begin, Iterator end)
{
cout<<text<<":\n";

++begin;
for(Iterator iter=begin; iter!=end; ++iter)
{
cout<<(*iter)->token.value<<' ';
}
cout<<endl;

}


void Parse(Iterator begin, Iterator end)
{


Token infix[] = {TIMES,MOD,DIV,PLUS,MINUS, //5
SHIFT_LEFT,SHIFT_RIGHT,LESS_THAN,LESS_THAN_EQUAL,//4
GREATER_THAN,GREATER_THAN_EQUAL,IS_EQUAL_TO,//3
NOT_EQUAL_TO,BITWISE_AND,BITWISE_OR,AND,OR,//5
MAKE_EQUAL,PLUS_EQUALS,COMMA,SEMICOLON,//4
INTERSECTION,UNION };//2



if(next(begin)==end)
{
cout<<"nothing to parse here"<<endl;
return;
}




show_top("unparsed",begin,end);


Group(next(begin),end); //also parses each group
show_top("grouped",begin,end);

Token function_operator[] = {ARROW};
FixOperators(next(begin),end,function_operator,1);
show_top("function",begin,end);



Token postfix[] = {INCREMENT,DECREMENT,TO_REFERENCE,TO_RELATION,TO_OPTION};
FixOperators(next(begin),end,postfix,5,"postfix");
show_top("postfix",begin,end);

HandleAdjacent(next(begin),end,infix,23);
show_top("adjacentized",begin,end);






//----standard-binary-operators----------------------------------------


Token scaling[] = {TIMES,MOD,DIV};
FixOperators(next(begin),end,scaling,3);


Token displacing[] = {PLUS,MINUS};
FixOperators(next(begin),end,displacing,2);

Token setwise[] = {INTERSECTION,UNION};
FixOperators(next(begin),end,setwise,2);


Token shifting[] = {SHIFT_LEFT,SHIFT_RIGHT};
FixOperators(next(begin),end,shifting,2);


Token comparison[] = {LESS_THAN,LESS_THAN_EQUAL,
GREATER_THAN,GREATER_THAN_EQUAL};
FixOperators(next(begin),end,comparison,4);


Token equality[] = {IS_EQUAL_TO,NOT_EQUAL_TO};
FixOperators(next(begin),end,equality,2);


Token bitwise_and[] = {BITWISE_AND};
FixOperators(next(begin),end,bitwise_and,1);



Token bitwise_or[] = {BITWISE_OR};
FixOperators(next(begin),end,bitwise_or,1);


Token logical_and[] = {AND};
FixOperators(next(begin),end,logical_and,1);


Token logical_or[] = {OR};
FixOperators(next(begin),end,logical_or,1);

show_top("infix",begin,end);


//----slightly-special-binary-operators------------------------


Token assignment[] = {MAKE_EQUAL,PLUS_EQUALS};
FixOperators(next(begin),end,assignment,2,"infix","right");

show_top("assignment",begin,end);

Token tupling[] = {COMMA};
FixOperators(next(begin),end,tupling,1);

show_top("tupling",begin,end);

Token sequencing[] = {SEMICOLON};
FixOperators(next(begin),end,sequencing,1);


show_top("sequencing",begin,end);

}





adjacent.cpp

#include "../Core/global.h"
#include "../Core/Runes and Articles.h"
#include <algorithm>
using namespace std;

void HandleAdjacent(Iterator begin, Iterator end, Token* operators, int size)
{
Iterator iter=begin;
Iterator next=++begin;

while(next!=end)
{
Token& r = (*next)->token;
Token& l = (*iter)->token;

if(l==PERIOD &&r.type=="identifier")
{
(*iter)->right=(*next);
next=AllExpressions.erase(next);
++iter;
++next;
continue;
}

Expression& right = *(*next);
Expression& left = *(*iter);

bool infix=false;
if(left.match(operators,size))
infix=true;
if(right.match(operators,size))
infix=true;

if(!infix)
{
Expression* p = new Expression(ADJACENCY);//adj
p->left=(*iter);
p->right=(*next);
(*iter)=p;
next=AllExpressions.erase(next);
continue;
}

++iter;
++next;
}

}

//this problem should not come up because it doesn't repeat: t . r y




group.cpp, handles all outfix operators


#include "../Core/global.h"
#include "../Core/Runes and Articles.h"

void Parse(Iterator begin, Iterator end);



//returns true if error
bool ProcessGroup
(
Iterator& iter,
Iterator& end,
Iterator& start,
Token OPEN,
Token CLOSE,
Token RESULT
)
{
int nesting=0;

while(true)
{

if(iter==end)
{
cout<<"Grouping:GROUPING: iter==end, unmatched ("<<endl;
return true;
}

Token& t( (*iter)->token );

if(t==OPEN)
{
++nesting;
++iter;
continue;
}

if(t==CLOSE)
{
if(nesting)
{
--nesting;
++iter;
continue;
}
else
{
Iterator sub_begin=start;
++sub_begin; //points to first element, or to ) if empty
Iterator sub_end=iter;//subend points to )

//Parse(sub_begin,sub_end,parent,"group");
Parse(start,iter);

//ok now sub_begin invalid
//sub_end still points to the valid )
//start still points to (

Iterator sub_expression = AllExpressions.erase(start);
//sub_expression points to the sub_expression
//unless it was empty

if(sub_expression==sub_end) //the group was empty
{
//overwrite the ) with () and contents==null
(*sub_end)= new Expression(RESULT);
}
else //non-empty
{
(*sub_end)->token=RESULT;
(*sub_end)->contents=(*sub_expression);
AllExpressions.erase(sub_expression);
}

return false;//goto READYSTATE;
}
}

//neither ( nor ) found
++iter;
}
}






void Group(Iterator begin, Iterator end)
{
Iterator iter=begin;
Iterator start;
Iterator finish;

while(true) //READYSTATE:
{
if(iter==end)
{
return;
}

cout<<"Group: ready_state"<<endl;
Token& t( (*iter)->token );

//----open-symbol---------------------------
if(t==OPENPAREN)
{
start=iter;
++iter;
if
(
ProcessGroup(iter,end,start,OPENPAREN,CLOSEPAREN,GROUP)
)cout<<"error processing GROUP"<<endl;
continue;
}

if(t==OPENBRACKET)
{
start=iter;
++iter;
if
(
ProcessGroup(iter,end,start,OPENBRACKET,CLOSEBRACKET,BOX)
)cout<<"error processing BOX"<<endl;
continue;
}

if(t==OPENBRACE)
{
start=iter;
++iter;
if
(
ProcessGroup(iter,end,start,OPENBRACE,CLOSEBRACE,BLOCK)
)cout<<"error processing BLOCK"<<endl;
continue;
}

//----close-symbol----------------------------------------------
if(t==CLOSEPAREN || t==CLOSEBRACKET || t==CLOSEBRACE)
{
cout<<"error unexpected closing symbol: "<<t<<endl;
return;
}

//----other-symbol----------------------------------------------
++iter;

}

}




infix.cpp (also handles postfix operators)

#include "../Core/global.h"
#include "../Core/Runes and Articles.h"
//uses rotations to handle right-associative operators
//not sure how this technique would work if there were
//right-associative operators at multiple precedence levels
//currently handles infix and postfix
void FixOperators
(
Iterator begin,
Iterator end,
Token* operators,
int size,
string fixation,
string associativity

)
{
if(fixation!="infix" && fixation!="postfix" && fixation!="prefix")
cout<<"error, improper fixation"<<endl;

//deal with empty list
if(begin==end)
return;

Iterator iter=begin;
if(associativity!="prefix")
++iter;

while(iter!=end)
{
if( (*iter)->match(operators,size) )
{

Iterator left=iter;
Iterator right=iter;
--left;
++right;
if(right==end && fixation!="postfix")
{


//special case, insert empty group: ()
/*if( (*iter)->token==SEMICOLON )
{
AllExpressions.insert(right,&EMPTYGROUP);
right=iter;
++right;
}*/


//special case, drop the comma
//else
if( (*iter)->token==COMMA || (*iter)->token==SEMICOLON)
{
//check this code
iter=AllExpressions.erase(iter);
goto NO_INCREMENT;
}

else
cout<<"FixOperators: no right operand"<<(*iter)->token<<endl;
}

if(fixation!="postfix")
{
(*iter)->right = (*right);
AllExpressions.erase(right);
}
if(fixation!="prefix")
{
(*iter)->left = (*left);
AllExpressions.erase(left);
}

(*iter)->fixation=fixation;


//check for chained right-associative operators
if
(
associativity=="right" &&
fixation=="infix" &&
(*iter)->left->match(operators,size)
)//rotation required
{
Expression* temp=*iter;
*iter=(*iter)->left;
Expression* right_most=*iter;
while(right_most->right->match(operators,size))
right_most=right_most->right;
temp->left=right_most->right;
right_most->right=temp;
}

}

++iter; //FIX ME:this probably goes here, maybe above
NO_INCREMENT:;
}
}




postparse.cpp, restructures things a little, considered part of the parser

#include "../Core/global.h"
#include "../Core/Runes and Articles.h"
#include "../SemanticAnalyzer/Scope.h"

void RemoveParenthesis(Expression* e);
void LexicallyScope(Expression* e, Scope* scope);
void RestructureFunctions(Expression* e);
void SetParent(Expression* e, Expression* parent);
void RestructureClasses(Expression* e);

void PostParse(Expression* e)
{

RemoveParenthesis(e);

SetParent(e,0);
RestructureClasses(e);
RestructureFunctions(e);

SetParent(e,0);

//note that root is a global
root = new Scope(0);
LexicallyScope(e,root);


}

void RemoveParenthesis(Expression* e)
{
if(!e)
return;

if(e->token==GROUP && e->contents)
{
Expression* temp=e->contents;
*e=*temp;
delete temp;
cout<<"removing ()"<<endl;
}

RemoveParenthesis(e->left);
RemoveParenthesis(e->right);
RemoveParenthesis(e->contents);


}

void LexicallyScopeChildren(Expression* e, Scope* scope)
{
LexicallyScope(e->right,scope);
LexicallyScope(e->contents,scope);
LexicallyScope(e->left,scope);
}


void LexicallyScope(Expression* e, Scope* scope)
{
if(!e)
return;

if(!scope)
{
cout<<"error in LexicallyScope, scope==0"<<endl;
return;
}

if(e->token==BLOCK)
{
Scope* new_scope = new Scope(scope);
e->scope=new_scope;
LexicallyScopeChildren(e,new_scope);
return;
}

e->scope=scope;
LexicallyScopeChildren(e,scope);
}




void RestructureFunctions(Expression* e)
{
if(!e)
return;

if //standard function
(
e->token==ADJ &&
e->left->token==ADJ &&
e->left->left->token==ARROW &&
e->right->token==BLOCK
)
{
e->right->left=e->left->left->left;
e->right->right=e->left->left->right;
Expression* p;

p=e->left->left->left;
e->left->left->left=ExtractType(p);

p=e->left->left->right;
e->left->left->right=ExtractType(p);

RestructureFunctions(e->right->contents);
return;
}

if //implicit function (block)
(
e->token==BLOCK &&
!(e->left) &&
!(e->right)
)
{
Expression* e2 = new Expression(MAKE_EQUAL);
e2->right=e->contents;
e2->left= new Expression(Token("identifier","__return_value"));
e->contents=e2;

e->left=new Expression(EMPTYGROUP);

e->right= new Expression(ADJACENCY);
e->right->right = new Expression(Token("identifier","__return_value"));
e->right->left = new Expression(TYPEOF); //creates cycle!

e->right->left->contents=e->contents->right;

RestructureFunctions(e->contents);
return;
}

RestructureFunctions(e->left);
RestructureFunctions(e->right);
RestructureFunctions(e->contents);

}

void LexicallyScope(Expression* e, Scope* scope);

void SetParent(Expression* e, Expression* parent)
{
if(!e)
return;
e->parent=parent;

if(e->token==TYPEOF)
return;

SetParent(e->left,e);
SetParent(e->right,e);
SetParent(e->contents,e);
}

void RestructureClasses(Expression* e)
{
if(!e)
return;

RestructureClasses(e->right);
RestructureClasses(e->contents);
RestructureClasses(e->left);

if(e->token==KEYWORD_CLASS)
{
cout<<"RestructureClasses KEYWORD_CLASS begin"<<endl;
Expression* p=e->parent->parent;
if(p->token != MAKE_EQUAL)
{
cout<<"error, = needed in class definition"<<endl;
}
p->token=CLASS_DEF;
cout<<"RestructureClasses KEYWORD_CLASS end"<<endl;
}

if(e->token==CLASS_DEF)
{
cout<<"RestructureClasses CLASS_DEF begin"<<endl;
Expression* temp=e->left;
e->left=e->left->right;
delete temp->left;
delete temp;

cout<<"RestructureClasses CLASS_DEF end"<<endl;
}
}




/*

void oldLexicallyScope(Expression* e, Scope* scope)
{
if(!e)
return;

if(!scope)
{
cout<<"error in LexicallyScope, scope==0"<<endl;
return;
}

cout<<"LexicallyScope 1"<<endl;

if(e->token==ARROW)
{

Expression* parent=e->parent;
if(!parent)
goto BLOCKLESS;

Expression* brother=parent->right;
if(!brother)
goto BLOCKLESS;
if( brother->token == Token("special","{}") )
{
Scope* new_scope = brother->scope;
e->scope=new_scope;
LexicallyScopeChildren(e,new_scope);
return;
}

Expression* grandparent=parent->parent;
if(!grandparent)
goto BLOCKLESS;

Expression* uncle=grandparent->right;
if(!uncle)
goto BLOCKLESS;
if( uncle->token == Token("special","{}") )
{
Scope* new_scope = uncle->scope;
e->scope=new_scope;
LexicallyScopeChildren(e,new_scope);
return;
}

BLOCKLESS: //just a function, no block
{
Scope* new_scope= new Scope(scope);
e->scope=new_scope;
LexicallyScopeChildren(e,new_scope);
return;
}

}


if(e->token==Token("special","{}"))
{
Scope* new_scope = new Scope(scope);
e->scope=new_scope;
LexicallyScopeChildren(e,new_scope);
return;
}

cout<<"LexicallyScope 3"<<endl;
e->scope=scope;
LexicallyScopeChildren(e,scope);

cout<<"LexicallyScope 4"<<endl;
}*/




Share this post


Link to post
Share on other sites

This topic is 3857 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this