Working control structures

Started by
22 comments, last by Zotoaster 17 years ago
I'm having some trouble getting control structures to work in my scripting system, mainly if statements just now. My main idea was this: If I have a piece of code, like this: print "hello"; if(name == "Zotoaster"){ print "Cool name"; } print "done.."; Everytime you find an if, get whats inside the braces, and use that as the script, which means nested ifs would work too. It all works fine, apart from when printing what's after the if statement, in this case, the print "done.."; part. Anyone have an idea how to fix this, or have a better idea of how to work it? Thanks
Advertisement
This is usually done by making your "if" a statement. A typical grammar looks like this:

'statement' :=  | if ('condition') 'block'  | print 'text' ;'block' :=   | 'statement'  | { 'statement*' }


Then, when executing the if-statement, you'd check the condition and insert the associated block before the remaining statements in the current block. Operational semantics would be:

exec('if(X) Y' :: next,S) =
if eval(X,S) then exec(Y :: next,S)
else exec(next,S)


Where :: is a concatenation operator, Y is the bound statement list, next is the statements to be executed after the current one, and S is the program state.
I don't quite understand that code there fully, but I think I have a rough idea of what you mean, but not enough to try and implement it. Do you mind giving a little more info? :)
Right. So, a few questions:

  • What's your language grammar?
  • How do you construct an AST from the source code?
  • What does your evaluation function look like?
Well, if by grammar you mean syntax, that it's very C like.

I looked up AST, and got abstract syntax tree.. To be honest, I don't know what that is.. I'm fairly new to this whole scripting business, sorry.

Evaluation as in.. maths evaluation? It's a whole class, and it involved the whole code. I can post it if you like :)
Right, so you really do not have a clue. I really advise that you get a book on programming language development. I'll write a quick primer for you, though.

The basic operations behind any programming language is as follows: the input file is read, character by character, and split up into tokens (characters or groups of characters which have particular meanings) by a lexer. The tokens are then transformed into an abstract syntax tree by a parser. This transform is described by means of a grammar which is a complete description of the language syntax.

Note — Some modern parsers are able to handle tokenizing themselves instead, and some complex languages (such as C++) must interpret the language at the same time to build the AST.


Once the abstract syntax tree exists, the compiler performs various static bindings (linking the usage and declaration of variables and functions) and checks (type-safety, existence of the objects), and then moves on to the transform part.

Note — Again, some languages don't do static binding at all (PHP, PERL, SH) while most others delay the actual binding another step (a dedicated linking step). Also, not all perform any checking whatsoever. On the contrary, some perform complete proofs about the language.


The transformation consists in taking the abstract syntax tree, altering it, and outputting an executable graph. This might require several successive transforms (moving from front-end to back-end to binary in GCC), or it might be immediate (for example, in λ-calculus, the AST is executed almost as-is). The result is the executable structure, which is then passed on to an evaluator.

The evaluator is either the processor (when the executable structure is machine code), a virtual machine (when the executable structure is bytecode), or an evaluator defined by means of operational semantics. Note that it's perfectly plausible to define an evaluator through operational semantics, and then implement it as a virtual machine, or by compiling the AST to machine code.

Now, this is the simple part. The first difficult part is providing a grammar which can handle what you want to do. The typical imperative language grammar considers blocks to be statements or lists of statements, and statements to be function calls, assignments, or control sequences (which may include blocks). So, the AST of your example would be as such:

BLOCK|+-STATEMENT-PRINT "hello"|+-STATEMENT-IF (name == Zotoaster)| || \-BLOCK|   ||   \-STATEMENT-PRINT "Cool name"|\-STATEMENT-PRINT "done"


A typical operational semantics definition of such a language is to consider that you are executing a sequence of statements. This sequence is called, say, P (program). Every time you execute a statement, you alter the state of the program (its memory, its output, its input), which we'll call S (state). Then, the execution function here consists in telling the program what the final state is, given an initial state S and an initial program P.

So, here:
execute(program P, state S) is:  execute([],S) = S // no code left? return the state  execute([print x]::rest,S) =     execute(rest,x::S) // we "print x", thereby changing S,                        // and then determine what the rest of the program does  execute([if(x) Y]::rest,S) =    if eval(x,S) // Looking at the state of the program,                 // is the expression 'x' true or false?    then execute(Y::rest,S) // It's true: execute the if-statement code                            // 'Y' before the rest of the code    else execute(rest,S) // It's false, don't bother with 'Y' and move on                         // to the rest


Note that the execution function never forgets about 'rest', which is what code will be executed after the current statement, even if the current statement is an if-clause.

Now, once this is known, you translate the above execution system into whatever language and/or VM you wish to use.

Now, I go to sleep. I'll check again on this thread in 8 hours or so.
Oh, I really am far behind aren't I? O_o

Atleast I understand that syntax tree thing, which should help.

I admit this scripting system wont be the best around, but it's fun to try and make it heheh.
Well, it's not so much that you're far behind. Language semantics is an entire field of study which has been around for quite a few decades now, so people have naturally evolved tools and mindsets towards this.

Look at it this way. You don't use punch cards anymore to give orders, and your CPU doesn't respond with a 100dpi printer: your computer has a screen and keyboard and mouse, because these are simpler to use when communicating with the machine. You don't write all your code in assembly language anymore, you write most simple programs in medium-level or high-level languages because it's simpler to use and it saves time.

In the same way, language developers and semanticians have their own tools: lexer generators (because writing an equivalent lexer by hand takes years of experience and a couple of days), parser generators (because writing an equivalent parser takes years of experience and a couple of weeks), adapted tree manipulation languages (just because C++, C# or Java definitely won't cut it until they have access to pattern matching and/or garbage collection), as well as many human-spoken languages to communicate the various concepts (tokens, grammar, operational/denotational semantics).

The idea is this: where the average language can be written by hand in a month from scratch, these tools allow us to whip up a working prototype in half a day, which leaves us the rest of the month to think about the actually smart things (such as improving the speed of compiled code by 10% in addition to what the optimizer may do, or perhaps writing a program that proves your code doesn't contain any bugs, or allowing the programmer to express 1500 lines of C in 10 lines of the prototype language, or allowing the code to be distributed to a hundred machines at compile time, and so on). Besides, the tools are so flexible, we can change the grammar in five seconds tops. Which is why, although I consider a fine occupation to reinvent these wheels, I usually advise people to use them if they intend to get work done quick.

So, let's consider your example language, and see how it would fit. This is more of an untested outline, but the idea is here. First, you seem to want a language that has a C-like syntax, and we'll add a few concepts of our own:

  • Imperative language: functions are sequences of statements. Statements are either control structures (if, while, return), function calls, printing, reading, or assigning a variable a value.

  • Our values are NULL, strings, integers and arrays. We'll go for a PHP-like approach where arrays are in fact hash tables, and conversion between strings and integers is automatic. Variables can be used before being initialized (they're initially NULL).

  • We use no overloading (since we have a single type) and I'm too lazy to do argument-count-based overload.

  • The language is interpreted by a virtual machine. It is not type-checked at all.


So, the first step is to decide which grammar we'll use. The typical grammar is split into three sections: expressions (things which have a value), statements (things which perform actions) and definitions (things which explain what a function is). Here, I'll use Menhir, which is a very nice parser generator for OCaml, to describe the grammar. The idea is that I define non-terminals to represent each of the "entities" that can appear in the language, using other non-terminals (for instance, an expression can be an expression plus another expression) and terminals (the symbols read from the file, also called tokens).

// --- First, I describe the tokens I'll use...// Tokens which are not associated with any value%token   TIf     // 'if'  TWhile  // 'while'  TFunc   // 'function'  TSet    // '='  TReturn // 'return'  TPrint  // 'print'  TRead   // 'read'  TBBlock // '{'  TSBlock // '}'  TBParen // '('  TEParen // ')'   TBSubs  // '['  TESubs  // ']'  TSColon // ';'  TComma  // ','  TNull   // 'null'  TAdd TMin TMul TDiv         // + - * /  TAnd TOr  TNot              // && || !  TEqu TGt TLt TGet TLet TNeq // == > < >= <= !=  TEof    // End of file// Tokens which are bound to a string value%token<string>  TIdent  // Identifier (foobar,x,Value)  TString // String literal "hello"// Tokens which are bound to an integer value%token<int>  TInt    // Integer literal 3// --- Then, I describe the priorities of the operators%left TMul TDiv // highest%left TAdd TMin %left TEqu TGt TLt TGet TLet TNeq %unary TNot%left TAnd%left TOr // lowest// --- Finally, the non-terminal that corresponds to a program%start program%%// --- I now describe how the non-terminals are defined.// Between braces at the end of each possible definition is the// OCaml code executed when this definition is encountered.// A program is a list of function definitionsprogram:  | defs = list(definition) TEof { AST.register (defs) };// A definition is "function", a name, an argument list and a bodydefinition:  | TFunc     name = TIdent     TBParen    args = separated_list(TIdent,TComma)    TEParen    body = block   { AST.make_function (name,args,body) };// A block is either a single statement, or a list of statements// between bracesblock:  | x = statement { [x] }  | TBBrace    statements = statement*    TEBrace { statements };// A statement can be an assignment, reading, printing, or // a control sequence (or an expression with ignored value)statement:  | into = assignable TSet valu = expression TSColon       { AST.make_assign (into,valu) }  | TRead into = assignable TSColon      { AST.make_read (into) }  | TPrint valu = expression TSColon      { AST.make_print (valu) }  | TIf     TBParen cond = expression TEParen    exec = block      { AST.make_if (cond,exec) }  | TWhile    TBParen cond = expression TEParen    exec = block      { AST.make_while (cond,exec) }  | TReturn TSColon      { AST.return_nothing }  | TReturn valu = expression TSColon      { AST.return (valu) }  | expr = expression      { AST.eval (expr) };// An assignable expression is either a variable, or // a member of an assignable expressionassignable:   | name = TIdent       { AST.make_assignable_var (name) }  | array = assignable     TBSubs index = expression TESubs      { AST.make_assignable_sub (array,index) };// An expression works by prioritizing: first, we group together// literals, calls and parantheses as "simple" expressionssimpleExpression:  | TBParen expr = expression TEParen      { expr }  | valu = TInt       { AST.int_literal (valu) }  | valu = TString      { AST.string_literal (valu) }  | TNull       { AST.null }  | name = TIdent     TBParen args = separated_list(expression,TComma) TEParen      { AST.make_call (name,args) };// Then, using simple expressions, we apply subscriptingapplyExpression:  | same = simpleExpression {same}  | array = applyExpression     TBSubs index = expression TESubs       { AST.make_sub (array,index) };// Then, once we have these, we can apply our operators// happilyexpression:  | same = applyExpression  {same}  | a = expression     op = binary    b = expression      { AST.make_operation2 (a,op,b) }  | TNot e = expression      { AST.make_operation1 ("!",e) };%inline binary:  | TAdd { "+" }  | TMul { "*" }  | TSub { "-" }  | TDiv { "/" }  | TGt  { ">" }  | TLt  { "<" }  | TGet { ">=" }  | TLet { "<=" }  | TEqu { "==" }  | TNeq { "!=" }  | TAnd { "&&" }  | TOr  { "||" };%%// --- There we go, this should describe the grammar nicely.


Sure, there are several niceties missing above, because I'm short on time (for instance, we don't have an unary minus, or an 'else' clause, or a 'break' statement, or a 'for' loop, or a 'switch' statement, oh well). But these are quickly added with minor alteration of the grammar. Also, I might have overlooked some stupid mistakes, but well, that happens.

Now, the next step is writing a lexer, which we'll plug into the parser. Ideally, I'll be using ocamllex, which is a lexer generator for OCaml:

// --- We define the different tokens as regular expressions// Between braces, the associated tokenrule token =  // Ignore spaces and comments  | '\t'   | ' '   | '\n'   | "//" [^ '\n']                                   { token lexbuf }  // An integer literal  | '-' ? [ '0' - '9' ] + as int                                      { Parser.TInt (int_of_string int) }  // A string literal  | '"' ( ([^ '"' '\\'] | '\\' _) * as lit ) '"'                                  { Parser.TString (lit) }  // An identifier  | [ 'a'-'z' 'A'-'Z' '_' ] ['a'-'z' 'A'-'Z' '0'-'9' '_']* as id                                  { Parser.TIdent (id) }  // Various keywords  | ";"         { Parser.TSColon }  | ","         { Parser.TComma }  | "if"        { Parser.TIf }  | "while"     { Parser.TWhile }  | "function"  { Parser.TFunc }  | "("         { Parser.TSParen }  | ")"         { Parser.TEParen }  | "["         { Parser.TSSub }  | "]"         { Parser.TESub }  | "{"         { Parser.TSBlock }  | "}"         { Parser.TEBlock }  | "+"         { Parser.TAdd }  | "-"         { Parser.TSub }  | "*"         { Parser.TMul }  | "/"         { Parser.TDiv }  | "=="        { Parser.TEqu }  | "!="        { Parser.TNeq }  | "<="        { Parser.TLet }  | ">="        { Parser.TGet }  | "<"         { Parser.TLt }  | ">"         { Parser.TGt }  | "="         { Parser.TSet }  | "print"     { Parser.TPrint }  | "read"      { Parser.TRead }  | "return"    { Parser.TReturn }  | "null"      { Parser.TNull }// End of file  | eof         { Parser.TEof }// An error!  | _ { failwith "Lexing error" }


This lexer should transform happily any input file (or standard input) into a sequence of parser-recognized tokens, which we then plug into the parser to construct our AST! So, there we go: a few hundred lines of description and we've got a near-optimal lexer and parser.

Now, the fun part is defining the AST. We'll actually define a few distinct parts: statements, assignable expressions, and expressions. These work differently: statements are executed, assignable expressions are modified and expressions are evaluated. Our abstract tree should reflect this. I'll also write the generation functions which are used by the parser, so everything works.

(* What our values are *)type value =  | Val_Null  | Val_Array of (value,value) Hashtbl.t  | Val_String of string  | Val_Int of int(* Our tree definition! *)type statement =  | Set of assignable * expression  | Return of expression  | If of expression * (statement list)  | While of expression * (statement list)  | Eval of expression  | Print of expression  | Read of assignableand assignable =  | AVariable of string  | AIndexed of assignable * expressionand expression =  | Variable of string  | Literal of value  | Call of string * (expression list)  | Binary of (value -> value -> value) * expression * expression  | Unary of (value -> value) * expression  | Indexed of expression * expressiontype function =  {    name : string;    args : string list;    body : statement list;  }  (* Our building functions *)let make_function (name,args,body) =  { name = name; args = args; body = body }  let make_assign (into,value)  = Set (into,value)let make_read (into)          = Read (into)let make_print (value)        = Print (value)let make_if (cond,exec)       = If (cond,exec)let make_while (cond,exec)    = While (cond,exec)let return_nothing            = Return (Literal Val_Null)let return (valu)             = Return (valu)let eval (expr)               = Eval (expr)let make_assignable_var (n)   = AVariable (n)let make_assignable_sub (e,i) = AIndexed (e,i)let int_literal (i)           = Literal (Val_Int i)let string_literal (s)        = Literal (Val_String s)let null                      = Literal (Val_Null)let make_call (f,arg)         = Call (f,arg)let make_sub (e,i)            = Indexed (e,i)let make_operation2 (a,op,b) =  Binary ((func_of_op2 op),a,b)  let make_operation1 (op,e) =  Unary ((func_of_op1 op),e)  (* We have a list of loaded functions, which is accessible   by name. Registering the functions consists in binding   them to their own names. *)let functions = Hashtbl.create 10let register =  List.iter    (fun f ->       try         ignore (Hashtbl.find functions f.name);        failwith ("Function '"^f.name^"' defined twice!");      with Not_found ->        Hashtbl.add functions f.name f)        let get_function (name) =  try    Hashtbl.find functions name  with Not_found ->     failwith "Could not find function '"^name^"'!"


Now, our AST is up and running, and can be generated in a breeze simply by launching the parser on a token stream. We can start defining operational semantics for the language. We basically want to implemnt two functions: eval and exec. Eval reads in an expression, and returns the value of that expression. Exec runs a list of statements, altering the world along the way, and returns the value returned by the function either when a "return" is encountered, or when it runs out of statements (implicit 'return;'). First, I'll assume for the sake of the eval function that we have access to a 'call(f,args)' function which calls a function with arguments and returns the return value. So, let's write that function:

(* First, we implement the operators. *)let do_op2 ~op_n ~op_ii ~op_is ~op_ss = fun  (* All null values *)  | Val_Null o  | o Val_Null -> op_n o    (* All array values *)  | (Val_Array _) _ -> | _ (Val_Array _) -> Val_Null    (* The rest *)  | (Val_Int a) (Val_Int b) -> op_ii a b  | (Val_Int a) (Val_array _) -> op_ii 0 a  | (Val_Int a) (Val_String b)   | (Val_string b) (Val_int a) -> op_is a b  | (Val_String a) (Val_String b) -> op_ss a b  let null1 = fun _ -> Val_Nulllet null2 = fun _ _ -> Val_Nulllet of_bool (b) = if b then Val_Int 1 else Val_Null  let op_add =  do_op2    ~op_n  = null1    ~op_ii = (fun a b -> Val_Int (a+b) )     ~op_ss = (fun a b -> Val_String (a^b) )    ~op_is = (fun i s ->       if s = "" then Val_string (string_of_int i)      else try Val_Int (int_of_string s + i) with _ -> (Val_int i)let op_sub =  do_op2    ~op_n  = null1    ~op_ii = (fun a b -> Val_Int (a-b))     ~op_ss = null2    ~op_is = null2      let op_mul =  do_op2    ~op_n  = null1    ~op_ii = (fun a b -> Val_Int (a*b))     ~op_ss = null2    ~op_is = null2    let op_div =  do_op2    ~op_n  = null1    ~op_ii = (fun a b -> Val_Int (a/b))     ~op_ss = null2    ~op_is = null2        let op_and =  do_op2    ~op_n  = null1    ~op_ii = (fun a b -> of_bool (not (a*b = 0)))     ~op_ss = (fun a b -> of_bool (not (a^b = "")))    ~op_is = (fun i s -> of_bool (not (a = 0 || b = "")))    let op_or =  do_op2     ~op_n  = (fun o -> op_and o o)    ~op_ii = (fun a b -> of_bool (a*b <> 0))    ~op_ss = (fun a b -> of_bool (a^b <> ""))    ~op_is = (fun i s -> of_bool (not (a = 0 && b = "")))    let op_equ = fun i j -> of_bool (i = j)let op_neq = fun i j -> of_bool (i <> j)let comp (c) =  do_op     ~op_n = null1    ~op_ii = (fun i j -> of_bool (c i j))    ~op_ss = (fun i j -> of_bool (c i j))    ~op_is = null2    let op_lt = comp (<)let op_gt = comp (>)let op_let = comp (>=)let op_get = comp (<=)let op_not x = op_neq (op_and x x) Val_Null(* These operators are automatically inserted into the AST by the    binary operation, which calls func_of_op* *)   let func_of_op1 = function  | "!" -> op_not  | op -> failwith ("Unknown unary operator '"^op^"'!")let func_of_op2 = function  | "+" -> op_add  | "-" -> op_sub  | "*" -> op_mul  | "/" -> op_div  | "&&" -> op_and  | "||" -> op_or  | "==" -> op_equ  | "!=" -> op_neq  | "<" -> op_lt  | ">" -> op_gt  | ">=" -> op_get  | "<=" -> op_let  | op -> failwith ("Unknown binary operator '"^op^"'!")    (* For the sake of it, finding a variable in a    (name|value)-variable association *)   let find_var (vars) (name) =  try     Hashtbl.find vars name  with Not_found -> Var_Null    (* Finally, we implement the evaluation function. Vars is a   hashtable that binds variables to values. *)let rec eval (vars) = function  | Variable name    -> find_var vars name  | Literal v        -> v  | Call (func,args) -> call func (List.map (eval vars) args)  | Binary (f,a,b)   -> f (eval vars a) (eval vars b)  | Unary (f,a)      -> f (eval vars a)  | Indexed (a,i)    ->     begin match (eval vars a) with      | Val_Array a -> find_var a (eval vars b)      | _ -> ignore (eval vars b); Var_Null    end


The choices of interaction are fairly arbitrary (I decided that an array
automatically turned into a 'null' whenever it was involved in an
operation, except for equality) but the functions are there and can be
amended to perform whatever you wish. The point is, describing the
operations in a loosely typed language such as this one is necessarily
a long deal, since there are N² combinations to consider for N types.

Once the evaluation function is done, we can write a mutually recursive
execution function. The execution function concerns itself
with one call only. It keeps a list of stack positions, represented
as the code left to be executed within each block it is present in.
It then pops an element from the topmost block, executes it, and moves
on to the rest (evaluating objects in the process, and altering the
memory state if necessary).

(* Execute a stack of operations given the name-value    association 'vars' *)and exec (vars) = function  | []            -> Val_Null (* Nothing left to execute *)  | []::above     -> exec vars above (* Block finished, move up one. *)  | (h::t)::above ->   (* We still have commands here: h is the next command, t is the rest     of the commands in this block, above is the blocks above this one. *)  begin match h with     | Set (a,e)    -> assign vars a (eval vars e); eexec vars (t::above)    | Return e     -> eval vars e    | If (e,l)     ->       if (op_and (eval vars e) <> Val_Null)       then exec vars (l::t::above)      else exec (t::above)    | While (e,l) ->      if (op_and (eval vars e) <> Val_Null)       then exec vars (l::(h::t)::above)      else exec (t::above)    | Eval e -> ignore (eval vars e); exec vars (t::above)    | Print e -> print_val (eval vars e); exec vars (t::above)    | Read a -> assign vars a (read_val ()); exec vars (t::above)  end  (* Call a function given by a name *)and call (func) (args) =  (* Try to find the function by name *)  let f = get_function func in    (* Create an environment from the argument list *)  let environment = Hashtbl.create 10 in  List.iter2     (Hashtbl.add environment)    (f.args)    (args);      (* Execute the function *)  exec environment [f.body];;


I'll leave to you to choose how to implement print_val and read_val, and I'll concentrate on the assignment function instead. The objective of this function is to alter the variable or object described by its second argument (as part of the environment passed as first variable) and set its value to its third argument. Since we want the left-hand object to be created (if it's an array-of-array-of-array-of-...) we'll write a function which builds the object (if it doesn't exist) and returns the current value and a function to modify it.

and assign vars target value =  let rec helper = function      | AVariable v ->       (* Return a function which performs the modification *)      (find_var vars v),      (* Value modification *)      (fun x -> Hashtbl.remove vars v; Hashtbl.add vars v x)          | AIndexed (a,i) ->       (* Evaluate the index *)      let i = eval vars i in      (* Build whatever is below *)      let (value,modify) = helper a in      begin match value with        (* Value is alredy an array, so keep it and add to it! *)        | Val_array a ->           (find_var a i),(fun x -> Hashtbl.remove a i; Hashtbl.add a i x)                  (* Value is not an array, we need to index into it so we alter           it.  *)        | _ -> let a = Hashtbl.create 10 in          modify (Val_array a);          (Val_Null),(fun x -> Hashtbl.add a i x)      end  in     (* Retrieve the modification function and call it on the value. *)  snd (helper target) value


This should probably work. It'd probably take one more hour to correct all the various syntax errors, typos and typical mistakes (the code isn't tested), as well as binding code to execute the main function on startup and the usual command line arguments to specify which script to load. Perhaps adding a subscripting option for strings might be a good idea as well. Total time: three hours. Note that the values are passed by reference, but only arrays are mutable, so you can pass a mutable reference by wrapping it in an array, as such:

function alter(foo){  foo[0] = "Ha!";}function main(){  foo[0] = null;  alter(foo);  print foo[0]; // Prints "Ha!"    alter(bar);  print bar[0]; // Prints "null", since bar wasn't an array when it was passed to 'alter'}

Wow. That's alot of stuff. I really appreciate the help. I'm going to properly read over this :)

p.s. Is any of it actually coded in C++ ? :P
Quote:p.s. Is any of it actually coded in C++ ? :P


C++ is a pretty unexpressive, verbose and slow language to do these things in. So no, I'd rather not [wink]

This topic is closed to new replies.

Advertisement