# Formal Definitions

## Language Definition

In the beginning we need to define our language. For start it is going to be fairly simple (hopefully we are going to extend it further) and our expressions will only consist of numbers (integer values), operators (for the beginning we're fine with +, -, * and /) and parentheses.*Backus-Naur Form*

For language definition we are going to use so-called BNF (Backus-Naur Form) which is a method to describe context-free grammar. The form is actually set of simple rules written as:
```
``` ::= expression

Where the value on left side is a non-terminal symbol, and on the right side there is a set of terminal symbols (if any) followed by non-terminal symbols (if any). On the right side we can have multiple expressions separated using | (which means we can replace the non-terminal on the left with one of the expressions on the right).
Parts of the expression can be enclosed in '[' and ']' brackets followed by '*' (which means that this expression is repeated N times, where N >= 0), or '+' (which means that this expression is repeated N times, where N >= 1), or '^N' where N is a number (which means that this expression is repeated exactly N times).
An example language is:
```
``` ::= [0..9]+
::= + | -
::= * | /
::= () |
::= [ ]*
::= [ ]*

## Automata

With this grammar, we can generate any possible expression that is valid in the language, which is great, but... we do not need to generate all valid inputs of our language, that will be left for the user, we need to accept the language and during that process it is needed to generate the code that is simple enough for some sort of virtual machine to execute. As the language is context-free, it is possible to use a recursive descent parser to accept (e.g. it is valid expression) or reject (e.g. it is invalid expression) the input. I'm not going to write too much about parsers in the series - I challenge users to try writing them. Of course you are free to use Flex or any other generator (but I do not recommend this, unless you know how exactly you would parse something - these production tools are awesome, but only when you know how the actual parsing works). What is a recursive descent parser? It is a top-down parser, built from mutually recursive procedures. Each of these procedures implements often one (and sometimes more) of productions of the grammar it recognizes. The exciting thing about this is that it closely resembles BNF!## Compilation and Execution

This is a process of producing machine code from some other language. Let me present a simple abstraction here, let us have a 'program': "Compute three plus four" Such sentence is, well... not really useful, we need to translate it into something we can work with. One of such formal languages is math. If we translate such sentence into:```
3+4
```

We know a way to compute it (there are multiple computation models how to compute this), let me present one:
```
MOVE reg0 3
MOVE reg1 4
ADD reg0 reg1
PRINT reg0
```

Having a machine with 2 registers, allowing for 3 instructions, where:
- Instruction MOVE takes 2nd argument (value) and moves it into 1st argument.
- Instruction ADD takes value stored in register in 2nd argument and adds it to value stored in register in 1st argument
- Instruction PRINT prints the value in 1st argument onto screen

```
INSTRUCTION | REGISTERS [] | SCREEN []
[-, -] | []
MOVE reg0 3 [3, -] | []
MOVE reg1 4 [3, 4] | []
ADD reg0 reg1 [7, 4] | []
PRINT reg0 [7, 4] | [7]
```

Such approach is known as imperative execution (we are giving orders to computing machine of what to do). This paradigm has one large advantage over other variants of execution: it is rather simple to make a machine that executes imperative language.
# Simple calculator

Let us begin with something simple, and what is the most simple language we can imagine? It is a calculator of course! Our goal is to be able to handle 4 operators (+, -, * and /), parentheses and integer numbers. So, let us formally define a language in BNF.```
``` ::= [0..9]+
::= + | -
::= * | /
::= () |
::= [ ]*
::= [ ]*

Note: such language is exactly the one in the previous example.
## Compiler

These 6 rules are to be rewritten into the source, I'm going to provide pseudo-code here (you can see full working code in accompanying project). Let us start with Integer, what we want to do when we hit an integer - well that is simple, we want to put its value into the register (e.g. somewhere where we can further work with it): Note that in our assembly we move the right value or register into left one. Also after each operation, the result is stored in left register.```
Integer()
{
printLine("mov.reg.i32 r0 %s", GetValue());
}
```

Where *GetValue*is just a function that takes next token from stream, validates whether it is an integer (containing only 0..9). Such function can look like following:

```
GetValue()
{
string output;
int i = 0;
string token = GetNextToken(); // Gets current token and moves to next
do
{
if (token in [0..9])
{
output += token;
}
}
while (i < token.length);
Match();
return output;
}
```

Note. for '+' (e.g. at least one repetition) we use a *do*-

*while*construct. I'm intentionally going to skip

*add_operation*and

*mul_operation*rules and jump over to

*Factor*.

```
Factor()
{
if (NextToken is LEFT-PARENTHESIS)
{
Match(LEFT-PARENTHESIS); // Eats current token, matches it against argument and goes to next token
Expression();
Match(RIGHT-PARENTHESIS); // Eats current token, matches it against argument and goes to next token
}
else
{
Integer();
}
}
```

This was quite obvious - in parentheses we always have another expression, outside we have just an integer. The following two *Term*and

*Expression*are the interesting ones:

```
Term()
{
Factor();
while (NextToken is MULTIPLICATION | DIVISION)
{
printLine("push.i32 r0");
if (NextToken is MULTIPLICATION))
{
Match(MULTIPLICATION); // Eats current token, matches it against argument and goes to next token
Factor();
printLine("pop.i32 r1");
printLine("mul.i32 r0 r1");
}
else if (NextToken is DIVISION)
{
Match(DIVISION); // Eats current token, matches it against argument and goes to next token
Factor();
printLine("pop.i32 r1");
printLine("div.i32 r1 r0");
printLine("mov.reg.reg r0 r1");
}
else if (NextToken is NULL) // If there is no next token
{
// Crash compilation and print error
Expected("Expected multiplication or division operation");
}
}
}
Expression()
{
Term();
while (NextToken is ADDITION | SUBTRACTION)
{
printLine("push.i32 r0");
if (NextToken is ADDITION)
{
Match(ADDITION); // Eats current token, matches it against argument and goes to next token
Term();
printLine("pop.i32 r1");
printLine("add.i32 r0 r1");
}
else if (NextToken is SUBTRACTION)
{
Match(SUBTRACTION); // Eats current token, matches it against argument and goes to next token
Term();
printLine("pop.i32 r1");
printLine("sub.i32 r0 r1");
printLine("neg.i32 r0");
}
else if (NextToken is NULL) // If there is no next token
{
// Crash compilation and print error
Expected("Expected addition or subtraction operation");
}
}
}
```

What happens here? I know it can seem confusing at first - but we are doing exactly what BNF rule says to us. Note that we handled *add_operation*/

*mul_operation*here in the

*while*condition and based upon what operation we encountered we do different things. Actually we started working with operator precedence (that is why we always push the register value onto stack and pop it prior to working with that), the rest should be clear. Addition and subtraction instructions are inside

*Expression*because they are lower priority compared to multiplication and division handled in

*Term*(in recursion, the deeper we are, the higher the operator precedence) - so resolving inside parentheses and actual values have the highest precedence, multiplication and division are in the middle and addition and subtraction have the lowest precedence. I know this is not the clearest thing to understand, and I highly encourage you to try running the compiler in debug mode so you can actually see what is happening (and that precedence is solved correctly in this way). When implementing the compiler like this and running on some input like:

```
(3 + 4 * (7 - 2)) / 2
```

We obtain assembly:
```
mov.reg.i32 r0 3
push.i32 r0
mov.reg.i32 r0 4
push.i32 r0
mov.reg.i32 r0 7
push.i32 r0
mov.reg.i32 r0 2
pop.i32 r1
sub.i32 r0 r1
neg.i32 r0
pop.i32 r1
mul.i32 r0 r1
pop.i32 r1
add.i32 r0 r1
push.i32 r0
mov.reg.i32 r0 2
pop.i32 r1
div.i32 r1 r0
mov.reg.reg r0 r1
```

Let us demonstrate what the execution would look like (by stack we mean LIFO stack):
```
mov.reg.i32 r0 3 // Registers [ 3, -] Stack []
push.i32 r0 // Registers [ 3, -] Stack [ 3]
mov.reg.i32 r0 4 // Registers [ 4, -] Stack [ 3]
push.i32 r0 // Registers [ 4, -] Stack [ 3, 4]
mov.reg.i32 r0 7 // Registers [ 7, -] Stack [ 3, 4]
push.i32 r0 // Registers [ 7, -] Stack [ 3, 4, 7]
mov.reg.i32 r0 2 // Registers [ 2, -] Stack [ 3, 4, 7]
pop.i32 r1 // Registers [ 2, 7] Stack [ 3, 4]
sub.i32 r0 r1 // Registers [-5, 7] Stack [ 3, 4]
neg.i32 r0 // Registers [ 5, 7] Stack [ 3, 4]
pop.i32 r1 // Registers [ 5, 4] Stack [ 3]
mul.i32 r0 r1 // Registers [20, 4] Stack [ 3]
pop.i32 r1 // Registers [20, 3] Stack []
add.i32 r0 r1 // Registers [23, 3] Stack []
push.i32 r0 // Registers [23, 3] Stack [23]
mov.reg.i32 r0 2 // Registers [ 2, 3] Stack [23]
pop.i32 r1 // Registers [ 2, 23] Stack []
div.i32 r1 r0 // Registers [ 2, 11] Stack []
mov.reg.reg r0 r1 // Registers [11, 11] Stack []
```

I recommend to check a few more examples and hand-compute the registers and stack value on the run. It will definitely help to understand that operator precedence will work here and how the calculator works.
## Disassembler

While I already mentioned how the computation works, I don't really recommend writing a virtual machine and processing strings. We can do better! Performing very simple disassembly and storing everything in binary form is a far better way - all you need is to think of the opcodes for your instructions, assign IDs to your registers and store all these numbers into binary form. Note that this way you can actually perform disassembly into x86 or x64 if you wish, but that is not necessary. Implementing a disassembler is very straight-forward and I recommend to look into accompanying source code, even though the implementation in there is not highly effective, it is easy to understand.## Virtual Machine

Up to here, we could only execute the code "by hand", but that ends now! What our virtual machine needs? It needs some memory where we can store stack and the actual code that is to be executed, and at least 4 registers - as our code is working with 2 (r0 and r1) and 2 more, one for stack pointer (as we need stack as temporary memory for our computation) and one for instruction pointer (so we know what we are executing right now). Now, we read (in binary) our application to the memory, place instruction pointer to the address where there is beginning of our application and stack pointer right after the end of or application in memory. E.g.:```
|O|U|R|_|A|P|P|L|I|C|A|T|I|O|N|_|S|O|U|R|C|E|_|C|O|D|E| | | | | | | | | | | |
| |
IP SP
```

Where IP represents memory address where instruction pointer points to, and SP represents memory address were stack pointer points to. E.g. in this case our IP=0 and our SP=27.
As each sub-computation stores result into R0 (register 0), the result of the whole computation will be also in register 0. Note that I'm actually printing both registers in the accompanying source code, once computation finishes.