Inside a Modern Programming Language
I'll be investigating the entire process from the ground up: from the way the parser works, to the way the code is compiled, and finally, how the virtual machine handles the whole process. It'll be a long ride, but if you have any interest whatsoever in the way programming languages are implemented, it should be an exciting one.
Part One: Parsing
Taking raw source code and turning it into something the computer can use is actually a pretty involved process. The first stage of this process involves parsing the code: simply examining it one word or symbol at a time, and turning that into a representation that can be further worked on by the compiler.
Parser theory is incredibly complex and full of all kinds of interesting research; so I won't delve into it too much here. (If you're interested in some further reading, check out "dynamic parsers".) Suffice it to say we have an interesting beast in Epoch: the programming language can actually teach itself new syntactical constructs!
Right now the facility for dynamic parsing is fairly limited; it's actually not accessible from Epoch code just yet, so extending the Epoch syntax is done from external C++ libraries. At some point, though, it will be entirely possible to define syntax chunks in an Epoch program itself, and then use those syntax elements in the program.
So what exactly does this mean for us, practically speaking, as programmers who want to write a game or whatnot in Epoch? Happily, there's not much to it from the end-user side: you just write code that "feels" natural, and the internal library and compiler systems do the rest.
Let's check out a basic Epoch program that uses a conditional - just your vanilla, run-of-the-mill if/elseif/else statement.
entrypoint : () -> ()
debugwritestring("Enter a number:")
integer(foo, cast(integer, debugreadstring()))
if(foo == 0)
debugwritestring("You entered zero")
elseif(foo == 1)
debugwritestring("You entered one")
elseif(foo == 2)
debugwritestring("You entered two")
debugwritestring("You entered something else")
It should be pretty clear what's going on here. We write a simple prompt to the user, asking for a number. Then we read a string from the console, turn it into an integer, and lastly print a message based on the value of that integer. The syntax may be a little bit alien, but the basic elements should be familiar to anyone who's used C, C++, Java, PHP, or any of a host of other languages in that syntax family.
For the parser, there's actually a lot going on behind the scenes here. When the program is first read from disk, the Epoch Standard Library is attached to it. This involves loading a C++ DLL, calling a bunch of functions, and generally setting up a bunch of parser tables.
You may be interested to know that the only thing the Epoch language recognizes in the above program is the general function definition for the entrypoint function. Everything else - from the debug function calls, to the if statements, right down to the integer variable declaration and the typecast - is defined in the standard library. This means that the grammar for the Epoch parser is very simple: it basically just recognizes functions, then delegates individual bits of parsing off to the standard library. This also means that by replacing the standard library, we can create a very, very different language with comparatively little effort.
Part of the standard library's initialization pass includes registering some rules with the parser. Basically this is just a way of saying "hey, when you see this sequence of tokens in the code, do such and such." I use boost::spirit::classic for the dynamic parser, and stored_rule for the actual dynamic rule sets.
After winding through a short maze of function calls and redirects, the standard library's rules actually make it to the spirit parser. In essence, it all boils down to function calls like this one:
void AddInlineEntity(const std::string& entityname)
EntityIdentifier = boost::spirit::classic::strlit<>(entityname.c_str()) | EntityIdentifier.copy();
Part Two: Semantic Actions and Entities
The process of taking the parsed input and working on it is generally accomplished with what are referred to as semantic actions. These are basically just pattern matching systems: when a pattern of code is recognized in the input, a corresponding action is taken. (This is a gross oversimplification, so parser theory buffs, please be gentle with me [wink] Again you can find out a lot by doing some research on parser theory, if this bit interests you.)
Epoch code is built out of two primary blocks: statements and entities. A statement is just like in any other imperative programming language: you write some code, the program does something. Statements pretty much boil down to one or more function calls in Epoch; pretty much everything is implemented as a function at some layer, including variable constructors, casts, comparisons, arithmetic, and so on.
The second block is the interesting one. Entities are basically just a sequence of statements inside a lexical scope, but with one important addition: they have meta-control semantics. These semantics define how an entity actually works. For example, if statements in Epoch are entities; but instead of just always executing the statements inside the entity, they conditionally execute or skip their code blocks based on a condition parameter.
Entities can take any type or number of parameters. Most importantly, they can also take zero parameters, which makes it possible to do things like implement else statements as entities. When the entity is reached in the code, the Epoch language invokes some meta-control code for the entity, which examines the parameters (if any), and tells the virtual machine how to proceed.
Another important feature of entities is chaining. Chaining is the process by which multiple entities can be linked together to selectively handle a blob of code. In our example program, the if/elseif/else blocks are all entities which form part of a single entity chain. The meta-control code for these entities selects the appropriate code to execute based on the condition parameters, and then runs it. Once a matching entity is found in the chain, the matching entity's code is executed, and the chain is exited.
Note that it is possible to implement a lot of flow control constructs with this mechanism. For instance, loops can be implemented by instructing the virtual machine to repeat the chain until the meta-control code decides to stop the loop. This can also be used to implement switch or select-case style statements, exception handlers, and so on. In a nutshell, the entity system is incredibly powerful and gives Epoch the potential to do a huge number of things within a uniform framework. This framework makes it possible to extend and upgrade the language with a minimum of work required in updating the parser itself, since everything can be done with dynamic rules from the standard library.
Part Three: Compiler Output and the Virtual Machine
So now we know that the code is turned into a chain of entities. What does the compiler actually produce to represent all this? Let's take a peek at the bytecode for the virtual machine, which is the stuff the compiler puts out and the VM later can execute.
00000000 POOL_STR 26 You entered something else
0000003b POOL_STR 25 @@anonscope@@3
0000005e POOL_STR 24 You entered two
00000083 POOL_STR 23 @@anonscope@@2
000000a6 POOL_STR 22 You entered one
000000cb POOL_STR 21 @@anonscope@@1
000000ee POOL_STR 20 You entered zero
00000115 POOL_STR 19 @@anonscope@@0
00000138 POOL_STR 18 foo
00000145 POOL_STR 17 Enter a number:
0000016a POOL_STR 16 entrypoint
00000185 POOL_STR 15 cast
00000194 POOL_STR 14 else
000001a3 POOL_STR 13 elseif
000001b6 POOL_STR 12 if
000001c1 POOL_STR 11 ==
000001cc POOL_STR 10 /
000001d5 POOL_STR 9 *
000001de POOL_STR 8 -
000001e7 POOL_STR 7 +
000001f0 POOL_STR 6 cast@@string_to_integer
00000225 POOL_STR 5 cast@@integer_to_string
0000025a POOL_STR 4 string
0000026d POOL_STR 3 integer
00000282 POOL_STR 2 debugreadstring
000002a7 POOL_STR 1 debugwritestring
000002ce INVOKE 16
000002d4 ENTITY 1 16
000002dd PUSH_STR 17
000002e6 INVOKE 1
000002eb PUSH_STR 18
000002f4 PUSH_STR 3
000002fd INVOKE 2
00000302 INVOKE 6
00000307 INVOKE 3
0000030d READ 18
00000312 PUSH_INT 0
0000031b INVOKE 11
00000320 ENTITY 17 19
00000329 PUSH_STR 20
00000332 INVOKE 1
00000338 READ 18
0000033d PUSH_INT 1
00000346 INVOKE 11
0000034b ENTITY 18 21
00000354 PUSH_STR 22
0000035d INVOKE 1
00000363 READ 18
00000368 PUSH_INT 2
00000371 INVOKE 11
00000376 ENTITY 18 23
0000037f PUSH_STR 24
00000388 INVOKE 1
0000038e ENTITY 19 25
00000397 PUSH_STR 26
000003a0 INVOKE 1
000003a9 SCOPE 16 0 1 18 4 0
000003c2 SCOPE 19 16 0
000003cf SCOPE 21 16 0
000003dc SCOPE 23 16 0
000003e9 SCOPE 25 16 0
There's a lot going on here! The first section of the code simply sets up a pool of string literals, which are just string data used by the program that is expected to be available at runtime. Note that there are several function names and even the keyword integer in there - this is the result of the standard library being in control of so much of the language. Since all those constructs are defined by the standard library, they are considered runtime entities on par with anything else in the language, or in the program that you write.
Next we see hiding in there two "bootstrap" instructions: INVOKE 16 and HALT. These instructions simply tell the virtual machine to find the function with ID 16 (look in the string table to find that this corresponds to "entrypoint") and execute it, then stop the program. entrypoint is Epoch's equivalent of main; it is the first code to run in the program, and unless something goes catastrophically wrong, it is often the last. (This is another vast oversimplification, but I won't get into details on the exceptions to that statement, for sake of space and time.)
Now we see the good stuff: an ENTITY instruction. This corresponds to the entry point function directly. Inside here is a bunch of goodies, and then the program ends with some metadata about lexical scopes and such, which we can safely ignore for the purposes of this investigation.
Look in the code for offset 0x30c and the CHAIN instruction. This signals the beginning of an entity chain. After that, we read a variable (foo, in this case) from its storage place on the stack, push the integer value 0 onto the stack, and then invoke the == comparison function. (Note again that comparisons are functions in the standard library; the VM does not have a comparison instruction.)
Now we see another entity, which corresponds directly to the first if in the code sample. This entity is special: it performs some meta-control code. (That isn't immediately obvious from the bytecode; it's controlled by the standard library once again. The entity type tag "17" is our only hint as to what kind of meta-control code will be involved; note that entity type tags aren't the same as string IDs!)
Under the hood, the standard library takes the result of the == function that was called earlier, and examines it; based on the value of that comparison, it will inform the VM on what to do next.
If the condition is true, the meta-control code tells the VM to go ahead and execute the entity in question, then exit the chain by skipping down to the corresponding ENDCHAIN instruction (in this case at offset 0x3a6). Otherwise, the meta-control code says to pass on the execution control to the next chunk of code within the chain - offset 0x338. Here we see another set of instructions which calls == once again, this time comparing to the value 1 instead of 0.
This chaining process repeats again, until we reach the final entity (type 19, at 0x38e) which is our else statement. This one has no parameters, so there's no code between the end of the previous entity and this one. The meta-control code for else always executes the code body attached to the entity. (This works because if the condition was true, we would have skipped past here to the end of the chain, so we'll never accidentally run the if and the else in the same pass.)
Now we're all done, and we ENDCHAIN and then RETURN. The return from entrypoint takes us back up to the original INVOKE, and execution reaches the instruction at 0x2d3 - HALT. Program's done!
So now we've seen the entire process, from start to finish, of how an Epoch program is converted from raw source code into a useful, running piece of software. Of course this is a pretty simplistic example just for illustrative purposes, but hopefully it gives you a decent taste of the way the language is built and how it all works under the surface.
Look for Epoch Release 10 soon, featuring all this good stuff and more; stay tuned to this space for updates on R10 and a release announcement. You can also follow development of the language on the Epoch web site and our LinkedIn group.