How to build a virtual machine

Started by
7 comments, last by assainator 9 years, 7 months ago

I want to build a virtual machine because I want to really LEARN deeply how a virtual machine works underneath the hood. I have experiencing programming games in Java for a year and a half and love building software. I think my problem solving skills and curiosity can be transferred into tackling some complexity in building a VM.

I read that I might need to communicate with the CPU and registers and RAM in some shape or form to create some features of a virtual machine. Does a library of Java have something that can necessarily communicate with hardware components of the system unit?

When I say build a virtual machine, I mean building the actual software using a library.

I need guidance. biggrin.png

Advertisement

Well, the first question you want to ask yourself is what is the purpose of the VM? Or what do you want to emulate?

VM doesn't necessarily means writing assembly code. In the context of games, there are 2 common emulation concepts:

  1. Emulating hardware - could be old game system such as NES emulator, other computer architecture such as commodore 64. For this type you really need to know the architecture you are planning to emulate - basically learn the assembly language, how it is encoded, memory architecture, video and sound systems, etc. Depending on your choice of HW to emulate, this can become really tricky.
  2. Emulating game engines - this type is similar to ScummVM. In this case, you write an emulator for a scripting language. There are some similarities to the first kind, but the main difference is that you don't need to understand the hardware internals, which greatly simplifies the problem. For old game systems (Scumm, SCI) - it's a simple mapping of the scripting instructions to some high-level Java/C++ calls, no need for assembly.

There are also other kind of VMs - for example, one can write a virtual machine for HLSL/GLSL shader execution. Very simple, but limited and pretty useless without proper context.

There's really no reason to learn x86 assembly (at least not at the beginning). If you are targeting some old architecture/game engine - your current PC is thousands of times more powerful, so even a decent interpreter would do.

Java will do just fine (in fact, Java runs on VM with JIT compilation).

You can take a look at the ScummVM source code and use it as an example (C++, uses OpenGL for the video system).

If you are just starting, I suggest that you:

- Work on Scumm. You have reference code for it, which means it's well documented.

- Start with basic interpreter, just to make sure you understand how the system work.

- Once it's working, identify the performance bottlenecks and optimize them (use HW accelerators for graphics/sound system, learn and apply x86 optimizations, etc.).

- That should basically do. If you still feel unsatisfied, start learning JIT compilation(JIT == Just In Time). This takes you to a whole new level of x86 low-level understanding.

Source code fo QEMU is available, it's no easy task,

Most scripting languages include some form of VM (some just directly execute parse trees, but those are dying out).

The gist is pretty simple. You can build a naive VM for a custom language using a structure similar to:

enum class OpCode : uint8_t {
  Noop,
  LoadConstant,
  Add,
  Subtract,
  JumpIfEqual,
  JumpIfNotEqual,
  // etc.
  NumOpCodes
};

using RegisterName = uint8_t;

constexpr size_t NumRegisters = 128;

struct Instruction final {
  OpCode opCode = OpCode:::Noop;
  RegisterName dest = 0;
  RegisterName src1 = 0;
  RegisterName src2 = 0;
  union {
    double dImmediate;
    size_t iImmediate;
  };
};

class Context final {
  std::array<double, NumRegisters> m_Registers;
  std::vector<Instruction> m_Instructions;
  size_t m_CurrentInstruction = 0;

public:
  bool LoadAssembly(std::string const& path);
  void Execute();
};

bool Context::LoadAssembly(std::string const& path) {
  // parse out a very simple assembly language here using istream and friends
  return success;
}

void Context::Execute() {
  while (m_CurrentInstruction < m_Instructions.size()) {
    Instruction const& inst = m_Instructions[m_CurrentInstruction++];
    switch (inst.opCode) {
    case OpCode::Noop:
      break; // do nothing
    case OpCode::LoadConstant:
      m_Registers[inst.dest] = inst.dImmediate;
      break;
    case OpCode::Add:
      m_Registers[inst.dest] = m_Registers[inst.src1] + m_Registers[inst.src2];
      break;
    case OpCode::JumpIfEqual:
      if (m_Registers[inst.src1] == m_Registers[inst.src2])
        m_CurrentInstruction = inst.iImmediate;
      break;
    // etc.
    default:
      throw std::exception("Unknown instruction");
    }
  }
}
You can build from there to include more than just double types, to include a stack for function calls, and so on. You can read the specification for an existing bytecode format (say, Java) and load that instead of a custom format (though often your VM will need to be specially tailored for that specific bytecode, since the format places requirements on the VM). You can study language parsing and compiling to get your own high-level language (or another high-level language) compiling into your VM bytecode's format. With some work on the above you can optimize your binary format so you're not wasting so much space in Instruction when it's not needed, you can implement better instruction dispatch, and so on.

Your VM's design is going to vary based on whether you're interpreting a dynamic language, a static language, a "trusted" language, etc. Lua's simple (and open source) VM is based on a dynamic trusted bytecode (since you can't load raw bytecode and have to compile from source when loading a script, there's no worry that you'll encounter malformed or malicious instructions). Java (and there are open source JVMs you can inspect) uses a static untrusted bytecode. .NET/C#/VB (Mono is an open source implementation of a .NET VM) uses a more advanced static untrusted bytecode.

You can then look into libjit or various related libraries for generating raw machine instructions and avoid (much) of the need of a custom VM.

For a top-of-the-line VM with JITting, you'll want to start reading the source code to things like LuaJIT or the various JavaScript engine implementations. The state of the art is very advanced and complex, but thankfully there are a hobajillion articles on the Web about all the advancements made in the open source JavaScript VMs and LuaJIT and even the experimental Java VMs. The tricks used to enable JIT support in a dynamic language are very non-trivial, so you might find it easier to implement a static-type VM if you're interested in exploring JITting, at least at first.

Sean Middleditch – Game Systems Engineer – Join my team!

Depends entirely how you define 'VM' -- I assume what you're talking about is "something like the Java Virtual Machine" -- and even within the context of something like the JVM, there are several distinct approaches. The easiest approach would be some kind of high-level language or byte-code interpretter -- In that approach, the JVM is not unlike the BASIC programming languages of yore, or many of the scripting languages popular today. Byte-code interpreters are basically the same function, but are attractive because of increased performance of applications running on the VM, and code-density/code-hiding -- but now you also need to write some kind of compiler, in addition to your VM, both of which are big tasks.

A state-of-the-art VM today uses just-in-time (or JIT) compilation to get best performance. In this setup, there's typically a compiler that takes a high-level language and translates it into Intermediate Language (or IL), a lower-level representation (which often, but not always, looks similar to assembly language of popular CPUs); this compiler only concerns itself with high-level, algorithmic optimizations. Then, the JIT-employing VM performs a final phase of compilation in which the VM IL is compiled down to actual machine code that can be run on the host processor directly, but still within the sandbox defined by the VM platform.

Much of the job of writing a usable (professional) VM today is about security of the sandbox environment, things that enhance programmer productivity -- like garbage collection, and JIT performance. Within that sphere of influence, the part you probably think about as "the VM" is mostly about defining what the virtual hardware platform and IL look like -- things like whether the machine architecture is register-based/stack-based/direct-memory/something-else, to tedious and not-immediately-obvious concerns like what the threading model/shared-memory/message-passing/memory-coherence models are. That's for a 'production quality' kind of VM -- odds are, if you want to keep it simple, your VM will look more like a scripting language.

throw table_exception("(? ???)? ? ???");

This article explains how Virtual Machines work: http://gameprogrammingpatterns.com/bytecode.html

http://amzn.com/1931841578

This is an older book, and has a terrible name. It actually teaches how to build a scripting language compiler and virtual machine from the ground up. It's written in C++, and not without some bugs in the code, but it is still a great book.

I used this to make a scripting language, compiler, and virtual machine for my final project for my programming degree. I was able to get math, variable, and recursive functions working, as well as embedding the language in C++ and adding C++ extensions to call C++ from the scripting language.

I think, therefore I am. I think? - "George Carlin"
My Website: Indie Game Programming

My Twitter: https://twitter.com/indieprogram

My Book: http://amzn.com/1305076532

http://amzn.com/1931841578

This is an older book, and has a terrible name. It actually teaches how to build a scripting language compiler and virtual machine from the ground up. It's written in C++, and not without some bugs in the code, but it is still a great book.

I used this to make a scripting language, compiler, and virtual machine for my final project for my programming degree. I was able to get math, variable, and recursive functions working, as well as embedding the language in C++ and adding C++ extensions to call C++ from the scripting language.

Thanks

My recommendation is looking up simple scripting languages on the internet and look at its source. I found Ionscript to be particularly useful: https://github.com/keebus/IonScript

"What? It disintegrated. By definition, it cannot be fixed." - Gru - Dispicable me

"Dude, the world is only limited by your imagination" - Me

This topic is closed to new replies.

Advertisement