Compiling to register-based instructions

Started by
0 comments, last by Kylotan 14 years, 5 months ago
Most VMs use a stack-based approach to interpret code, such as Java and the .NET runtime. Some however have started using register based instructions, because they tend to take up less opcodes, and don't have a load of implicit operands. One reason why I'm interested in doing this is that it would make JIT compilation of my scripts (and perhaps even full compilation) easier. However, I do not know how to achieve a good method of doing this. Today I tried something that 'simulates' stack based instructions, but using registers, as if the register file is a stack. Register 1 is the bottom, register 2 is above that, etc. Here's the code (Python):

# Tokenizer
class Tokenizer:

    def __init__( self, source ):
        self.src    = source + " ";
        self.sym    = ["+", "-", "*", "/", "(", ")", "=" ]

    def IsAlpha( self, char ):
        char = ord( char )
        return ( char >= ord( 'a' ) and char <= ord( 'z' ) ) or ( char >= ord( 'A' ) and char <= ord( 'Z' ) ) or char == ord( '_' )

    def IsDigit( self, char ):
        char = ord( char )
        return char >= ord( '0' ) and char <= ord( '9' )

    def IsOther( self, char ):
        return not ( self.IsAlpha( char ) or self.IsDigit( char ) )

    def IsSymbol( self, tok ):
        return tok in self.sym;

    def IsIdent( self, tok ):

        if not self.IsAlpha( tok[0] ):
            return False

        for i in range( 1, len( tok ) ):
            if self.IsOther( tok ):
                return False

        return True

    def IsNumber( self, tok ):
        for i in tok:
            if not self.IsDigit( i ):
                return False

        return True

    def IsToken( self, tok ):
        return self.IsNumber( tok ) or self.IsIdent( tok ) or self.IsSymbol( tok )

    def Tokenize( self ):
        tok = []

        a = 0
        while ( a < len( self.src ) ):

            b = len( self.src ) - 1
            while ( b > a ):

                substr = self.src[a:b]
                if self.IsToken( substr ):
                    tok += [substr]
                    a = b - 1
                    break
                
                b -= 1

            a += 1

        return tok




# Compiler
class Compiler:

    def __init__( self, src ):
        self.tok = Tokenizer( src )
        self.toks   = self.tok.Tokenize()

    def Compile( self ):

        instr = []
        stack = []
        stackcount = 1
        maxstack = 0
        
        for cur in self.toks:

            # Number or variable
            if ( self.tok.IsDigit( cur[0] ) ):
                instr += [[ "load", stackcount, int( cur ) ]]
                stackcount += 1
                if stackcount > maxstack:
                    maxstack = stackcount
            else:
                if ( not ( cur == "(" or cur == ")" ) ):
                    stack += cur
                else:
                    if ( cur == "(" ):
                        stack += cur
                    else:
                        while ( stack[-1] != "(" ):
                            stackcount -= 2
                            curi = []
                            if ( stack[-1] == "+" ):
                                curi = ["add"]
                            if ( stack[-1] == "-" ):
                                curi = ["sub"]
                            if ( stack[-1] == "*" ):
                                curi = ["mul"]
                            if ( stack[-1] == "/" ):
                                curi = ["div"]

                            curi += [stackcount, stackcount, stackcount + 1]
                            stackcount += 1

                            instr += [curi]
                            
                            stack.pop()
                        stack.pop()

        if ( len( stack ) > 0 ):
            while ( len ( stack ) > 0 ):
                stackcount -= 2
                curi = []
                if ( stack[-1] == "+" ):
                    curi = ["add"]
                if ( stack[-1] == "-" ):
                    curi = ["sub"]
                if ( stack[-1] == "*" ):
                    curi = ["mul"]
                if ( stack[-1] == "/" ):
                    curi = ["div"]

                curi += [stackcount, stackcount, stackcount + 1]
                stackcount += 1

                instr += [curi]
                            
                stack.pop()

        instr.insert( 0, [ ".maxstack", maxstack ] )
        instr += [["print", 1]]
        return instr



# Virtual Machine
class VM:

    def __init__( self, src ):
        self.comp   = Compiler( src )
        self.inst   = self.comp.Compile()

    def Display( self ):

        for i in self.inst:
            curinst = i[0]

            if curinst == "load":
                print curinst + "\tR" + str( i[1] ) + "\t" + str( i[2] )

            if curinst == "add" or curinst == "sub" or curinst == "mul" or curinst == "div":
                print curinst + "\tR" + str( i[1] ) + "\tR" + str( i[2] ) + "\tR" + str( i[3] )

            if curinst == "print":
                print curinst + "\tR" + str( i[1] )

            if curinst == ".maxstack":
                print "register count:\t" + str( i[1] )

        print "--------------------------"

    def Run( self ):

        reg = []
        for i in self.inst:

            inst = i[0]

            if ( inst == ".maxstack" ):
                reg = [0] * i[1]

            if ( inst == "load" ):
                reg[i[1]] = i[2]

            if ( inst == "add" ):
                reg[i[1]] = reg[i[2]] + reg[i[3]]

            if ( inst == "sub" ):
                reg[i[1]] = reg[i[2]] - reg[i[3]]

            if ( inst == "mul" ):
                reg[i[1]] = reg[i[2]] * reg[i[3]]

            if ( inst == "div" ):
                reg[i[1]] = reg[i[2]] / reg[i[3]]

            if ( inst == "print" ):
                print reg[i[1]]
            

eqn = raw_input( "Enter an equation [i.e., (4-2)*3]: " )
print "--------------------------"
vm = VM( eqn )
vm.Display()
vm.Run()

Give it a shot. However, we learned in our computing systems lecture that this is just an easy way to do this, and 'real' compilers actually do it a different way. My way IS easy because it works like a stack, and it's easy to deal with dependencies. The 'other' way seems to have a completely different structure to the actual format that source is programmed in, and seems to make dependencies more tricky. Does anyone know a good method of doing this? Thanks
Advertisement
You've still got more of an byte-code interpreter than a compiler there, since you're mapping high level operations pretty much one-to-one to the instructions. A real compiler will typically have to perform decomposition of a high level statement into several low level instructions.

Instead of being able to arbitrarily choose how many registers you want, typically you're limited to a given number of them. (In real compilers this is down to the hardware, but you get to choose how many you want.) The compiler for a high level language then decides which registers are available for each instruction and allocates them accordingly in the generated machine code. (It's possible to run out of registers but you can always overflow onto the stack in that case.) These algorithms are supposedly quite complex. Maybe the Wikipedia page on register allocation will give you some pointers.

Sometimes register choice is further complicated since instructions are hard-wired (literally, at least in older hardware) to use certain registers, meaning the compiler sometimes has to shuffle things around to get them to the correct register. This is probably less of a problem on RISC architectures than on CISC and not a problem at all on soft architectures such as a bespoke VM because you can simply avoid writing such instructions should you so choose.

It also has to consider generating instructions to save and restore registers when calling functions so that instructions in the nested function don't overwrite values in the current function. You could save and restore all registers for every function call, but technically a function only has to preserve the registers it uses and restore them at the end, so there is scope for optimisation there.

However, many of these issues are extra concerns that machine code compilers have to address in order to work with the specific hardware in question. Your VM doesn't have the same physical restrictions and therefore much of this complexity can be ignored.

I hope all of that's correct, as it's a long time since I last worked with compilers or assembly.

This topic is closed to new replies.

Advertisement