Loop unrolling

Engines and Middleware Programming

Started by cobru March 21, 2007 04:06 AM

8 comments, last by cobru 17 years ago

193

Author

March 21, 2007 04:06 AM

Hello everybody, I'm working on a small compiler + 'vm' thinggy (compiler parses the sourcecode, then it compiles to bytecode and the 'tinyvm' interprets the bytecode) I'm down to the 'optimizations' part in the compiler, for example, instead of: pushf 2 pushf 3 add the compiler generates pushf 5 and I also implemented loop unrolling (for example, for simple loops like for(i=0;i<10;i++) { code } the compiler generates the bytecode corresponding to i = 0; { code }; i = 1; { code } ; ... , but I've started wondering if it's such a good idea to do loop unrolling, because the branching in the 'tinyvm' isn't exactly the same as the branching in a real processor. Ok, here's the question: besides the fact that i get more bytecode, does anyone see any other advantages/disadvantages to loop unrolling for a system like mine (remember: vm interprets bytecode, it doesn't JIT it and the compiler doesn't generate real assembly code) ? Thanks.

Q: How many programmers does it take to write a nice piece of software?A: MORE.

ToohrVyk

1,596

March 21, 2007 04:21 AM

Probably not: you're only saving a comparison, which is too high-level to handle anyway. Besides, you would need some measure of speed, so it would be best to leave this optimization to the VM (some way of compiling an often-repeated operation to faster-executing code).

I'd say, the best implementation there would be at optimization time is to add a repeat-n-times primitive operation to the VM (so the for loop would be executed in VM code instead of bytecode) and code known-interval loops in terms of this operation.

Blog — Facebook

Woodchuck

246

March 21, 2007 05:52 AM

Quote:Original post by ToohrVyk
Probably not: you're only saving a comparison, which is too high-level to handle anyway. Besides, you would need some measure of speed, so it would be best to leave this optimization to the VM (some way of compiling an often-repeated operation to faster-executing code).

I'd say, the best implementation there would be at optimization time is to add a repeat-n-times primitive operation to the VM (so the for loop would be executed in VM code instead of bytecode) and code known-interval loops in terms of this operation.

Mmmh interresting. The only problem with that is you may transform the machine from a risc one to a cisc one. But that's hardware considerations that are probably far from cobru ones (and mines too for sure=).

Cobru : to execute the byte code, your vm probably execute a lots of branches (switches for example;) for any instructions no ? A branching instruction just add a few more (with luck, just one). You have to calculate the percentage of branch added, it could be a good measure to start.

The answer finaly depends on what considerations are yours about your vm. Do you want to write a vm that theoricaly fit well with hardwares or a vm that is fast on an high-end core 2 duo ?

ToohrVyk

1,596

March 21, 2007 06:12 AM

Quote:Original post by Woodchuck
Mmmh interresting. The only problem with that is you may transform the machine from a risc one to a cisc one. But that's hardware considerations that are probably far from cobru ones (and mines too for sure=).

Right, you're working on a VM, not a processor. Besides, it's always possible to further compile such instructions to assembly...

srpt [n]   ;for(i = 0; i < n; ++i)           ;{dwim       ;  stufferpt       ;}

to (approximately) :

mov  ecx,[n] ; srpt [n]loop:        ;dwimsub  ecx,1   ;cmp  ecx,0   ; erptjne  loop    ;

Blog — Facebook

BarsMonster

122

March 21, 2007 08:31 PM

In my project, I am not going to unroll any loops.

If you have some free time, you can unroll trivial 1x loop, and probably 2x.

cobru

193

Author

March 22, 2007 03:18 AM

Thanks for the replies...

Well, the problem is a bit more 'complicated', I suspended the loop unrolling for now until I finish work on the vm , then I will do some benchmarks with normal loops and with unrolled loops.

Basically, ToohrVyk is right, I'm basically only saving a comparison. But things get a bit complicated, the VM (and compiler) is written in C# (I needed speed of development for this one, it's not going to be used for a game, but as part of a business process) so the extra comparison can add some overhead.

for example, here's a simple loop in my language:

--- romanian programming language code ;) ---
var i : intreg;
var k : numeric;

pentru i dela 1 la 10
k := k + i;
--- end code ---

the 'for' is "pentru i dela 10 la 10" (for(i=1;i<=10;i++).

here's the normal bytecode generated by the compiler:
.pushint 1
.stvar i
:l1
.ldvar i // (is a Integer)
.pushint 10
.isless_eq
.jmp_true l2
.pushint 10
.stvar i
.jmp l3
:l2
.ldvar k // (is a Numeric)
.ldvar i // (is a Integer)
.cast_to Numeric
.add
.stvar k
.ldvar i // (is a Integer)
.pushint 1
.add
.stvar i
.jmp l1
:l3

basically, by unrolling the loop i save the part ":l1 ldvar i; pushint 10; isless_eq; jmp_true" and also the 'jmp l3'. the jumps are pretty simple, very little overhead (the vm does just ProgramCounter = new_offset), but for the comparison, the vm does something like this:

[source language="c#"]// ldvar iStackVar tmp = new StackVar(Vars); stack.Push(tmp);// pushint 10StackVar tmp = new StackVar(10); stack.Push(tmp);// isless_eqStackVar b = stack.Pop(); StackVar a = stack.Pop();              StackVar res = new StackVar(compare_lesseq(a, b)); stack.Push(res);// jmp_trueStackVar b = stack.Pop(); if(b.boolVal) ProgramCounter = arg1; // arg1 = the opcode argument

The vm works with several kind of primitives, like integer, floats, strings, tables and multivalues (domain-specific language anyway) and in order to make the big switch() smaller, i didn't make isless_eq_integer, isless_eq_numeric etc, i put one instruction and based on the type of the variables it does the comparison.

The compare_lesseq() function does something like:
switch(a.type)
{

case integer:
switch(b.type)
{
case integer: return (a.intvalue <= b.intvalue);
}

So the 'extra compare' adds quite some overhead...

But as I said, I'll do the benchmarks and decide afterwards (seems like the most reasonable thing to do). Also, the 'size' of the bytecode is of little importance, the final goal for the language is to allow some consultant to write business rules in a simpler way than writing them in c# (and he doesn't know c#).

Q: How many programmers does it take to write a nice piece of software?A: MORE.

iMalc

2,466

March 22, 2007 03:31 AM

How about you make your compiler figure out whether it is smaller to unroll or not, and always take whichever is shortest? Bytecodes of course take much longer to execute than asm instructions, so the fewer bytecodes your VM has to execute, the better!

btw, it will only tend to be shorter to unroll when the loop contains very little and only executes a few times.

Also, you might want to see if you can furthur optimise when unrolling, for example when generating each {code} section, see if the fact that (i=1) for example, has any implications. e.g:

for(i=0;i<4;i++) if (i!=1) {more_code} else {other_code}

In this case you can do the 'if' test at compile time if unrolling.
This may be quite tricky though, so only attempt if you are confident you can do it.

"In order to understand recursion, you must first understand recursion."
My website dedicated to sorting algorithms

cobru

193

Author

March 22, 2007 04:34 AM

Well iMalc, the 'figuring out' part was already in the compiler when I implemented loop unrolling. The language grammar defines the loop as:

"pentru" <integer_variable>
"dela" <integer_or_numeric_expression>
"la" <integer_or_numeric_expression>
[ "pas" <integer_constant> ]

It ain't exactly C or C#, actually (i force the "step" to be a constant in the grammar, so you can only write:
"pentru i dela 1 la 10 pas 2" or "pentru i dela 10 la 1 pas -2" but you can't write "pentru i dela 1 la 10 pas k").

At compile time, the parser creates a 'for' node with children: integerVariable (the counter), begin(expression), end(expression) and step(integerConstant) and I test, if the 'begin' and 'end' nodes are constants, i can unroll the loop (and i can also decide to unroll them when (end-begin)/step is less than some threshold value). But if I do unroll the loop, I think the benefits start appearing for large loops , not small ones (for large loops, you have more comparisons, and as I said, I don't care how big the bytecode is).

As for the kind of optimization you suggested, I thought of it but I need to introduce tracking of the variables in the compiler (basically the compiler should try to execute the code it compiles, if possible) and I'm afraid I don't have the time to do this kind of exotic optimizations.

What I mean by 'exotic' optimizations:

k := 5; // mark k as having the value "5"
if (k < 0) { code } // see if we know the value of k at compile time, we do know, we can do the test and don't generate anything for this instruction because 5 < 0 evaluates to false

of course, something like this can't be optimized:
k := get_some_value(); // returns external value so K is 'unknown'
if(k < 12) // again, can't optimize, we don't know the value of K
k := 5; // useless because it depends on the if above.

but this is another story and quite exotic for what i need at this moment .. maybe later when i'll have time ;)

Q: How many programmers does it take to write a nice piece of software?A: MORE.

ToohrVyk

1,596

March 22, 2007 04:36 AM

If speed is an issue, then you probably shouldn't be using a switch anyway. I would personally eliminate it and increase performance (and reduce memory usage) by doing the following: first, create a 'bytecode instruction' object. Every time a bytecode instruction is read from file, you create the corresponding bytecode object, and send it to an instruction factory, which returns an IInstruction that corresponds to the described operation. Make sure that each unique instruction is only instantiated once (otherwise, return the existing one) so that memory usage is reduced.

An IIntsruction has an Execute operation to be called, which can alter the virtual machine state on-the-fly. Each implementation provides its own corresponding action when executed. Then, the execution loop consists in extracting the current instruction and Executeing it. This does away with the switch, and makes it easy to register new instruction types (bind a type to a string in a hashtable in the factory).

The memory weight is exactly a reference per instruction, with similar instructions being grouped up (all 'Add' operations use the same memory).

[comentariu necesar despre cât de ridicole sunt comenzile in româneste]

Blog — Facebook

cobru

193

Author

March 22, 2007 05:12 AM

I was planning to try both the switch and the thing you suggested, but my concern is the values on the stack, each value is an object and most instructions pop some of them from the stack and push (new Value()).. I'll try to optimize that part first.

Q: How many programmers does it take to write a nice piece of software?A: MORE.

Loop unrolling

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Loop unrolling

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines