• Create Account

### #ActualHodgman

Posted 29 December 2012 - 01:02 AM

Hodgman, are your states then inheritted from a State base class? If not, how do you determine how to set the state the correct way? I would think you would want to avoid the overhead of virtual functions on something so low level on the engine.
Yeah they're pretty much implementations of a base-class interface, but they don't use virtual.

In theory, virtual shouldn't be that slow, so it's important to understand what makes it "slow" in practice.
When calling a regular function, the compiler can hard-code a "jump to this address" instruction -- a completely predictable branch, with the main performance pitfall that the jump may cause an i-cache miss, if you haven't called that function recently (i.e. if the function's code isn't already in the cache).
When calling a virtual function, it needs instructions to read the first word of the object (the vtable address), then add some hard-coded offset to that word to get the address of the vtable entry for the function we're after, then fetch the word at that address, and finally jump to the address in that word.
This has the same performance issue as above (an i-cache miss if the function isn't resident), but also the branch is harder to predict because the branch-target isn't known until right before you perform the jump (this matters more on PPC than x86). Further, we need to perform two additional reads from memory, which could each cause d-cache misses -- the first one (the vtable address fetch) is likely if you've not read a member from this object recently, and the second one (the function pointer fetch) is likely to miss if you've not called a virtual function on this type of object recently. If the function-call has to read members of the object, then the 1st one doesn't matter, as you've just moved the cache miss from inside the function to right before it! Cache misses are really slow (hundreds of cycles), so they're a much bigger deal than the few extra instructions or the branch misprediction.

So, the main reason that virtual functions are slow, is that you might cause a cache-miss when fetching a function pointer from a vtable. There's not much you can do about this, because the compiler implements vtables behind the scenes, leaving you with no control over where they live in RAM.

So, for this particular case (renderer commands), I implemented a vtable-type system myself, where every different command class shares the same "vtable" -- this means that if I execute 100 commands in a row (from a pre-prepared command buffer), then the first one will generate a cache-miss when reading this "vtable", but the following commands are less likely to cause the same cache-miss.
e.g.
namespace Commands
{
enum Type
{
Foo,
Bar0,
Bar1,
Bar2,
};
}

struct Command
{
u8 id;
};
struct Foo
{
u8 id;
int value;
};
struct Bar
{
u8 id;
int value;
};

void Submit_Foo(Device& device, Command& command)
{
assert( command.id == Commands::Foo );
Foo& foo = *(Foo*)&command;
device.DoFoo( foo.value );
}

void Submit_Bar(Device& device, Command& command)
{
assert( command.id >= Commands::Bar0 && command.id <= Commands::Bar2 );
Bar& bar = *(Bar*)&command;
device.SetBarSlot( bar.id - Commands::Bar0, bar.value );
}

typedef void(*PFnSubmit)(Device&, Command&);
static PFnSubmit g_CommandTable[] =
{
&Submit_Foo,//Commands::Foo
&Submit_Bar,//Commands::Bar0
&Submit_Bar,//Commands::Bar1
&Submit_Bar,//Commands::Bar2
};

inline void SubmitCommand(Device& device, Command& command)
{
g_CommandTable[command.id](device, command);
}

//then, if you're really fancy, you can cut down on cache misses by pre-fetching or unrolling when executing a batch of commands
void SubmitCommands(Device& device, CommandIterator commands)
{//n.b. pseudo code
prefetch g_CommandTable
foreach command in commands
prefetch next command.id
SubmitCommand( device, command )
}

### #2Hodgman

Posted 29 December 2012 - 01:01 AM

Hodgman, are your states then inheritted from a State base class? If not, how do you determine how to set the state the correct way? I would think you would want to avoid the overhead of virtual functions on something so low level on the engine.
Yeah they're pretty much implementations of a base-class interface, but they don't use virtual.

In theory, virtual shouldn't be that slow, so it's important to understand what makes it "slow" in practice.
When calling a regular function, the compiler can hard-code a "jump to this address" instruction -- a completely predictable branch, with the main performance pitfall that the jump may cause an i-cache miss, if you haven't called that function recently (i.e. if the function's code isn't already in the cache).
When calling a virtual function, it needs instructions to read the first word of the object (the vtable address), then add some hard-coded offset to that word to get the address of the vtable entry for the function we're after, then fetch the word at that address, and finally jump to the address in that word.
This has the same performance issue as above (an i-cache miss if the function isn't resident), but also the branch is harder to predict because the branch-target isn't known until right before you perform the jump (this matters more on PPC than x86). Further, we need to perform two additional reads from memory, which could each cause d-cache misses -- the first one (the vtable address fetch) is likely if you've not read a member from this object recently, and the second one (the function pointer fetch) is likely to miss if you've not called a virtual function on this type of object recently. If the function-call has to read members of the object, then the 1st one doesn't matter, as you've just moved the cache miss from inside the function to right before it! Cache misses are really slow (hundreds of cycles), so they're a much bigger deal than the few extra instructions or the branch misprediction.

So, the main reason that virtual functions are slow, is that you might cause a cache-miss when fetching a function pointer from a vtable. There's not much you can do about this, because the compiler implements vtables behind the scenes, leaving you with no control over where they live in RAM.

So, for this particular case (renderer commands), I implemented a vtable-type system myself, where every different command class shares the same "vtable" -- this means that if I execute 100 commands in a row (from a pre-prepared command buffer), then the first one will generate a cache-miss when reading this "vtable", but the following commands are less likely to cause the same cache-miss.
e.g.
namespace Commands
{
enum Type
{
Foo,
Bar0,
Bar1,
Bar2,
};
}

struct Command
{
u8 id;
};
struct Foo
{
u8 id;
int value;
};
struct Bar
{
u8 id;
int value;
};

void Submit_Foo(Device& device, Command& command)
{
assert( command.id == Commands::Foo );
Foo& foo = *(Foo*)&command;
device.DoFoo( foo.value );
}

void Submit_Bar(Device& device, Command& command)
{
assert( command.id >= Commands::Bar0 && command.id <= Commands::Bar2 );
Bar& bar = *(Bar*)&command;
device.SetBarSlot( bar.id - Commands::Bar0, bar.value );
}

typedef void(*PFnSubmit)(Device&, Command&);
static PFnSubmit g_CommandTable[] =
{
&Submit_Foo,//Commands::Foo
&Submit_Bar,//Commands::Bar0
&Submit_Bar,//Commands::Bar1
&Submit_Bar,//Commands::Bar2
};

inline void SubmitCommand(Device& device, Command& command)
{
g_CommandTable[command.id](device, command);
}

//then, if you're really fancy, you can cut down on cache misses by pre-fetching or unrolling when executing a batch of commands
void SubmitCommands(Device& device, CommandIterator commands)
{
prefetch g_CommandTable
foreach command in commands
prefetch next command.id
SubmitCommand( device, command )
}

### #1Hodgman

Posted 29 December 2012 - 12:51 AM

Hodgman, are your states then inheritted from a State base class? If not, how do you determine how to set the state the correct way? I would think you would want to avoid the overhead of virtual functions on something so low level on the engine.

Yeah they're pretty much implementations of a base-class interface, but they don't use virtual.

In theory, virtual shouldn't be that slow, so it's important to understand what makes it "slow" in practice.
When calling a regular function, the compiler can hard-code a "jump to this address" instruction -- a completely predictable branch, with the main performance pitfall that the jump may cause an i-cache miss, if you haven't called that function recently (i.e. if the function's code isn't already in the cache).
When calling a virtual function, it needs instructions to read the first word of the object (the vtable), then add some hard-coded offset to that value to get the address of the vtable entry for the function we're after, then fetch the word at that address, and jump to the address in that word.
This has the same performance issue as above (an i-cache miss if the function isn't resident), but also the branch is harder to predict because the branch-target isn't known until right before you perform the jump (this matters more on PPC than x86). Further, we need to perform two additional reads from memory, which could cause d-cache misses -- the first one (the vtable) is likely to miss if you've not called a virtual function on this type of object recently, and the second one is likely if you've not read a member from this object recently. If the function-call has to read members of the object, then the 2nd one doesn't matter, as you've just moved the cache miss from inside the function to right before it. Cache misses are really slow (hundreds of cycles), so they're a much bigger deal that the few extra instructions.

So, the main reason that virtual functions are slow, is that you might cause a cache-miss when fetching a vtable. There's not much you can do about this, because the compiler implements vtables behind the scenes, leaving you with no control over where they live in RAM.

So, for this particular case (renderer commands), I implemented a vtable-type system myself, where every different command class shares the same vtable -- this means that if I execute 100 commands in a row (from a pre-prepared command buffer), then the first one will generate a cache-miss when reading this "vtable", but it will hopefully still be in the cache when the rest of the commands are executed immediately afterwards.

e.g.

namespace Commands
{
enum Type
{
Foo,
Bar0,
Bar1,
Bar2,
};
}

struct Command
{
u8 id;
};
struct Foo
{
u8 id;
int value;
};
struct Bar
{
u8 id;
int value;
};

void Submit_Foo(Device& device, Command& command)
{
assert( command.id == Commands::Foo );
Foo& foo = *(Foo*)&command;
device.DoFoo( foo.value );
}

void Submit_Bar(Device& device, Command& command)
{
assert( command.id >= Commands::Bar0 && command.id <= Commands::Bar2 );
Bar& bar = *(Bar*)&command;
device.SetBarSlot( bar.id - Commands::Bar0, bar.value );
}

typedef void(*PFnSubmit)(Device&, Command&);
PFnSubmit g_CommandTable[] =
{
&Submit_Foo,//Commands::Foo
&Submit_Bar,//Commands::Bar0
&Submit_Bar,//Commands::Bar1
&Submit_Bar,//Commands::Bar2
};

void SubmitCommand(Device& device, Command& command)
{
g_CommandTable[command.id](device, command);
}

PARTNERS