How to use small datatypes efficiently

Started by
28 comments, last by Infinisearch 8 years, 7 months ago

Some systems crash, yes. Others you incur an invisible performance penalty.

But, I hate to drag this line out, unless you have profiled a real world application and identified this as a genuine bottleneck, such optimisation is absurd.

I am meticulous as i always be,it's more fun than utility, ^-^

Advertisement


Once that 64 byte buffer is loaded, anything you do within that 64 byte buffer is very nearly free. Thanks to the magic of the Out Of Order core and processor caches, doing one operation on one item in that block is very nearly the same clock time as doing one operation on sixteen items in that block.

Would it actually be beneficial to pack our variables more tightly together (use char/short where possible) in order to reduce block loads?

"I would try to find halo source code by bungie best fps engine ever created, u see why call of duty loses speed due to its detail." -- GettingNifty


Once that 64 byte buffer is loaded, anything you do within that 64 byte buffer is very nearly free. Thanks to the magic of the Out Of Order core and processor caches, doing one operation on one item in that block is very nearly the same clock time as doing one operation on sixteen items in that block.

Would it actually be beneficial to pack our variables more tightly together (use char/short where possible) in order to reduce block loads?

For individual objects defined on the stack, no probably not. For objects stored in arrays/vectors/etc...) quite possibly. The vast majority of the time, on modern CPUs, the biggest computational bottleneck will be in fetching data from memory. Designing objects so they fit nicely on cache line boundaries can yield significant performance increases.

Correct. It is about locality of data for cache use.

If you can design your system so you are constantly reading a series of sequential memory addresses -- basically operating on densely packed linear arrays or a structure of arrays -- the CPU will see this pattern and prefetch all your data, giving a great processing experience.

If you are reading objects one at a time scattered across memory, or reading along distant steps like an array of structures, this can cause a huge number of cache misses and stalls as data is loaded from hither and yon.


Would it actually be beneficial to pack our variables more tightly together (use char/short where possible) in order to reduce block loads?

Memory layout is important to a very large degree. In truth, most developers today should be far more concerned about how much cache activity their program causes, rather than how many CPU cycles. You can, of course, swamp the CPU cycles such that you have effectively infinite memory bandwidth, or swamp the memory such that you have effectively infinite CPU cycles. Still, most people today still approach optimization naively from a standpoint of minimizing CPU work first, then fitting memory accesses to its needs when they really should be doing the opposite -- start with dataflow, and fit CPU usage to it -- that's you you get the best net performance (assuming that in doing so you don't end up with something algorithmically untenable).

Its not strictly about packing variables more tightly, though that's part of it -- what you really want to think is something like "I'm working on several entities (say, elements in an array) in a loop, how can I get more loop iterations done before I need to load the next cache line?" Sometimes using smaller variables or packing bits is (part of) the solution, other times, you might split monolithic entities into two or more (say, instead of an array of complex structures, you have several arrays of simpler datatypes from the original structure) so that you are only iterating over the data that you need *right now* in this very loop.

There's a branch of software design dedicated to this approach, called Data-Oriented Design (DOD) (Google: Mike Acton DOD for a good start) which has been growing in mainstream popularity/acceptance in performance-minded crowds after being rebooted by game developers (first popularized on vector-machines like Crays, then adopted by game developers quite aggressively during the Xbox360/PS3 generation though people have been doing it here and there all along).

throw table_exception("(? ???)? ? ???");

By the way, a very good lesson to learn from this thread as a whole is that modern processors don't do what you think they do -- what actually happens under the hood is pretty far removed from even the assembly instructions you're feeding it, let alone from C code or other high-level languages. These days, what you and I know as x64, ARM, or PowerPC architecture is, for the most part, an abstraction solely for our benefit -- not the machine's. There was a time when the visible architecture literally was the machine but that's not the case any more. Chip designers spend far more transistors, easily an order of magnitude more, implementing caches, micro-ops, register renaming, and other fancy tricks that help hide memory latency than they do implementing anything you would directly recognize as as the architecture of the machine you think you're programming.

Today, only in the world of microcontrollers and DSPs do you see silicon that actually resembles the architecture it purports to be. Even many of those on the higher-end are becoming less immediately recognizable.

A string of C or assembly code doesn't tell you anything about how a modern CPU does the work you ask it to do -- it only guarantees logical equivalency and makes for a familiar design language that won't turn your brain to mush. You can't know anything about how the machine *actually* works without diving into details of the true, underlying architecture.

throw table_exception("(? ???)? ? ???");

I did a test,the result shows that they are nearly identical in efficiency,both 32bits and 64bits,here is my test code

union UValue
{
void SetVal2(int val);
int GetVal2(void);
struct
{
char m_nVal1;
char m_nVal3;
short m_nVal2;
};
int m_nVal;
} uValue;
void UValue::SetVal2(int val)
{
m_nVal = (m_nVal & 65535) | (val << 16);
}
int UValue::GetVal2(void)
{
return m_nVal >> 16;
}
#define nCount (1024 * 1024 * 128)
UValue arUVal[nCount];
unsigned int nSum;
int main(void)
{
cout << sizeof(UValue) << endl;
int n = arUVal[0].GetVal2();
LARGE_INTEGER frequency; // ticks per second
LARGE_INTEGER t1, t2; // ticks
double elapsedTime;
srand(time(NULL));
QueryPerformanceFrequency(&frequency);
//----------------------------------------------------------------------------------------------------
QueryPerformanceCounter(&t1);
for (unsigned int i = 0; i < nCount; ++i)
{
arUVal.m_nVal2 = rand();
}
for (unsigned int i = 1; i < nCount; ++i)
{
arUVal.m_nVal2 = arUVal.m_nVal2 + 1;
}
for (unsigned int i = 1; i < nCount; ++i)
{
nSum += arUVal.m_nVal2;
}
QueryPerformanceCounter(&t2);
elapsedTime = (t2.QuadPart - t1.QuadPart) * 1000.0 / frequency.QuadPart;
cout << "sum of direct access: " << nSum << " milliseconds used: " <<elapsedTime<<endl;
//----------------------------------------------------------------------------------------------------
QueryPerformanceCounter(&t1);
for (unsigned int i = 0; i < nCount; ++i)
{
arUVal.SetVal2(rand());
}
for (unsigned int i = 1; i < nCount; ++i)
{
arUVal.SetVal2(arUVal.GetVal2() + 1);
}
for (unsigned int i = 1; i < nCount; ++i)
{
nSum += arUVal.GetVal2();
}
QueryPerformanceCounter(&t2);
elapsedTime = (t2.QuadPart - t1.QuadPart) * 1000.0 / frequency.QuadPart;
cout << "sum of function access: " << nSum << " milliseconds used: " << elapsedTime << endl;
}
And the order of these two code blocks matters,who comes latter who is slightly faster.And the result is the same even i changed the union to this
#pragma pack(1)
union UValue
{
void SetVal2(int val);
int GetVal2(void);
struct
{
char m_nVal1;
short m_nVal2;
char m_nVal3;
};
int m_nVal;
} uValue;
Look at the assembly being produced by your compiler. I would assume that compilers are smart enough to emit the equivalent of your set/get mask-and-shift routines where appropriate, and also to emit load-small-datatype instructions where appropriate.

If you're writing low-level code like this, you need to have an intuition of what kind of asm the compiler will be producing from your code.

at actually happens under the hood is pretty far removed from even the assembly instructions you're feeding it

QFT. It can be pretty important to know which asm operations are actually microcoded as super-expensive algorithms (e.g. on PowerPC, the asm for "1<<x" looks super simple, but it's actually a microcode loop equivalent to "result=1;for(i=0..x)result<<=1;), or whether asm registers actually exist, or whether they're just mapped onto the stack (e.g. what looks like cheap asm code for moving a value to a register might actually be a routine that calculates a memory address and then stores a value there).
In fact the assembly generated by the direct access is slightly simpler,its better let compiler to do such optimization.

I thought I would tip you off to these excellent videos that explains what you are asking about. You will see why the data probably is already in the processor by the time the CPU executes the instructions needing those registers.

This topic is closed to new replies.

Advertisement