Back to General and Gameplay Programming

How to use small datatypes efficiently

hongqi ma · 2015-09-25T02:23:44

I used to use small datatypes(short, char) to save memory,but these types will likely need two bus transaction for operations,take this union for example union UValue { struct { short m_nVal1; short m_nVal2; }; int m_nVal; } uValue; Tow bus transaction are needed to read m_nVal2 into cpu(while i am not sure if this is the case for m_nVal1 cuz it's four bytes aligned),will this be extremely slow,is it wise to operate on m_nVal2 with some special tactics,say, set uValue.m_nVal2 to 77, do it like this uValue.m_nVal = (uValue.m_nVal & 65535) | (77 << 16); Will this way be much faster than uValue.m_nVal2 = 77 ? I am sure we all like the feeling that we coding in the best way. It can be encapsulated in a member function void UValue::SetVal2(int val) { m_nVal = (m_nVal & 65535) | (77 << 16); }

General and Gameplay Programming Programming

Started by PolarWolf September 17, 2015 02:27 PM

28 comments, last by Infinisearch 8 years, 7 months ago

PolarWolf

184

Author

September 17, 2015 04:27 PM

Some systems crash, yes. Others you incur an invisible performance penalty.

But, I hate to drag this line out, unless you have profiled a real world application and identified this as a genuine bottleneck, such optimisation is absurd.

I am meticulous as i always be,it's more fun than utility, ^-^

TheComet

3,927

September 17, 2015 05:00 PM

Once that 64 byte buffer is loaded, anything you do within that 64 byte buffer is very nearly free. Thanks to the magic of the Out Of Order core and processor caches, doing one operation on one item in that block is very nearly the same clock time as doing one operation on sixteen items in that block.

Would it actually be beneficial to pack our variables more tightly together (use char/short where possible) in order to reduce block loads?

"I would try to find halo source code by bungie best fps engine ever created, u see why call of duty loses speed due to its detail." -- GettingNifty

Ryan_001

3,477

September 17, 2015 06:29 PM

Once that 64 byte buffer is loaded, anything you do within that 64 byte buffer is very nearly free. Thanks to the magic of the Out Of Order core and processor caches, doing one operation on one item in that block is very nearly the same clock time as doing one operation on sixteen items in that block.

Would it actually be beneficial to pack our variables more tightly together (use char/short where possible) in order to reduce block loads?

For individual objects defined on the stack, no probably not. For objects stored in arrays/vectors/etc...) quite possibly. The vast majority of the time, on modern CPUs, the biggest computational bottleneck will be in fetching data from memory. Designing objects so they fit nicely on cache line boundaries can yield significant performance increases.

frob

46,223

September 17, 2015 07:50 PM

Correct. It is about locality of data for cache use.

If you can design your system so you are constantly reading a series of sequential memory addresses -- basically operating on densely packed linear arrays or a structure of arrays -- the CPU will see this pattern and prefetch all your data, giving a great processing experience.

If you are reading objects one at a time scattered across memory, or reading along distant steps like an array of structures, this can cause a huge number of cache misses and stalls as data is loaded from hither and yon.

Ravyne

14,306

September 17, 2015 08:10 PM

Would it actually be beneficial to pack our variables more tightly together (use char/short where possible) in order to reduce block loads?

Memory layout is important to a very large degree. In truth, most developers today should be far more concerned about how much cache activity their program causes, rather than how many CPU cycles. You can, of course, swamp the CPU cycles such that you have effectively infinite memory bandwidth, or swamp the memory such that you have effectively infinite CPU cycles. Still, most people today still approach optimization naively from a standpoint of minimizing CPU work first, then fitting memory accesses to its needs when they really should be doing the opposite -- start with dataflow, and fit CPU usage to it -- that's you you get the best net performance (assuming that in doing so you don't end up with something algorithmically untenable).

Its not strictly about packing variables more tightly, though that's part of it -- what you really want to think is something like "I'm working on several entities (say, elements in an array) in a loop, how can I get more loop iterations done before I need to load the next cache line?" Sometimes using smaller variables or packing bits is (part of) the solution, other times, you might split monolithic entities into two or more (say, instead of an array of complex structures, you have several arrays of simpler datatypes from the original structure) so that you are only iterating over the data that you need *right now* in this very loop.

There's a branch of software design dedicated to this approach, called Data-Oriented Design (DOD) (Google: Mike Acton DOD for a good start) which has been growing in mainstream popularity/acceptance in performance-minded crowds after being rebooted by game developers (first popularized on vector-machines like Crays, then adopted by game developers quite aggressively during the Xbox360/PS3 generation though people have been doing it here and there all along).

throw table_exception("(? ???)? ? ???");

Ravyne

14,306

September 17, 2015 09:50 PM

By the way, a very good lesson to learn from this thread as a whole is that modern processors don't do what you think they do -- what actually happens under the hood is pretty far removed from even the assembly instructions you're feeding it, let alone from C code or other high-level languages. These days, what you and I know as x64, ARM, or PowerPC architecture is, for the most part, an abstraction solely for our benefit -- not the machine's. There was a time when the visible architecture literally was the machine but that's not the case any more. Chip designers spend far more transistors, easily an order of magnitude more, implementing caches, micro-ops, register renaming, and other fancy tricks that help hide memory latency than they do implementing anything you would directly recognize as as the architecture of the machine you think you're programming.

Today, only in the world of microcontrollers and DSPs do you see silicon that actually resembles the architecture it purports to be. Even many of those on the higher-end are becoming less immediately recognizable.

A string of C or assembly code doesn't tell you anything about how a modern CPU does the work you ask it to do -- it only guarantees logical equivalency and makes for a familiar design language that won't turn your brain to mush. You can't know anything about how the machine *actually* works without diving into details of the true, underlying architecture.

throw table_exception("(? ???)? ? ???");

PolarWolf

184

Author

September 18, 2015 01:23 AM

I did a test,the result shows that they are nearly identical in efficiency,both 32bits and 64bits,here is my test code

union UValue

{

void SetVal2(int val);

int GetVal2(void);

struct

{

char m_nVal1;

char m_nVal3;

short m_nVal2;

};

int m_nVal;

} uValue;

void UValue::SetVal2(int val)

{

m_nVal = (m_nVal & 65535) | (val << 16);

}

int UValue::GetVal2(void)

{

return m_nVal >> 16;

}

#define nCount (1024 * 1024 * 128)

UValue arUVal[nCount];

unsigned int nSum;

int main(void)

{

cout << sizeof(UValue) << endl;

int n = arUVal[0].GetVal2();

LARGE_INTEGER frequency; // ticks per second

LARGE_INTEGER t1, t2; // ticks

double elapsedTime;

srand(time(NULL));

QueryPerformanceFrequency(&frequency);

//----------------------------------------------------------------------------------------------------

QueryPerformanceCounter(&t1);

for (unsigned int i = 0; i < nCount; ++i)

{

arUVal.m_nVal2 = rand();

}

for (unsigned int i = 1; i < nCount; ++i)

{

arUVal.m_nVal2 = arUVal.m_nVal2 + 1;

}

for (unsigned int i = 1; i < nCount; ++i)

{

nSum += arUVal.m_nVal2;

}

QueryPerformanceCounter(&t2);

elapsedTime = (t2.QuadPart - t1.QuadPart) * 1000.0 / frequency.QuadPart;

cout << "sum of direct access: " << nSum << " milliseconds used: " <<elapsedTime<<endl;

//----------------------------------------------------------------------------------------------------

QueryPerformanceCounter(&t1);

for (unsigned int i = 0; i < nCount; ++i)

{

arUVal.SetVal2(rand());

}

for (unsigned int i = 1; i < nCount; ++i)

{

arUVal.SetVal2(arUVal.GetVal2() + 1);

}

for (unsigned int i = 1; i < nCount; ++i)

{

nSum += arUVal.GetVal2();

}

QueryPerformanceCounter(&t2);

elapsedTime = (t2.QuadPart - t1.QuadPart) * 1000.0 / frequency.QuadPart;

cout << "sum of function access: " << nSum << " milliseconds used: " << elapsedTime << endl;

}

And the order of these two code blocks matters,who comes latter who is slightly faster.And the result is the same even i changed the union to this

#pragma pack(1)

union UValue

{

void SetVal2(int val);

int GetVal2(void);

struct

{

char m_nVal1;

short m_nVal2;

char m_nVal3;

};

int m_nVal;

} uValue;

Hodgman

52,717

September 18, 2015 02:07 AM

Look at the assembly being produced by your compiler. I would assume that compilers are smart enough to emit the equivalent of your set/get mask-and-shift routines where appropriate, and also to emit load-small-datatype instructions where appropriate.

If you're writing low-level code like this, you need to have an intuition of what kind of asm the compiler will be producing from your code.

at actually happens under the hood is pretty far removed from even the assembly instructions you're feeding it

QFT. It can be pretty important to know which asm operations are actually microcoded as super-expensive algorithms (e.g. on PowerPC, the asm for "1<<x" looks super simple, but it's actually a microcode loop equivalent to "result=1;for(i=0..x)result<<=1;“), or whether asm registers actually exist, or whether they're just mapped onto the stack (e.g. what looks like cheap asm code for moving a value to a register might actually be a routine that calculates a memory address and then stores a value there).

. 22 Racing Series .

PolarWolf

184

Author

September 18, 2015 02:42 AM

In fact the assembly generated by the direct access is slightly simpler,its better let compiler to do such optimization.

aregee

1,072

September 19, 2015 02:51 PM

I thought I would tip you off to these excellent videos that explains what you are asking about. You will see why the data probably is already in the processor by the time the CPU executes the instructions needing those registers.

How to use small datatypes efficiently

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

How to use small datatypes efficiently

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines