Sign in to follow this  

How to use small datatypes efficiently

This topic is 815 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I used to use small datatypes(short, char) to save memory,but these types will likely need two bus transaction for operations,take this union for example

 

union UValue
{
    struct
    {
         short    m_nVal1;
         short    m_nVal2;
    };
 
    int   m_nVal;
} uValue;
 
Tow bus transaction are needed to read m_nVal2 into cpu(while i am not sure if this is the case for  m_nVal1 cuz it's four bytes aligned),will this be extremely slow,is it wise to operate on m_nVal2 with some special tactics,say, set uValue.m_nVal2 to 77, do it like this
 
 uValue.m_nVal = (uValue.m_nVal & 65535)  | (77 << 16);
 
Will this way be much faster than uValue.m_nVal2 = 77 ? I am sure we all like the feeling that we coding in the best way.
 
It can be encapsulated in a member function
 
void UValue::SetVal2(int val)
{
    m_nVal = (m_nVal & 65535) | (77 << 16);
}
Edited by PolarWolf

Share this post


Link to post
Share on other sites

Probably not.

Getting data from memory into cache will be the bottleneck. How you access the members once they are on the CPU are unlikely to make any difference to a real-world application.

The problem lies exactly in Getting data from memory into cache,cuz m_nVal2 is not aligned ,it will need tow bus transaction tow complete,so my qustion is will reading m_nVal  instead be much faster.

Share this post


Link to post
Share on other sites


The problem lies exactly in Getting data from memory into cache,cuz m_nVal2 is not aligned ,it will need tow bus transaction tow complete,so my qustion is will reading m_nVal  instead be much faster.

No, not at all.

 

Also m_nVal2 is properly aligned. Unions don't break alignment requirements.  They are somewhat dangerous and platform specific, so they generally shouldn't be used unless you know exactly what you are doing and why you are doing it.  

 

It doesn't make much sense to do it in this case.

Share this post


Link to post
Share on other sites

 


Excellent explanation frob,i did ignored catches. But two bus transactions may be possible if less likely,here is a snippet from this link https://msdn.microsoft.com/en-us/library/ee418650.aspx
 
Reads and writes of types that are not naturally aligned—for instance, writing DWORDs that cross four-byte boundaries—are not guaranteed to be atomic. The CPU may have to do these reads and writes as multiple bus transactions, which could allow another thread to modify or see the data in the middle of the read or write.
 

 

Share this post


Link to post
Share on other sites

There would be no reason for the compiler to break boundaries for that specific union.  There are other unions you can build that will break boundaries, but that one -- two shorts unioned with an int -- will not break any boundaries.

Share this post


Link to post
Share on other sites

Yes this quote doesn't match my example,it says cross four-byte boundaries,while my example is not four bytes aligned,sorry for that.

Btw it seems that operate on not aligned arrays on mobile platform will  crash the application.

Share this post


Link to post
Share on other sites
Some systems crash, yes. Others you incur an invisible performance penalty.

But, I hate to drag this line out, unless you have profiled a real world application and identified this as a genuine bottleneck, such optimisation is absurd.

Share this post


Link to post
Share on other sites

Some systems crash, yes. Others you incur an invisible performance penalty.

But, I hate to drag this line out, unless you have profiled a real world application and identified this as a genuine bottleneck, such optimisation is absurd.

 I am meticulous as i always be,it's more fun than utility, ^-^

Share this post


Link to post
Share on other sites


Once that 64 byte buffer is loaded, anything you do within that 64 byte buffer is very nearly free.  Thanks to the magic of the Out Of Order core and processor caches, doing one operation on one item in that block is very nearly the same clock time as doing one operation on sixteen items in that block.

Would it actually be beneficial to pack our variables more tightly together (use char/short where possible) in order to reduce block loads?

Share this post


Link to post
Share on other sites

 


Once that 64 byte buffer is loaded, anything you do within that 64 byte buffer is very nearly free.  Thanks to the magic of the Out Of Order core and processor caches, doing one operation on one item in that block is very nearly the same clock time as doing one operation on sixteen items in that block.

Would it actually be beneficial to pack our variables more tightly together (use char/short where possible) in order to reduce block loads?

 

 

For individual objects defined on the stack, no probably not.  For objects stored in arrays/vectors/etc...) quite possibly.  The vast majority of the time, on modern CPUs, the biggest computational bottleneck will be in fetching data from memory.  Designing objects so they fit nicely on cache line boundaries can yield significant performance increases.

Share this post


Link to post
Share on other sites

Correct. It is about locality of data for cache use.

 

If you can design your system so you are constantly reading a series of sequential memory addresses -- basically operating on densely packed linear arrays or a structure of arrays -- the CPU will see this pattern and prefetch all your data, giving a great processing experience.

 

If you are reading objects one at a time scattered across memory, or reading along distant steps like an array of structures, this can cause a huge number of cache misses and stalls as data is loaded from hither and yon. 

Share this post


Link to post
Share on other sites


Would it actually be beneficial to pack our variables more tightly together (use char/short where possible) in order to reduce block loads?

 

Memory layout is important to a very large degree. In truth, most developers today should be far more concerned about how much cache activity their program causes, rather than how many CPU cycles. You can, of course, swamp the CPU cycles such that you have effectively infinite memory bandwidth, or swamp the memory such that you have effectively infinite CPU cycles. Still, most people today still approach optimization naively from a standpoint of minimizing CPU work first, then fitting memory accesses to its needs when they really should be doing the opposite -- start with dataflow, and fit CPU usage to it -- that's you you get the best net performance (assuming that in doing so you don't end up with something algorithmically untenable).

 

Its not strictly about packing variables more tightly, though that's part of it -- what you really want to think is something like "I'm working on several entities (say, elements in an array) in a loop, how can I get more loop iterations done before I need to load the next cache line?" Sometimes using smaller variables or packing bits is (part of) the solution, other times, you might split monolithic entities into two or more (say, instead of an array of complex structures, you have several arrays of simpler datatypes from the original structure) so that you are only iterating over the data that you need *right now* in this very loop.

 

There's a branch of software design dedicated to this approach, called Data-Oriented Design (DOD) (Google: Mike Acton DOD for a good start) which has been growing in mainstream popularity/acceptance in performance-minded crowds after being rebooted by game developers (first popularized on vector-machines like Crays, then adopted by game developers quite aggressively during the Xbox360/PS3 generation though people have been doing it here and there all along).

Share this post


Link to post
Share on other sites

By the way, a very good lesson to learn from this thread as a whole is that modern processors don't do what you think they do -- what actually happens under the hood is pretty far removed from even the assembly instructions you're feeding it, let alone from C code or other high-level languages. These days, what you and I know as x64, ARM, or PowerPC architecture is, for the most part, an abstraction solely for our benefit -- not the machine's. There was a time when the visible architecture literally was the machine but that's not the case any more. Chip designers spend far more transistors, easily an order of magnitude more, implementing caches, micro-ops, register renaming, and other fancy tricks that help hide memory latency than they do implementing anything you would directly recognize as as the architecture of the machine you think you're programming.

 

Today, only in the world of microcontrollers and DSPs do you see silicon that actually resembles the architecture it purports to be. Even many of those on the higher-end are becoming less immediately recognizable.

 

A string of C or assembly code doesn't tell you anything about how a modern CPU does the work you ask it to do -- it only guarantees logical equivalency and makes for a familiar design language that won't turn your brain to mush. You can't know anything about how the machine *actually* works without diving into details of the true, underlying architecture. 

Edited by Ravyne

Share this post


Link to post
Share on other sites

I did a test,the result shows that they are nearly identical in efficiency,both 32bits and 64bits,here is my test code

 

union UValue
{
void SetVal2(int val);
int GetVal2(void);
struct
{
char m_nVal1;
char m_nVal3;
short   m_nVal2;
 
};
 
int   m_nVal;
} uValue;
 
void UValue::SetVal2(int val)
{
m_nVal = (m_nVal & 65535) | (val << 16);
}
 
int UValue::GetVal2(void)
{
return m_nVal >> 16;
}
 
#define nCount (1024 * 1024 * 128)
UValue arUVal[nCount];
unsigned int nSum;
 
int main(void)
{
cout << sizeof(UValue) << endl;
int n = arUVal[0].GetVal2();
LARGE_INTEGER frequency;        // ticks per second
LARGE_INTEGER t1, t2;           // ticks
double elapsedTime;
 
srand(time(NULL));
 
QueryPerformanceFrequency(&frequency);
 
 
//----------------------------------------------------------------------------------------------------
QueryPerformanceCounter(&t1);
 
for (unsigned int i = 0; i < nCount; ++i)
{
arUVal[i].m_nVal2 = rand();
}
 
for (unsigned int i = 1; i < nCount; ++i)
{
arUVal[i].m_nVal2 = arUVal[i].m_nVal2 + 1;
}
 
for (unsigned int i = 1; i < nCount; ++i)
{
nSum += arUVal[i].m_nVal2;
}
 
QueryPerformanceCounter(&t2);
elapsedTime = (t2.QuadPart - t1.QuadPart) * 1000.0 / frequency.QuadPart;
 
cout << "sum of direct access: " << nSum << "    milliseconds used: " <<elapsedTime<<endl;
//----------------------------------------------------------------------------------------------------
QueryPerformanceCounter(&t1);
 
for (unsigned int i = 0; i < nCount; ++i)
{
arUVal[i].SetVal2(rand());
}
 
for (unsigned int i = 1; i < nCount; ++i)
{
arUVal[i].SetVal2(arUVal[i].GetVal2() + 1);
}
 
for (unsigned int i = 1; i < nCount; ++i)
{
nSum += arUVal[i].GetVal2();
}
 
QueryPerformanceCounter(&t2);
elapsedTime = (t2.QuadPart - t1.QuadPart) * 1000.0 / frequency.QuadPart;
 
cout << "sum of function access: " << nSum << "    milliseconds used: " << elapsedTime << endl;
}
 
And the order of these two code blocks matters,who comes latter who is slightly faster.And the result is the same even i changed the union to this
#pragma pack(1)
union UValue
{
void SetVal2(int val);
int GetVal2(void);
struct
{
char m_nVal1;
short   m_nVal2;
char m_nVal3;
};
 
int   m_nVal;
} uValue;
Edited by PolarWolf

Share this post


Link to post
Share on other sites
Look at the assembly being produced by your compiler. I would assume that compilers are smart enough to emit the equivalent of your set/get mask-and-shift routines where appropriate, and also to emit load-small-datatype instructions where appropriate.

If you're writing low-level code like this, you need to have an intuition of what kind of asm the compiler will be producing from your code.

at actually happens under the hood is pretty far removed from even the assembly instructions you're feeding it

QFT. It can be pretty important to know which asm operations are actually microcoded as super-expensive algorithms (e.g. on PowerPC, the asm for "[tt]1<<x[/tt]" looks super simple, but it's actually a microcode loop equivalent to "[tt]result=1;for(i=0..x)result<<=1;[/tt]“), or whether asm registers actually exist, or whether they're just mapped onto the stack (e.g. what looks like cheap asm code for moving a value to a register might actually be a routine that calculates a memory address and then stores a value there).

Share this post


Link to post
Share on other sites

I thought I would tip you off to these excellent videos that explains what you are asking about.  You will see why the data probably is already in the processor by the time the CPU executes the instructions needing those registers.

 

https://www.youtube.com/watch?v=qZTt8JtQOmw

 

https://www.youtube.com/watch?v=TJq5FE7jGzU

Edited by aregee

Share this post


Link to post
Share on other sites


If you can design your system so you are constantly reading a series of sequential memory addresses -- basically operating on densely packed linear arrays or a structure of arrays -- the CPU will see this pattern and prefetch all your data, giving a great processing experience.
 
If you are reading objects one at a time scattered across memory, or reading along distant steps like an array of structures, this can cause a huge number of cache misses and stalls as data is loaded from hither and yon. 

I read somewhere that modern hardware prefetchers don't necessarily need adjacent addresses, they can deal with stride.  So your array of structures example would be doable, if true. 

Share this post


Link to post
Share on other sites

The prefetcher can do stride, yes, but linear access (both forward and backwards) is dead-simple for it to do. I'm not sure whether the prefetcher is looking at the address of the elements or at which cache lines are loaded (Seems to me the later would be less expensive, but the benefit might justify the higher cost of the former). If it is looking at cache-lines rather than addresses, certain stride sizes could cause the pattern to be hard to predict (e.g. stride size of 1.5 cache lines (96 bytes) would have a pattern of read cache-line A, read B, skip C, read D, read E, skip F... the prefetcher would help, assuming it notices that it should try something even if imperfect, but it won't get max benefit.

 

The other issue is that the prefetcher itself doesn't directly benefit throughput -- it only improves latency. If you have a 96 byte structure and you only touch 12-16 bytes (say, an object position), then you're loading one cache-line per object and reducing latency of that cache line being ready only helps so much. On the other hand, if you have tightly-packed elements accessed linearly, it often works out that the prefetcher entirely (or almost-entirely) hides the latency behind the work you're doing on the previously-loaded elements. In fact, I've seen a graph (I believe it was in a talk by Herb Sutter) that showed how, for linear (forward or reverse) accesses, the prefetcher behaves very predictably and significantly better than other access patterns (with random access being worst-case, and strided access in between) -- the prefetcher looks very much like an infinitely-large last-level cache for linear accesses.

Share this post


Link to post
Share on other sites

They'll still be prefetching cache-line sized chunks (probably 64B), so you'd want any gaps / want the stride to be a multiple of 64Bytes.

This is true although I wonder if the prefetcher accelerates both prefetches for non aligned accesses that straddle cachelines.  It most probably does but I wonder how much faster it turns out.

 


The other issue is that the prefetcher itself doesn't directly benefit throughput -- it only improves latency. If you have a 96 byte structure and you only touch 12-16 bytes (say, an object position), then you're loading one cache-line per object and reducing latency of that cache line being ready only helps so much.

Repeated latency savings amount to an increase in bandwidth, but yeah you're right, using portions of a cacheline reduces bandwidth efficiency/bandwidth.

Share this post


Link to post
Share on other sites

Repeated latency savings amount to an increase in bandwidth,


I know what you're trying to say, and its even mathematically true (throughput is comprised if two terms Elements-processed over Time, where in terms if caches Elements is the size of the cache line and Time is derived from latency), that said, I find it more productive to think about throughput and latency as orthogonal most times.

They certainly are related, but you typically have much more control over what elements of a cache line are next to each other (useful elements) than you do over when those elements are lower in the cache (or how many times you need to look at the same/duplicate data). That is, you can't actually reduce the Time term except by using more of the elements in close proximity. Thus, latency only really matters for loading something new (not yet first-level cached).

Share this post


Link to post
Share on other sites

This topic is 815 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this