How to use small datatypes efficiently

Started by
28 comments, last by Infinisearch 8 years, 6 months ago

I used to use small datatypes(short, char) to save memory,but these types will likely need two bus transaction for operations,take this union for example

union UValue
{
struct
{
short m_nVal1;
short m_nVal2;
};
int m_nVal;
} uValue;
Tow bus transaction are needed to read m_nVal2 into cpu(while i am not sure if this is the case for m_nVal1 cuz it's four bytes aligned),will this be extremely slow,is it wise to operate on m_nVal2 with some special tactics,say, set uValue.m_nVal2 to 77, do it like this
uValue.m_nVal = (uValue.m_nVal & 65535) | (77 << 16);
Will this way be much faster than uValue.m_nVal2 = 77 ? I am sure we all like the feeling that we coding in the best way.
It can be encapsulated in a member function
void UValue::SetVal2(int val)
{
m_nVal = (m_nVal & 65535) | (77 << 16);
}
Advertisement
Probably not.

Getting data from memory into cache will be the bottleneck. How you access the members once they are on the CPU are unlikely to make any difference to a real-world application.

Tow bus transaction are needed to read m_nVal2 into cpu(while i am not sure if this is the case for m_nVal1 cuz it's four bytes aligned),will this be extremely slow,is it wise to operate on m_nVal2 with some special tactics,say, set uValue.m_nVal2 to 77, do it like this

No, it wouldn't take two bus transactions.

Memory works in blocks. Large blocks of memory are paged in and out from L3 cache, smaller blocks are paged in and out from the L2 cache, smaller blocks are paged in and out from the L1 cache. You don't control those block sizes.

On recent hardware, the smallest cache block size is typically 64 bytes, called a cache line.

Once that 64 byte buffer is loaded, anything you do within that 64 byte buffer is very nearly free. Thanks to the magic of the Out Of Order core and processor caches, doing one operation on one item in that block is very nearly the same clock time as doing one operation on sixteen items in that block.

The exact speed of operations depends on the exact chip since they are adjusted and tuned for each processor. An i7 4790k will have slightly different characteristics from an i7 5930k. There are trends over time, but because of the nature of the out of order core and the huge variability in what chips people have installed, you cannot really know exactly how long any instruction is going to take. Even running the exact same code on a HT chip can have somewhat different timings because the CPU may be more busy or less busy with other threads.

Finally, many times dealing with smaller data sizes (16-bit or 8-bit) requires a little more CPU work than dealing with a full word size (32 bits). Thankfully the timing differences are USUALLY so small it doesn't matter, but if it does matter it is often best to work with the natural word size of the processor.

Probably not.

Getting data from memory into cache will be the bottleneck. How you access the members once they are on the CPU are unlikely to make any difference to a real-world application.

The problem lies exactly in Getting data from memory into cache,cuz m_nVal2 is not aligned ,it will need tow bus transaction tow complete,so my qustion is will reading m_nVal instead be much faster.


The problem lies exactly in Getting data from memory into cache,cuz m_nVal2 is not aligned ,it will need tow bus transaction tow complete,so my qustion is will reading m_nVal instead be much faster.

No, not at all.

Also m_nVal2 is properly aligned. Unions don't break alignment requirements. They are somewhat dangerous and platform specific, so they generally shouldn't be used unless you know exactly what you are doing and why you are doing it.

It doesn't make much sense to do it in this case.


Excellent explanation frob,i did ignored catches. But two bus transactions may be possible if less likely,here is a snippet from this link https://msdn.microsoft.com/en-us/library/ee418650.aspx
Reads and writes of types that are not naturally aligned—for instance, writing DWORDs that cross four-byte boundaries—are not guaranteed to be atomic. The CPU may have to do these reads and writes as multiple bus transactions, which could allow another thread to modify or see the data in the middle of the read or write.

It may not be as serious as i thought,and that what i want to know. thx

There would be no reason for the compiler to break boundaries for that specific union. There are other unions you can build that will break boundaries, but that one -- two shorts unioned with an int -- will not break any boundaries.

Yes this quote doesn't match my example,it says cross four-byte boundaries,while my example is not four bytes aligned,sorry for that.

Btw it seems that operate on not aligned arrays on mobile platform will crash the application.

Some systems crash, yes. Others you incur an invisible performance penalty.

But, I hate to drag this line out, unless you have profiled a real world application and identified this as a genuine bottleneck, such optimisation is absurd.

This topic is closed to new replies.

Advertisement