# Multiplication and division of four chars packed in one int

This topic is 678 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Hello,

I would like to store four c++ chars in one int (or other int data types compatibile with each other in such way) and perform basic operations on them in such way that the operation would be done for all of the elements at once (to speed up computations). Like in the example below:

union{
uint8_t i8[4];
uint32_t i64;
} A,B,C;

A.i8[0]=10;
A.i8[1]=14;
A.i8[2]=2;
A.i8[3]=5;

B.i8[0]=8;
B.i8[1]=4;
B.i8[2]=2;
B.i8[3]=2;

cout<<"A = "<<short(A.i8[0])<<" "<<short(A.i8[1])<<" "<<short(A.i8[2])<<" "<<short(A.i8[3])<<endl;
cout<<"B = "<<short(B.i8[0])<<" "<<short(B.i8[1])<<" "<<short(B.i8[2])<<" "<<short(B.i8[3])<<endl;
C.i64=A.i64+B.i64;
cout<<"A+B="<<short(C.i8[0])<<" "<<short(C.i8[1])<<" "<<short(C.i8[2])<<" "<<short(C.i8[3])<<endl;
C.i64=A.i64-B.i64;
cout<<"A-B="<<short(C.i8[0])<<" "<<short(C.i8[1])<<" "<<short(C.i8[2])<<" "<<short(C.i8[3])<<endl;
C.i64=A.i64*B.i64;
cout<<"A*B="<<short(C.i8[0])<<" "<<short(C.i8[1])<<" "<<short(C.i8[2])<<" "<<short(C.i8[3])<<endl;
C.i64=A.i64/B.i64;
cout<<"A/B="<<short(C.i8[0])<<" "<<short(C.i8[1])<<" "<<short(C.i8[2])<<" "<<short(C.i8[3])<<endl;

The addition and subtraction works just fine, however I don't really have any idea to how to perform multiplication and division in such way.

I was planning to use SSE, however it turned out that the target platform might not have ability to use SSE functionality. How to force C++ to multiply and divide properly four chars one by one like it happens in SSE?

Edited by Misery

##### Share on other sites

I'm not aware of any software algorithm to do this, it takes specialized hardware instructions to do this kind of thing -- even add and subtract don't work as you think they do, they'll fail if any of the 4 bytes underflow or overflow.

You could pack fewer bytes into a word to avoid this, but I suspect you'd spend as many instructions shifting and masking as you're spending on adding and subtracting individual bytes.

What hardware are you targeting that there's no SIMD (SSE, or something like it)? What actual algorithms are you trying to speed up? There might be better ways of optimizing them.

##### Share on other sites

It is supposed to be an image filter for smartphone. I am not worry about overflow or underflow as I am able to control it.

##### Share on other sites

Multiplication of 2 bytes gives 16 bit (that is, the result of A * B is 'number_of_bits(A) + number_of_bits(B)' bits long).

Shifts are added  (A << x) * (B << y) = A * 2^x * B * 2^y = A * B 2^(x+y) = (A * B) << (x + y).

One byte multiplying with byte 0 and byte 2 would thus give you 2 shorts

A * (B + (C << 16)) = A*B + ((A * C) << 16) where both A*B and A*C are 16 bit.

##### Share on other sites

For multiply operations you need a 16bit result potentially, unless you know your operands have a small magnitude. If you know that the left-most bit positions carrying a 1 in each operand sum to less than 8, a multiply won't overflow (e.g. 10000000 x 00000001 won't overflow, 00001111 x 00001111 won't overflow, as a kind of special case even 00010000 x 00001111 won't overflow -- but 00010001 x 00001111 does). To be suitable for larger operands, you need to provide more than 8 result bits -- at least temporarily if you know you can get the result back in range by the end of the algorithm. These extra bits would take the form of the most significant result bits.

Division is similar, but you need at least an extra result bit on top of the 16 bit result multiplication needs and all the extra bits would take the form of the least significant result bits.

One technique you could try would be to transform the algorithm algebraically to see if you can bend it into a form that's more suitable for your hardware -- e.g. replace divisions with multiplications by the inverse (which can win if you only need to calculate the inverse once and its used multiple times, or if you can accept a faster approximation -- a trick Quake used, IIRC). There are other old-school tricks that could help -- for example, multiplying and dividing by powers of two can be replaced with shifts, which are often but not always faster than multiply and usually faster and never slower than divide.

If your hardware really lacks SIMD (The ARM--I'm assuming, since its a smartphone--equivalent would be NEON, or VFP on very low-end devices, but I'm not sure how much either of them supports integer math) it might also have no/slow hardware divide instruction, in which case eliminating divides alone would be a huge gain. If your hardware does turn out to support NEON (and maybe VFP) you might find that its actually faster to use SIMD floating point math instead to take advantage of those instruction units, converting from and to your 8.8.8.8 bit format at the beginning and the end of your algorithm (or doing away with it in your filtering pipeline entirely).

##### Share on other sites

What OS is it for?

On Android there's Renderscript that is specifically made for things like image filters that might work. In general you should be able to do a lot of stuff in GLSL shaders. I would investigate that possibility as it could save you a lot of work and improve performance.

##### Share on other sites
I think what you are looking for is SWAR (SIMD within a register)? That was hype in the mid-90s, and I think there is something about it on one of the bit hack / bit twiddling sites (Aggregate?), but I'm not sure how useful it is. With the stuff like SSE and Neon becoming mainstream, these ideas became pretty superfluous for almost everybody. You might just be lucky and find something under that name,anyway. Worth a try.

##### Share on other sites

This topic is 678 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Create an account

Register a new account

• ### Forum Statistics

• Total Topics
628662
• Total Posts
2984095

• 10
• 9
• 9
• 10
• 21