Sign in to follow this  
Decept

Saturating in asm

Recommended Posts

Hi I'm currently programming inline asm in VC2003 with the use of MMX/SSE/SSE2 instructions. I have 4 32 bit signed integer values in a 128 bit xmm register, that is, they are packed dword integers. I need to convert these values into 4 16 bit unsigned integers. The 32 bit signed integers should already be in the 16 bit unsigned integer range. But due too some possible errors they may be just outside it, but not by much. To make them unsigned 16 bit I must of course make sure that they are >0x00 and <0xFFFF. There is the problem, how??? I can't find any instruction that would do just that. I already tried a bunch of different approaches, but none worked. For example I thought of using MAX and MIN instructions against the extreme values, but of course the instructions do not handle double words, only words and bytes. That is the case with all the solutions I have come up with... Please, If anyone know a way...

Share this post


Link to post
Share on other sites
you could always trick the cpu into thinking you have a float, then use minps maxps. The latencies on these instructions will suck... :


xmm0: your values
xmm1: 0x3f800000(one in float rep)
xmm2: 0x3f80ffff max value
xmm4: 0x0000ffff mask for final result.

orps xmm0, xmm1
minps xmm0, xmm2
maxps xmm0, xmm1
andps xmm0, xmm4


How did you end up with a signed 16-bit ints in xmm registers anyway? If you'd put them in mm registers, it'd be much easier.

Share this post


Link to post
Share on other sites
I can't have that latency as this code is in a VERY time critical place.

The code mainly operate on 32 bit ints, it's only the output that needs to be 16 bit. The use of SIMD is perfect for my code and using 128 bit registers for 4 32 bit ints is great. It solves a lot of problems, aswell as being fast.
One of the problems I had without 128 bit registers is that the asm code would be too long.

Why would the mm registers be easier?

Share this post


Link to post
Share on other sites
Or you could use SSE2's extended MMX instructions (pretty much identical to the 64-bit versions except for that they operate on the SSE registers).

The tricky part is handling handling the underflow (since you're mixing signed and unsinged integers).
One way would be to begin by subtracting 0x8000, perform a signed 32-16 packing and add it back on again afterwards. Or you could create a comparison mask the original values are above zero and apply it to the integer (masking out negative values completely). Yet another method would be to add a bias value (0x8000) to the 32-bit integer and subtract the same amount with unsigned saturation to keep it above zero.

I can't quite figure out any way of doing it without wasting two instructions but maybe they could be combined with some previous code, or you could study the instruction set since I have probably overlooked some helpful instructions.

Share this post


Link to post
Share on other sites
what does the rest of the loop look like? You might be able to schedule away the latencies...

If the values were in mm, you'd be able to use the pminsw/pmaxsw then pshufw the 2 registers to get 4 16 bit values into one mm reg. depends on what you want to do next with it.

Really the only reason that you can't just maxps without orps'ing first is that your numbers are denormalized as floats... can't have a 0 exponent :(

Share this post


Link to post
Share on other sites
I posted that you'd have to orps 0x3f800000 and mask it back off to compare with the SSE minps/maxps functions. This is because 0x0000xxxx is a denormalized value in the float standard... but it seems like SSE might not care about that. It might just do the compare with fewer bits of precision (which is what you want anyway).

So, can you try the algorithm I posted without the orps/andps? It will cut down your latency huge (10 cycles about). Just minps with 0x0000ffff and maxps with 0x00000000 and you might be golden.

I should also say that this kind of stuff is really bad... maybe for consoles it's okay, but you have a whole set of MMX registers that are supposed to deal with integers. The only way to end up with an int in a xmm register is to load it on purpose or or/and/nor/xor it there, in which case mmx is probably better.

So think about that, but also let me know how the SSE goes :)

Share this post


Link to post
Share on other sites
Hi
sorry for the late reply, I have so much to do right now. I still haven't tried your suggestions. But I will try it and reply later this week.

The reason I use the xmm registers is that I can manipulate more variables at the same time. Actually I can manipulate exactly the number of variables that I need. And if I were to use the mm registers I would run out of them very quickly.

Share this post


Link to post
Share on other sites
Hi
my code is still not in a runable state so I can't try your suggestion yet. But I will get back to you on that as soon as possible.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this