mul24 / mul_hi replacements

Started by
-1 comments, last by MxADD 9 years, 1 month ago
Hi
I'm in need to optimize some nasty part in the vertex shader :)
The original code:
uint4 Ctrl; // x - divider [1, 2, 3, or 4]
// y - Mask
// z - unused
// w - unused
uint VtxInstc; // InstanceIndex (SV_InstanceID) [will be in range 1 - 256]
...
uint InstanceIndex = VtxInstc / Ctrl.x;
uint ShadowMapIndex = VtxInstc % Ctrl.x;
ShadowMapIndex = (Ctrl.y >> (ShadowMapIndex * 2)) & 0x3;
...
Looking at ISA (on GCN 1.1, Radeon HD 7870) i was shocked how many instruction it takes to do this block (around 42 :/)
So after a while ...
in C++ add those to the Ctrl:
static const uint32_t OPSGInfoLockZ[] =
{
1,
1,
0xAAAAAAAB,
1,
};
static const uint32_t OPSGInfoLockW[] =
{
0,
1,
31, // 33 if in c++ since no mul_hi ...
2,
};
Ctrl.z = OPSGInfoLockZ[Ctrl.x-1];
Ctrl.w = OPSGInfoLockW[Ctrl.x-1];
and in PSSL (PS4)
uint InstanceIndex = mul24(VtxInstc, Ctrl.z) >> Ctrl.w; // For divide by 3 it will be canceled out to 0
InstanceIndex |= mul_hi(VtxInstc, Ctrl.z) >> 1; // For divide by 3 it will have some result as Ctrl.z will be high enough
uint ShadowMapIndex = VtxInstc - mul24(InstanceIndex, Ctrl.x); // VtxInstc % Ctrl.x;
ShadowMapIndex = (Ctrl.y >> (ShadowMapIndex * 2)) & 0x3;
... ok ... 12 instructions ... not bad
Now time for DX11 (Shader model 5)
... and here is a problem and question, it is possible to port code above to HLSL (DX11) ?
i could live without mul24 (multiply 2 ints and return lower 24 bits of the result, this takes 1 cycle, compared to 2 cycles for ordinary multiply)
but not without mul_hi (multiply 2 32bit ints, and return upper 32 bit of the 64bit result)
basically I want to write in HLSL something like this (in c++):
uint InstanceIndex = (uint64(Ctrl.z) * VtxInstc) >> Ctrl.w;
but there is no uint64 type and using double type is overkill not an optimization :/

This topic is closed to new replies.

Advertisement