Efficient 24/32-bit sRGB to linear float image conversion on CPU

Started by
3 comments, last by Hodgman 9 years ago

Does anyone know of some efficient ways of converting 24/32-bit sRGB to linear floating point on the CPU? I haven't got access to a CPU with AVX2 instructions yet, but I am intrigued by the new gather instructions. I was thinking that these could possibly be used for this type of conversion, such as in this example below. The LUT would be 256x4 bytes, so I imagine it would fit entirely into L1 data cache.


__m256 RGBA8toRGBA32F(const char* pixel_data, const float* LUT)
{
    return _mm256_i32gather_ps(LUT, _mm256_cvtepu8_epi32(_mm_load_si128((__m128i*)pixel_data)), 4);
}
Advertisement

for a given sRGB (sR,sB,sG) and Gamma the RGB(R,G,B) will be :

R=pow(sR,1/Gamma)

G=pow(sG,1/Gamma)

B=pow(sB,1/Gamma)

but I don't think you can do this with AVX2's functions

for a given sRGB (sR,sB,sG) and Gamma the RGB(R,G,B) will be :

R=pow(sR,1/Gamma)

G=pow(sG,1/Gamma)

B=pow(sB,1/Gamma)

but I don't think you can do this with AVX2's functions

sRGB isn't a gamma, see: http://entropymine.com/imageworsener/srgbformula/

read http://en.wikipedia.org/wiki/Gamma_correction#Windows.2C_Mac.2C_sRGB_and_TV.2Fvideo_standard_gammas and

"Unlike most other RGB color spaces, the sRGB gamma cannot be expressed as a single numerical value. The overall gamma is approximately 2.2, consisting of a linear (gamma 1.0) section near black, and a non-linear section elsewhere involving a 2.4 exponent and a gamma (slope of log output versus log input) changing from 1.0 through about 2.3."

in http://en.wikipedia.org/wiki/SRGB 2nd paragraph well it says that sRGB is not using a single gamma and gpu uses a table to change the out put RGB to sRGB and the link you sent itself uses 2.4 as gamma

It involves a gamma curve, but isn't a gamma curve - pow2.2 is a good approximation, but for accuracy it's important to use the real formula with the linear tail.
33093914fd8e8a5b71a35360155af91d.png
Where94af36b08a89271078d4a538585fca35.png

I'd implement your look-up-table version, and a plain ALU version and profile them in a real usage situation. The LUT version's performance will depend heavily on how much pressure is on the cache.

For the ALU version you can do both sides of the discontinuity and then select the correct side branchlessly.
a = srgb*(1/12.92);
b = pow((srgb+0.055)*(1/1.055),2.4);
rgb = srgb ? 0.04045 ? a : b;

^ That final ternary statement can be implemented with conditional moves/shuffles, masking and adding (ANDing and ORing), etc...

...but the pow is costly, so maybe you do want to use a real branch if any elements in the vec4/vec8 need it.

n.b. to SSEize the pow, you can use exp/log instead:
b = exp(log((srgb+0.055)*(1/1.055))*2.4);
...and get an exp/log implementation from a library like http://gruntthepeon.free.fr/ssemath/

[edit]

To write this kind of SIMD code, I've recently been using the ISPC language, which lets you write the algorithm once and then compile it to SSE2/AVX/AVX2/etc... Gathers/scatters will be emulated on the older instruction sets though, of course.

This topic is closed to new replies.

Advertisement