Some HDR questions

Started by
4 comments, last by Hodgman 11 years, 3 months ago
Hi

I am trying to figure out HDR so I can implement it in my engine, please let me try to explain what I know, correct any mistakes and awnser the questions along the way.

So let me see if I get the pipeline right...

Normally I would render a scene to the backbuffer which is in D3DFMT_X8R8G8B8 format. The texture sampler to color meshes returns ARGB float values between 0.0f and 1.0f as colors in HLSL do not range from 0 to 255. Instead, they range from 0 to 1. So if the sampled texture is in D3DFMT_A16B16G16R16 format the 32bit float in the shader will be able to hold all that information. The 32bit channel float must then be truncated down to an 8bit channel to fit the backbuffer which happens after the fragment shader?

Now in HDR (deferred shading) i want to create a rendertarget texture in D3DFMT_A16B16G16R16 format and render the scene to it. My shaders still outputs a color between 0 and 1 but the precision is a 32bit float for each channel. That 32bit float output must then be truncated down to fit in a 16bit channel in the texture render target?

The pipeline is:
Render Scene to HDR texture -> Render quad with the HDR texture to backbuffer.

The colors are still between 0 and 1 in both situations but in two different precisions, making room for smoother color transients. So even with 100 light sources the intensity will never go beyond the value 1.0f?

Now, to get that neat effects I need to follow this pipeline:
Render Scene to HDR -> Tone-mapping HDR texture -> Render quad with the HDR texture.
By tone-mapping we take the HDR texture and scale values down using the exponential function
float4 exposed = 1.0 - exp( -(unexposed * K ) );
Where unexposed is the raw channel value between 0 and 1 and the exposed is also a value between 0 and 1? I thought that from reading guides that the unexposed value from the HDR texture would be between 0 and infinity..... The K is the Luminance adaption value.

K is obtained from the previous frame by making a second HDR render target and outputting the Log2(exposed) in the fragment shader and mipmapping that down to a 1x1 texture. This 1x1 texture has RGB values which is weighted avaraged and taken the the exponent of to get K.

Can K be obtained by having a global variable instead of another render target that each fragment adds its Log2(exposed) value to? this value would then be divided down by the number of pixels in the application?

The Render Scene to HDR -> Tone-mapping HDR texture stages can be merged in one go but just adding the above code at the end of the pixelshader.

The pipeline is now:
Render Scene to HDR with Tone-mapped values -> Render quad with the HDR texture.

This pipeline will now bring out dark and light areas in the scene, instead of having the scene over- or under-exposed.


To get that Bloom effect...
This is simply done by taking a copy of the current scene, blurring it, and adding the blurred version back to the original. To make this fast, you usually scale down the scene to a texture that is for example 1/4 of the original, and then blur that. Another thing that you do is that you only want the brightest pixels to glow, not the entire scene, therefore, when scaling down the scene, you usually also subtract a factor from the original pixels, for example 1.0, but you can use whatever looks good. The smaller this factor is the more parts of your scene will glow and vice versa.
So I guess I need to make another HDR render target, and output a darkened color to it. Mipmap it down and perform a Horizontal guassian blur pass and a Vertical Gaussian blur pass. Then add the two guassing passes in a third/fourth pass to produce the final scene on the backbuffer.

The pipelines are now:
RenderTarget0 HDR (Tone-mapped with K from previous frame) -> HDR scene
RenderTarget1 HDR (Luminance/Log2/K) -> mipmap down to 1x1 -> exp(dot( color.rgb, float3( 0.333, 0.333, 0.333 ) )) -> K
RenderTarget2 HDR (mipmapped for bloom) -> Mipmap 1/4 size with darkened colors -> Vertical blur
RenderTarget3 HDR (mipmapped for bloom) -> Mipmap 1/4 size with darkened colors -> Horizontal blur

BackBuffer = HDR scene + BilinearUpscale(Vertical blur + Horizontal blur)

This sums up to 3 or 4 passes?


Woah! Am I thinking correctly? Please let me know what I have done wrong and possibly suggest better pipelines.
Advertisement
create a rendertarget texture in D3DFMT_A16B16G16R16 format and render the scene to it. My shaders still outputs a color between 0 and 1 but the precision is a 32bit float for each channel. That 32bit float output must then be truncated down to fit in a 16bit channel in the texture render target?
Yes, your shaders always output 32bit floats, which are then truncated to the precision of the render-target.
The colors are still between 0 and 1 in both situations but in two different precisions, making room for smoother color transients. So even with 100 light sources the intensity will never go beyond the value 1.0f?
D3DFMT_A16B16G16R16 still only stores values from 0-1, but it does so in increments of 1/65536th instead of the usual increments of 1/256th.
With this format, you have to be careful with your lighting values so that they never reach 1.0f, because if they do, then you won't be able to represent anything brighter, and you get strange results where that happens.

D3DFMT_A16B16G16R16F is probably what you want -- it stores floating point numbers, but in the compact "half float" format, instead of full 32-bit floats. This lets you store numbers from around 0-65519, with a decent amount of fractional precision. This means that if you've got 10 lights with brightness of '100' overlapping, then there will be no clamping of the result. You'll be able to store 1000.0f in the render-target!
I thought that from reading guides that the unexposed value from the HDR texture would be between 0 and infinity..... The K is the Luminance adaption value.
As above, I'd recommend using the 16F format so that your HDR texture is between 0 and big-enough-to-pretend-it's-infinity ;)

If you do use a format that's between 0 and 1, you can use some kind of hand-picked multiplier value as your brightest HDR value. e.g. if you use "unexposed * 1000" in your tone-mapper, and "return output * 0.001" in your lighting shaders, then your render target is basically storing values from 0 to 1000, in increments of 1/65536.
Can K be obtained by having a global variable instead of another render target that each fragment adds its Log2(exposed) value to?
In D3D9, shaders can't write to shared global variables like that. The "unordered access view" makes this possible in D3D11 only.
To get that Bloom effect ... This sums up to 3 or 4 passes?
Usually you'd do your "darken" pass during down-sampling, seperate from the blurring. The reason is, that you want the input texture to the blur to be low-res, as well as the output texture (otherwise you waste bandwidth).
So:
1) render HDR scene
2) output Log luminance (and mipmap or downsample to 1x1)
3) downsample HDR scene and apply a darkening factor
4) blur downsampeld scene vertically
5) blur the previous horizontally
6) sum together the HDR scene and the blur-result, and tone-map the result (output as 0-1 to an 8-bit target)

In my last game, we did it a bit differently though for performance reasons (we were targeting old GPUs where 64bpp textures were slow):
1-5) as above, except #3's "darkening factor" is a tone-mapper (outputting 0-1 to an 8-bit target), and 4/5 are done in 8-bit.
6) tone-map the HDR scene (output as 0-1 to an 8-bit target)
7) screen-blend the blur-result over the scene

Thanks for awnsering, I am still not in the clear though :)

I did not notice that there were two formats D3DFMT_A16B16G16R16 and D3DFMT_A16B16G16R16F. You say I need to be careful to not reach 1.0f with the former, but what about reaching 65504 (max half precision) with the latter?

And say my brightest pixel is 1000.0f, won't then all values between 1000.0f and 65504 be wasted? I though that HDR was to utilize the whole range, e.g 0-1 with D3DFMT_A16B16G16R16 in increments of 1/65536.

1) render HDR scene
2) output Log luminance (and mipmap or downsample to 1x1)
3) downsample HDR scene and apply a darkening factor
4) blur downsampeld scene vertically
5) blur the previous horizontally
6) sum together the HDR scene and the blur-result, and tone-map the result (output as 0-1 to an 8-bit target)

Let me try to understand what you are saying here.

These are produced in the first pass:

RenderTarget0 (D3DFMT_A16B16G16R16F): Render scene with light values between 0.0f and 65504? (or just ignore 65504 and possibly output a value like 2344534756348.0f)

RenderTarget1 (D3DFMT_A16B16G16R16F): Render same scene but with Log luminance instead for pixel values and mipmap it.

Get the K in application by reading lowest mip surface?

This is produced in the second pass:

3. "downsample/mipmap" my first render target? or render a new frame (RenderTarget2) on another smaller texture using the HDR scene as texture to a quad and output a darker color e.g: out.color *= 0.75?

These are produced in the third pass:

4. RenderTarget3 (D3DFMT_A16B16G16R16F): Blur second pass scene horizonatly to a new render target

5. RenderTarget4 (D3DFMT_A16B16G16R16F): Blur second pass scene vertically to a new render target

Fourth pass:

6. Render to backbuffer: Take the first result (RenderTarget0) and draw it as a quad together with the blurred targets, this means three samples, one from each. Apply tone-mapping on the result.

On the "third pass" above, can you mix this into the same rendertarget? Or do I split each blur pass into two passes?

You say I need to be careful to not reach 1.0f with the former, but what about reaching 65504 (max half precision) with the latter?

Yeah, that's just as bad, but it's a lot harder to do.

And say my brightest pixel is 1000.0f, won't then all values between 1000.0f and 65504 be wasted?

Yes, this is true whether you use integer-fractional or floating-point formats.
However, floating point formats have logarithmic precision, where as numbers get bigger they become less precise -- so you can give yourself a bit of extra headroom, without sacrificing precision in the common case.
e.g. say you scene brightness is usually from 0-1k, but in very rare cases some pixels might have values of 60k. With ABGR16F, then the usual numbers have great precision, and the rare numbers are still valid, but are slightly less accurate (you might get some colour banding if you're unlucky, but that is much better than clamping).
With ABGR16 (integer-fractional), you'd have to either set your scale factor to 1000 and suffer from clamping issues, or you'd set it to 60000 and suffer reduced precision across both the 'usual' brightnesses and the 'rare' brightnesses.

Depending on your game, one may be better than the other, but I'd recommend 16F just for simplicity.
On my last game, we actually used 10bit integer-fractional on one platform (with a scale factor), and 16bit float on other platforms, depending on which one was more optimal. For us, 16F "just worked", whereas 10-bit integer was a headache that we had to constantly tweak per time-of-day to get acceptable results.

I though that HDR was to utilize the whole range, e.g 0-1 with D3DFMT_A16B16G16R16 in increments of 1/65536.

The point is to not have a limit to your brightness values (i.e. the 0 to infinity thing). Infinity doesn't work in practice, so the point is to have a large limit ;)

If you're doing a physical simulation, then in the general case, it's hard to predict what you maximum brightness value is. e.g. direct sunlight might be 1000 times brighter than fluorescent lighting, and an object 1cm from a light-bulb might be 1000 times brighter than an object 1m from it. So for a general solution, full 32-bit floating point would be preferable, but it's quite expensive to use F32 formats!

These are produced in the first pass:
RenderTarget0 (D3DFMT_A16B16G16R16F): Render scene with light values between 0.0f and 65504? (or just ignore 65504 and possibly output a value like 2344534756348.0f)
RenderTarget1 (D3DFMT_A16B16G16R16F): Render same scene but with Log luminance instead for pixel values and mipmap it.

Get the K in application by reading lowest mip surface?

These are two passes.

After drawing the HDR scene to RT0, you then render to RT1 using RT0 as input. The shader reads RT0 and uses a luminance function to find the luminance of each pixel.
RT1 is monochromatic, so you could use the D3DFMT_R16F format for it. In my last game, we actually compressed the data down to just 8-bits, but a half-float format is easier.
Instead of reading the K value back in the application, you can keep everything on the GPU-side. If RT1 is an auto mip-mapped texture, you can later use tex2Dlod(rt1, float4(0.5,0.5,0,9000)).r to read the 1x1 pixel result in your tone-mapping shader.

Each of these is also an individual "pass":
3. Output to RT2 (lower resolution), read RT0 as input. The shader reads RT0 and multiplies by some fraction to darken it (or does (input-threshold)*fraction if you want to have a sharper cut-off to the effect). Now you've got your "downsampled" and darkened version of the scene.
You'll want to make sure you can tweak this fraction value (and possibly threshold value) at run-time, to fine-tune the effect. You probably want to use quite small numbers, like 0.1, but that of course depends on the scene/lighting/game.

4. Output to RT3 (lower resolution), read RT2 as input several times with different horizontal offsets. Add all the samples together using different weights for each sample. Now you've got a horizontally blurred image.
5. Output to RT2 (RT2's data isn't required any more, so let's recycle that target!), read RT3 as input several times with different vertical offsets, and add/weight as before. Now you've got a vertically blurred version of a horizontally blurred image -- this produces a circularly blurred image!

6. Read RT2 and RT0, combine them, and tone-map the result.

So in all, 6 passes just for HDR, add some more passes for shadowing and we are up to maybe 10+ passes.

Why can't the first two passes be done in one drawcall? To me it seems logical to output to two RenderTargets, one color for HDR scene and one color for the luminance. Is this supported in D3D9, having MRT, one with a monochromatic/mipmap chain and another without? Or will this "optimization" defeat its purpose when the mipmap chain generation takes 4 times longer with 4 16F channels instead of one? Will an auto mip-mapped texture generate its chain when the scene is done, or do I need to call GenerateMipMapSubLevels() before trying to sample it in the second pass?


float4(0.5,0.5,0,9000)).r

I assume this reads the last mip surface in the chain, even though the chain is much shorter.

What are the pro/cons in splitting the blurs in two passes instead of two render targets? I would assume you save one vertex fetch using two RTs since the center texel fetch is the same for both horizontal and vertical. Won't there be also overhead by having two passes?

You seem to be worried about counting "passes", but all you're really counting here is how many times you change the current render-target, which is just a state-change like any other.

e.g. here's the thumbnail view of about 70 passes from my last game:

lrAbF.png

Why can't the first two passes be done in one drawcall? To me it seems logical to output to two RenderTargets, one color for HDR scene and one color for the luminance. Is this supported in D3D9, having MRT, one with a monochromatic/mipmap chain and another without? Or will this "optimization" defeat its purpose when the mipmap chain generation takes 4 times longer with 4 16F channels instead of one?

Yes, that approach would work smile.png and yes the mip-map generation will require 4x the bandwidth, but...

You mentioned deferred shading earlier, so --

During your lighting pass, if a pixel is covered by multiple lights, then that pixel is drawn to multiple times (once per light), with additive blending.

With the MRT approach, each pixel write is 16 bytes of data (ABGR16*2). Say we've got 4 lights per pixel on average for a 720p scene; that gives us 56.25MiB of data being written.

If you don't use an MRT, and instead compute luminosity at the end in another "pass", then when drawing the scene you've only got 28.125MiB of bandwidth, and then the luminosity pass reads ABGR16 and writes R16 once per pixel, which is another ~8.8MiB, for a total of ~37MiB of bandwidth.

Actually... log(light1) + log(light2) != log( light1 + light2 ), so the first approach would give incorrect results for a deferred rendering set-up anyway ;)

Will an auto mip-mapped texture generate its chain when the scene is done, or do I need to call GenerateMipMapSubLevels() before trying to sample it in the second pass?

Yes, auto-mipped textures generate their chains at some unspecified time after they've been rendered to, but before they're used in texture-sampler slots, automatically.

The function to manually generate their chain is just a hint that says "now would be an optimal time to generate your mips" -- you don't have to call it in D3D9 (some other APIs do require you to manually perform mipmap generation).

I assume this reads the last mip surface in the chain, even though the chain is much shorter.

Exactly. If you know how many levels there are in the mip chain, you can use the correct number. I just used 9000 as an in-source code joke (the mip level "is over 9000!"), and because the code should work for any resolution... unsure.png

What are the pro/cons in splitting the blurs in two passes instead of two render targets?

Blurring the original scene horizontally and vertically, like you described earlier, doesn't give the correct result. It will give you a plus-sign shaped blur, instead of a circular blur.

What we're doing is a Gaussian blur, which in it's basic form is a single-pass effect (read in original scene, output circularly-blurred scene), but in it's traditional form, it requires us to sample in a large box around each pixel, e.g. 5x5=25 samples per pixel.

However, the Gaussian blur has a special property of being "separable", which means that instead of doing it in one 5x5 pass, we can instead do a 5x1 pass, then use that result as the input to a 1x5 pass, and the math works out to give us the exact same final result as the traditional method (but with just 10 samples instead of 25).

The earliest article that I know of on real-time bloom is here, and does a decent job explaining this.

This topic is closed to new replies.

Advertisement