Mipmapping 3D texture

Started by
12 comments, last by JoeJ 6 years, 11 months ago

Hello,

I'm currently trying to mipmap 3D textures in D3D12. So the basic idea was to build 2 miplevels at once as I'm running 4x4x4 threads in group. Also using groupshared memory to do some optimizations. What is sad (on mipmap generation) - that for 1st mip calculation just 8 threads will do some work and just single thread for the 2nd mip calculation. All of them are occupied only when filling data into groupshared memory. If you're curious, the compute shader is here: https://pastebin.com/4DYWbsDZ (it was too long to dump it here)

To generate complete mipchain, I use a loop like:


        // Initial setup
	int todo = mMiplevels - 1;
	int base = 0;
	int dimension = mDimensions;

        // As long as there are some mip-levels to generate
	while (todo != 0)
	{
                // Always generate 2 miplevels at once (as long as we can)
		int mipLevels = 2;
		if (todo == 1)
		{
			todo++;
			base--;
			dimension *= 2;
		}

                // Record what we need to do into cmd list
		context->SetConstants(0, Engine::DWParam(base), Engine::DWParam(mipLevels), Engine::DWParam(1.0f / (float)dimension));
		context->SetDescriptorTable(1, mColorTexture->GetSRV());
		context->SetDescriptorTable(2, mColorTexture->GetUAV(base + 1));
		context->SetDescriptorTable(3, mColorTexture->GetUAV(base + 2));
		context->Dispatch(dimension / 4, dimension / 4, dimension / 4);

		base += mipLevels;
		dimension /= pow(2, mipLevels);
	} 

After this for loop there is a 'Finish' call which executes command lists, and of course a barrier as the volume is going to be used. Note that the volume is generated and used within single frame.

The question here is (to others who most likely already tried similar thing), can this be done in better way?

My current blog on programming, linux and stuff - http://gameprogrammerdiary.blogspot.com

Advertisement
I don't know how large your texture is and how many of them you have, but one way to limit the 'less work at the top of a tree problem' would be:

Dispatch: Generate level 1 from 0
Barrier
Dispatch: Generate level 2 from 1
Barrier
Dispatch: Generate level 3 from 2
Barrier
Dispatch: Generate levels 4 to 8 within a single workgroup

The last step saves you 3 barriers.
I've got speedups of 4(!) with similar problems.
I've done three mips at a time in 2D with similar issues - 8x8 threads gather 8x8 pixels from mip 0, 4x4 threads write mip 1, 2x2 write mip 2, 1x1 writes mip 3... At the time I tested it against three dispatches and it was significantly faster so I left it at that...

I guess you could just increase the workload per thread to reduce the number of idle threads.
e.g. 8x8 threads read 16x16 pixels from mip 0, 8x8 threads write 8x8 to mip 1, etc...
Yep, i agree with this. Probably you should do more than two levels at once as you currently do.
The ALU work is nothing so it does not hurt when threads become idle.
You save further dispatches and barriers and also reading bandwidth.

Thanks for the responses. I'm setting up timer queries around all the dispatches now to measure changes I'm about to try (probably no 'easier' way around - then to try & fail modifications to this approach), I hope to have results shortly - so I'll post them here.

My current blog on programming, linux and stuff - http://gameprogrammerdiary.blogspot.com

The ALU work is nothing so it does not hurt when threads become idle.

This was my experience too. I had another case where I had to mipmap a 64bpp gbuffer for a deferred impostor rendering system. I used the same algorithm to generate three mip-levels at a time, which results in a lot of idle threads.
I managed to speed it up by completely re-packing the gbuffer into 32bpp, which involved adding A LOT of ALU operations for shifting/masking/OR'ing bits around the place, not to mention that I had to implement the 2x2 box filter with ALU instead of using a texture filter... So even in that very ALU-heavy mipmapping filter, the only real cost was memory bandwidth.

Alright, so I've set up timer queries wherever I needed and profiled the hell out of several variations of code. So, first some notes about the setup:

Application generates dynamic voxel representation of Crytek's Sponza at 512x512x512 resolution, which is then mip-mapped. For proof-testing (that my mip maps are correct), I've calculated ambient occlusion using voxel cone tracing. The timestamps are obtained before and after each Dispatch during mipmap generation, and also right before and after the loop for generating mipmaps (so I can calculate overhead).

And my test scenarios.

1. Naive generation

Using 2x2x2 workgroup I generate lower miplevel from each higher miplevel. The generation stores 2x2x2 voxels in groupshared memory and calculates lower miplevel out of it only using 1st thread in workgroup. The results are (I've picked 3 samples of profiling output):

Dispatch[0] (256 256 256): 31.613440ms
Dispatch[1] (128 128 128): 4.366080ms
Dispatch[2] (64 64 64): 0.542720ms
Dispatch[3] (32 32 32): 0.068000ms
Dispatch[4] (16 16 16): 0.008640ms
Dispatch[5] (8 8 8): 0.002080ms
Dispatch[6] (4 4 4): 0.000480ms
Dispatch[7] (2 2 2): 0.000320ms
Total Time: 36.616640ms
Call overhead: 0.014880ms
Dispatch[0] (256 256 256): 29.836800ms
Dispatch[1] (128 128 128): 3.298880ms
Dispatch[2] (64 64 64): 0.412480ms
Dispatch[3] (32 32 32): 0.052000ms
Dispatch[4] (16 16 16): 0.007040ms
Dispatch[5] (8 8 8): 0.001280ms
Dispatch[6] (4 4 4): 0.000800ms
Dispatch[7] (2 2 2): 0.003520ms
Total Time: 33.616160ms
Call overhead: 0.003360ms
Dispatch[0] (256 256 256): 31.044640ms
Dispatch[1] (128 128 128): 3.807680ms
Dispatch[2] (64 64 64): 0.485600ms
Dispatch[3] (32 32 32): 0.066720ms
Dispatch[4] (16 16 16): 0.006560ms
Dispatch[5] (8 8 8): 0.001280ms
Dispatch[6] (4 4 4): 0.000800ms
Dispatch[7] (2 2 2): 0.004000ms
Total Time: 35.420960ms
Call overhead: 0.003680ms

2. Generate 2 levels at once

Using 4x4x4 workgroup always 2 miplevels at once are generated. If there is one last level (odd number of mipmaps), two smallest are re-generated. Uses groupshared memory - for first always "1st out of 2x2x2 sub-group" or 1st out of whole workgroup does the actual mipmapping:

Dispatch[0] (128 128 128): 8.049120ms
Dispatch[1] (32 32 32): 0.125120ms
Dispatch[2] (8 8 8): 0.004320ms
Dispatch[3] (2 2 2): 0.000320ms
Total Time: 8.180960ms
Call overhead: 0.002080ms
Dispatch[0] (128 128 128): 8.042560ms
Dispatch[1] (32 32 32): 0.125600ms
Dispatch[2] (8 8 8): 0.004160ms
Dispatch[3] (2 2 2): 0.000320ms
Total Time: 8.175040ms
Call overhead: 0.002400ms
Dispatch[0] (128 128 128): 7.860160ms
Dispatch[1] (32 32 32): 0.123840ms
Dispatch[2] (8 8 8): 0.004000ms
Dispatch[3] (2 2 2): 0.000320ms
Total Time: 7.990560ms
Call overhead: 0.002240ms

3. Generate 3 levels at once

As there is improvement, the logical step would be to go further and attempt to generate 3 levels at once, right? So I dispatch 8x8x8 workgroup and generate 3 miplevels at once (magic constants for masking are starting to be a bit of black magic here). Total number of Dispatch functions is down to 3:

Dispatch[0] (64 64 64): 18.375040ms
Dispatch[1] (8 8 8): 0.035040ms
Dispatch[2] (2 2 2): 0.000320ms
Total Time: 18.414240ms
Call overhead: 0.003840ms
Dispatch[0] (64 64 64): 16.648960ms
Dispatch[1] (8 8 8): 0.035360ms
Dispatch[2] (2 2 2): 0.000320ms
Total Time: 16.686240ms
Call overhead: 0.001600ms
Dispatch[0] (64 64 64): 20.219360ms
Dispatch[1] (8 8 8): 0.048960ms
Dispatch[2] (2 2 2): 0.000320ms
Total Time: 20.270560ms
Call overhead: 0.001920ms

Summary

As you can notice the timing got worse, I already think I have explanation for that (I'm not absolutely sure this is the case). In case where I generate 2 miplevels I use workgroup of 4x4x4 - each thread first stores data in groupshared memory, and then it is retrieved from apropriate locations and stored again (and thread_id location). Now as cache line for shared memory has 64-bytes, it is efficient.

For 8x8x8 when generating 3 miplevels at once this groupshared array has to be 512-bytes in size, therefore reading from it and writing from it at some locations will cause slow down.

The current code for 512^3 volume generates mipmaps in 8ms on AMD Rx 480 GPU. I still think that this is way too much, but maybe it is fast enough!

What next?

Thanks for advices (any further hint is of course welcome!). I might have also noted that my mip generation function is not just a sum (but counts voxels that contain something and uses that value as a divisor) - yet it is still just few operations, I might try to do the sum/counting in parallel reduction approach, but that will introduce even more memory barriers, which I'd suspect makes performance drop (but it is worth trying).

EDIT: Side not, parallel reduction won't work, adding more memory barriers will further decrease performance. As it is obviously memory bound (those 7 additions and 1 division are have literally no impact).

Lower resolutions?

Note, just out of curiosity, I took the one that generates 2 levels at once and put it in for 256^3 volume. The results are:

Dispatch[0] (64 64 64): 0.988000ms
Dispatch[1] (16 16 16): 0.016480ms
Dispatch[2] (4 4 4): 0.000480ms
Dispatch[3] (2 2 2): 0.000320ms
Total Time: 1.007360ms
Call overhead: 0.002080ms
Dispatch[0] (64 64 64): 0.974880ms
Dispatch[1] (16 16 16): 0.016480ms
Dispatch[2] (4 4 4): 0.000480ms
Dispatch[3] (2 2 2): 0.000320ms
Total Time: 0.994240ms
Call overhead: 0.002080ms

Dispatch[0] (64 64 64): 0.985120ms
Dispatch[1] (16 16 16): 0.016480ms
Dispatch[2] (4 4 4): 0.000480ms
Dispatch[3] (2 2 2): 0.000320ms
Total Time: 1.004640ms
Call overhead: 0.002240ms

Which is about 1ms for whole mip chain (and 8 times faster than doing the 512^3 volume).

My current blog on programming, linux and stuff - http://gameprogrammerdiary.blogspot.com

Thanks for sharng result, always interesting.

I'm not good at predicting performance on volumes, but i would expect not mor than 2 ms at the end.


Using 2x2x2 workgroup

That's only 8 threads - the other 32-8 on NV or 64-8 will be idle. You always want a workgroup size of at least 64.

I'll use ATI terminology: One Compute Unit (Wavefront) has 64 threads operating in lockstep. One CU can not operate multiple Workgroups, AFAIK the API or driver does not fix a mistake like this for you.

You need to process 4*4*4 texels at least.


For 8x8x8 when generating 3 miplevels at once this groupshared array has to be 512-bytes in size, therefore reading from it and writing from it at some locations will cause slow down.

According your pastebin you have float4 texels, so need 8*8*8 * sizeof(float4) = 8192 bytes, right?

For 8^3 threads you join 512/64 = 8 CUs together, each of it has 64kB LDS. But you want high occupancy - 10 workgroups in flight per CU in the best case, se we have about 6000 bytes available per CU.

6000 x 8 joined CUs = 48 000 bytes available, but we need only 8000, which is totally fine. (You could even store more texels per thread, which becomes only interesting for my suggestion in the first reply)

LDS is very fast (14 times faster than main memory on GCN), also reading / writing one float4 per thread is good.

It's interesting in OpenCL AMD only allows to join 4 CUs at max (256 threads). Maybe there is indeed a higher cost to read LDS from another CU if you go beyond that, but i don't think this could explain why 3 levels are so much slower.

Can you post your code for the 3 levels?

Edit: doing some bandwidth math:

RX480 = 320 GB/s = 5GB per frame

512^3 * float4 = 2GB (yikes - that's why i don't like volumes), so 8ms make total sense :(

Thanks for explanation. Here is the code:


cbuffer InputDimensions : register(b0)
{
	uint3 dimensions;
}

cbuffer InputMiplevels : register(b1)
{
	uint srcMiplevel;
	uint miplevels;
	float texelSize;
}

SamplerState srcSampler : register(s0);
Texture3D<float4> srcLevel : register(t0);
RWTexture3D<float4> mipLevel1 : register(u0);
RWTexture3D<float4> mipLevel2 : register(u1);
RWTexture3D<float4> mipLevel3 : register(u2);

groupshared float tmpR[512];
groupshared float tmpG[512];
groupshared float tmpB[512];
groupshared float tmpA[512];

void StoreColor(uint idx, float4 color)
{
	tmpR[idx] = color.r;
	tmpG[idx] = color.g;
	tmpB[idx] = color.b;
	tmpA[idx] = color.a;
}

float4 LoadColor(uint idx)
{
	return float4(tmpR[idx], tmpG[idx], tmpB[idx], tmpA[idx]);
}

float HasVoxel(float4 color)
{
	return color.a > 0.0f ? 1.0f : 0.0f;
}

[numthreads(8, 8, 8)]
void GenerateMipmap3(uint GI : SV_GroupIndex, uint3 DTid : SV_DispatchThreadID)
{
	float4 src[8];
	float3 uvw = (DTid.xyz + 0.5f) * texelSize;
	src[0] = srcLevel.SampleLevel(srcSampler, uvw, (float)srcMiplevel);
	StoreColor(GI, src[0]);
	GroupMemoryBarrierWithGroupSync();

	if ((GI & 0x49) == 0)
	{
		src[1] = LoadColor(GI + 0x01);
		src[2] = LoadColor(GI + 0x08);
		src[3] = LoadColor(GI + 0x09);
		src[4] = LoadColor(GI + 0x40);
		src[5] = LoadColor(GI + 0x41);
		src[6] = LoadColor(GI + 0x48);
		src[7] = LoadColor(GI + 0x49);

		float div = 0.0f;
		for (int i = 0; i < 8; i++)
		{
			div += HasVoxel(src[i]);
		}

		if (div == 0.0f)
		{
			src[0] = 0.0f;
		}
		else
		{
			src[0] = (src[0] + src[1] + src[2] + src[3] + src[4] + src[5] + src[6] + src[7]) / div;
		}

		// Store value + write into shared memory
		mipLevel1[DTid / 2] = src[0];
		StoreColor(GI, src[0]);
	}

	GroupMemoryBarrierWithGroupSync();

	if ((GI & 0xDB) == 0)
	{
		src[1] = LoadColor(GI + 0x02);
		src[2] = LoadColor(GI + 0x10);
		src[3] = LoadColor(GI + 0x12);
		src[4] = LoadColor(GI + 0x80);
		src[5] = LoadColor(GI + 0x82);
		src[6] = LoadColor(GI + 0x90);
		src[7] = LoadColor(GI + 0x92);

		float div = 0.0f;
		for (int i = 0; i < 8; i++)
		{
			div += HasVoxel(src[i]);
		}

		if (div == 0.0f)
		{
			src[0] = 0.0f;
		}
		else
		{
			src[0] = (src[0] + src[1] + src[2] + src[3] + src[4] + src[5] + src[6] + src[7]) / div;
		}

		mipLevel2[DTid / 4] = src[0];
		StoreColor(GI, src[0]);
	}

	GroupMemoryBarrierWithGroupSync();

	if (GI == 0)
	{
		src[1] = LoadColor(GI + 0x04);
		src[2] = LoadColor(GI + 0x20);
		src[3] = LoadColor(GI + 0x24);
		src[4] = LoadColor(GI + 0x100);
		src[5] = LoadColor(GI + 0x104);
		src[6] = LoadColor(GI + 0x120);
		src[7] = LoadColor(GI + 0x124);

		float div = 0.0f;
		for (int i = 0; i < 8; i++)
		{
			div += HasVoxel(src[i]);
		}

		if (div == 0.0f)
		{
			src[0] = 0.0f;
		}
		else
		{
			src[0] = (src[0] + src[1] + src[2] + src[3] + src[4] + src[5] + src[6] + src[7]) / div;
		}

		mipLevel3[DTid / 8] = src[0];
	}
}

EDIT: As per your calculation I could technically switch to smaller type than float4 per voxel - which I might do in the end. Along with lowering resolution of 3D texture. While this won't solve the problem, it will move all the times into acceptable frame time which I'd like to achieve.

I'm also looking forward to try to put all the voxels into octree instead and perform cone tracing with that. Although I suspect that octree generation will use less bandwidth than mipmapping of 3D texture.

My current blog on programming, linux and stuff - http://gameprogrammerdiary.blogspot.com

Why don't you store texels as float4 in LDS? Seperating rgba to unique arrays will cause a slow down due to fragmentation.

Reading the rest of the code...

float4 src[8];

You're generating an array into registers, that's very slow. Array indexing is fine only for memory but not for registers.

The compiler could fail to optimize this away. (AMD compiler is fast and NV is very slow for me in Vulkan, so probaply it does less optimizations)

I would not use 8 registers at all, just one and add values sequentially. (Altough register preassure shold not be the issue here)

8*float4 is already 32 VGPR, but you want max 24 for 100% occupancy, which is the most important in a bandwidth limited situation.


if ((GI & 0x49) == 0)

Does this mean not all threads have work even on the lowest level? They should.

This topic is closed to new replies.

Advertisement