Subsurface scattering in Vulkan path tracing

Started by
68 comments, last by taby 1 month, 3 weeks ago

Vilem Otte said:
Of course you could just dump it either into float4 buffer, or use 3 buffers - one for matrices, another for offsets and another for sizes.

Sometimes it's clear what's better, AoS or SoA.

If we are in a compute shader, and each thread reads from such structs in ordered sequence, like so:

vec2 mOffset = bufferOffsets[x + localThreadIndex];

This SoA layout is faster, because threads read memory sequentially.

Contrary, if we would use AoS like so:

vec2 mOffset = structuredBufferShadowTiles[x + localThreadIndex].mOffset;

It's slower, because the stride of memory access becomes larger. It's no longer tightly and sequentially packed.

(My example assumes we currently only need mOffset to illustrate the difference.)

I have seen cases where the performance difference between AoS vs. SoA was ten times (!), so memory access patterns really matter, and thus we should design memory layout carefully.
I remember GCN was fast for data types like float, vec2 and vec4. For larger things like a 4x4 matrix the benefit becomes much smaller quickly.

But ofc. this mostly applies to compute shaders in practice, because we have precise control over access pattern from all threads.

Assuming ray generation shaders are grouped in tiles, SoA might be a win here too, but ideally we would know and replicate the grouping of threads the GPU does for ray generation shaders precisely, so the access is as sequential as possible over the whole SM / CU.

It also sucks in general that trying out different layouts always is a lot of work.

That said, even if there is a way that SSBO can win over local arrays, it may be be hard to find, even for the experienced expert. And worse: It may differ across various chips.

Btw, beside AoS vs. SoA i had also tried to see a difference from using SSBO vs. images, but that difference was zero in my case.

Advertisement

Thanks for the input you guys! I really appreciate it.

I'm one step closer to some kind of solution. Here's a screenshot showing the opacity of the ray march, in grayscale.

I think that it goes to show that my code and logic are on the right track.

bool is_inside_mesh(const vec3 location, inout vec3 collision_pos, inout vec3 collision_dir, inout vec3 collision_colour)
{
	RayPayload r = rayPayload;
	bool is_outside = false;

	const vec3 direction = RandomUnitVector(prng_state);//vec3(0, 1, 0);

	traceRayEXT(topLevelAS, gl_RayFlagsOpaqueEXT, 0xff, 0, 0, 0, location, 0.001, direction, 10000.0, 0);

	if(rayPayload.dist == -1.0 || dot(direction, rayPayload.normal) <= 0.0)
		is_outside = true;

	if(rayPayload.dist == -1.0)
	{
		collision_pos = location;// + direction;
	}
	else
	{
		collision_pos = location + direction*rayPayload.dist;
	}

	collision_dir = direction;
	collision_colour = rayPayload.color;

	rayPayload = r;
	return !is_outside;
}


int do_random_walk_until_exits_mesh2(const vec3 velocity, const float velocity_constant, const float sss_constant, const float sss_density, const int max_walk_length, const vec3 pos, inout vec3 final_pos, inout vec3 final_dir, inout float base_colour, inout float base_opacity, const float hue)
{
	final_pos = pos;
	
	base_opacity = 0.0;
	base_colour = 0.0;

	int i = 1;

	vec3 collision_pos = vec3(0, 0, 0);
	vec3 collision_dir = vec3(0, 0, 0);
	vec3 collision_colour = vec3(0, 0, 0);

	while(is_inside_mesh(final_pos, collision_pos, collision_dir, collision_colour) && i <= max_walk_length)
	{	
		final_pos += (1.0 - sss_density)*velocity*velocity_constant;
		final_pos += (1.0 - sss_density)*sss_constant*RandomUnitVector(prng_state)*velocity_constant;

		const vec3 mask = hsv2rgb(vec3(hue, 1.0, 1.0));
		float colour = collision_colour.r*mask.r + collision_colour.g*mask.g + collision_colour.b*mask.b;

		const float trans = 1.0 - clamp(base_opacity, 0.0, 1.0);
		base_colour += colour;
		base_opacity += 0.01;//*trans;

		i++;
	}

	base_opacity = pow(clamp(base_opacity, 0, 1), 1.0);

	return i;
}

JoeJ said:
That said, even if there is a way that SSBO can win over local arrays, it may be be hard to find, even for the experienced expert. And worse: It may differ across various chips.

Not may - will.

Even if you go into constant buffer vs structured buffer (ssbo). You're going to see numbers going up and down depending on architecture.

At one point I made decisions which results in non optimal code everywhere, which costs some performance, but at least is manage-able for me. I don't have enough developers to upkeep path per hw architecture - and by myself I don't see a point in doing so. Additionally - keeping the code base sane and readable is far superior to small changes in performance.

Which again - will completely change each time new architecture is released! So there is even less reasons to do such thing!

JoeJ said:
Btw, beside AoS vs. SoA i had also tried to see a difference from using SSBO vs. images, but that difference was zero in my case.

This depends on architecture - new architectures like RDNA (but even GCN) will see ZERO gains from that. Meanwhile some older architecture like Fermi will have gains from that.

I remember that from my thesis, where I could gain some more performance by pushing something through texture cache (L2-TEX) instead of standard cache (L2-L1). I'm not sure whether it was some geometry data or BVH at the time. It also worked on AMD gpus back then. It gave me quite some boost in performance - at the cost of code complexity.

Nowadays I don't think any new architecture has a separate texture cache - RDNA doesn't, Ada Lovelace doesn't either. I haven't looked at Intel arch. but I somewhat doubt their Xe (or what they call it) has it.

JoeJ said:
Sometimes it's clear what's better, AoS or SoA.

It depends on many things - in the mentioned case, you always use whole structure. Could SoA be faster? Hard to tell, I never benchmarked it - there was no reason why (not a bottleneck).

It's likely even a completely different approach could work better (you are very likely to use small subset of the buffer for whole thread group). Determining that and fetching into groupshared memory might be even superior … might not. The problem with these things is, as you said, they need to be benchmarked. And there is no point in doing so when it's not a bottleneck.

JoeJ said:
I have seen cases where the performance difference between AoS vs. SoA was ten times (!), so memory access patterns really matter, and thus we should design memory layout carefully.

It can get even funnier - I don't remember exact numbers but long time back when I first got my hands on 8-core CPU, my CPU ray tracer fired from static image every few seconds to interactive by going AoS → SoA.

And a LOT of SIMD intrinsics.

My current blog on programming, linux and stuff - http://gameprogrammerdiary.blogspot.com

Vilem Otte said:
Which again - will completely change each time new architecture is released! So there is even less reasons to do such thing!

Depends. Back then, i often tried alternative ways to do this or that anyway to find out what is fast and what not.
I've had multiple GPUs around, saw results diverging, and so i've kept two code branches. One for AMD and one for NV. Not much extra work and totally worth it.

It did not seem necessary to have multiple branches within a vendors chip generations. And interestingly, over time NV and AMD became more similar with the years. Kepler (and Fermi too) behaved very different from GCN (much slower in general), but for Pascal the GCN branch was better than the Kepler branch. Performance was fine too. So i concluded there is now no more need to have multiple branches which is great.

Thus, currently i would just pick the ‘best’ GPU architecture and optimize for that, assuming other vendors catch up and converge to similar architecture.

I would not optimize RT for AMD GPUs for example, because they will update to fixed function traversal too. If they want to be competitive.

But ofc. console platforms deserve their specific optimizations no matter what, if that's a target.

Vilem Otte said:
Meanwhile some older architecture like Fermi will have gains from that.

Pretty sure i did those tests on Fermi too actually, but difference was negligible in my case as well.
I do not remember what exactly i did, but probably BVH and random access.
So we got differing impressions even from working on similar things, which is no surprise.

Vilem Otte said:
It depends on many things - in the mentioned case, you always use whole structure. Could SoA be faster? Hard to tell, I never benchmarked it - there was no reason why (not a bottleneck).

I think the same. So at least one uncertainty less.

Vilem Otte said:
Nowadays I don't think any new architecture has a separate texture cache

It seems GPUs became much better with non sequential access in general, so i assume the AoS vs. SoA topic is no longer that important, hopefully.

It surely makes still a big difference in cases and is worth to know, but i do not intend to cause even more uncertainty to people.

I'm not really a fan of low level optimizations. There is not so much too learn here on the long run, imo.
And memory access is actually a problem the HW industry has to solve. We should not need to make a science out of working around their limitations.

But well, reality differs from such reasonable ideals.
Working on games we often have no choice and low level optimizations are often needed if we like it or not.

It's getting better.

Looks like you've added the blur : )

Now i wonder: Could you make the air scattering as well, so the bunny light stops feeling too sharp? And could you do this so gently the rest of the scene remains almost as sharp?

Or should this be a matter of bloom, which happens in the eye and so doing it cheaply in screenspace would be actually more correct?

Or would we need to do both, probably?

I'm never really sure about this question.
Many games do lens flare effects, which is related. But it always looks like camera lens flares caused from lenses and shutter.
I want those irregular diffraction patters happening in my eyes around the sun instead.

And often i see kind of glitches, looking like a view through to microscope. I guess that's dust particles on the eye. This probably goes too far. : )

JoeJ said:
Many games do lens flare effects, which is related. But it always looks like camera lens flares caused from lenses and shutter. I want those irregular diffraction patters happening in my eyes around the sun instead.

You probably want something like this paper, which I believe was used in Lugaru 2.

In my project I'm currently just doing bloom, but it's got to be implemented correctly to look accurate. It's not sufficient to just blur once at a single scale, you need to apply multiple blurs of different sizes and add them together to get a bloom result that works for light sources of any size and brightness. I'm not sure it's worth doing the fancy human eye lens flare calculations over proper bloom.

A bit better!

LOL, now I can't turn it off.

@Aressera I remember reading also one (I think from the same uni) about accurate simulation of lens flares.

https://resources.mpi-inf.mpg.de/lensflareRendering/pdf/flare.pdf

I think that's it - but not sure. Which was focused on generating correct lens flares based on your optical system definition - even in the paper there are some optical setups and results.

My current blog on programming, linux and stuff - http://gameprogrammerdiary.blogspot.com

This topic is closed to new replies.

Advertisement