Jump to content

  • Log In with Google      Sign In   
  • Create Account

Banner advertising on our site currently available from just $5!


1. Learn about the promo. 2. Sign up for GDNet+. 3. Set up your advert!


theagentd

Member Since 15 Nov 2010
Offline Last Active Today, 06:08 PM

#5226781 Performance of drawing vegetation; overdraw, alpha testing... Alpha cutout mo...

Posted by theagentd on 01 May 2015 - 06:57 PM

Hello, everyone.
 
I'm still struggling with getting good vegetation rendering performance. Basically, what I have is this:
Mwu4Q8M.jpg

 

With the right ground texture and lighting, it looks... okay.

 

g1mvKyp.jpg

 

 

Sadly, the draw distance is limited as hell, and due to this, we basically need to make the grass extremely boring and feature-less so the fading in the distance isn't noticeable. I want faster, taller, thicker, more distinct grass with higher render distance without the performance cost. The current grass takes around 2 milliseconds, half being drawing the grass to the G-buffer (including depth pre-pass) and the other half being the SRAA pass. Basically, the grass is a bunch of flat triangles.

 

What I've learned so far:

 

 - Do a depth prepass if you're not doing MSAA. It saves a SHITLOAD of time since depth-only+alpha-test is really cheap, and GL_EQUAL depth testing in the second G-buffer pass is literally 4 times as fast. Basically doubles my performance, but it isn't usable in the SRAA (MSAA) pass.

 

 - The shape of the grass meshes matters a lot. At first we had randomly rotated flat meshes, but these looked like crap and tended to lump together. We switched to 3 intersecting quads rotated 120 degrees from each other, which had a more volumetric and even look, but it looked like crap from above. In the end, we went with http://i.imgur.com/h0LilV0.jpg, which looks good from most angles, especially above.

 

 - The area of the mesh is what matters. Even transparent fragments that get discarded by the alpha test are essentially full-cost fragments. We "optimized" our textures to be as little transparent cutout as possible and as much grass as possible to minimize the number of wasted fragments.

 

 - Fading the alpha value of the grass looks like crap with alpha testing, and even worse with alpha-to-coverage. A better solution was to simply slowly sink the grass into the ground, which had less popping and flickering.

 

 - Shadows are completely out of the question. Don't draw shadows for the grass.

 

 

 

With all these tricks, our grass went from 20ms to 2ms at... "acceptable" quality. I still don't like it, but frankly, I'm at a loss at what to do next. Then I see something like this:

 

ibupedgVMNqjdr.gif

 

And I'm just like "What the hell". The have significantly longer render distance than us, but don't have seem to have any significant performance problems. The grass is also significantly taller, and in Crysis 3 you even walk around with the camera inside it. My guess is that the are rendering those blades as meshes, not alpha-tested billboards. I would like to test that out anyway.

 

Is there any tools out there to generate a triangle mesh from an image with alpha cutouts? Preferably one that allows for multiple LODs of the same model to be generated.




#5219985 SLI with an advanced OpenGL game?

Posted by theagentd on 29 March 2015 - 10:36 AM

I think I have the basics figured out. To get SLI working with an OpenGL game, you need to use NVIDIA Inspector.

 

1. Create a new profile for your game.

2. Add the game's executable(s) to the profile.

3. Set "SLI rendering mode" to "SLI_RENDERING_MODE_FORCE_AFR2", hex value 0x00000003. Do NOT touch the "NVIDIA predefined---" values, and there is no need to choose a GPU count.

(Optional) 4: Enable the SLI indicator to show the (pretty worthless) SLI scaling indicator, which can at least tell you if SLI is enabled at all.

 

At this point, you can try running your game. Make sure you run the game in true fullscreen, not just a borderless window covering the whole screen, or SLI won't scale (but an empty SLI indicator will still show!). If all goes well, you should see a 90% boost in FPS (assuming you're GPU limited) and the SLI indicator should fill with green. However, many functions can inhibit SLI performance (most notably FBO rendering to textures, mipmap generation, etc), and in some cases the driver my completely kill your scaling by forcing synchronization between the GPUs, often leading to negative scaling. In this case, there is a special compatibility setting you can set which seems to disable most SLI synchronization and give you proper scaling. If you see no scaling, try these last few steps:

 

5. Click the "Show unknown settings from NVIDIA predefined profiles" button on the menu bar (the icon with two cogwheels and a magnifying glass).

6. Scroll down pretty far until you reach a cathegory called "Unknown".

7. Find the setting called "MULTICHIP_OGL_OPTIONS (0x209746C1)". Change it from the default 0x00000000 to 0x00000002 by typing in the value by hand.

 

Explanation:

 

The MULTICHIP_OGL_OPTIONS seems to have the same function for OpenGL as the SLI compatibility bits have for DirectX games. I tried all of the predefined values in the dropdown list for that setting, but many either had no scaling or graphical artifacts. What I realized was that changing it to a value that was NOT on the list seems to disable the synchronization, regardless of which value is chosen. I was expecting each bit to serve some specific function, but that does not seem to be the case. The hex value may be some kind of hash code or something that alters the driver's behavior. Setting it to anything that isn't predefined (0x00000002 being the first "free" value) seems to disable all synchronization between GPUs.

 

Sadly, I still haven't figured out the GL_TEXTURE_2D_ARRAY problem. It seems to be a driver problem where the driver does not copy the generated mipmaps to both GPUs when glGenerateMipmaps() is called. This may be intended behavior, but the same function certainly works for normal GL_TEXTURE_2D textures.




#5218452 *solved* Too slow for full resolution, want to improve this SSAO code

Posted by theagentd on 23 March 2015 - 07:28 AM

I've decided to write a small summary of the most important optimizations I added to get to this point.

 

 

1. I was reconstructing the eye space position of each SSAO sample using the hardware depth buffer and the inverse projection matrix. Switching to reconstructing the position using a linear depth buffer and a frustum corner vector saves an almost ridiculous number of instructions.

 

The matrix version does

 - 3xMAD (convert coords from [0.0 1.0] to [-1.0 1.0])

 - 12xMAD (matrix multiply)

 - 1xRCP + 3xMUL (W divide)

= 15xMAD + 3xMUL + 1xRCP = 18 instructions + RCP which is even slower

 

The frustum corner version only takes 2xMAD + 5xMUL = 7 instructions and no RCP, saving a huge amount of APU time. This is the simple change that brought my shader from 1.78ms to 0.84ms.

 

 

2. The second biggest bottleneck is cache coherency. This can be solved by mipmapping the linear depth buffer and picking a LOD level for each sample based on the sample offset distance. Basically, as the samples get more and more spread out, we counter the reduced cache coherency by moving to smaller mip levels, bringing cache coherency up again. Visually, the result is identical. I cannot see any difference whatsoever. Mipmapping the depth buffer is fast, generally taking under 0.1ms. When using extremely large sample radii, this technique brought my SSAO shader from over 20ms down to 0.84ms constant, regardless of sample radius, well worth the cost of generating the mipmaps.

 

 

3. Keep the blur simple. My blur was both depth and normal-aware. I've removed the normal awareness as the performance cost was not worth the minor improvement it achieved. Secondly, make sure you're only doing one texture sample per blur sample. I had one GL_R8 texture for the SSAO value from the SSAO pass and a GL_R32F texture for depth, and the shader was completely bottlenecked by the number of texture samples. I changed the texture format of the SSAO result texture to GL_RG16F and packed the SSAO value in the red channel and the depth in the green channel. The blur shader then only had to do one texture sample to get both the SSAO value and the depth. At the end, it outputs both the blurred value and the unmodified center depth for the next blur pass. This almost doubled the blur performance, although writing the extra depth value has a small amount of overhead.

 

 

 

Here are my benchmark results for 16 samples with 2 9x9 blur passes applied.

 

BEFORE (best case scenario):

 
                engine.post.SSAORenderer : 3.424ms
                    Downsample/pack buffers : 0.323ms
                    Clear buffers : 0.112ms
                    Render SSAO : 1.783ms
                    Blur : 1.201ms
 
The "Render SSAO" pass would skyrocket to over 20ms when the sample radius got over ~75 pixels.
 
 
AFTER:
 
                engine.post.SSAORenderer : 1.498ms
                    Generate depth buffer mipmaps : 0.09ms
                    Render SSAO : 0.826ms
                    Blur : 0.578ms

 

 

Improvement results:

 - Precomputation: 0.435 --> 0.090 = 4.83x improvement

 - SSAO pass: 1.783 --> 0.826 = 2.16x improvement (best case scenario for the old algorithm, in practice closer to 5x to 30x improvement)

 - Blur: 1.201 --> 0.578 = 2.08x improvement

 - Total: 3.424 --> 1.498 = 2.29x improvement

 

Quality-wise, the new algorithm looks identical, except the improved cache locality allows for much larger sample radii, which allows for higher quality without having to resort to clamping the sampling radius or other hacks.

 

 

The only thing left to investigate now is compute shader, which isn't something I can prioritize since our engine must run on OGL3 hardware.

 

 

Again, thanks everyone! I hope that someone finds this useful.




#5218313 *solved* Too slow for full resolution, want to improve this SSAO code

Posted by theagentd on 22 March 2015 - 02:23 PM

Now we just need some screenshots.

Ah, of course. Here you go. =3

 

http://screenshotcomparison.com/comparison/117813/picture:0 (Don't mind the FPS counter on these two.)

http://screenshotcomparison.com/comparison/117813/picture:1 (More representative FPS.)




#5218284 *solved* Too slow for full resolution, want to improve this SSAO code

Posted by theagentd on 22 March 2015 - 12:07 PM

Great success!

 

I've written a small sample location generator that I can use to generate sample locations for any sample count. I have good sample distributions for 8, 16, 24 and 32 samples. 

 

I added the simple one-line mipmap generation code. I don't generate that many mipmaps, but the generation clocks in at around 0.09 - 0.10ms. Pretty awesomely, the SSAO pass now runs in constant time regardless of sample radius. With the maximum sample radius set to 1000 pixels I get the following results when I stuff the camera into some grass:

 

32 samples, 1000 pixel sample radius, GL_R32F depth buffer, with random sample locations:

 

Old:

                engine.post.SSAORenderer : 25.532ms
                    Render SSAO : 24.705ms
                    Blur : 0.825ms

 

New:

                engine.post.SSAORenderer2 : 2.325ms
                    Generate depth buffer mipmaps : 0.094ms
                    Render SSAO : 1.401ms
                    Blur : 0.826ms

 

 

Best of all, the 1.4ms performance at 32 samples is constant regardless of sample radius. Even more amazing, the image quality is identical. I can't I even see any difference at all, and even if I can spot some difference by flipping between the old and the new algorithm, it doesn't even look worse, just noisy in a different way. At 24 samples and 2 blur passes, I get pretty good quality and 2ms performance at 1920x1080. I will most likely limit the sample radius slightly simply to avoid artifacts when the samples end up outside the screen since I don't have a "guard band" that provides information outside the screen. I've opted to not go with the 2x2 bilateral blur they do in the shader using dFdx/dFdy as it caused block artifacts on my vegetation.

 

Simply brute-forcing it all does not seem to be very feasible. One blur pass costs around the same as 8 additional samples, and having at least 1 or preferably 2 blur passes improves quality a lot.

 

@kalle_h

The only code I used from the SAO paper was the depth buffer mipmap generation shader, and for the SSAO code I simply plugged in the LOD level calculation code and switched from texture() to textureLod() when sampling. Since I already use normalized texture coordinates, I didn't need to change anything else or mess with integer texture coordinates, so the APU performance of the new version is barely affected.

 

I'm pretty satisfied with the current results, so I think I'll just go with this. I might tweak the radius and fall-off function or so, but it'll mostly be aesthetic tweaks from now on.




#5205373 Smart compute shader box blur is slower?!

Posted by theagentd on 19 January 2015 - 02:05 PM

Hey!

So I decided to try some fun little compute shader experiments and see if I could make a simple box filter faster by using shared memory. I now have 4 short compute shaders that do the same thing using different techniques, but the performance I'm getting is quite baffling. The visual result of all 4 methods is exactly the same. Varying the work group and filter sizes affects performance of course, but the relative performance between the different techniques is pretty much constant.

1. The first implementation simply gathers all samples using imageLoad() and was intended to be some kind of reference implementation.

#version 430

layout (binding = 0, rgba16f) uniform image2D inputImg;
layout (binding = 1, rgba16f) uniform image2D outputImg;

layout (local_size_x = WORK_GROUP_SIZE) in;

void main(){
	
	ivec2 pixelPos = ivec2(gl_GlobalInvocationID.xy);
	
	vec3 total = vec3(0);
	for(int i = -FILTER_RADIUS; i <= FILTER_RADIUS; i++){
		total += imageLoad(inputImg, pixelPos + ivec2(i, 0)).rgb;
	}
	
	total /= FILTER_RADIUS*2+1;
	
	imageStore(outputImg, pixelPos, vec4(total, 1));
}

2. The second one is identical, except it reads from a texture using texelFetch() instead of using imageLoad() to take advantage of the texture cache.

 

3.After that I implemented a more advanced version based on http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/Efficient%20Compute%20Shader%20Programming.pps which caches the values in shared memory.

#version 430

layout (binding = 0, rgba16f) uniform image2D inputImg;
layout (binding = 1, rgba16f) uniform image2D outputImg;

layout (local_size_x = WORK_GROUP_SIZE) in;

#define CACHE_SIZE (WORK_GROUP_SIZE+FILTER_RADIUS*2)
shared vec3[CACHE_SIZE] cache;

void main(){
	
	ivec2 pixelPos = ivec2(gl_GlobalInvocationID.xy);
	int localIndex = int(gl_LocalInvocationID.x);
	
	for(int i = localIndex; i < CACHE_SIZE; i += WORK_GROUP_SIZE){
		cache[i] = imageLoad(inputImg, pixelPos + ivec2(i-localIndex - FILTER_RADIUS, 0)).rgb;
	}
	
	barrier();
	
	vec3 total = vec3(0);
	for(int i = 0; i <= FILTER_RADIUS*2; i++){
		total += cache[localIndex + i];
	}
	total /= FILTER_RADIUS*2+1;
	imageStore(outputImg, pixelPos, vec4(total, 1));
}

4. The last one is exactly the same as above, but just like the 2nd one uses texelFetch() instead of imageLoad().

 

 

 

The performance of the four techniques using a 256x1x1 sized work-group and a 64 radius filter on my GTX 770 at 1920x1080 is:

 

1) 38 FPS.

2) 414 FPS (!)

3) 223 FPS

4) 234 FPS

 

As you can see, manually caching values isn't helping at all. Changing the cache array to a vec4[] instead to improve the memory layout only marginally improved performance (230 --> 240 FPS or so). Frankly I'm at a loss. Is texture memory simply so fast and cached so well that using shared memory for performance has become redundant? Am I doing something clearly wrong?




#5185398 Particle alpha blending

Posted by theagentd on 06 October 2014 - 04:39 PM

You can do blending in the fullscreen pass as well, but you need to change the blend function you use when rendering the particles.

 

The problem is that with the most common blend mode (glBlendFunc(GL_SRC_ALPHA, GL_ONE_MINUS_SRC_ALPHA)), you accidentally square the source alpha, resulting in incorrect destination alpha. This can be avoided by using

 

glBlendFuncSeparate(GL_SRC_ALPHA, GL_ONE_MINUS_SRC_ALPHA, GL_ONE, GL_ONE_MINUS_SRC_ALPHA);

 

instead. This should give the resulting FBO correct alpha values.

 

EDIT: Also, in the fullsceen copy pass, your pixels essentially have premultiplied color, so glBlendFunc(GL_ONE, GL_ONE_MINUS_SRC_ALPHA) is the correct one to use.




#5185386 Particle alpha blending

Posted by theagentd on 06 October 2014 - 03:56 PM

I don't think you're actually getting dark halos. I think it's kind of an optical illusion.  It's caused by the color gradient change of the light when it reaches the edge. Try squaring the alpha value like this:

float alpha = 1.0f - clamp(length(position), 0.0f, 1.0f);
frag_color = vec4(1.0f, 1.0f, 1.0f, alpha*alpha);

See if it looks more like you expect.




#5182092 Compute shader runs more than once?!

Posted by theagentd on 22 September 2014 - 07:45 AM

Hello.

 

I'm having problems with a compute shader. It seems like the compute shader is randomly run more than once, which screws up my test shader. Note that I'm using Java, so the syntax of some commands (glMapBufferRange() for example) are slightly different.

 
 
 
 

 

I have a persistently mapped coherent buffer which I use for uploads and downloads:

		buffer = glGenBuffers();
		glBindBuffer(GL_SHADER_STORAGE_BUFFER, buffer);
		glBufferStorage(GL_SHADER_STORAGE_BUFFER, BUFFER_LENGTH * ATTRIBUTE_SIZE, GL_MAP_WRITE_BIT | GL_MAP_READ_BIT | GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT);
		mappedBuffer = glMapBufferRange(GL_SHADER_STORAGE_BUFFER, 0, BUFFER_LENGTH * ATTRIBUTE_SIZE, GL_MAP_WRITE_BIT | GL_MAP_READ_BIT | GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT, null);

The buffer length is 16 and attribute size is 4, to fit 16 integers.

 

 

Each frame, the buffer is initialized to all 0:

		//Reset persistent buffer to 0
		int total = 0;
		for(int i = 0; i < BUFFER_LENGTH; i++){
			int v = 0;
			mappedBuffer.putInt(v);
			total += v;
		}
		System.out.println("Before: " + total); //prints 0
		mappedBuffer.clear(); //Resets the Java ByteBuffer wrapper around the pointer

I then run my compute shader:

		//Add 1 to first 8 values in buffer.
		computeShader.bind();
		glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 0, buffer);
		glDispatchCompute(1, 1, 1);

I wait for the GPU to finish running the shader.

		//Wait for the GPU to finish running the compute shader
		GLSync syncObject = glFenceSync(GL_SYNC_GPU_COMMANDS_COMPLETE, 0);
		glClientWaitSync(syncObject, GL_SYNC_FLUSH_COMMANDS_BIT, 1000*1000*1000);
		
		glFinish(); //Should not be needed, but there just in case for now.

And finally I read back the data:

		//Read back result from persistent buffer
		System.out.println("Result:");
		total = 0;
		for(int i = 0; i < BUFFER_LENGTH; i++){
			int v = mappedBuffer.getInt();
			total += v;
			System.out.println(v); //Print value
		}
		System.out.println("After: " + total);
		mappedBuffer.clear(); //Reset Java wrapper around pointer

And here's my compute shader:

#version 430

layout (binding = 0, rgba16f) uniform image2D img;

layout(std430, binding = 0) buffer Data{

	int data[];
	
} dataBuffer;

layout (local_size_x = 16, local_size_y = 1, local_size_z = 1) in;

void main(){

	int offset = int(gl_GlobalInvocationID.x);
	//int offset = int(gl_WorkGroupSize.x * gl_WorkGroupID.x + gl_LocalInvocationID.x);
	
	if(offset < 8){
		//dataBuffer.data[offset]++;
		atomicAdd(dataBuffer.data[offset], 1);
	}
}

 
 
Summary:
 - I have a persistently mapped coherent buffer which I try to update using a compute shader.
 - I initialize this 16-int buffer to all zeroes.
 - I call the compute shader with 1x1x1 work groups = 1 workgroup, and each workgroup has a work group size of 16x1x1, e.g. a single line of 16 invocations.
 - The shader increments the first 8 elements of the buffer by 1.
 - I correctly wait for the results and everything, but the compute shader randomly seems to be running twice.
 - I read back and print the result from the buffer.

 

 

 

The result is a buffer which 99% of the time contains the value 2 instead of 1!!!

 

 

Before: 0

Result:
2
2
2
2
2
2
2
2
0
0
0
0
0
0
0
0
After: 16
 
Before: 0
Result:
2
2
2
2
2
2
2
2
0
0
0
0
0
0
0
0
After: 16
 
Before: 0
Result:
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
After: 8
 
Before: 0
Result:
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
After: 8

 

This randomly occurs regardless of if the shader uses an atomicAdd() or not. It seems like the compute shader is actually run twice for each element in the buffer instead of once, but I see no possible way of how this could happen. What is going on?!




#5159233 Temporal Subpixel Reconstruction Antialiasing

Posted by theagentd on 09 June 2014 - 05:25 AM

Hello!

 

I came up with a way of extending Subpixel Reconstruction Antialiasing with a temporal component to improve subpixel accuracy and achieve antialiasing of shading as well, while also completely preventing ghosting.

 

Here's a comparison of no antialiasing, SRAA and TSRAA to help catch your interest. =P

 

1x+shading+zoom.png8x+SRAA+zoom.png8x+TSRAA+zoom.png

 

Here's a link to the article I wrote about it.




#5155317 Using the ARB_multi_draw_indirect command

Posted by theagentd on 22 May 2014 - 08:18 PM

Are you sure that you're not simply prematurely optimizing? How exactly is your situation looking? Have you identified the bottleneck?

 

Slightly off-topic: I was inspired by your post and decided to try out glMultiDrawElementsIndirect() since I identified a part in my engine where I simply called glDrawElementsInstancedBaseVertex() in a loop. This was for shadow rendering, so no texture switches were required. Depending on how many types of tiles that were visible, around 20 draw calls were issued in a row, which I replaced with a single glMultiDrawElementsIndirect() call instead. That left my code with 3 different modes, depending on OpenGL support.

 

 

OGL3: Although all the instance data for all draw calls is packed into the same VBO, the vertex attribute pointer needs to be updated before each draw call so that it reads the correct subset of instances from that buffer.

			glVertexAttribPointer(instancePositionLocation, 3, GL_FLOAT, false, 0, baseInstance * 12);
			glDrawElementsInstancedBaseVertex(GL_TRIANGLES, numIndices, GL_UNSIGNED_SHORT, baseIndex*2, numInstances, baseVertex);

ARB_base_instance: If ARB_base_instance is supported, I can instead simply pass in a base instance instead of modifying the instance data pointer, removing the last set of state change from the mesh rendering loop:

glDrawElementsInstancedBaseVertexBaseInstance(GL_TRIANGLES, numIndices, GL_UNSIGNED_SHORT, baseIndex*2, numInstances, baseVertex, baseInstance);

ARB_multi_draw_indirect: If ARB_multi_draw_indirect is supported, I can pack together the above data into an array (an IntBuffer in my case since I'm using Java, hence the weird code), and draw them all with a single draw call:

//In the mesh "rendering" loop
multiDrawBuffer.put(numIndices).put(numInstances).put(baseIndex).put(baseVertex).put(baseInstance);
multiDrawCount++;

//After the loop:
ARBMultiDrawIndirect.glMultiDrawElementsIndirect(GL_TRIANGLES, GL_UNSIGNED_SHORT, multiDrawBuffer, multiDrawCount, 0);
multiDrawBuffer.clear();
multiDrawCount = 0;

 

 

Performance:
OGL3: 56 FPS

ARB_base_instance: 56 FPS (seems like the overhead of glVertexAttribPointer() is extremely low)

ARB_multi_draw_indirect: 62 FPS

 

The scene used was a purposely CPU intensive scene with 1944 shadow maps being rendered (extremely low resolution and most simply had no shadow casters that passed frustum culling). The resolution was intentionally kept very low and the GPU load was at around 69-71%. My Java code was NOT the bottleneck; my OpenGL commands take approximately 8.5 ms to execute, and then an additional ~8 ms is spent blocking on buffer swap (= waiting for the driver to complete the queued commands, e.g. C code (or something) in the driver). My conclusion is that glMultiDrawElementsIndirect() effectively reduced the load on the driver thread significantly, even when batching together just 10-20 draw calls into each glMultiDrawElementsIndirect() commands.




#5155294 Using the ARB_multi_draw_indirect command

Posted by theagentd on 22 May 2014 - 04:59 PM

You won't gain anything by simply replacing each glDrawElements() call with a glMultiDrawElementsIndirect(). The whole point of glMultiDrawElementsIndirect() is to allow you to upload everything you need for all your draw calls to the GPU (using uniform buffers, texture buffers, bindless textures, sparse textures, etc) and then replace ALL your glDrawElements() calls with a single glMultiDrawElementsIndirect() call. As far as I know, glMultiDrawElementsIndirect() is not faster than glDrawElements() when simply used as a replacement for the latter.

 

I strongly recommend you take a look at this presentation http://www.slideshare.net/CassEveritt/beyond-porting which explains really well both the problems and how to solve them.




#5146973 where to start Physical based shading ?

Posted by theagentd on 14 April 2014 - 02:11 PM

Holy shit! Enabling the Toksvig map in this test application makes it look extremely good during magnification! Something that's always bothered me was how smooth surfaces far away looked, but with a Toksvig map it almost looks exactly as I'd expect during minification! The shader there seems to be tuned for Blinn-Phong which uses a specular exponent instead of roughness. How do I this for Cook-Torrance and roughness?

 

EDIT: I NEED THIS. I've been staring at this for so long now.




#5145540 where to start Physical based shading ?

Posted by theagentd on 08 April 2014 - 09:04 PM

Let me see if I have understood this correctly. To implement Cook-Torrance I need to:

 

1. Modify my G-buffer to store specular intensity (AKA ref_at_norm_incidence in the article) and a roughness value.

2. Normalize the function by multiplying the diffuse term by (1 - (specular intensity AKA ref_at_norm_incidence)).

3. Bathe in the glory of physical based shading.

 

My bet is that this is 100x easier to tweak to realistic results compared to my current lighting.




#5137031 Average luminance (2x downsample) filter kernel

Posted by theagentd on 06 March 2014 - 06:48 PM

In your current implementation, you've written:

out.vColor += Sample(Luminance, in.vTex0 + float2(vTexel.y, vTexel.y));

 

You're using vTexel.y twice. ^^






PARTNERS