Jump to content

  • Log In with Google      Sign In   
  • Create Account

theagentd

Member Since 15 Nov 2010
Offline Last Active Oct 22 2014 03:06 PM

#5185398 Particle alpha blending

Posted by theagentd on 06 October 2014 - 04:39 PM

You can do blending in the fullscreen pass as well, but you need to change the blend function you use when rendering the particles.

 

The problem is that with the most common blend mode (glBlendFunc(GL_SRC_ALPHA, GL_ONE_MINUS_SRC_ALPHA)), you accidentally square the source alpha, resulting in incorrect destination alpha. This can be avoided by using

 

glBlendFuncSeparate(GL_SRC_ALPHA, GL_ONE_MINUS_SRC_ALPHA, GL_ONE, GL_ONE_MINUS_SRC_ALPHA);

 

instead. This should give the resulting FBO correct alpha values.

 

EDIT: Also, in the fullsceen copy pass, your pixels essentially have premultiplied color, so glBlendFunc(GL_ONE, GL_ONE_MINUS_SRC_ALPHA) is the correct one to use.




#5185386 Particle alpha blending

Posted by theagentd on 06 October 2014 - 03:56 PM

I don't think you're actually getting dark halos. I think it's kind of an optical illusion.  It's caused by the color gradient change of the light when it reaches the edge. Try squaring the alpha value like this:

float alpha = 1.0f - clamp(length(position), 0.0f, 1.0f);
frag_color = vec4(1.0f, 1.0f, 1.0f, alpha*alpha);

See if it looks more like you expect.




#5182092 Compute shader runs more than once?!

Posted by theagentd on 22 September 2014 - 07:45 AM

Hello.

 

I'm having problems with a compute shader. It seems like the compute shader is randomly run more than once, which screws up my test shader. Note that I'm using Java, so the syntax of some commands (glMapBufferRange() for example) are slightly different.

 
 
 
 

 

I have a persistently mapped coherent buffer which I use for uploads and downloads:

		buffer = glGenBuffers();
		glBindBuffer(GL_SHADER_STORAGE_BUFFER, buffer);
		glBufferStorage(GL_SHADER_STORAGE_BUFFER, BUFFER_LENGTH * ATTRIBUTE_SIZE, GL_MAP_WRITE_BIT | GL_MAP_READ_BIT | GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT);
		mappedBuffer = glMapBufferRange(GL_SHADER_STORAGE_BUFFER, 0, BUFFER_LENGTH * ATTRIBUTE_SIZE, GL_MAP_WRITE_BIT | GL_MAP_READ_BIT | GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT, null);

The buffer length is 16 and attribute size is 4, to fit 16 integers.

 

 

Each frame, the buffer is initialized to all 0:

		//Reset persistent buffer to 0
		int total = 0;
		for(int i = 0; i < BUFFER_LENGTH; i++){
			int v = 0;
			mappedBuffer.putInt(v);
			total += v;
		}
		System.out.println("Before: " + total); //prints 0
		mappedBuffer.clear(); //Resets the Java ByteBuffer wrapper around the pointer

I then run my compute shader:

		//Add 1 to first 8 values in buffer.
		computeShader.bind();
		glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 0, buffer);
		glDispatchCompute(1, 1, 1);

I wait for the GPU to finish running the shader.

		//Wait for the GPU to finish running the compute shader
		GLSync syncObject = glFenceSync(GL_SYNC_GPU_COMMANDS_COMPLETE, 0);
		glClientWaitSync(syncObject, GL_SYNC_FLUSH_COMMANDS_BIT, 1000*1000*1000);
		
		glFinish(); //Should not be needed, but there just in case for now.

And finally I read back the data:

		//Read back result from persistent buffer
		System.out.println("Result:");
		total = 0;
		for(int i = 0; i < BUFFER_LENGTH; i++){
			int v = mappedBuffer.getInt();
			total += v;
			System.out.println(v); //Print value
		}
		System.out.println("After: " + total);
		mappedBuffer.clear(); //Reset Java wrapper around pointer

And here's my compute shader:

#version 430

layout (binding = 0, rgba16f) uniform image2D img;

layout(std430, binding = 0) buffer Data{

	int data[];
	
} dataBuffer;

layout (local_size_x = 16, local_size_y = 1, local_size_z = 1) in;

void main(){

	int offset = int(gl_GlobalInvocationID.x);
	//int offset = int(gl_WorkGroupSize.x * gl_WorkGroupID.x + gl_LocalInvocationID.x);
	
	if(offset < 8){
		//dataBuffer.data[offset]++;
		atomicAdd(dataBuffer.data[offset], 1);
	}
}

 
 
Summary:
 - I have a persistently mapped coherent buffer which I try to update using a compute shader.
 - I initialize this 16-int buffer to all zeroes.
 - I call the compute shader with 1x1x1 work groups = 1 workgroup, and each workgroup has a work group size of 16x1x1, e.g. a single line of 16 invocations.
 - The shader increments the first 8 elements of the buffer by 1.
 - I correctly wait for the results and everything, but the compute shader randomly seems to be running twice.
 - I read back and print the result from the buffer.

 

 

 

The result is a buffer which 99% of the time contains the value 2 instead of 1!!!

 

 

Before: 0

Result:
2
2
2
2
2
2
2
2
0
0
0
0
0
0
0
0
After: 16
 
Before: 0
Result:
2
2
2
2
2
2
2
2
0
0
0
0
0
0
0
0
After: 16
 
Before: 0
Result:
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
After: 8
 
Before: 0
Result:
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
After: 8

 

This randomly occurs regardless of if the shader uses an atomicAdd() or not. It seems like the compute shader is actually run twice for each element in the buffer instead of once, but I see no possible way of how this could happen. What is going on?!




#5159233 Temporal Subpixel Reconstruction Antialiasing

Posted by theagentd on 09 June 2014 - 05:25 AM

Hello!

 

I came up with a way of extending Subpixel Reconstruction Antialiasing with a temporal component to improve subpixel accuracy and achieve antialiasing of shading as well, while also completely preventing ghosting.

 

Here's a comparison of no antialiasing, SRAA and TSRAA to help catch your interest. =P

 

1x+shading+zoom.png8x+SRAA+zoom.png8x+TSRAA+zoom.png

 

Here's a link to the article I wrote about it.




#5155317 Using the ARB_multi_draw_indirect command

Posted by theagentd on 22 May 2014 - 08:18 PM

Are you sure that you're not simply prematurely optimizing? How exactly is your situation looking? Have you identified the bottleneck?

 

Slightly off-topic: I was inspired by your post and decided to try out glMultiDrawElementsIndirect() since I identified a part in my engine where I simply called glDrawElementsInstancedBaseVertex() in a loop. This was for shadow rendering, so no texture switches were required. Depending on how many types of tiles that were visible, around 20 draw calls were issued in a row, which I replaced with a single glMultiDrawElementsIndirect() call instead. That left my code with 3 different modes, depending on OpenGL support.

 

 

OGL3: Although all the instance data for all draw calls is packed into the same VBO, the vertex attribute pointer needs to be updated before each draw call so that it reads the correct subset of instances from that buffer.

			glVertexAttribPointer(instancePositionLocation, 3, GL_FLOAT, false, 0, baseInstance * 12);
			glDrawElementsInstancedBaseVertex(GL_TRIANGLES, numIndices, GL_UNSIGNED_SHORT, baseIndex*2, numInstances, baseVertex);

ARB_base_instance: If ARB_base_instance is supported, I can instead simply pass in a base instance instead of modifying the instance data pointer, removing the last set of state change from the mesh rendering loop:

glDrawElementsInstancedBaseVertexBaseInstance(GL_TRIANGLES, numIndices, GL_UNSIGNED_SHORT, baseIndex*2, numInstances, baseVertex, baseInstance);

ARB_multi_draw_indirect: If ARB_multi_draw_indirect is supported, I can pack together the above data into an array (an IntBuffer in my case since I'm using Java, hence the weird code), and draw them all with a single draw call:

//In the mesh "rendering" loop
multiDrawBuffer.put(numIndices).put(numInstances).put(baseIndex).put(baseVertex).put(baseInstance);
multiDrawCount++;

//After the loop:
ARBMultiDrawIndirect.glMultiDrawElementsIndirect(GL_TRIANGLES, GL_UNSIGNED_SHORT, multiDrawBuffer, multiDrawCount, 0);
multiDrawBuffer.clear();
multiDrawCount = 0;

 

 

Performance:
OGL3: 56 FPS

ARB_base_instance: 56 FPS (seems like the overhead of glVertexAttribPointer() is extremely low)

ARB_multi_draw_indirect: 62 FPS

 

The scene used was a purposely CPU intensive scene with 1944 shadow maps being rendered (extremely low resolution and most simply had no shadow casters that passed frustum culling). The resolution was intentionally kept very low and the GPU load was at around 69-71%. My Java code was NOT the bottleneck; my OpenGL commands take approximately 8.5 ms to execute, and then an additional ~8 ms is spent blocking on buffer swap (= waiting for the driver to complete the queued commands, e.g. C code (or something) in the driver). My conclusion is that glMultiDrawElementsIndirect() effectively reduced the load on the driver thread significantly, even when batching together just 10-20 draw calls into each glMultiDrawElementsIndirect() commands.




#5155294 Using the ARB_multi_draw_indirect command

Posted by theagentd on 22 May 2014 - 04:59 PM

You won't gain anything by simply replacing each glDrawElements() call with a glMultiDrawElementsIndirect(). The whole point of glMultiDrawElementsIndirect() is to allow you to upload everything you need for all your draw calls to the GPU (using uniform buffers, texture buffers, bindless textures, sparse textures, etc) and then replace ALL your glDrawElements() calls with a single glMultiDrawElementsIndirect() call. As far as I know, glMultiDrawElementsIndirect() is not faster than glDrawElements() when simply used as a replacement for the latter.

 

I strongly recommend you take a look at this presentation http://www.slideshare.net/CassEveritt/beyond-porting which explains really well both the problems and how to solve them.




#5146973 where to start Physical based shading ?

Posted by theagentd on 14 April 2014 - 02:11 PM

Holy shit! Enabling the Toksvig map in this test application makes it look extremely good during magnification! Something that's always bothered me was how smooth surfaces far away looked, but with a Toksvig map it almost looks exactly as I'd expect during minification! The shader there seems to be tuned for Blinn-Phong which uses a specular exponent instead of roughness. How do I this for Cook-Torrance and roughness?

 

EDIT: I NEED THIS. I've been staring at this for so long now.




#5145540 where to start Physical based shading ?

Posted by theagentd on 08 April 2014 - 09:04 PM

Let me see if I have understood this correctly. To implement Cook-Torrance I need to:

 

1. Modify my G-buffer to store specular intensity (AKA ref_at_norm_incidence in the article) and a roughness value.

2. Normalize the function by multiplying the diffuse term by (1 - (specular intensity AKA ref_at_norm_incidence)).

3. Bathe in the glory of physical based shading.

 

My bet is that this is 100x easier to tweak to realistic results compared to my current lighting.




#5137031 Average luminance (2x downsample) filter kernel

Posted by theagentd on 06 March 2014 - 06:48 PM

In your current implementation, you've written:

out.vColor += Sample(Luminance, in.vTex0 + float2(vTexel.y, vTexel.y));

 

You're using vTexel.y twice. ^^




#5136974 Motion vector calculation problem

Posted by theagentd on 06 March 2014 - 03:48 PM

@osmanb

you might reduce the issue by dividing by abs(w), as most of the problem arises due to the flip of x and y when you divide by something negative. it's still not correct, but far less noticeable and you get away with per-vertex cost.

I tried that after I posted my first post, but it didn't help. I also tried with -abs(w). It seems like the problem is that the motion vector becomes extremely large when a vertex lies very close to (linear) Z=0. Direction isn't really the main problem; it's the length of the vector. I clamp my motion vectors so my motion blur doesn't explode, but the motion vectors are several magnitudes too large even for unnoticable movement.




#5109480 Efficient frustum culling of static terrain and dynamic objects

Posted by theagentd on 15 November 2013 - 09:03 AM

I finally had time to finish my implementation!

 

 

Heavy test

Old culling: 73 FPS

New culling: 89 FPS

 

Extreme test

Old culling: 13 FPS (36 with multithreading)

New culling: 20 FPS (45 with multithreading)

(The multithreading is not optimized for the new culling method yet.)

 

 

 

Old implementation:

1. Update bounds of all objects.

2. For each frustum, test all objects against frustum.

 

New implementation:

1. Update bounds of all objects.

2a. Update the bounds of the octree to perfectly fit all objects based on their bounds.

2b. Clear all entries in the octree. (2a and 2b are done at the same time by traversing the tree.)

3. For each frustum, query the octree for intersecting objects.

 

There are many parameters to tweak, but so far an octree with 4 or 5 levels seems optimal, which can be updated in under around 0.6 milliseconds. Although a tree with 5 levels was overall around 10-20% slower in both updating and culling compared to a 4 level tree, when comparing threaded performance 5 levels was actually slightly faster which I believe is due to hyperthreading helping with cache misses when traversing the deeper tree. Performance was actually a bit worse for very simple scenes running at >150 FPS, but in those cases performance was not a problem in the first place. Octree's seem to, like deferred shading, sacrifice performance in simple scenes to improve scaling to more complex scenes, an excellent trade-off in my opinion.

 

I will profile my octree updating a bit to see if the automatic bounds computing really is worth it, etc. When I optimize my threading code to run other tasks in parallel with the octree updating I expect threaded performance to improve quite a bit there too, and octree updating overhead to have a smaller impact on overall performance.




#5003103 EVSM performance tip!

Posted by theagentd on 21 November 2012 - 09:09 PM

Hello. I have a small tip for anyone using EVSMs. I don't know if it's something completely obvious or so, but here it is anyway.

The papers I've found recommends that as a shadow map, you render to a MSAA 32-bit float RGBA render target to store the two moments (m1, m1*m1, m2, m2*m2). You also need a depth buffer since the RGBA buffer can't be used as a depth buffer. This is incredibly expensive. For 4xMSAA, we get 16+4 bytes per MSAA sample. That's 80MBs just for a 4 sample 1024^2 variance map! We also need a resolved variance map, so add another 16MBs there. In total: 96MBs just for a 1024^2 shadow map. Don't you dare try a resolution 2048^2...

However, I found that I can reduce this memory footprint a lot. We don't have to calculate the moments until we resolve the MSAA texture! Instead of having a 32-bit float RGBA texture + a depth texture, we can get by with only a 32-bit float depth texture. By outputting view-space depth and modifying the depth range, we can just pass it down to gl_FragDepth (in the case of OpenGL) in the shadow rendering shader. When resolving we simply read the depth samples and calculate the moments and average them together! The result is that we only need 4 bytes per MSAA sample, period. That's 16MBs for a a 4xMSAA 1024^2 variance. The resolved variance texture is identical to before so that's another 16MBs. In total: 32MBs, which is a LOT better than 96MBs.

This not only reduces VRAM usage a lot, it also massively reduces the bandwidth needed for the shadow map rendering and resolve passes. On my GTX 295 (only one GPU, equal to a GTX 260/275), the performance is a lot better with my optimization. I'm getting 240 FPS in using the standard technique (1024^2 + 4xMSAA) and 440 FPS with my new one, which is almost twice as fast. Quality is identical to the normal technique since it's identical to normal EVSM stuff after resolving the MSAA texture.

I hope someone finds this useful, and if something's unclear feel free to ask!

EDIT: Fun fact: It's not possible to create a 8xMSAA 32-bit float RGBA texture, but it is possible to create a 8xMSAA 32-float depth buffer, so my technique works with 8xMSAA while the standard technique does not (at least on my hardware). Even funnier: Mine with 8xMSAA is 50% faster than the original with 4xMSAA and uses half as much memory.


#4948098 Rendering to texture/FBO issue

Posted by theagentd on 11 June 2012 - 03:21 AM

I think the rendering is working fine, but you're not clearing the FBO, causing the depth test to fail after the first frame. Make sure that all FBOs are cleared properly, both the shadow map and the shadow post processing buffer. I see for example that you call glViewport(), then glClear() and THEN glBindFramebuffer(). I don't think you wanted to clear a shadow map sized part of the screen, right? =S


#4947891 Rendering to texture/FBO issue

Posted by theagentd on 10 June 2012 - 05:52 AM

You don't have a depth buffer in your FBO, so the geometry just appears in the order it's drawn. You need to have a depth buffer in step 2 too since you're trying to draw all the geometry again, and you obviously need the closest depth there.

I get the feeling you're trying to do a screen space blur of the shadow map to get smooth shadows, correct? This won't look very good and will be very inaccurate, especially for steep angles since it's uniformly blurred. You'll also possible get light bleeding where you're not supposed to unless you take the depth of neighboring pixels into account. Anyway, it's far from optimal. The easiest way to do soft shadows is with PCF, which is basically sampling lots of nearby depth values from the shadow map and checking how many of these that pass. This gives problems with self shadowing though so you need a high depth bias which in turn pretty much removes self shadowing. If you're feeling really adventurous look up variance shadow maps.


#4905792 Unlimited Detail Technology

Posted by theagentd on 24 January 2012 - 08:48 AM


They are showing that:
a) It can easily be animated
b) It has zero problems with memory size issues
c) It is unlimited

So seems all blaming comments here was just jealous.


Yeah, let me know when they release a white paper.

Also, if anyone is looking for a challenging drinking game, watch that video and take a shot every time you hear the world 'unlimited'.


... drinking gamo reason ...

FTFY




PARTNERS