Jump to content

  • Log In with Google      Sign In   
  • Create Account


Member Since 15 Nov 2010
Offline Last Active Feb 05 2016 06:39 AM

Topics I've Started

[SOLVED] Uniform buffer actually viable?

03 January 2016 - 09:08 AM



I've just implemented a pretty nice batch renderer, but I'm struggling with uniform buffers. I have a great system which only maps a single buffer (built on a lot of experience with doing fast buffer mapping), and I place all my uniform block data in a single uniform buffer. I was assuming that this would be the fastest way to handle uniform changes, but it turns out that uniform buffers have some fatal drawbacks. For example, changing a single variable in a block forces me to allocate a new one with a minimum size of 256 bytes, which is HUGE. I can barely get over 64 bytes right now, so there's a lot of wasted space which seems to inhibit performance a lot, especially for simpler shaders with few uniforms. In almost all cases I change some kind of uniform value between draw calls, and in many cases I end up 


I had this idea that I would split up uniforms into different blocks so that I only had to update the ones that change (view+projection matrices in one block, materials in one block, etc), but as it is now the winning move is to just pack everything into one block so that I don't waste that much space and reupload stuff that has changed to avoid the risk of having to update two smaller blocks with even more padding. It's getting to a point where I think it would be faster to just build a list of glUniform**() calls to do instead of bothering with uniform buffers.


Are uniform buffers just nonviable for real-life usage? Can I work around the offset alignment problem to reduce the padding? Is glUniform() simply superior in most cases and on most drivers?



EDIT: After googling a bit, I want to clarify that my buffer handling is very effective. I place all my uniform data in a single mapped buffer (persistently coherently mapped if possible, otherwise cycling unsynchronized), so there's only a single map operation done per frame. The problem is that the data uploading is simply really slow when the padding is added (can't batch upload it), and the buffers get really big. I think I'm gonna implement some hacky glUniform() calls to compare performance.


Also, this is OpenGL for PC, tested on an Nvidia card.

FBX importing problems

22 September 2015 - 05:35 PM

Hello. I'm trying to convert binary FBX files to my own format, which means extracting joints, vertices and bone bindings/weights from them, but I can't seem to get it right.


1. There are lots of transformation data in there. PreRotation, Lcl Translation/Rotation/Scaling, GeometricTranslation/Rotation/Scaling, PostRotation... I have no idea which transforms apply to what.


2. Due to the above problems, I'm having huge problems with importing joints/bones. The objects have a number of the above transforms, but in addition each deformer has both a "Transform" and a "TransformLink". None of these match each other. What am I supposed to base my joint transforms (bind pose) on?


3. My vertex positions and normal seem to match what I expect to get without any transformation applied to them, but I've found sources saying that they should be affected by Lcl and Geometric transforms. The model is rotated in ways I'm not expecting it to be with those applied though.


4. My model seems to only have bone bindings for a small subset of the model. Only 176 out of 750 vertices have any bone bindings at all. This may of course be a problem with the model, but it is being animated correctly in Maya, which confuses me even more.


I would really appreciate any help I could get. Thanks!

Slerp over a triangle?

29 June 2015 - 08:25 PM



I have a rather interesting problem. I have a triangle with 3 corners, each holding a a unit vector (say a normal). I also have two alpha values, a0 and a1. Together, I calculate a point on the triangle like this:

result = v0 + (v2 - v0) * a0 + (v1 - v0) * a1

Although this correctly produces a new unit vector, the result is skewered by the linear interpolation. What I actually want to do is spherical linear interpolation (slerp) between these 3 vectors based on a0 and a1, but I have no idea how to slerp between more than 2 vectors. I can't really find any information on it either. This is to be done at load time once, so performance is not much of an issue here.

Nvidia GLSL shader compiler bug? Massive register usage for unrolled loop.

08 June 2015 - 07:30 AM



I have a shader that compiles to a horrible mess. It's a depth-of-field shader which requires a lot of random samples to be taken around each pixel, but it has excellent quality with enough samples. The problem is that the Nvidia compiler seems to choke and generate extremely bad assembler code for it that uses a huge number of temporary registers. Since the shader does so many texture reads, that limits the ability of the GPU to hold multiple shader invocations in registers at the same time for memory latency hiding, so the performance of the shader tanks as it gets stuck on texture loads (at least this is my theory).

#version 150

#define PI 3.1415926535897932384626433832795

//Modify sample count here, above 204 gives me an unspecified internal compiler exception
#define DOF_SAMPLES 64
vec2 offsets[252] = vec2[252](vec2(0.0, 0.0), vec2(-0.05094843, -0.027418049), vec2(0.050125718, -0.068688385), vec2(0.09910661, 0.036509473), vec2(0.016413035, 0.121766396), vec2(-0.10062634, 0.09444325), vec2(-0.15083583, -0.015798414), vec2(-0.103096426, -0.1277946), vec2(0.009343931, -0.17559767), vec2(0.12595633, -0.13791811), vec2(0.19384219, -0.035696544), vec2(0.18766445, 0.08716766), vec2(0.1124681, 0.18475763), vec2(-0.004027978, 0.22524856), vec2(-0.12445892, 0.19807222), vec2(-0.21423309, 0.113127284), vec2(-0.2502723, -0.005099549), vec2(-0.22513011, -0.12629732), vec2(-0.1462998, -0.22181903), vec2(-0.03268847, -0.27112693), vec2(0.09111107, -0.26504686), vec2(0.20011634, -0.20609483), vec2(0.2741352, -0.106516264), vec2(0.3004192, 0.01483083), vec2(0.27541655, 0.13634056), vec2(0.20423868, 0.2381278), vec2(0.099106446, 0.30425382), vec2(-0.023290485, 0.32530487), vec2(-0.14467937, 0.299009), vec2(-0.24792022, 0.22988579), vec2(-0.31921172, 0.12802827), vec2(-0.3495787, 0.007491847), vec2(-0.3357994, -0.11608094), vec2(-0.28017777, -0.22740729), vec2(-0.19014229, -0.3131003), vec2(-0.076512635, -0.36374018), vec2(0.04749505, -0.37400672), vec2(0.16820504, -0.3432542), vec2(0.27250993, -0.2753727), vec2(0.3498804, -0.17790553), vec2(0.39280573, -0.061176278), vec2(0.3974986, 0.06335222), vec2(0.36392722, 0.1831637), vec2(0.2956421, 0.28734145), vec2(0.1994358, 0.3662899), vec2(0.08417271, 0.41331866), vec2(-0.03999258, 0.42461538), vec2(-0.16186737, 0.39958513), vec2(-0.27175704, 0.3405859), vec2(-0.3603246, 0.252969), vec2(-0.42078885, 0.14401685), vec2(-0.44863185, 0.02257885), vec2(-0.44203544, -0.10178065), vec2(-0.401791, -0.21976513), vec2(-0.33128855, -0.3224197), vec2(-0.23587917, -0.40254584), vec2(-0.12263095, -0.4545548), vec2(2.9601651E-4, -0.47500983), vec2(0.124502406, -0.46272635), vec2(0.24105985, -0.41890258), vec2(0.3426444, -0.34664255), vec2(0.42262557, -0.25086743), vec2(0.47584936, -0.13816687), vec2(0.49925408, -0.015681067), vec2(0.49160993, 0.10858902), vec2(0.45357937, 0.22741406), vec2(0.38769093, 0.33335632), vec2(0.29811624, 0.4201616), vec2(0.1903407, 0.48286253), vec2(0.07078616, 0.5180246), vec2(-0.053854316, 0.5238695), vec2(-0.17626365, 0.5002464), vec2(-0.28965878, 0.44875896), vec2(-0.38829878, 0.37214747), vec2(-0.4665091, 0.27498433), vec2(-0.52034074, 0.16267747), vec2(-0.54730046, 0.04076701), vec2(-0.54602325, -0.08385942), vec2(-0.5167307, -0.20529567), vec2(-0.46119738, -0.3169049), vec2(-0.38233075, -0.41343606), vec2(-0.28394252, -0.4903723), vec2(-0.17137805, -0.543774), vec2(-0.04968076, -0.57146436), vec2(0.07505839, -0.57217383), vec2(0.19722003, -0.54598904), vec2(0.31062397, -0.4944529), vec2(0.4106351, -0.41991213), vec2(0.49273396, -0.32579467), vec2(0.55314887, -0.21663895), vec2(0.5893967, -0.09741622), vec2(0.60010576, 0.027145972), vec2(0.58489394, 0.1507767), vec2(0.54447216, 0.2690125), vec2(0.48081723, 0.3763101), vec2(0.39661273, 0.46847865), vec2(0.29568815, 0.5415854), vec2(0.18178858, 0.5930285), vec2(0.060128435, 0.6205564), vec2(-0.06461978, 0.6233034), vec2(-0.18747027, 0.601262), vec2(-0.30358085, 0.55540854), vec2(-0.40820295, 0.48783192), vec2(-0.49784634, 0.4009216), vec2(-0.5688769, 0.29824063), vec2(-0.6186902, 0.1837656), vec2(-0.6455222, 0.061860383), vec2(-0.6484957, -0.06289835), vec2(-0.62763125, -0.18589757), vec2(-0.5838239, -0.3026598), vec2(-0.51859844, -0.4092575), vec2(-0.4345059, -0.5016188), vec2(-0.33465162, -0.5765445), vec2(-0.22264922, -0.63150495), vec2(-0.10247934, -0.6647122), vec2(0.021983659, -0.6751602), vec2(0.14614245, -0.6625285), vec2(0.26578295, -0.6274031), vec2(0.37719408, -0.5709465), vec2(0.47639263, -0.49525908), vec2(0.56038487, -0.4027192), vec2(0.62624264, -0.2966684), vec2(0.6719576, -0.18066056), vec2(0.69624704, -0.058175553), vec2(0.6983263, 0.06685303), vec2(0.6782938, 0.18977612), vec2(0.63671046, 0.307704), vec2(0.57510304, 0.41631395), vec2(0.49554783, 0.51231426), vec2(0.400357, 0.5930643), vec2(0.29244864, 0.65609974), vec2(0.1754449, 0.69941944), vec2(0.052591655, 0.72192955), vec2(-0.07208396, 0.722997), vec2(-0.19556342, 0.7026118), vec2(-0.313208, 0.66164815), vec2(-0.42258692, 0.6010672), vec2(-0.5199313, 0.5229809), vec2(-0.602774, 0.42951193), vec2(-0.66860044, 0.3236692), vec2(-0.7159336, 0.20787737), vec2(-0.7431986, 0.0860637), vec2(-0.74982345, -0.03867427), vec2(-0.73572564, -0.16251041), vec2(-0.7012463, -0.28273726), vec2(-0.64756024, -0.3953836), vec2(-0.5760888, -0.49775696), vec2(-0.4887943, -0.58711183), vec2(-0.38836306, -0.6608815), vec2(-0.27691227, -0.7175607), vec2(-0.15805756, -0.75535756), vec2(-0.03424813, -0.773538), vec2(0.09054814, -0.77156425), vec2(0.21345562, -0.7496163), vec2(0.33131352, -0.70830643), vec2(0.4411171, -0.64873886), vec2(0.5398127, -0.572728), vec2(0.6255269, -0.48178065), vec2(0.69582444, -0.37842688), vec2(0.74885577, -0.26563898), vec2(0.7836738, -0.14555883), vec2(0.79928285, -0.021499978), vec2(0.7953974, 0.10310024), vec2(0.7721461, 0.22596616), vec2(0.7302907, 0.3433905), vec2(0.6706387, 0.45328656), vec2(0.59503376, 0.5523862), vec2(0.5048565, 0.63898236), vec2(0.40249664, 0.7107442), vec2(0.29076397, 0.7658893), vec2(0.17159884, 0.8035328), vec2(0.048127197, 0.8226634), vec2(-0.07678943, 0.82290584), vec2(-0.20028447, 0.8043227), vec2(-0.31955224, 0.7674063), vec2(-0.43191117, 0.7130601), vec2(-0.53486264, 0.64257175), vec2(-0.62641716, 0.5572826), vec2(-0.7040037, 0.4596876), vec2(-0.76645505, 0.35137343), vec2(-0.8121836, 0.23506467), vec2(-0.8402563, 0.11334115), vec2(-0.8501355, -0.011126165), vec2(-0.8416843, -0.13563512), vec2(-0.8151627, -0.25751156), vec2(-0.77104247, -0.3745396), vec2(-0.7103987, -0.48384058), vec2(-0.6345899, -0.58313197), vec2(-0.5452804, -0.670368), vec2(-0.44439965, -0.74377894), vec2(-0.33371583, -0.8020702), vec2(-0.2162917, -0.84372354), vec2(-0.09380041, -0.868235), vec2(0.030842794, -0.87501734), vec2(0.15550463, -0.863951), vec2(0.27683, -0.83542097), vec2(0.39319742, -0.78989804), vec2(0.50187993, -0.72844803), vec2(0.60106504, -0.6520949), vec2(0.6881655, -0.56294334), vec2(0.76210856, -0.46222004), vec2(0.8211906, -0.35224876), vec2(0.86445415, -0.23481578), vec2(0.89084774, -0.11304182), vec2(0.90013, 0.011521529), vec2(0.8920367, 0.1364757), vec2(0.8668728, 0.2585674), vec2(0.8250396, 0.37631813), vec2(0.7674823, 0.4870859), vec2(0.6950991, 0.58915573), vec2(0.6095126, 0.6802521), vec2(0.5124028, 0.75872827), vec2(0.40564424, 0.8231951), vec2(0.2908514, 0.8726881), vec2(0.17055117, 0.90612817), vec2(0.046985038, 0.9229954), vec2(-0.07801756, 0.9230508), vec2(-0.20173839, 0.90630436), vec2(-0.3223663, 0.87301296), vec2(-0.43729547, 0.8239084), vec2(-0.5444945, 0.7599695), vec2(-0.64209765, 0.68242854), vec2(-0.7287248, 0.59239984), vec2(-0.80278486, 0.4914286), vec2(-0.86252344, 0.38208303), vec2(-0.9074107, 0.2655509), vec2(-0.93659306, 0.14384396), vec2(-0.9494719, 0.019528575), vec2(-0.94592875, -0.10522286), vec2(-0.9260293, -0.2287058), vec2(-0.8902057, -0.34835088), vec2(-0.83917457, -0.4621388), vec2(-0.7736365, -0.5685537), vec2(-0.6949583, -0.66541266), vec2(-0.6041978, -0.7514446), vec2(-0.503229, -0.82490075), vec2(-0.3933879, -0.88483435), vec2(-0.2773674, -0.9299018), vec2(-0.15581436, -0.9598745), vec2(-0.031619634, -0.97397035), vec2(0.09315319, -0.9720673), vec2(0.21691974, -0.9542121), vec2(0.3372007, -0.92078584), vec2(0.4520694, -0.872441), vec2(0.5605, -0.8095382), vec2(0.6595406, -0.7338182), vec2(0.7484103, -0.6460213), vec2(0.8256419, -0.5475068), vec2(0.8895131, -0.4406578), vec2(0.93975735, -0.3259703), vec2(0.9751435, -0.2060976), vec2(0.9952242, -0.0829647));

uniform sampler2D tileBuffer;
uniform sampler2D inputBuffer;
uniform sampler2D packedCoCDepthBuffer;

in vec2 texCoords;

out vec4 fragColor;

const float f = 0.5;
const float piff = PI*f*f;

float cocAlpha(float coc){
	return 1.0 / max(piff, (PI * coc * coc));

float rand(vec2 xy){
    return fract(sin(dot(xy, vec2(12.9898, 78.233))) * 43758.5453);

#define TILE_BIT_SHIFT 4 //Usually defined by the program depending on tile size

#pragma optionNV(unroll all)
void main(){
	ivec2 tileCoords = ivec2(gl_FragCoord.xy) >> TILE_BIT_SHIFT;
	vec2 tile = texelFetch(tileBuffer, tileCoords, 0).xy;
	float maxCoC = tile.x;
	float minDepth = tile.y;
	vec2 center = gl_FragCoord.xy;
	vec4 centerColor = texelFetch(inputBuffer, ivec2(center), 0);
	float angle = rand(gl_FragCoord.xy) * 2 * PI;
	float c = cos(angle);
	float s = sin(angle);
	mat2 rotation = mat2(
		c, -s,
		s,  c
	vec2 centerData = texelFetch(packedCoCDepthBuffer, ivec2(center), 0).xy;
	float centerCoC = centerData.x;
	float centerDepth = centerData.y;
	vec4 foreground = vec4(0);
	vec4 background = vec4(0);
	float centerCoCAlpha = cocAlpha(centerCoC);
	float centerCoC1 = centerCoC + 1;
	for(int i = 0; i < DOF_SAMPLES; i++){
		ivec2 coords = ivec2(center + (rotation*offsets[i])*maxCoC);
		float distance = length(offsets[i])*maxCoC;
		vec2 cocDepth = texelFetch(packedCoCDepthBuffer, coords, 0).rg;
		float coc = cocDepth.x;
		float depth = cocDepth.y;
		float backAlpha = clamp(depth - centerDepth + 1, 0.0, 1.0);
		float foreAlpha = 1.0 - backAlpha;
		float coc1 = clamp(1.0 + centerCoC1 - distance, 0.0, 1.0);
		backAlpha *= centerCoCAlpha * coc1;
		float coc2 = clamp(1.0 + coc - distance, 0.0, 1.0);
		foreAlpha *= cocAlpha(coc) * coc2;
		vec3 color = texelFetch(inputBuffer, coords, 0).rgb;
		vec4 f = vec4(color, 1.0) * foreAlpha;
		vec4 b = vec4(color, 1.0) * backAlpha;
		//vec4 b = vec4(color, 1.0) * (backAlpha + foreAlpha * 0.000001); //Using this line instead of the one above reduces the register count a bit.

		foreground += f;
		background += b; //Comment out this line and the number of registers drops to 7 regardless of sample count.
	float alpha = clamp(foreground.a / (cocAlpha(maxCoC)*DOF_SAMPLES), 0.0, 1.0);
	foreground.rgb /= max(foreground.a, 0.000000001);
	background.rgb /= max(background.a, 0.000000001);
	vec3 color = mix(background.rgb, foreground.rgb, alpha);
	fragColor = vec4(color, centerColor.a);

Excuse the offset array...


The shader is relatively simple. For each sample, it samples the depth and CoC (circle of confusion) of the sample and calculates a background alpha and a foreground alpha. It then fetches the color of the sample and accumulates the foreground color and background color in two vec4 vectors. At the end I blend between the two.


The problem here is that the register count completely depends on the sample count, which it shouldn't. Using GPU ShaderAnalyzer to compile the shader for AMD GPUs gives a register count of 16, regardless of sample count. On Nvidia, it simply explodes. Dumping the result of glGetProgramBinary() as text, I can see the temp register line:

TEMP R0, R1, R2, R3, ..., R64, R65; //66 registers

 - 32 samples: 12.3ms, 66 registers.

 - 64 samples: 52.2ms, 130 registers.

 - 128 samples: 263.3ms, 258 registers.

 - 204 samples: 924.2ms, 410 registers.


Above 204 samples I get an unspecified internal compiler exception and the shader won't compile.


Obviously, I can't have this kind of performance at 1920x1080. The shader should be runnable on vastly less registers, but the assembly is a complete mess. If I do some subtle changes, I can somewhat reduce the sample count making 32 and 64 samples viable. If I replace the calculation of the "b" temp variable in the loop from <backAlpha> to <(backAlpha + foreAlpha*0.000001)>, the shader massively improves.


 - 32 samples, 1.8ms, 43 registers (23 less)

 - 64 samples, 3.8ms, 91 registers (39 less)


Commenting out <background += b;> completely leaves all the texture fetches intact but the register count drops to a constant 7 regardless of sample count, but the result obviously becomes incorrect.


 - 204 samples, 16.5ms, 7 registers (403 less)


The thing is that the background variable and its calculations should only consume 2-3 registers. I'm really at a loss of what to do.



Performance of drawing vegetation; overdraw, alpha testing... Alpha cutout model genera...

01 May 2015 - 06:57 PM

Hello, everyone.
I'm still struggling with getting good vegetation rendering performance. Basically, what I have is this:


With the right ground texture and lighting, it looks... okay.





Sadly, the draw distance is limited as hell, and due to this, we basically need to make the grass extremely boring and feature-less so the fading in the distance isn't noticeable. I want faster, taller, thicker, more distinct grass with higher render distance without the performance cost. The current grass takes around 2 milliseconds, half being drawing the grass to the G-buffer (including depth pre-pass) and the other half being the SRAA pass. Basically, the grass is a bunch of flat triangles.


What I've learned so far:


 - Do a depth prepass if you're not doing MSAA. It saves a SHITLOAD of time since depth-only+alpha-test is really cheap, and GL_EQUAL depth testing in the second G-buffer pass is literally 4 times as fast. Basically doubles my performance, but it isn't usable in the SRAA (MSAA) pass.


 - The shape of the grass meshes matters a lot. At first we had randomly rotated flat meshes, but these looked like crap and tended to lump together. We switched to 3 intersecting quads rotated 120 degrees from each other, which had a more volumetric and even look, but it looked like crap from above. In the end, we went with http://i.imgur.com/h0LilV0.jpg, which looks good from most angles, especially above.


 - The area of the mesh is what matters. Even transparent fragments that get discarded by the alpha test are essentially full-cost fragments. We "optimized" our textures to be as little transparent cutout as possible and as much grass as possible to minimize the number of wasted fragments.


 - Fading the alpha value of the grass looks like crap with alpha testing, and even worse with alpha-to-coverage. A better solution was to simply slowly sink the grass into the ground, which had less popping and flickering.


 - Shadows are completely out of the question. Don't draw shadows for the grass.




With all these tricks, our grass went from 20ms to 2ms at... "acceptable" quality. I still don't like it, but frankly, I'm at a loss at what to do next. Then I see something like this:




And I'm just like "What the hell". The have significantly longer render distance than us, but don't have seem to have any significant performance problems. The grass is also significantly taller, and in Crysis 3 you even walk around with the camera inside it. My guess is that the are rendering those blades as meshes, not alpha-tested billboards. I would like to test that out anyway.


Is there any tools out there to generate a triangle mesh from an image with alpha cutouts? Preferably one that allows for multiple LODs of the same model to be generated.