Jump to content

  • Log In with Google      Sign In   
  • Create Account


Member Since 15 Nov 2010
Offline Last Active Today, 12:26 PM

Topics I've Started

FBX importing problems

22 September 2015 - 05:35 PM

Hello. I'm trying to convert binary FBX files to my own format, which means extracting joints, vertices and bone bindings/weights from them, but I can't seem to get it right.


1. There are lots of transformation data in there. PreRotation, Lcl Translation/Rotation/Scaling, GeometricTranslation/Rotation/Scaling, PostRotation... I have no idea which transforms apply to what.


2. Due to the above problems, I'm having huge problems with importing joints/bones. The objects have a number of the above transforms, but in addition each deformer has both a "Transform" and a "TransformLink". None of these match each other. What am I supposed to base my joint transforms (bind pose) on?


3. My vertex positions and normal seem to match what I expect to get without any transformation applied to them, but I've found sources saying that they should be affected by Lcl and Geometric transforms. The model is rotated in ways I'm not expecting it to be with those applied though.


4. My model seems to only have bone bindings for a small subset of the model. Only 176 out of 750 vertices have any bone bindings at all. This may of course be a problem with the model, but it is being animated correctly in Maya, which confuses me even more.


I would really appreciate any help I could get. Thanks!

Slerp over a triangle?

29 June 2015 - 08:25 PM



I have a rather interesting problem. I have a triangle with 3 corners, each holding a a unit vector (say a normal). I also have two alpha values, a0 and a1. Together, I calculate a point on the triangle like this:

result = v0 + (v2 - v0) * a0 + (v1 - v0) * a1

Although this correctly produces a new unit vector, the result is skewered by the linear interpolation. What I actually want to do is spherical linear interpolation (slerp) between these 3 vectors based on a0 and a1, but I have no idea how to slerp between more than 2 vectors. I can't really find any information on it either. This is to be done at load time once, so performance is not much of an issue here.

Nvidia GLSL shader compiler bug? Massive register usage for unrolled loop.

08 June 2015 - 07:30 AM



I have a shader that compiles to a horrible mess. It's a depth-of-field shader which requires a lot of random samples to be taken around each pixel, but it has excellent quality with enough samples. The problem is that the Nvidia compiler seems to choke and generate extremely bad assembler code for it that uses a huge number of temporary registers. Since the shader does so many texture reads, that limits the ability of the GPU to hold multiple shader invocations in registers at the same time for memory latency hiding, so the performance of the shader tanks as it gets stuck on texture loads (at least this is my theory).

#version 150

#define PI 3.1415926535897932384626433832795

//Modify sample count here, above 204 gives me an unspecified internal compiler exception
#define DOF_SAMPLES 64
vec2 offsets[252] = vec2[252](vec2(0.0, 0.0), vec2(-0.05094843, -0.027418049), vec2(0.050125718, -0.068688385), vec2(0.09910661, 0.036509473), vec2(0.016413035, 0.121766396), vec2(-0.10062634, 0.09444325), vec2(-0.15083583, -0.015798414), vec2(-0.103096426, -0.1277946), vec2(0.009343931, -0.17559767), vec2(0.12595633, -0.13791811), vec2(0.19384219, -0.035696544), vec2(0.18766445, 0.08716766), vec2(0.1124681, 0.18475763), vec2(-0.004027978, 0.22524856), vec2(-0.12445892, 0.19807222), vec2(-0.21423309, 0.113127284), vec2(-0.2502723, -0.005099549), vec2(-0.22513011, -0.12629732), vec2(-0.1462998, -0.22181903), vec2(-0.03268847, -0.27112693), vec2(0.09111107, -0.26504686), vec2(0.20011634, -0.20609483), vec2(0.2741352, -0.106516264), vec2(0.3004192, 0.01483083), vec2(0.27541655, 0.13634056), vec2(0.20423868, 0.2381278), vec2(0.099106446, 0.30425382), vec2(-0.023290485, 0.32530487), vec2(-0.14467937, 0.299009), vec2(-0.24792022, 0.22988579), vec2(-0.31921172, 0.12802827), vec2(-0.3495787, 0.007491847), vec2(-0.3357994, -0.11608094), vec2(-0.28017777, -0.22740729), vec2(-0.19014229, -0.3131003), vec2(-0.076512635, -0.36374018), vec2(0.04749505, -0.37400672), vec2(0.16820504, -0.3432542), vec2(0.27250993, -0.2753727), vec2(0.3498804, -0.17790553), vec2(0.39280573, -0.061176278), vec2(0.3974986, 0.06335222), vec2(0.36392722, 0.1831637), vec2(0.2956421, 0.28734145), vec2(0.1994358, 0.3662899), vec2(0.08417271, 0.41331866), vec2(-0.03999258, 0.42461538), vec2(-0.16186737, 0.39958513), vec2(-0.27175704, 0.3405859), vec2(-0.3603246, 0.252969), vec2(-0.42078885, 0.14401685), vec2(-0.44863185, 0.02257885), vec2(-0.44203544, -0.10178065), vec2(-0.401791, -0.21976513), vec2(-0.33128855, -0.3224197), vec2(-0.23587917, -0.40254584), vec2(-0.12263095, -0.4545548), vec2(2.9601651E-4, -0.47500983), vec2(0.124502406, -0.46272635), vec2(0.24105985, -0.41890258), vec2(0.3426444, -0.34664255), vec2(0.42262557, -0.25086743), vec2(0.47584936, -0.13816687), vec2(0.49925408, -0.015681067), vec2(0.49160993, 0.10858902), vec2(0.45357937, 0.22741406), vec2(0.38769093, 0.33335632), vec2(0.29811624, 0.4201616), vec2(0.1903407, 0.48286253), vec2(0.07078616, 0.5180246), vec2(-0.053854316, 0.5238695), vec2(-0.17626365, 0.5002464), vec2(-0.28965878, 0.44875896), vec2(-0.38829878, 0.37214747), vec2(-0.4665091, 0.27498433), vec2(-0.52034074, 0.16267747), vec2(-0.54730046, 0.04076701), vec2(-0.54602325, -0.08385942), vec2(-0.5167307, -0.20529567), vec2(-0.46119738, -0.3169049), vec2(-0.38233075, -0.41343606), vec2(-0.28394252, -0.4903723), vec2(-0.17137805, -0.543774), vec2(-0.04968076, -0.57146436), vec2(0.07505839, -0.57217383), vec2(0.19722003, -0.54598904), vec2(0.31062397, -0.4944529), vec2(0.4106351, -0.41991213), vec2(0.49273396, -0.32579467), vec2(0.55314887, -0.21663895), vec2(0.5893967, -0.09741622), vec2(0.60010576, 0.027145972), vec2(0.58489394, 0.1507767), vec2(0.54447216, 0.2690125), vec2(0.48081723, 0.3763101), vec2(0.39661273, 0.46847865), vec2(0.29568815, 0.5415854), vec2(0.18178858, 0.5930285), vec2(0.060128435, 0.6205564), vec2(-0.06461978, 0.6233034), vec2(-0.18747027, 0.601262), vec2(-0.30358085, 0.55540854), vec2(-0.40820295, 0.48783192), vec2(-0.49784634, 0.4009216), vec2(-0.5688769, 0.29824063), vec2(-0.6186902, 0.1837656), vec2(-0.6455222, 0.061860383), vec2(-0.6484957, -0.06289835), vec2(-0.62763125, -0.18589757), vec2(-0.5838239, -0.3026598), vec2(-0.51859844, -0.4092575), vec2(-0.4345059, -0.5016188), vec2(-0.33465162, -0.5765445), vec2(-0.22264922, -0.63150495), vec2(-0.10247934, -0.6647122), vec2(0.021983659, -0.6751602), vec2(0.14614245, -0.6625285), vec2(0.26578295, -0.6274031), vec2(0.37719408, -0.5709465), vec2(0.47639263, -0.49525908), vec2(0.56038487, -0.4027192), vec2(0.62624264, -0.2966684), vec2(0.6719576, -0.18066056), vec2(0.69624704, -0.058175553), vec2(0.6983263, 0.06685303), vec2(0.6782938, 0.18977612), vec2(0.63671046, 0.307704), vec2(0.57510304, 0.41631395), vec2(0.49554783, 0.51231426), vec2(0.400357, 0.5930643), vec2(0.29244864, 0.65609974), vec2(0.1754449, 0.69941944), vec2(0.052591655, 0.72192955), vec2(-0.07208396, 0.722997), vec2(-0.19556342, 0.7026118), vec2(-0.313208, 0.66164815), vec2(-0.42258692, 0.6010672), vec2(-0.5199313, 0.5229809), vec2(-0.602774, 0.42951193), vec2(-0.66860044, 0.3236692), vec2(-0.7159336, 0.20787737), vec2(-0.7431986, 0.0860637), vec2(-0.74982345, -0.03867427), vec2(-0.73572564, -0.16251041), vec2(-0.7012463, -0.28273726), vec2(-0.64756024, -0.3953836), vec2(-0.5760888, -0.49775696), vec2(-0.4887943, -0.58711183), vec2(-0.38836306, -0.6608815), vec2(-0.27691227, -0.7175607), vec2(-0.15805756, -0.75535756), vec2(-0.03424813, -0.773538), vec2(0.09054814, -0.77156425), vec2(0.21345562, -0.7496163), vec2(0.33131352, -0.70830643), vec2(0.4411171, -0.64873886), vec2(0.5398127, -0.572728), vec2(0.6255269, -0.48178065), vec2(0.69582444, -0.37842688), vec2(0.74885577, -0.26563898), vec2(0.7836738, -0.14555883), vec2(0.79928285, -0.021499978), vec2(0.7953974, 0.10310024), vec2(0.7721461, 0.22596616), vec2(0.7302907, 0.3433905), vec2(0.6706387, 0.45328656), vec2(0.59503376, 0.5523862), vec2(0.5048565, 0.63898236), vec2(0.40249664, 0.7107442), vec2(0.29076397, 0.7658893), vec2(0.17159884, 0.8035328), vec2(0.048127197, 0.8226634), vec2(-0.07678943, 0.82290584), vec2(-0.20028447, 0.8043227), vec2(-0.31955224, 0.7674063), vec2(-0.43191117, 0.7130601), vec2(-0.53486264, 0.64257175), vec2(-0.62641716, 0.5572826), vec2(-0.7040037, 0.4596876), vec2(-0.76645505, 0.35137343), vec2(-0.8121836, 0.23506467), vec2(-0.8402563, 0.11334115), vec2(-0.8501355, -0.011126165), vec2(-0.8416843, -0.13563512), vec2(-0.8151627, -0.25751156), vec2(-0.77104247, -0.3745396), vec2(-0.7103987, -0.48384058), vec2(-0.6345899, -0.58313197), vec2(-0.5452804, -0.670368), vec2(-0.44439965, -0.74377894), vec2(-0.33371583, -0.8020702), vec2(-0.2162917, -0.84372354), vec2(-0.09380041, -0.868235), vec2(0.030842794, -0.87501734), vec2(0.15550463, -0.863951), vec2(0.27683, -0.83542097), vec2(0.39319742, -0.78989804), vec2(0.50187993, -0.72844803), vec2(0.60106504, -0.6520949), vec2(0.6881655, -0.56294334), vec2(0.76210856, -0.46222004), vec2(0.8211906, -0.35224876), vec2(0.86445415, -0.23481578), vec2(0.89084774, -0.11304182), vec2(0.90013, 0.011521529), vec2(0.8920367, 0.1364757), vec2(0.8668728, 0.2585674), vec2(0.8250396, 0.37631813), vec2(0.7674823, 0.4870859), vec2(0.6950991, 0.58915573), vec2(0.6095126, 0.6802521), vec2(0.5124028, 0.75872827), vec2(0.40564424, 0.8231951), vec2(0.2908514, 0.8726881), vec2(0.17055117, 0.90612817), vec2(0.046985038, 0.9229954), vec2(-0.07801756, 0.9230508), vec2(-0.20173839, 0.90630436), vec2(-0.3223663, 0.87301296), vec2(-0.43729547, 0.8239084), vec2(-0.5444945, 0.7599695), vec2(-0.64209765, 0.68242854), vec2(-0.7287248, 0.59239984), vec2(-0.80278486, 0.4914286), vec2(-0.86252344, 0.38208303), vec2(-0.9074107, 0.2655509), vec2(-0.93659306, 0.14384396), vec2(-0.9494719, 0.019528575), vec2(-0.94592875, -0.10522286), vec2(-0.9260293, -0.2287058), vec2(-0.8902057, -0.34835088), vec2(-0.83917457, -0.4621388), vec2(-0.7736365, -0.5685537), vec2(-0.6949583, -0.66541266), vec2(-0.6041978, -0.7514446), vec2(-0.503229, -0.82490075), vec2(-0.3933879, -0.88483435), vec2(-0.2773674, -0.9299018), vec2(-0.15581436, -0.9598745), vec2(-0.031619634, -0.97397035), vec2(0.09315319, -0.9720673), vec2(0.21691974, -0.9542121), vec2(0.3372007, -0.92078584), vec2(0.4520694, -0.872441), vec2(0.5605, -0.8095382), vec2(0.6595406, -0.7338182), vec2(0.7484103, -0.6460213), vec2(0.8256419, -0.5475068), vec2(0.8895131, -0.4406578), vec2(0.93975735, -0.3259703), vec2(0.9751435, -0.2060976), vec2(0.9952242, -0.0829647));

uniform sampler2D tileBuffer;
uniform sampler2D inputBuffer;
uniform sampler2D packedCoCDepthBuffer;

in vec2 texCoords;

out vec4 fragColor;

const float f = 0.5;
const float piff = PI*f*f;

float cocAlpha(float coc){
	return 1.0 / max(piff, (PI * coc * coc));

float rand(vec2 xy){
    return fract(sin(dot(xy, vec2(12.9898, 78.233))) * 43758.5453);

#define TILE_BIT_SHIFT 4 //Usually defined by the program depending on tile size

#pragma optionNV(unroll all)
void main(){
	ivec2 tileCoords = ivec2(gl_FragCoord.xy) >> TILE_BIT_SHIFT;
	vec2 tile = texelFetch(tileBuffer, tileCoords, 0).xy;
	float maxCoC = tile.x;
	float minDepth = tile.y;
	vec2 center = gl_FragCoord.xy;
	vec4 centerColor = texelFetch(inputBuffer, ivec2(center), 0);
	float angle = rand(gl_FragCoord.xy) * 2 * PI;
	float c = cos(angle);
	float s = sin(angle);
	mat2 rotation = mat2(
		c, -s,
		s,  c
	vec2 centerData = texelFetch(packedCoCDepthBuffer, ivec2(center), 0).xy;
	float centerCoC = centerData.x;
	float centerDepth = centerData.y;
	vec4 foreground = vec4(0);
	vec4 background = vec4(0);
	float centerCoCAlpha = cocAlpha(centerCoC);
	float centerCoC1 = centerCoC + 1;
	for(int i = 0; i < DOF_SAMPLES; i++){
		ivec2 coords = ivec2(center + (rotation*offsets[i])*maxCoC);
		float distance = length(offsets[i])*maxCoC;
		vec2 cocDepth = texelFetch(packedCoCDepthBuffer, coords, 0).rg;
		float coc = cocDepth.x;
		float depth = cocDepth.y;
		float backAlpha = clamp(depth - centerDepth + 1, 0.0, 1.0);
		float foreAlpha = 1.0 - backAlpha;
		float coc1 = clamp(1.0 + centerCoC1 - distance, 0.0, 1.0);
		backAlpha *= centerCoCAlpha * coc1;
		float coc2 = clamp(1.0 + coc - distance, 0.0, 1.0);
		foreAlpha *= cocAlpha(coc) * coc2;
		vec3 color = texelFetch(inputBuffer, coords, 0).rgb;
		vec4 f = vec4(color, 1.0) * foreAlpha;
		vec4 b = vec4(color, 1.0) * backAlpha;
		//vec4 b = vec4(color, 1.0) * (backAlpha + foreAlpha * 0.000001); //Using this line instead of the one above reduces the register count a bit.

		foreground += f;
		background += b; //Comment out this line and the number of registers drops to 7 regardless of sample count.
	float alpha = clamp(foreground.a / (cocAlpha(maxCoC)*DOF_SAMPLES), 0.0, 1.0);
	foreground.rgb /= max(foreground.a, 0.000000001);
	background.rgb /= max(background.a, 0.000000001);
	vec3 color = mix(background.rgb, foreground.rgb, alpha);
	fragColor = vec4(color, centerColor.a);

Excuse the offset array...


The shader is relatively simple. For each sample, it samples the depth and CoC (circle of confusion) of the sample and calculates a background alpha and a foreground alpha. It then fetches the color of the sample and accumulates the foreground color and background color in two vec4 vectors. At the end I blend between the two.


The problem here is that the register count completely depends on the sample count, which it shouldn't. Using GPU ShaderAnalyzer to compile the shader for AMD GPUs gives a register count of 16, regardless of sample count. On Nvidia, it simply explodes. Dumping the result of glGetProgramBinary() as text, I can see the temp register line:

TEMP R0, R1, R2, R3, ..., R64, R65; //66 registers

 - 32 samples: 12.3ms, 66 registers.

 - 64 samples: 52.2ms, 130 registers.

 - 128 samples: 263.3ms, 258 registers.

 - 204 samples: 924.2ms, 410 registers.


Above 204 samples I get an unspecified internal compiler exception and the shader won't compile.


Obviously, I can't have this kind of performance at 1920x1080. The shader should be runnable on vastly less registers, but the assembly is a complete mess. If I do some subtle changes, I can somewhat reduce the sample count making 32 and 64 samples viable. If I replace the calculation of the "b" temp variable in the loop from <backAlpha> to <(backAlpha + foreAlpha*0.000001)>, the shader massively improves.


 - 32 samples, 1.8ms, 43 registers (23 less)

 - 64 samples, 3.8ms, 91 registers (39 less)


Commenting out <background += b;> completely leaves all the texture fetches intact but the register count drops to a constant 7 regardless of sample count, but the result obviously becomes incorrect.


 - 204 samples, 16.5ms, 7 registers (403 less)


The thing is that the background variable and its calculations should only consume 2-3 registers. I'm really at a loss of what to do.



Performance of drawing vegetation; overdraw, alpha testing... Alpha cutout model genera...

01 May 2015 - 06:57 PM

Hello, everyone.
I'm still struggling with getting good vegetation rendering performance. Basically, what I have is this:


With the right ground texture and lighting, it looks... okay.





Sadly, the draw distance is limited as hell, and due to this, we basically need to make the grass extremely boring and feature-less so the fading in the distance isn't noticeable. I want faster, taller, thicker, more distinct grass with higher render distance without the performance cost. The current grass takes around 2 milliseconds, half being drawing the grass to the G-buffer (including depth pre-pass) and the other half being the SRAA pass. Basically, the grass is a bunch of flat triangles.


What I've learned so far:


 - Do a depth prepass if you're not doing MSAA. It saves a SHITLOAD of time since depth-only+alpha-test is really cheap, and GL_EQUAL depth testing in the second G-buffer pass is literally 4 times as fast. Basically doubles my performance, but it isn't usable in the SRAA (MSAA) pass.


 - The shape of the grass meshes matters a lot. At first we had randomly rotated flat meshes, but these looked like crap and tended to lump together. We switched to 3 intersecting quads rotated 120 degrees from each other, which had a more volumetric and even look, but it looked like crap from above. In the end, we went with http://i.imgur.com/h0LilV0.jpg, which looks good from most angles, especially above.


 - The area of the mesh is what matters. Even transparent fragments that get discarded by the alpha test are essentially full-cost fragments. We "optimized" our textures to be as little transparent cutout as possible and as much grass as possible to minimize the number of wasted fragments.


 - Fading the alpha value of the grass looks like crap with alpha testing, and even worse with alpha-to-coverage. A better solution was to simply slowly sink the grass into the ground, which had less popping and flickering.


 - Shadows are completely out of the question. Don't draw shadows for the grass.




With all these tricks, our grass went from 20ms to 2ms at... "acceptable" quality. I still don't like it, but frankly, I'm at a loss at what to do next. Then I see something like this:




And I'm just like "What the hell". The have significantly longer render distance than us, but don't have seem to have any significant performance problems. The grass is also significantly taller, and in Crysis 3 you even walk around with the camera inside it. My guess is that the are rendering those blades as meshes, not alpha-tested billboards. I would like to test that out anyway.


Is there any tools out there to generate a triangle mesh from an image with alpha cutouts? Preferably one that allows for multiple LODs of the same model to be generated.

SLI with an advanced OpenGL game?

26 March 2015 - 11:49 AM



I have a rather advanced game engine utilizing deferred shading, lots of post-processing, etc. I also have two GTX 770s in SLI, and for some time I was able to get around 1.9x scaling using a small SLI profile I made using Nvidia Inspector, but after adding some new special effects and postprocessing it no longer works. In addition, the last time it DID work it seemed to corrupt the GL_TEXTURE_2D_ARRAY texture used for particles somehow...


The point is that the entire engine was made with SLI in mind. No framebuffers are reused inbetween frames, except for my temporal supersampling which actually has a system for buffering frames so that each frame reuses every Nth frame's texture instead to compensate for N GPUs running in parallel using Alternate Frame Rendering (AFR). All I want to do is disable all driver side synchronization of framebuffers etc, but no matter what SLI compatibility settings I use the game won't scale beyond a single GPU.


What can I do to disable all synchronization? Is there some specific compatibility bits to do this for OpenGL games? Is there some way of debugging the behavior of the driver to find out what's going on?