Jump to content

  • Log In with Google      Sign In   
  • Create Account

Matias Goldberg

Member Since 02 Jul 2006
Offline Last Active Today, 03:09 PM


Posted by Matias Goldberg on 11 November 2015 - 10:06 AM

There is one solution that can work for you.

Your problem obviously is that you're running out of virtual address space (past 2GB).

You can refactor your code so that you use expanded memory instead. The idea is to work on a small amount of virtual memory chunks while the physical memory backing that is much larger. So you can use for example 6GB of physical memory but only read 128 MBs using pointers at a time.

Read Myth: Without /3GB a single program can't allocate more than 2GB of virtual memory on how to do that.

#5261194 Software skinning for a precise AABB

Posted by Matias Goldberg on 09 November 2015 - 01:35 PM

I wouldn't assume that animation blending is only used in 1% of cases...
Normally you're blending many animations together in order to animate a skeleton, and normally one skeleton and it's associated animations will be used with different meshes too.

And sometime you interpolate so incompatible poses, that a stored per frame AABB comes useless. If you need access to individual bones OBBs in your game, it is ok to utilize the idea for frustum culling, but doing so after the occlusion check with the explicit all animations containing AABB first. You then still have a chance to not render, wheather you check it or not is up to you, but it is not that expensive if you arrive at the point of frame construction anyway.

Interpolate AABB for each animation, then merge (enlarge) the interpolated AABBs of the animations running concurrently.
The point is to avoid enlarging the AABB in each animation; otherwise you'll get extremely conservative AABBs by the time you're past the 50% of the animation.

Going off-topic, can you tell a difference in performance between using and not using __restrict in your whole project? (maybe FPS changes?)

There is no definitive answer. It can be as much as 4-10x improvement, as well as much as 0x improvement (with the added risk that if the pointers do alias you get incorrect behavior).
It depends on how much the compiler must preserve the intended behavior, how much information the compiler already had, whether strict aliasing is enforced, what work is being done (e.g. the amount of pointers floating around), the architecture being targeted, where the bottleneck currently is, and how the code (as in the style) was written.

A prime example is matrix multiplication in the form A = B * C. If A may alias to either B or C; you need to do a lot of copies, often causing register spilling. Assuming they're all independent gives a noticeable performance boost.
But if the compiler already knows they're not aliased (i.e. A, B & C are local variables in the stack) then adding __restrict won't improve anything. Similarly if the pointers were just malloc'ed and stored to member variables of a class, the compiler already knows the pointers do not alias. But if in the middle you perform a call to a function whose definition is unknown, the compiler might have to assume this function could've modified the malloc'ed pointers; hence adding __restrict or moving that function call after the multiplication boost performance.
One simple line of code can break a lot of assumptions, so the impact of __restrict needs to be evaluated on a per case basis.


Posted by Matias Goldberg on 07 November 2015 - 09:58 PM

AFAIK when you turn on /LARGEADDRESSAWARE, you still have to use the "/3GB" boot option unless you're on a 64-bit OS.


But if you're on a 64-bit OS, build a 64-bit exe instead, which is the proper way to solve your problem.

#5260939 Software skinning for a precise AABB

Posted by Matias Goldberg on 07 November 2015 - 04:11 PM

@Matias Thanks for the code snippet. Is poseMatrix the Bones[] array I pass to the vertex shader per frame?

You're welcome. Yes, the poseMatrix is the one you would pass to the vertex shader, but with the world matrix set to identity (i.e. position = 0, scale = 1, orientation = identity)

Btw, what's up with all those __restricts and reinterpret casts? tongue.png

Restrict is a performance optimization. It is a promise to the compiler the pointers won't alias to each other, otherwise any change to any of those pointers means the other pointers of the same type need to be reloaded from memory (e.g. the compiler can't assume that blendWeight & blendIndex point to different address locations, thus changing blendWeight could change the value blendIndex; that's mostly because blendIndex is an uint8_t*, and as such it can alias to anything).

The reinterpret cast is just to get access from the vertex data using the stride (i.e. an offset in bytes) from the current vertex, just like when you specify a vertex format. I was careful to not break strict aliasing by using a uint8_t pointer for vertexData (char can alias to any other dataype).


It is often recommended to put the __restrict keyword inside a macro (e.g. MY_RESTRICT). So that if you're having a horrible bug, you turn all __restrict off, and see if the bug persist (which means you broke the promise by accident)

#5260824 Software skinning for a precise AABB

Posted by Matias Goldberg on 06 November 2015 - 04:19 PM

All the listed options are incredibly slow for having many instances in real time running at the same time (except the ones that suggest baking the AABB data).
The best solution that is good enough for 99% of cases is to save the AABBs at different frame intervals (granted, you need to do software skinning first) store them (e.g. in the mesh file), and then linearly interpolate between those AABBs.
Btw sw skinning takes no sh@t time to implement, here:
const Matrix4x4 * __restrict poseMatrix;
uint8_t const  * __restrict vertexData;
size_t posStride;
size_t blendIndexStride;
size_t blendWeightStride;
size_t numVertices;
size_t bytesPerVertex;
int numWeightsPerVertex;

Vector3 max( -std::numeric_limits<float>::max() );
Vector3 min(  std::numeric_limits<float>::min() );

for( size_t i=0; i<numVertices; ++i )
	const Vector3 __restrict  *vPos = reinterpret_cast<const Vector3 __restrict  *>( vertexData + posStride );
	const uint8_t __restrict  *blendIndex = reinterpret_cast<const uint8_t __restrict  *>( vertexData + blendIndexStride );
	const float __restrict  *blendWeight = reinterpret_cast<const float __restrict  *>( vertexData + blendWeightStride );
	for( int j=0; j<numWeightsPerVertex; ++j )
		const Vector3 translatedPos = poseMatrix[blendIndex[j]] * vPos;
		max.x = std::max( max.x, translatedPos.x );
		max.y = std::max( max.y, translatedPos.y );
		max.z = std::max( max.z, translatedPos.z );
		min.x = std::min( min.x, translatedPos.x );
		min.y = std::min( min.y, translatedPos.y );
		min.z = std::min( min.z, translatedPos.z );
	vertexData += bytesPerVertex;
There. I wrote it in under 5 minutes. Now you have no excuse.
poseMatrix is the matrix you would normally pass to the vertex shader but without the world space component (so that the operation is done in local space).

#5260575 GLSL not behaving the same between graphics cards

Posted by Matias Goldberg on 04 November 2015 - 08:48 PM

I see that "gl_FragColor = v_fragmentColor; // Produces a white box as expected, no problems!"


I wouldn't be so sure about that. Try a better colour, like (0.6, 0.4, 0.2, 1.0) (orange).

If something goes wrong the driver may fallback to an all white or an all black pixel shader.


Also, needless to say running behind a VM with OpenGL is far from optimal (rather... extremely suboptimal).


Another problem is that this isn't the code you're executing because from what you're saying cocos-2d is adding more code. It's possible these additions are causing the shader to be non-standard compliant.

For example some ancient code we had in our OGRE engine added code before the user declared #extension; but by spec #extensions must be declared before anything else. We never noticed until Mesa SW implementation (also the Mesa Intel drivers) started complaining because it was the only OpenGL implementation that actually enforced it, the others silently accepted the faulty shader.

#5260514 Unrestricted while loop?

Posted by Matias Goldberg on 04 November 2015 - 09:58 AM

It's allowed in Shader Model 4+.

But beware if the loop never ends, or it takes too long (e.g. more than one second) TDR will kick in, affecting your process, or eventually restarting the system via a BSOD (if you make a couple TDRs more in a row).


TDR can be disabled via regedit but is absolutely not recommended (only if you plan to use your own computer for intensive simulations).


On Shader Model 3 it is not allowed.

#5260171 Who ate all the memory?

Posted by Matias Goldberg on 02 November 2015 - 01:05 PM

Without thorough analyzing of the program it is hard to say.

Languages like C# are designed around the notion that RAM is unlimited or nearly unlimited, thus skyrocketing memory usage when dealing with large objects like a 8092x8092 texture doesn't surprise me. One wrong usage and you can get multiple duplicates (assuming there are no leaks).


Best advice I can give is to start removing code until you see a major change in RAM usage to narrow the search locate the offending snippets code, so it can be better analyzed.

#5260105 How can I declare a nan literal?

Posted by Matias Goldberg on 02 November 2015 - 07:58 AM

I shall mention defaulting your floats to NaN is a bad idea.


They have the habit of spreading around like a plague (since almost any operation with a NaN returns a NaN). One forgotten NaN can cause you to see thousands of NaNs everywhere on somewhere on an almost completely unrelated section of code.

Furthermore NaNs often cause physics engines to crash. So you will get cryptic crashes inside the physics engine which you probably won't have the source code (e.g. PhysX, Havok) with little clue that you're the one causing the problem with a NaN initialized a long time ago.


But worst sin of all; NaNs are incredibly useful for finding uninitialized variables. I often override all allocators (new, malloc, etc) and initialize all the range of memory I return to signaling NaNs. Then make sure the control word is set to signal exceptions on any NaN.

When an uninitialized value is used, bam! exception raised, problem found.


If you use NaNs explicitly around all of your codebase, this technique becomes useless since every time you initialize a float with a NaN you'll get a false positive.

Also NaNs can hit the performance. CPUs rarely operate at 100% speed when NaNs are involved.


Use a more sane value, like 0 or 1.

#5260013 Anyway to retrive value from fragment shader back to .cpp file?

Posted by Matias Goldberg on 01 November 2015 - 12:55 PM

While there are GPU debuggers out there (RenderDoc, Visual Studio, apitrace, NSight, GPU PerfStudio, Intel GPA), they're often nowhere near close to the state of CPU debugging (specially OpenGL).


What we often do is debug by colour. Change the shader, output the value to the screen, and either intuitively accept it (e.g. it should look yellow -> looks yellow, good enough for me) or use those GPU debuggers to retrieve the value of the texture from an FLOAT32 RTT.


Debugging by colour is a necessary skill. Learn to "read" XYZ from red, green and blue.

#5259416 ETC and PVRTC dead to an unified compression ?

Posted by Matias Goldberg on 28 October 2015 - 08:28 AM

The problem is not API support or GPU support. The problem is that all these formats are patented.

PVRTC1/2 is awesome, but it is patented and exclusive to PowerVR GPUs.

BC1-BC3 is okeish, but is patented but licensed in the Desktop world.

BC4-BC7 is cool, but is patented but licensed in the Desktop world.

ETC is not good, but it's free and "what we've got on mobile".


ATSC is really good (possibly the best) and royalty free, but it is very recent, adoption is far from being widespread yet.


Btw. Metal does not currently support BC on mobile. Patents. I suggest you look at the support chart (page 87).

#5259358 Antialiasing will stay in the future ?

Posted by Matias Goldberg on 27 October 2015 - 08:57 PM

I disagree with MarkS. Even if you have pixels that are not individually discernible, aliasing can introduce visible artifacts, e.g. moiré patterns. A pixel should ideally be of the color that is the average color of the area it covers, and a single sample is a poor estimate of the average.

I disagree with your disagreement. Aliasing occurs due to low Visual Acuity (jagged lines) or high visual acuity while still not reaching Vernier Acuity / Nyquist frequency (moiré patterns). The Visual & Vernier Acuity are affected by pixel density (resolution), display size, and distance from our eyes to the display.

Once the pixel density is high enough, the colour itself our eyes perceive will be a "blended average" done in analog form.
However, it is definitely certain that if we reach the processing power to achieve high enough resolution render natively at the Vernier Acuity, then it will far more useful, cost-effective and cheaper to lower the display resolution and use MSAA or SSAA instead.

It also shall be noted 4K is far from being enough to reach that state though (specially if vendors insist in selling >42' displays), so AA is still going to be needed.

#5258433 Hardware instancing speed check

Posted by Matias Goldberg on 21 October 2015 - 04:05 PM

I wasn't after an exact science-based response, hence my "I appreciate this can be a subjective question" and "Seems alright doesn't it...?" lines.  I think I've supplied enough information for someone to say "yes that seems fair" or "no, I can get 500 times more than that on my similar machine.  I've implied that I'm just rendering 100 simple trees, I didn't want to bore everyone or waste anyone's time with reams of shader code or other info to scan through.

The problem is that you can get MAJOR differences.

Rendering 30M triangles with 7M vertices is far different than rendering 30M triangles with 90M vertices (very high triangle reuse vs 3 vertices per triangle). We're talking about 1280% increase in the amount of data needed to be processed by the vertex shader.

It's not the same to send the world view projection matrix as a single matrix than to send each matrix separately and concatenate them in the shader. Also it's not the same to send the world, view & projection matrices per tree, than sending view & projection matrix per camera, and the world matrix per tree.

Having many triangles that occupy less than a pixel can cause you up to 4x slowdown.

I'm not talking about pasting huge loads of shader code (you're right in that no one's gonna read that), but at least tell us a rough idea of the complexity. Outputting a fixed colour is the most simple pixel shader, then there's a diffuse texture with a simple Blinn-Phong BRDF... and then there's a shader that can do normal mapping, specular mapping, and uses a GGX with Fresnel BRDF to lit it.

And last but not least, instancing at 100 instances is going to barely make a difference. If you have noticeable performance differences with an older implementation that didn't use instancing, then you're (or were) doing it wrong.
Instancing will fix your CPU bottlenecks, which often begin to take a toll once you exceed the 1000 drawcalls (depends on API; e.g. in DX11 you can make 50k drawcalls like nothing if you're careful enough; you're tagging this thread as D3D9 so I assume you're using D3D9, and that API often begins to show its problems between 1k - 3k draws)

#5258412 Hardware instancing speed check

Posted by Matias Goldberg on 21 October 2015 - 02:06 PM

Your measuring metrics are very poor.


You're just considering number of triangles and framerate, whereas any meaningful evaluation would require:

  • Vertex size in bytes / vertex description
  • Number of vertex attributes
  • Number of vertex buffers
  • Number of vertices
  • Number of triangles
  • Complexity of vertex shader. Number of uniforms
  • Number of interpolants exported to pixel shader
  • Complexity of pixel shader
  • How many pixels the average triangle occupies
  • Whether they're rendered front to back or back to front
  • Frametime in milliseconds (rather than FPS)
  • A bit of source code to get a rough idea of some of the above (e.g. complexity of shaders, etc)
  • HW and OS you're running on


With the information provided in your post we have absolutely no idea if your numbers are good or not.

#5257493 GLSL Error C1502 (Nvidia): "index must be constant expression"

Posted by Matias Goldberg on 16 October 2015 - 08:56 AM

Welcome to game development! Where there's no best choice and your options suck!

You just (re)discovered the problems of traditional Forward rendering (as opposed to Deferred Shading, or Forward+, or Clustered Forward). Best way to learn is the hard way.

What you're doing is called Forward shading, and no, you're not missing anything.

You either:

  1. Create a buffer with a fixed maximum number of lights (e.g. 8), use the same lights for all objects.
  2. Create a megabuffer of N*L where N is the number of objects, and L the max number or lights per object (e.g. 1000 objects, 8 lights per object = 8000)
  3. Use a non-forward solution

Forward+ and clustered forward for example, create a single buffer with all lights in scene; then via a special algorithm, it generates a second buffer with an index of the lights being used by that tile or cluster. e.g. Send all 500 lights in scene; then a special pass determines that a tile of pixels (or a 'cluster') uses lights 0, 12 and 485. Then the whole tile/cluster is shaded by those three lights:

for( int i=0; int i<lightsInTile; ++i )
     colour += shade( lights[tileList[i]] );