Jump to content

  • Log In with Google      Sign In   
  • Create Account

Ciprian Stanciu

Member Since 12 Jul 2002
Offline Last Active Feb 06 2015 10:31 AM

Posts I've Made

In Topic: Generating Barycentric Coordinates similar to hardware tesselation ?

22 November 2014 - 06:30 PM

I recently added Instanced tesselation to my app, Relative Benchmark for iOS under OpenGL ES 3 : https://itunes.apple.com/us/app/relative-benchmark/id560637086?mt=8

In Topic: Overhead of subroutines in arrays when multidrawing?

02 June 2014 - 10:28 AM

The problem with saying '5% less performance' is we don't know if you were CPU or GPU bound in your tests.

If you were CPU bound then it's possible the extra calls made the difference.
If you were GPU bound then it might be a GPU resource allocation problem.
Or it could be anything in between smile.png

If you could run the tests via AMD's profiling tools and get some solid numbers that would be useful.
Even their off-line shader analyser might give some decent guideline numbers as that'll tell you VGPR/SGPR usage and you can dig from there to see what is going on.


I'm GPU bound as I barely do anything on the CPU and I usually have near 99% GPU Usage. I was having 95%+ GPU Usage with my old HD7850 when I did the test but now with my 280X I have around 70% GPU usage, same scene, same code, DX11 code is still at 99%+. Thanks for the heads up with the tools, I'll check them out whenever I can experiment with that code :).

In Topic: Overhead of subroutines in arrays when multidrawing?

31 May 2014 - 02:23 PM

Great theoretical talks, but as no one provided any experimental data I'll chime in.


I tested this out in hopes of getting better performance to switching out around 80 different shaders which are essentially just one uber shader compiled with various macros. Some of the macros include VS Skinning, diffuse texture sampling, alpha testing, lighting calculations (if it has a normal) and ofcourse interpolating more stuff just in case it has normals or texcoords (though most of my objects did).


So my shaders were doing like a VS Skinning uniform in which I provided 2 versions : one in which skinning takes place, multiplying 4 matrices and one in which nothing was happening. For Texturing, the same, one subroutine would sample a texture, the other would do nothing.


My results on an AMD HD7850 was that I had 5% less performance with a single program compared to ~80 different variants, same number of draw calls, multiple glUniformSubroutine calls per object, and ~100 objects. I don't know if the perf degradation came from using the extra glUniformSubroutine uniform as these are rather odd, as once you use them for a draw call, you have to call the function again before you use them again because it forgets state ( they say it's a D3D11 artifact ).


However, this was just one way to use them, as micro jumps don't seem faster than multiple shader switching. But I'm thinking to use them in a different way in the future, just make one subroutine function variant be an entire VS/PS/etc and then index that based on DrawID or similar and unify drawcalls with different shaders. Adding bindless textures to the mix and merging VB/IB/CB together would make it possible to have just a single draw call per scene :).

In Topic: SPEED: glUniform Vs. uniform buffer objects.

13 May 2014 - 09:01 PM

What about having an UBO which is an array of blocks (I'm not even sure you can use uniforms for indexing like that).

const uint MAX_INSTANCES = 1024;
uniform uint currentIndex;
struct Matrices
    mat4 modelViewProj;
    mat4 modelView;
layout  (std140, binding = 0) uniform MatrixBlocks
    Matrices matrices[MAX_INSTANCES]
main ()
Would that make the matrix array more tightly packed? (instead of having matrix, then another matrix at the 256 byte mark, and so on). You'd update "currentIndex" per instance drawn.


MAX_INSTANCES could be a constant or also an uniform (ie, instance count for this frame).


Interesting idea to do a single buffer binding AND do a glUniform call per object, using the best of both worlds I guess.


And yes you can index into block variables, the drivers have to support it though and it's a bit off spec. I did that on AMD and once I'm at around 80% of the max 16K / block, it sometimes crashes. On GLES3 the last time I tried it, it crashed the shader compilation on Adreno 320 and told PowerVR about and they exhibited the same issue, now it's supposedly fixed (on the reference driver at least), I just have to get a GLES3 iOS device now and see for myself :).


I tried the glBindBufferRange thing on my Nexus 4 btw and I'm exhibiting a serious 25% performance degradation (again with absolutely no updates per frame) but I also can't put one of my structures inside the block, it gives off some stupid errors, I think it ignores some variables and then ends up with different blocks pretending to be the same one, so I had to put my light properties (PS only) in standard glUniform calls. On Nexus 4 the alignment offset is 4 bytes, btw, pretty small in comparison.

In Topic: SPEED: glUniform Vs. uniform buffer objects.

13 May 2014 - 08:06 AM

There is also alignment requirements which I don't quite understand yet (I'm not sure if my card is saying I should align my updates to 256 byte blocks or 256 bit blocks from what I've seen querying GL_UNIFORM_BUFFER_OFFSET_ALIGNMENT).



It's bytes and it refers to the Offset only, so if it says 256 you can do

glBindBufferRange( GL_UNIFORM_BUFFER, 0, BufferHandle, 256, Range1 );

glBindBufferRange( GL_UNIFORM_BUFFER, 0, BufferHandle, 512, Range2 );


glBindBufferRange( GL_UNIFORM_BUFFER, 0, BufferHandle, n * 256, Range3 );



You talked about less updates, I do no updates, just bindings and it's still slower than glUniforms. I'll probably try the persistent mapping in GL next as well.