cippyboy

Members
  • Content count

    373
  • Joined

  • Last visited

Community Reputation

223 Neutral

About cippyboy

  • Rank
    Member
  1. I recently added Instanced tesselation to my app, Relative Benchmark for iOS under OpenGL ES 3 : https://itunes.apple.com/us/app/relative-benchmark/id560637086?mt=8
  2.   I'm GPU bound as I barely do anything on the CPU and I usually have near 99% GPU Usage. I was having 95%+ GPU Usage with my old HD7850 when I did the test but now with my 280X I have around 70% GPU usage, same scene, same code, DX11 code is still at 99%+. Thanks for the heads up with the tools, I'll check them out whenever I can experiment with that code :).
  3. Great theoretical talks, but as no one provided any experimental data I'll chime in.   I tested this out in hopes of getting better performance to switching out around 80 different shaders which are essentially just one uber shader compiled with various macros. Some of the macros include VS Skinning, diffuse texture sampling, alpha testing, lighting calculations (if it has a normal) and ofcourse interpolating more stuff just in case it has normals or texcoords (though most of my objects did).   So my shaders were doing like a VS Skinning uniform in which I provided 2 versions : one in which skinning takes place, multiplying 4 matrices and one in which nothing was happening. For Texturing, the same, one subroutine would sample a texture, the other would do nothing.   My results on an AMD HD7850 was that I had 5% less performance with a single program compared to ~80 different variants, same number of draw calls, multiple glUniformSubroutine calls per object, and ~100 objects. I don't know if the perf degradation came from using the extra glUniformSubroutine uniform as these are rather odd, as once you use them for a draw call, you have to call the function again before you use them again because it forgets state ( they say it's a D3D11 artifact ).   However, this was just one way to use them, as micro jumps don't seem faster than multiple shader switching. But I'm thinking to use them in a different way in the future, just make one subroutine function variant be an entire VS/PS/etc and then index that based on DrawID or similar and unify drawcalls with different shaders. Adding bindless textures to the mix and merging VB/IB/CB together would make it possible to have just a single draw call per scene :).
  4.   Interesting idea to do a single buffer binding AND do a glUniform call per object, using the best of both worlds I guess.   And yes you can index into block variables, the drivers have to support it though and it's a bit off spec. I did that on AMD and once I'm at around 80% of the max 16K / block, it sometimes crashes. On GLES3 the last time I tried it, it crashed the shader compilation on Adreno 320 and told PowerVR about and they exhibited the same issue, now it's supposedly fixed (on the reference driver at least), I just have to get a GLES3 iOS device now and see for myself :).   I tried the glBindBufferRange thing on my Nexus 4 btw and I'm exhibiting a serious 25% performance degradation (again with absolutely no updates per frame) but I also can't put one of my structures inside the block, it gives off some stupid errors, I think it ignores some variables and then ends up with different blocks pretending to be the same one, so I had to put my light properties (PS only) in standard glUniform calls. On Nexus 4 the alignment offset is 4 bytes, btw, pretty small in comparison.
  5.   It's bytes and it refers to the Offset only, so if it says 256 you can do glBindBufferRange( GL_UNIFORM_BUFFER, 0, BufferHandle, 256, Range1 ); glBindBufferRange( GL_UNIFORM_BUFFER, 0, BufferHandle, 512, Range2 ); .... glBindBufferRange( GL_UNIFORM_BUFFER, 0, BufferHandle, n * 256, Range3 );   @mhagain You talked about less updates, I do no updates, just bindings and it's still slower than glUniforms. I'll probably try the persistent mapping in GL next as well.
  6. I just implemented this a couple of days ago :) But the only samples I found on the matter only used uniform blocks to store the "pointer" : #extension GL_ARB_bindless_texture : require         layout(binding = 0) uniform BindlessTexturesBuffer         {             uniform uvec2 Texture0;         }; One more thing that made my AMD drivers just error out pixels from my desktop was that you shouldn't call glMakeTextureResident more than once, I now do a call to glIsTextureResident prior to doing that.   Other than that works ok on AMD/280X and I get a +6% performance boost just from using bindless textures.   One more thing, subsequent calls to glTexParameter will fail, I think you can't modify the sampler state after making it resident so I don't even call glActiveTexture/glBindTexture anymore when using bindless textures.
  7. Yeah, but sharing a small UBO between all programs and still have a second UBO per program is kind of bad too. I haven't tested on GL but I did this on D3D11, had 3 CB per object with different levels of updates, per frame, per material and per object, and using 3 CB per object was always slower than using just one. I had this happen on my old HD5770 and my newer HD7850, I initially thought it was the first generation DX11 drivers, seems like it wasn't. Some still argue that they see benefits from sharing some buffers but I haven't found any case in that favor yet, tried with 30 bone skinned meshes too, it was still faster to keep all the constant data in a CB than split bones in a CB and the rest in another CB (so that I could for example reuse the bones CB in a shadow pass where i needed just that).   UPDATE : Just implemented the global uniform buffer with glBindBufferRange calls. Discovered that the offset you put to glBindBufferRange needs to be a multiple of GL_UNIFORM_BUFFER_OFFSET_ALIGNMENT, and observed still a negative performance penalty of -1% for using a dozen calls to glBindBufferRange compared to using just glUniform calls. I'll also add that for glUniform calls I have a state minimizer which compares data with the previous data set up (from like a previous object) and if it's the same thing, it doesn't call glUniform. And the comparison was done standing still, so basically I don't even map the buffer to update it cause I'm not moving/changing any data.   UPDATE2 : Did it on D3D11.1 using VSSetConstantBuffers1 and PSSetConstantBuffers1 and there is indeed a difference of aproximately <=0.4%. I already had 99% GPU usage though so it's not all that unexpected, on GL though I had like ~63% GPU usage and it's mostly doing the same thing. Perhaps it's because I have relatively small constant buffers ? I got like 464 bytes per object which need to turn to 512 for alignment purposes.
  8. @mhagain   I thought I was doing something wrong as well when I saw they're slower (cause nobody will tell you that, just that using them is cool and fast). I too turned all uniforms into a single UBO, updating and binding it on a per object basis (just like in D3D11) and it's slower. I even made it so that if the camera doesn't move the update doesn't even happen, but binding 100 UBOs for each draw call was still slower than using plain a dozen glUniform calls, around 1-3% slower.   I was planning to do #3 as described above when I found the topic :).
  9. Updating my post cause I was wondering for a long time how is Nvidia doing with separate shader objects. Turns out, not that great. Got a GTX 650 running on latest 335 drivers, got 290 FPS with SSO On, and 308 without them, so still a heavy performance penatly. And I'm doing a pipeline object per shader pair.   And I'm also seeing some weird behavior with objects that don't use textures.
  10. GL_TEXTURE_SPARSE_ARB on Nvidia

    After about a couple of hours of swearing I realized the code from the slides is just wrong ! Thanks to Christophe Riccio and his samples ( http://ogl-samples.g-truc.net/ ) the code on the slides should've been :   glGetInternalformativ( GL_TEXTURE_2D, GL_RGBA8, GL_VIRTUAL_PAGE_SIZE_X_ARB, 1, &page_sizes_x[0]);   The difference being that it's GL_RGBA8 and then GL_VIRTUAL_PAGE_SIZE_X_ARB and not the other way around !
  11. GL_TEXTURE_SPARSE_ARB on Nvidia

    I can confirm that it doesn't work as well, tried texture 2d, texture 2D array, tried doing glTexStorage2D/3D before and after, I'm on 335.23 too and Windows 8.1 with a GTX 650.
  12.   I missed that. That doesn't sound right at all. You could try posting on the AMD devguru website, but I wouldn't hold my breath for a response. I was having so many issues with AMD and OpenGL that I just gave up and bought a new GPU from Nvidia.     So what is your performance difference on Nvidia hardware with separate shader object ? The same ? faster/slower ?
  13.   Uhm, no. http://www.opengl.org/registry/specs/ARB/separate_shader_objects.txt   Subroutines are available in 4.0 and separate shader objects are available in 4.1, so if you are already using one its not much of a stretch to use the other.     Have you ever used separate shader objects ? I already said in this post I get a negative performance benefit of 50% if I use them on AMD hardware.
  14. I already have macros, that's my standard way of getting 50 different shader combinations out of an uber shader that has it all. The number is larger purely because in GL you need to have a monolithic program. In DX it's more like 30.   Ideally, the subroutines should just be jumps in the shader code, or even better, when the shader code is copied from GPU memory to L1 cache or whatever for instruction interpretation, it should copy only the currently bound subroutines' code.
  15. Ok, so after a lot more work I managed to set up subroutines for all my 60 or so monolithic shaders. So basically instead of swapping 60 different shaders it uses a single shader with around 5 subroutine types ( skinning on/off, normals/lighting on/off, texturing on/off and alpha testing on/off ), 2 in VS and 3 in FS (the normals one is in both VS and FS). My conclusion is that it's slower even though I get 98% GPU usage.   I now get 118 FPS with subroutines versus 163 FPS without them.   However, there's some slight differences. Especially in the skinning department. Since the shader takes bone indices and weights for ALL objects, I assume there's some wasted caching involved when the current object does not present indices & weights, even though they're not used by the active subroutines set up, they might be fetched.   I also tried testing the shader without any object skinning, so the only difference in inputs is with and without normals/texcoords. Still about 50+ shader swaps, different camera angle though, the perf is 148 with subroutines vs 180 without. Still in the minus with performance, so I'm really wondering what is the threshold for a performance boost, or if the AMD driver is just really poor.