Sign in to follow this  
space_cadet

OpenGL OpenGL ES 2.0 on Android: how to render 500 cubes effectively

Recommended Posts

Hi,

I am trying to simulate a starfield of rotating cubes moving towards the user:

[img]http://i.imgur.com/7d9zu.png[/img]

I am struggling to maintain a smooth animation on Android. I can't see why a Dual-core 1 GHz Cortex-A9 with a ULP GeForce GPU should not be able to do that. Here's what I do:

INITIALIZATION:

- Create a VBO containing 36 cube position vertices and 36 cube normal vertices, interleaved
- Set up a simple shader that takes positions, normals, color, MV matrix, MVP matrix, light position
- Connect the VBO to the shader's position and normal attribute respectively

EACH FRAME:

a] Movement (for each cube)

- calculate scaling matrix, rotation matrix A, translation matrix, rotation matrix B
- multiply all of the above matrices to obtain model matrix
- multiply model matrix and view matrix to obtain MV matrix
- multiply MV matrix and projection matrix to obtain MVP matrix

b] Draw (for each cube)

- hook up cube's MV matrix with shader (glUniformMatrix4fv)
- hook up cube's MVP matrix with shader (glUniformMatrix4fv)
- hook up cube's color constant with shader (glVertexAttrib4f)
- draw cube (glDrawArrays)

I have done a lot of profiling and timing analyzation, so I don't think I have any unintended performance leaks in my code. Rather, I believe there is a more general design flaw, and I hope someone can help me improve it.

Share this post


Link to post
Share on other sites
Do you have any profiling tools that allow you to measure CPU and GPU timings independently? The first step will be determining which processor is the bottleneck, so you can focus your optimisations usefully.

That said, a solution will probably involve call glDraw less often than once-per-cube.
On the CPU side, every "batch" of geometry has a certain amount of overhead, so you want to reduce the total amount of [font=courier new,courier,monospace]gl*[/font] calls, particularly [font=courier new,courier,monospace]glDraw*[/font] calls.
On the GPU side, you want every [font=courier new,courier,monospace]glDraw*[/font] "batch" to contain as many vertices and as many pixels as possible. It seems like many of your cubes cover hardly any pixels, which may greatly exaggerate the per-draw overheads.

Share this post


Link to post
Share on other sites
[quote name='Hodgman' timestamp='1353758793' post='5003722']That said, a solution will probably involve call glDraw less often than once-per-cube.[/quote]

True, but [i]how[/i] ? Since every cube needs its own scaling, rotation and translation, how can I combine glDrawArrays() calls?

Share this post


Link to post
Share on other sites
[quote name='space_cadet' timestamp='1353759859' post='5003725']
[quote name='Hodgman' timestamp='1353758793' post='5003722']That said, a solution will probably involve call glDraw less often than once-per-cube.[/quote]

True, but [i]how[/i] ? Since every cube needs its own scaling, rotation and translation, how can I combine glDrawArrays() calls?
[/quote]

Option 1: Software transform the vertices. That is, do the matrix multiply on the CPU, then you can send all the cubes in one go.
Option 2: 'Hardware skinning' style solution. Instead of doing one cube at a time, put 16* cubes into your VBO. The vertices for each cube have indices, which you use in your vertex shader to look up into an array of matrix uniforms.

From experience doing similar things on iOS, I'd expect option 1 to be the better choice.

*Probably you'll want this number to be as large as possible within the constaints of the max amount of uniform space your GPU supports. Edited by C0lumbo

Share this post


Link to post
Share on other sites
you don't need to do matrix mult to translate for example
1D, 2D or 3D translation is 3, 6 and 9 flops respectively

rotation is harder to optimize, but a 2D rotation is alot cheaper

Share this post


Link to post
Share on other sites

I know this thread is a month old, but I have invested serious thought and work into the project, and I'd like to post an update to show that I aprecciate your answers, and to help other people with similar issues.

 

 

PROFILING

 

 

Do you have any profiling tools that allow you to measure CPU and GPU timings independently? The first step will be determining which processor is the bottleneck, so you can focus your optimisations usefully.

 

 

 

I believe that you have a very good point here; I have mostly been "optimizing blindly", which is considered bad practice. I have tried / considered the following options:

 

 

1. Android SDK Tools: the profiler shipped with the SDK shows CPU time, but not GPU time. Also, it profiles my own application, but I would like to see what's going on in the whole system. Google has recognized this and published a tool for system-wide profiling called systrace, but it's only available for Jelly Bean. The same goes for dumpsys gfxinfo which in combination with the Jelly Bean developer option "Profile GPU Rendering" outputs a stat about how much time is spent processing, drawing and swapping UI. See Chet Haase and Guy Romain's excellent presentation about Android UI performance for more information about those tools. For me, they are not an option, I am stuck with Honeycomb for various reasons.

 

 

 

2. Log output: yes I know it is stone age, but I thought it would be interesting to see how much time my application spends in my move() (CPU) and draw() (GPU) methods. The results are not very conclusive; I guess this has to to with multithreading and the way Android handles vsync.

 

 

3. NVIDIA Tools: there is a tool for Tegra 2 platforms called PerfHUD ES that looks very promising: detailed information about draw calls and lots of other GPU-related information. I am currently trying to get this running on honeycomb. Any help aprecciated.

 

 

OPTIMIZING

 

 

 

Option 1: Software transform the vertices. That is, do the matrix multiply on the CPU, then you can send all the cubes in one go.
Option 2: 'Hardware skinning' style solution. Instead of doing one cube at a time, put 16* cubes into your VBO. The vertices for each cube have indices, which you use in your vertex shader to look up into an array of matrix uniforms.

 

 

 

Both of your options seem very reasonable approaches to take. I decided to implement option 2 first. The OpenGL ES specification calls this method "Dynamic Indexing".  It took me an hour to rearrange my application accordingly, and a whole day to find out how to get hold of the vertex index inside the shader and use it to address my transformation matrices. It's not straightforward on OpenGL ES 2.0 because for some fucking reason they decided to leave out the crucial gl_VertexID variable there. Sorry, but this tiny little detail really drove me mad. Anyways, the solution is quite simple, once you know how to do it. Anton Holmquist has a short, to-the-point blog post about it, which I wish I'd found earlier.

 

The one big drawback about this method is, like you pointed out, the limited uniform space. For those who don't have a clue what that is (like I did): it is the space available for declaring uniform variables in the shader. I read somewhere that this space relates to the amount / size of registers the GPU has - correct me if I'm wrong here. For anyone interested, calling glGetIntegerv(GL_MAX_VERTEX_UNIFORM_VECTORS) will return a number that says how much uniform space you have on your system. The specification for OpenGL ES 2.0 says it has to be at least 128. The number is expressed in vectors, and since each matrix has 4 vectors, that means you could declare a maximum of 32 matrix variables or, in my case, two arrays of 16 matrices. In other words, I can batch a maximum of 16 cubes now. 

 

 

NEXT

 

Dynamic indexing has certainly improved performance, but I am not entirely happy yet. I will implement the abovementioned option 1, hoping to improve performance by shifting work from the GPU to the CPU and, as a next step, parallize the move() and draw() operations.

Share this post


Link to post
Share on other sites

Doing the the matrix calculations in software shouldn't be too bad. But if you are CPU limited (maybe the case on Android) then here is an alternative that does the matrix calculation on the GPU:

 

Pass a vec4 Rotation (rotX, rotY, rotZ, angle) to your vertex shader. Then calculate the rotation matrix in your shader:

 

[code]mat4 transform = mat4(1.0); //identity   ... make rotation matrix from axis and angle ...   gl_Position = projection * view * transform * a_position;[/code]

 

Theoretically you could put the Position and Normal attributes into a separate static VBO since they are unchanged (and update the view matrix instead). If you fill all your instance data in one go, you could end up with a single draw call per frame.

Share this post


Link to post
Share on other sites
I do think the software approach will beat the vertex shader approach, so you should probably go ahead and give that a shot.

But - I think you should be able to manage more than 16 cubes per batch. Firstly, you don't need a full 4x4 matrix - you can switch to 4x3 matrices and implicitly assume that the last row is 0, 0, 0, 1. You might need to transpose your matrices to achieve this.

Also, I don't think you need two arrays of matrices. I assume one of the matrices is for your normals, and one for the positions. But you can usually use the same matrix for your positions and your normals and simply ignore the position for the normals.

So, by my reckoning you should be able to manage 128/3=42 cubes each batch. Your GLSL code might end up looking a bit like this (untested, not compiled):

uniform float4 g_vMatrices[42*3];

...

vWorldPos.x = dot(g_vMatrices[iCubeIndexAttribute*3], float4(vPositionAttribute, 1.0));
vWorldPos.y = dot(g_vMatrices[iCubeIndexAttribute*3+1], float4(vPositionAttribute, 1.0));
vWorldPos.z = dot(g_vMatrices[iCubeIndexAttribute*3+2], float4(vPositionAttribute, 1.0));
vWorldNormal.x = dot(g_vMatrices[iCubeIndexAttribute*3], float4(vNormalAttribute, 0.0));
vWorldNormal.y = dot(g_vMatrices[iCubeIndexAttribute*3+1], float4(vNormalAttribute, 0.0));
vWorldNormal.z = dot(g_vMatrices[iCubeIndexAttribute*3+2], float4(vNormalAttribute, 0.0));


Oops - just realised you'd then need to transform the positions by the viewproj matrix which will take up a few more uniforms, so you'll end up with only 41 cubes per batch. Edited by C0lumbo

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this  

  • Forum Statistics

    • Total Topics
      627735
    • Total Posts
      2978848
  • Similar Content

    • By DelicateTreeFrog
      Hello! As an exercise for delving into modern OpenGL, I'm creating a simple .obj renderer. I want to support things like varying degrees of specularity, geometry opacity, things like that, on a per-material basis. Different materials can also have different textures. Basic .obj necessities. I've done this in old school OpenGL, but modern OpenGL has its own thing going on, and I'd like to conform as closely to the standards as possible so as to keep the program running correctly, and I'm hoping to avoid picking up bad habits this early on.
      Reading around on the OpenGL Wiki, one tip in particular really stands out to me on this page:
      For something like a renderer for .obj files, this sort of thing seems almost ideal, but according to the wiki, it's a bad idea. Interesting to note!
      So, here's what the plan is so far as far as loading goes:
      Set up a type for materials so that materials can be created and destroyed. They will contain things like diffuse color, diffuse texture, geometry opacity, and so on, for each material in the .mtl file. Since .obj files are conveniently split up by material, I can load different groups of vertices/normals/UVs and triangles into different blocks of data for different models. When it comes to the rendering, I get a bit lost. I can either:
      Between drawing triangle groups, call glUseProgram to use a different shader for that particular geometry (so a unique shader just for the material that is shared by this triangle group). or
      Between drawing triangle groups, call glUniform a few times to adjust different parameters within the "master shader", such as specularity, diffuse color, and geometry opacity. In both cases, I still have to call glBindTexture between drawing triangle groups in order to bind the diffuse texture used by the material, so there doesn't seem to be a way around having the CPU do *something* during the rendering process instead of letting the GPU do everything all at once.
      The second option here seems less cluttered, however. There are less shaders to keep up with while one "master shader" handles it all. I don't have to duplicate any code or compile multiple shaders. Arguably, I could always have the shader program for each material be embedded in the material itself, and be auto-generated upon loading the material from the .mtl file. But this still leads to constantly calling glUseProgram, much more than is probably necessary in order to properly render the .obj. There seem to be a number of differing opinions on if it's okay to use hundreds of shaders or if it's best to just use tens of shaders.
      So, ultimately, what is the "right" way to do this? Does using a "master shader" (or a few variants of one) bog down the system compared to using hundreds of shader programs each dedicated to their own corresponding materials? Keeping in mind that the "master shaders" would have to track these additional uniforms and potentially have numerous branches of ifs, it may be possible that the ifs will lead to additional and unnecessary processing. But would that more expensive than constantly calling glUseProgram to switch shaders, or storing the shaders to begin with?
      With all these angles to consider, it's difficult to come to a conclusion. Both possible methods work, and both seem rather convenient for their own reasons, but which is the most performant? Please help this beginner/dummy understand. Thank you!
    • By JJCDeveloper
      I want to make professional java 3d game with server program and database,packet handling for multiplayer and client-server communicating,maps rendering,models,and stuffs Which aspect of java can I learn and where can I learn java Lwjgl OpenGL rendering Like minecraft and world of tanks
    • By AyeRonTarpas
      A friend of mine and I are making a 2D game engine as a learning experience and to hopefully build upon the experience in the long run.

      -What I'm using:
          C++;. Since im learning this language while in college and its one of the popular language to make games with why not.     Visual Studios; Im using a windows so yea.     SDL or GLFW; was thinking about SDL since i do some research on it where it is catching my interest but i hear SDL is a huge package compared to GLFW, so i may do GLFW to start with as learning since i may get overwhelmed with SDL.  
      -Questions
      Knowing what we want in the engine what should our main focus be in terms of learning. File managements, with headers, functions ect. How can i properly manage files with out confusing myself and my friend when sharing code. Alternative to Visual studios: My friend has a mac and cant properly use Vis studios, is there another alternative to it?  
    • By ferreiradaselva
      Both functions are available since 3.0, and I'm currently using `glMapBuffer()`, which works fine.
      But, I was wondering if anyone has experienced advantage in using `glMapBufferRange()`, which allows to specify the range of the mapped buffer. Could this be only a safety measure or does it improve performance?
      Note: I'm not asking about glBufferSubData()/glBufferData. Those two are irrelevant in this case.
    • By xhcao
      Before using void glBindImageTexture(    GLuint unit, GLuint texture, GLint level, GLboolean layered, GLint layer, GLenum access, GLenum format), does need to make sure that texture is completeness. 
  • Popular Now