Sign in to follow this  
space_cadet

OpenGL OpenGL ES 2.0 on Android: how to render 500 cubes effectively

Recommended Posts

space_cadet    143
Hi,

I am trying to simulate a starfield of rotating cubes moving towards the user:

[img]http://i.imgur.com/7d9zu.png[/img]

I am struggling to maintain a smooth animation on Android. I can't see why a Dual-core 1 GHz Cortex-A9 with a ULP GeForce GPU should not be able to do that. Here's what I do:

INITIALIZATION:

- Create a VBO containing 36 cube position vertices and 36 cube normal vertices, interleaved
- Set up a simple shader that takes positions, normals, color, MV matrix, MVP matrix, light position
- Connect the VBO to the shader's position and normal attribute respectively

EACH FRAME:

a] Movement (for each cube)

- calculate scaling matrix, rotation matrix A, translation matrix, rotation matrix B
- multiply all of the above matrices to obtain model matrix
- multiply model matrix and view matrix to obtain MV matrix
- multiply MV matrix and projection matrix to obtain MVP matrix

b] Draw (for each cube)

- hook up cube's MV matrix with shader (glUniformMatrix4fv)
- hook up cube's MVP matrix with shader (glUniformMatrix4fv)
- hook up cube's color constant with shader (glVertexAttrib4f)
- draw cube (glDrawArrays)

I have done a lot of profiling and timing analyzation, so I don't think I have any unintended performance leaks in my code. Rather, I believe there is a more general design flaw, and I hope someone can help me improve it.

Share this post


Link to post
Share on other sites
Hodgman    51223
Do you have any profiling tools that allow you to measure CPU and GPU timings independently? The first step will be determining which processor is the bottleneck, so you can focus your optimisations usefully.

That said, a solution will probably involve call glDraw less often than once-per-cube.
On the CPU side, every "batch" of geometry has a certain amount of overhead, so you want to reduce the total amount of [font=courier new,courier,monospace]gl*[/font] calls, particularly [font=courier new,courier,monospace]glDraw*[/font] calls.
On the GPU side, you want every [font=courier new,courier,monospace]glDraw*[/font] "batch" to contain as many vertices and as many pixels as possible. It seems like many of your cubes cover hardly any pixels, which may greatly exaggerate the per-draw overheads.

Share this post


Link to post
Share on other sites
space_cadet    143
[quote name='Hodgman' timestamp='1353758793' post='5003722']That said, a solution will probably involve call glDraw less often than once-per-cube.[/quote]

True, but [i]how[/i] ? Since every cube needs its own scaling, rotation and translation, how can I combine glDrawArrays() calls?

Share this post


Link to post
Share on other sites
C0lumbo    4411
[quote name='space_cadet' timestamp='1353759859' post='5003725']
[quote name='Hodgman' timestamp='1353758793' post='5003722']That said, a solution will probably involve call glDraw less often than once-per-cube.[/quote]

True, but [i]how[/i] ? Since every cube needs its own scaling, rotation and translation, how can I combine glDrawArrays() calls?
[/quote]

Option 1: Software transform the vertices. That is, do the matrix multiply on the CPU, then you can send all the cubes in one go.
Option 2: 'Hardware skinning' style solution. Instead of doing one cube at a time, put 16* cubes into your VBO. The vertices for each cube have indices, which you use in your vertex shader to look up into an array of matrix uniforms.

From experience doing similar things on iOS, I'd expect option 1 to be the better choice.

*Probably you'll want this number to be as large as possible within the constaints of the max amount of uniform space your GPU supports. Edited by C0lumbo

Share this post


Link to post
Share on other sites
space_cadet    143

I know this thread is a month old, but I have invested serious thought and work into the project, and I'd like to post an update to show that I aprecciate your answers, and to help other people with similar issues.

 

 

PROFILING

 

 

Do you have any profiling tools that allow you to measure CPU and GPU timings independently? The first step will be determining which processor is the bottleneck, so you can focus your optimisations usefully.

 

 

 

I believe that you have a very good point here; I have mostly been "optimizing blindly", which is considered bad practice. I have tried / considered the following options:

 

 

1. Android SDK Tools: the profiler shipped with the SDK shows CPU time, but not GPU time. Also, it profiles my own application, but I would like to see what's going on in the whole system. Google has recognized this and published a tool for system-wide profiling called systrace, but it's only available for Jelly Bean. The same goes for dumpsys gfxinfo which in combination with the Jelly Bean developer option "Profile GPU Rendering" outputs a stat about how much time is spent processing, drawing and swapping UI. See Chet Haase and Guy Romain's excellent presentation about Android UI performance for more information about those tools. For me, they are not an option, I am stuck with Honeycomb for various reasons.

 

 

 

2. Log output: yes I know it is stone age, but I thought it would be interesting to see how much time my application spends in my move() (CPU) and draw() (GPU) methods. The results are not very conclusive; I guess this has to to with multithreading and the way Android handles vsync.

 

 

3. NVIDIA Tools: there is a tool for Tegra 2 platforms called PerfHUD ES that looks very promising: detailed information about draw calls and lots of other GPU-related information. I am currently trying to get this running on honeycomb. Any help aprecciated.

 

 

OPTIMIZING

 

 

 

Option 1: Software transform the vertices. That is, do the matrix multiply on the CPU, then you can send all the cubes in one go.
Option 2: 'Hardware skinning' style solution. Instead of doing one cube at a time, put 16* cubes into your VBO. The vertices for each cube have indices, which you use in your vertex shader to look up into an array of matrix uniforms.

 

 

 

Both of your options seem very reasonable approaches to take. I decided to implement option 2 first. The OpenGL ES specification calls this method "Dynamic Indexing".  It took me an hour to rearrange my application accordingly, and a whole day to find out how to get hold of the vertex index inside the shader and use it to address my transformation matrices. It's not straightforward on OpenGL ES 2.0 because for some fucking reason they decided to leave out the crucial gl_VertexID variable there. Sorry, but this tiny little detail really drove me mad. Anyways, the solution is quite simple, once you know how to do it. Anton Holmquist has a short, to-the-point blog post about it, which I wish I'd found earlier.

 

The one big drawback about this method is, like you pointed out, the limited uniform space. For those who don't have a clue what that is (like I did): it is the space available for declaring uniform variables in the shader. I read somewhere that this space relates to the amount / size of registers the GPU has - correct me if I'm wrong here. For anyone interested, calling glGetIntegerv(GL_MAX_VERTEX_UNIFORM_VECTORS) will return a number that says how much uniform space you have on your system. The specification for OpenGL ES 2.0 says it has to be at least 128. The number is expressed in vectors, and since each matrix has 4 vectors, that means you could declare a maximum of 32 matrix variables or, in my case, two arrays of 16 matrices. In other words, I can batch a maximum of 16 cubes now. 

 

 

NEXT

 

Dynamic indexing has certainly improved performance, but I am not entirely happy yet. I will implement the abovementioned option 1, hoping to improve performance by shifting work from the GPU to the CPU and, as a next step, parallize the move() and draw() operations.

Share this post


Link to post
Share on other sites
mattdesl    176

Doing the the matrix calculations in software shouldn't be too bad. But if you are CPU limited (maybe the case on Android) then here is an alternative that does the matrix calculation on the GPU:

 

Pass a vec4 Rotation (rotX, rotY, rotZ, angle) to your vertex shader. Then calculate the rotation matrix in your shader:

 

[code]mat4 transform = mat4(1.0); //identity   ... make rotation matrix from axis and angle ...   gl_Position = projection * view * transform * a_position;[/code]

 

Theoretically you could put the Position and Normal attributes into a separate static VBO since they are unchanged (and update the view matrix instead). If you fill all your instance data in one go, you could end up with a single draw call per frame.

Share this post


Link to post
Share on other sites
C0lumbo    4411
I do think the software approach will beat the vertex shader approach, so you should probably go ahead and give that a shot.

But - I think you should be able to manage more than 16 cubes per batch. Firstly, you don't need a full 4x4 matrix - you can switch to 4x3 matrices and implicitly assume that the last row is 0, 0, 0, 1. You might need to transpose your matrices to achieve this.

Also, I don't think you need two arrays of matrices. I assume one of the matrices is for your normals, and one for the positions. But you can usually use the same matrix for your positions and your normals and simply ignore the position for the normals.

So, by my reckoning you should be able to manage 128/3=42 cubes each batch. Your GLSL code might end up looking a bit like this (untested, not compiled):

uniform float4 g_vMatrices[42*3];

...

vWorldPos.x = dot(g_vMatrices[iCubeIndexAttribute*3], float4(vPositionAttribute, 1.0));
vWorldPos.y = dot(g_vMatrices[iCubeIndexAttribute*3+1], float4(vPositionAttribute, 1.0));
vWorldPos.z = dot(g_vMatrices[iCubeIndexAttribute*3+2], float4(vPositionAttribute, 1.0));
vWorldNormal.x = dot(g_vMatrices[iCubeIndexAttribute*3], float4(vNormalAttribute, 0.0));
vWorldNormal.y = dot(g_vMatrices[iCubeIndexAttribute*3+1], float4(vNormalAttribute, 0.0));
vWorldNormal.z = dot(g_vMatrices[iCubeIndexAttribute*3+2], float4(vNormalAttribute, 0.0));


Oops - just realised you'd then need to transform the positions by the viewproj matrix which will take up a few more uniforms, so you'll end up with only 41 cubes per batch. Edited by C0lumbo

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this  

  • Similar Content

    • By Kjell Andersson
      I'm trying to get some legacy OpenGL code to run with a shader pipeline,
      The legacy code uses glVertexPointer(), glColorPointer(), glNormalPointer() and glTexCoordPointer() to supply the vertex information.
      I know that it should be using setVertexAttribPointer() etc to clearly define the layout but that is not an option right now since the legacy code can't be modified to that extent.
      I've got a version 330 vertex shader to somewhat work:
      #version 330 uniform mat4 osg_ModelViewProjectionMatrix; uniform mat4 osg_ModelViewMatrix; layout(location = 0) in vec4 Vertex; layout(location = 2) in vec4 Normal; // Velocity layout(location = 3) in vec3 TexCoord; // TODO: is this the right layout location? out VertexData { vec4 color; vec3 velocity; float size; } VertexOut; void main(void) { vec4 p0 = Vertex; vec4 p1 = Vertex + vec4(Normal.x, Normal.y, Normal.z, 0.0f); vec3 velocity = (osg_ModelViewProjectionMatrix * p1 - osg_ModelViewProjectionMatrix * p0).xyz; VertexOut.velocity = velocity; VertexOut.size = TexCoord.y; gl_Position = osg_ModelViewMatrix * Vertex; } What works is the Vertex and Normal information that the legacy C++ OpenGL code seem to provide in layout location 0 and 2. This is fine.
      What I'm not getting to work is the TexCoord information that is supplied by a glTexCoordPointer() call in C++.
      Question:
      What layout location is the old standard pipeline using for glTexCoordPointer()? Or is this undefined?
       
      Side note: I'm trying to get an OpenSceneGraph 3.4.0 particle system to use custom vertex, geometry and fragment shaders for rendering the particles.
    • By markshaw001
      Hi i am new to this forum  i wanted to ask for help from all of you i want to generate real time terrain using a 32 bit heightmap i am good at c++ and have started learning Opengl as i am very interested in making landscapes in opengl i have looked around the internet for help about this topic but i am not getting the hang of the concepts and what they are doing can some here suggests me some good resources for making terrain engine please for example like tutorials,books etc so that i can understand the whole concept of terrain generation.
       
    • By KarimIO
      Hey guys. I'm trying to get my application to work on my Nvidia GTX 970 desktop. It currently works on my Intel HD 3000 laptop, but on the desktop, every bind textures specifically from framebuffers, I get half a second of lag. This is done 4 times as I have three RGBA textures and one depth 32F buffer. I tried to use debugging software for the first time - RenderDoc only shows SwapBuffers() and no OGL calls, while Nvidia Nsight crashes upon execution, so neither are helpful. Without binding it runs regularly. This does not happen with non-framebuffer binds.
      GLFramebuffer::GLFramebuffer(FramebufferCreateInfo createInfo) { glGenFramebuffers(1, &fbo); glBindFramebuffer(GL_FRAMEBUFFER, fbo); textures = new GLuint[createInfo.numColorTargets]; glGenTextures(createInfo.numColorTargets, textures); GLenum *DrawBuffers = new GLenum[createInfo.numColorTargets]; for (uint32_t i = 0; i < createInfo.numColorTargets; i++) { glBindTexture(GL_TEXTURE_2D, textures[i]); GLint internalFormat; GLenum format; TranslateFormats(createInfo.colorFormats[i], format, internalFormat); // returns GL_RGBA and GL_RGBA glTexImage2D(GL_TEXTURE_2D, 0, internalFormat, createInfo.width, createInfo.height, 0, format, GL_FLOAT, 0); glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_NEAREST); glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_NEAREST); DrawBuffers[i] = GL_COLOR_ATTACHMENT0 + i; glBindTexture(GL_TEXTURE_2D, 0); glFramebufferTexture(GL_FRAMEBUFFER, GL_COLOR_ATTACHMENT0 + i, textures[i], 0); } if (createInfo.depthFormat != FORMAT_DEPTH_NONE) { GLenum depthFormat; switch (createInfo.depthFormat) { case FORMAT_DEPTH_16: depthFormat = GL_DEPTH_COMPONENT16; break; case FORMAT_DEPTH_24: depthFormat = GL_DEPTH_COMPONENT24; break; case FORMAT_DEPTH_32: depthFormat = GL_DEPTH_COMPONENT32; break; case FORMAT_DEPTH_24_STENCIL_8: depthFormat = GL_DEPTH24_STENCIL8; break; case FORMAT_DEPTH_32_STENCIL_8: depthFormat = GL_DEPTH32F_STENCIL8; break; } glGenTextures(1, &depthrenderbuffer); glBindTexture(GL_TEXTURE_2D, depthrenderbuffer); glTexImage2D(GL_TEXTURE_2D, 0, depthFormat, createInfo.width, createInfo.height, 0, GL_DEPTH_COMPONENT, GL_FLOAT, 0); glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_NEAREST); glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_NEAREST); glBindTexture(GL_TEXTURE_2D, 0); glFramebufferTexture(GL_FRAMEBUFFER, GL_DEPTH_ATTACHMENT, depthrenderbuffer, 0); } if (createInfo.numColorTargets > 0) glDrawBuffers(createInfo.numColorTargets, DrawBuffers); else glDrawBuffer(GL_NONE); if (glCheckFramebufferStatus(GL_FRAMEBUFFER) != GL_FRAMEBUFFER_COMPLETE) std::cout << "Framebuffer Incomplete\n"; glBindFramebuffer(GL_FRAMEBUFFER, 0); width = createInfo.width; height = createInfo.height; } // ... // FBO Creation FramebufferCreateInfo gbufferCI; gbufferCI.colorFormats = gbufferCFs.data(); gbufferCI.depthFormat = FORMAT_DEPTH_32; gbufferCI.numColorTargets = gbufferCFs.size(); gbufferCI.width = engine.settings.resolutionX; gbufferCI.height = engine.settings.resolutionY; gbufferCI.renderPass = nullptr; gbuffer = graphicsWrapper->CreateFramebuffer(gbufferCI); // Bind glBindFramebuffer(GL_DRAW_FRAMEBUFFER, fbo); // Draw here... // Bind to textures glActiveTexture(GL_TEXTURE0); glBindTexture(GL_TEXTURE_2D, textures[0]); glActiveTexture(GL_TEXTURE1); glBindTexture(GL_TEXTURE_2D, textures[1]); glActiveTexture(GL_TEXTURE2); glBindTexture(GL_TEXTURE_2D, textures[2]); glActiveTexture(GL_TEXTURE3); glBindTexture(GL_TEXTURE_2D, depthrenderbuffer); Here is an extract of my code. I can't think of anything else to include. I've really been butting my head into a wall trying to think of a reason but I can think of none and all my research yields nothing. Thanks in advance!
    • By Adrianensis
      Hi everyone, I've shared my 2D Game Engine source code. It's the result of 4 years working on it (and I still continue improving features ) and I want to share with the community. You can see some videos on youtube and some demo gifs on my twitter account.
      This Engine has been developed as End-of-Degree Project and it is coded in Javascript, WebGL and GLSL. The engine is written from scratch.
      This is not a professional engine but it's for learning purposes, so anyone can review the code an learn basis about graphics, physics or game engine architecture. Source code on this GitHub repository.
      I'm available for a good conversation about Game Engine / Graphics Programming
    • By C0dR
      I would like to introduce the first version of my physically based camera rendering library, written in C++, called PhysiCam.
      Physicam is an open source OpenGL C++ library, which provides physically based camera rendering and parameters. It is based on OpenGL and designed to be used as either static library or dynamic library and can be integrated in existing applications.
       
      The following features are implemented:
      Physically based sensor and focal length calculation Autoexposure Manual exposure Lense distortion Bloom (influenced by ISO, Shutter Speed, Sensor type etc.) Bokeh (influenced by Aperture, Sensor type and focal length) Tonemapping  
      You can find the repository at https://github.com/0x2A/physicam
       
      I would be happy about feedback, suggestions or contributions.

  • Popular Now