Jump to content

  • Log In with Google      Sign In   
  • Create Account

We're offering banner ads on our site from just $5!

1. Details HERE. 2. GDNet+ Subscriptions HERE. 3. Ad upload HERE.


[iOS] Drawing Anything Causes Terrible Performance


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
10 replies to this topic

#1 L. Spiro   Crossbones+   -  Reputation: 14198

Like
0Likes
Like

Posted 21 September 2012 - 01:14 AM

I have been trying to find the cause of our terrible iOS performance for a very long time, and have in the past narrowed it down to the presentRenderbuffer: signal.
However, my efforts were in vain because that was only the place where the problem was manifesting itself, not where the actual problem is.
The actual problem has to do with the inner workings of ::glFlush(), which is called by presentRenderbuffer:. I can replace presentRenderbuffer: with ::glFlush() and get the same results.

So what is the actual problem?
In my extremely exhaustive search for the problem, I stripped down the code to the point where it is issuing no draw calls. It does everything else, such as activate shaders, set/unset textures, change render targets, etc.
The result is an expected 60 FPS.

I then enabled a single draw call. A triangle strip with just 4 vertices. The shader for it simply reads a texture and returns the result. The purpose is simply to copy from the 480×320 back-buffer texture to the 960×640 render buffer.
Suddenly it drops to 30 FPS. A single draw call adds 16.6 milliseconds to my frame time.
That is ridiculous.


The renderer is running on a separate thread from the main iOS thread, but all the OpenGL ES 2 calls are on that thread.
This should not be a problem because my past engine was rendering from a second thread and could always reach 60 FPS in OpenGL ES 1.0.

Also, executing the draw command inside the OpenGL ES 2 driver does not take 16.6 milliseconds.
My current engine works the same way (rendering 3D to a back buffer and then copying it to the render buffer before calling presentRenderbuffer:) yet I can run physically based shaders and post-processing at 55-60 FPS, so I certainly won’t accept that, “switching render targets is just that slow,” or, “the fill-rate is just that bad.”


The render call is leaving behind side-effects that cause ::glFinish to stall unreasonably.
I can add back the rest of the render calls and it stays at 30 FPS. So it isn’t that any draw command itself is actually taking a long time, it is just causing an unnecessary stall.

I would normally assume it is waiting for V-blank, but that doesn’t fit, since it stalls for up to 32 milliseconds.

Here are the actual render commands I issue in one frame (resulting in 30 FPS):
#0 GL_NO_ERROR <- glGetError()
#1 GL_NO_ERROR <- glGetError()
#2 GL_NO_ERROR <- glGetError()
#3 glFramebufferTexture2D(GL_FRAMEBUFFER, GL_COLOR_ATTACHMENT0, GL_TEXTURE_2D, 8, 0)
#4 glFramebufferRenderbuffer(GL_FRAMEBUFFER, GL_DEPTH_ATTACHMENT, GL_RENDERBUFFER, 4)
#5 glFramebufferRenderbuffer(GL_FRAMEBUFFER, GL_STENCIL_ATTACHMENT, GL_RENDERBUFFER, 0)
#6 glDiscardFramebufferEXT(GL_FRAMEBUFFER, 1, {GL_DEPTH_ATTACHMENT})
#7 glClearDepthf(0.0000000)
#8 glClearStencil(0)
#9 glDiscardFramebufferEXT(GL_FRAMEBUFFER, 1, {GL_COLOR_ATTACHMENT0})
#10 glClear(GL_COLOR_BUFFER_BIT)
#11 glDepthMask(1)
#12 glClear(GL_DEPTH_BUFFER_BIT)
#13 glDepthMask(0)
#14 glDiscardFramebufferEXT(GL_FRAMEBUFFER, 1, {GL_STENCIL_ATTACHMENT})
#15 glClear(GL_STENCIL_BUFFER_BIT)
#16 glDepthMask(1)
#17 glFramebufferRenderbuffer(GL_FRAMEBUFFER, GL_COLOR_ATTACHMENT0, GL_RENDERBUFFER, 1)
#18 glFramebufferRenderbuffer(GL_FRAMEBUFFER, GL_DEPTH_ATTACHMENT, GL_RENDERBUFFER, 2)
#19 glFramebufferRenderbuffer(GL_FRAMEBUFFER, GL_STENCIL_ATTACHMENT, GL_RENDERBUFFER, 2)
#20 glDiscardFramebufferEXT(GL_FRAMEBUFFER, 1, {GL_DEPTH_ATTACHMENT})
#21 glBindTexture(GL_TEXTURE_2D, 8)
#22 glVertexAttribPointer(2, 3, GL_FLOAT, 0, 36, 0x07709cac)
#23 glVertexAttribPointer(1, 4, GL_FLOAT, 0, 36, 0x07709cb8)
#24 glVertexAttribPointer(0, 2, GL_FLOAT, 0, 36, 0x07709cc8)
#25 glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_LINEAR)
#26 glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_LINEAR)
#27 glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_S, GL_CLAMP_TO_EDGE)
#28 glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_T, GL_CLAMP_TO_EDGE)
#29 glUseProgram(79)
#30 glDrawArrays(GL_TRIANGLE_STRIP, 0, 4)
#31 glUseProgram(82)
#32 glUniformMatrix4fv(asTexStage[2], 1, 0, {-2.4142134, 0.0000000, 0.0000000, 0.0000000, 0.0000000, 3.2390072, 0.0000904, -0.4472136, 0.0000000, -1.6195037, 0.0001807, -0.8944272, 0.0000000, 0.0000000, 9.8761530, 1118.0339355})
#33 glUniform4fv(csvConst_1, 3, {1.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000, 1.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000, 1.0000000, 0.0000000})
#34 glUniform4fv(7, 1, {1.0000000, 1.0000000, 1.0000000, 1.0000000})
#35 glVertexAttribPointer(1, 3, GL_FLOAT, 0, 52, 0x0885b990)
#36 glVertexAttribPointer(0, 4, GL_UNSIGNED_BYTE, 1, 52, 0x0885b9a8)
#37 glDepthMask(0)
#38 glUseProgram(110)
#39 glUniform4fv(csvConst_1, 2, {0.0020833, -0.0031250, 1.0000000, 1.0000000, -1.0000000, 1.0000000, 0.0000000, 0.0000000})
#40 glVertexAttribPointer(1, 4, GL_FLOAT, 0, 28, 0x0cbe3240)
#41 glVertexAttribPointer(0, 4, GL_UNSIGNED_BYTE, 1, 28, 0x0cbe3250)
#42 glBindTexture(GL_TEXTURE_2D, 2)
#43 glUseProgram(111)
#44 glUniform4fv(csvConst_1, 2, {0.0020833, -0.0031250, 1.0000000, 1.0000000, -1.0000000, 1.0000000, 0.0000000, 0.0000000})
#45 glVertexAttribPointer(1, 4, GL_FLOAT, 0, 28, 0x0cc3a040)
#46 glVertexAttribPointer(0, 4, GL_UNSIGNED_BYTE, 1, 28, 0x0cc3a050)
#47 glVertexAttribPointer(2, 2, GL_FLOAT, 0, 28, 0x0cc3a054)
#48 glDepthMask(1)
#49 glDepthMask(0)
#50 glUseProgram(82)
#51 glUniformMatrix4fv(asTexStage[2], 1, 0, {-0.6600000, 0.0000000, 0.0000000, 0.0000000, 0.2657602, 1.1154090, 0.0000596, -0.2951610, 0.5315203, 0.0171107, 0.0001193, -0.5903219, -432.1875000, -373.9130249, 10.0050621, 480.0000000})
#52 glUniform4fv(csvConst_1, 3, {-0.6600000, 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.5903220, -0.2951610, 0.0000000, 0.0000000, -0.2951610, -0.5903219, 0.0000000})
#53 glUniform4fv(7, 1, {1.0000000, 1.0000000, 1.0000000, 1.0000000})
#54 glVertexAttribPointer(1, 3, GL_FLOAT, 0, 52, 0x0ccd8860)
#55 glVertexAttribPointer(0, 4, GL_UNSIGNED_BYTE, 1, 52, 0x0ccd8878)
#56 [0x0000000002480500 presentRenderbuffer:GL_RENDERBUFFER]

This is with only the 1 draw call being allowed (line #30)
What could be causing such unreasonable stalls here?
Buffer resolves? Buffer restores?
Anyone ever see this before?


L. Spiro
It is amazing how often people try to be unique, and yet they are always trying to make others be like them. - L. Spiro 2011
I spent most of my life learning the courage it takes to go out and get what I want. Now that I have it, I am not sure exactly what it is that I want. - L. Spiro 2013
I went to my local Subway once to find some guy yelling at the staff. When someone finally came to take my order and asked, “May I help you?”, I replied, “Yeah, I’ll have one asshole to go.”
L. Spiro Engine: http://lspiroengine.com
L. Spiro Engine Forums: http://lspiroengine.com/forums

Sponsor:

#2 freakchild   Members   -  Reputation: 557

Like
1Likes
Like

Posted 21 September 2012 - 12:24 PM

Quick check...but was your renderbuffer storage allocated at a time where your orientation was the opposite of what your game is generally running? Even if orientation did not change (program might start out thinking it's portrait when it's not)? If so, reallocate it. I've seen large frame rate drops for that reason, although not as low as 30fps. It's worth ruling out though given it's a similar enough problem to yours.

A quick way of checking this would be via looking at the code or recollection from memory of course, but also don't overlook the Core Animation profiler and seeing if the non-fast path rendering is kicking in, not just for this reason but for others too. Definitely check that tool out because it's often overlooked in favor of the more comprehensive profiling options yet a handful of things it might highlight won't show up elsewhere.

Edited by freakchild, 21 September 2012 - 12:27 PM.


#3 L. Spiro   Crossbones+   -  Reputation: 14198

Like
0Likes
Like

Posted 21 September 2012 - 03:45 PM

I am about 99% sure it is allocated after the device is rotated, since the image is not distorted (in the past it was not allocated at the appropriate time and the content was stretched as a result), but how do I check for sure (other than the fact that I am 100% that the device orientation is “landscape” at the time of allocation)?

I mainly spend my time with Core Animation, Time Profiler, and CPU Analyzer, but I don’t know what you mean by the non-fast path? Where does the Core Animation profiler show that?
By the way, CPU Analyzer shows that both the main iOS thread and our main game thread are spending an unreasonable amount of time idle. In fact,more time in the idle state than in any other state. Likely because of ::glFlush(), since that is the only place our main thread can be set to sleep/wait.


L. Spiro
It is amazing how often people try to be unique, and yet they are always trying to make others be like them. - L. Spiro 2011
I spent most of my life learning the courage it takes to go out and get what I want. Now that I have it, I am not sure exactly what it is that I want. - L. Spiro 2013
I went to my local Subway once to find some guy yelling at the staff. When someone finally came to take my order and asked, “May I help you?”, I replied, “Yeah, I’ll have one asshole to go.”
L. Spiro Engine: http://lspiroengine.com
L. Spiro Engine Forums: http://lspiroengine.com/forums

#4 freakchild   Members   -  Reputation: 557

Like
1Likes
Like

Posted 21 September 2012 - 05:09 PM

I don't think you can tell based on whether or not it is distorted or whether or not other things are seemingly correct - I seem to think your other gl calls will correct distortion alone regardless of whether it was allocated/re-allocated for the current orientation or not and this particular oddity is very much a non-obvious under the hood thing anyway.

If you think everything is okay, it probably isn't this problem, but yes if that's where you think you are then the easiest way to rule it out is to do this very specific profile. This 'fast path or not' test is in the Core animation profiler. You'll want to be looking at the core animation options (not the time profiling ones) and under that, the debug options. Check the 'Color OpenGL Fast Path blue' option and run the app. Everything that is rendered via the fast path will be colored with a blue tint. You might however also like the other profile options in here too, but I'd flick them on and off one at a time only...you'll see why.

I believe this profiling tool only works on the main renderbuffer - not an off screen target so if you're rendering to an off-screen buffer then doing a copy/present (as you appear to be) then only the whole screen will be blue (or not), but that's enough to expose this problem in this case. For direct to renderbuffer rendering this is actually a pretty useful tool as it will highlight individual polygons that for some reason have fallen back to the slower path.

Bear in mind that if you are on the slow path then it can be for other reasons too, not just this - but that's where you'd go back to the source code and double check.

#5 joew   Crossbones+   -  Reputation: 3678

Like
1Likes
Like

Posted 21 September 2012 - 10:40 PM

Are you using any of the GLSL commands that disable optimizations for depth buffering/tiling at all? Calling discard in a shader is a quick way to drop from 60 to 30 in a single call.

#6 Hodgman   Moderators   -  Reputation: 31785

Like
1Likes
Like

Posted 21 September 2012 - 11:27 PM

N.B. I've never touched iOS dev.
From your description -- the CPU idling in flush -- means that simply the GPU is spending a lot of time performing your draw-call, so when you flush, you're waiting for the GPU.

However... only a "finish" should wait for the GPU -- a "flush" should only allow the GPU to begin work on the newly pushed commands. It could be that the iOS GPU driver always does a finish before a flush... or there are reasons when a flush could block until more GPU work is done, such as if the GPU's command buffer was full.

Also, are you using vsync? Dropping from 60Hz to 30Hz sounds like your frame-times have been pushed from just below 16.6ms to just over 16.6ms, which causes vsync to "round up" to 33.3ms.

Does Apple have any GPU profiling utilities, so you can inpsect how much time each draw-call is actually taking on the GPU? Ideally, it would be able to give you an even more specific breakdown, showing you how much pressure the draw-call was putting on rasterization, texture fetching, ROP, etc...
Or does it support GPU timing events, where you can insert a timing marker before/after your draw-calls to time the GPU usage yourself?

Your draw-call currently requires about 3MB of combined input and output -- is the performance still just as bad if you resize your quad so that it only reads a single texel and ouputs a single pixel?

Do you have access to low-level details on your shader (such as the number of cycles required), on the driver implementation (such as how commands are buffered, and the regular GPU/CPU latency) and specifications on the hardware (such as clock speed, texture cache size, shading registers, memory bandwith, ROP bandwidth, etc...) that you could use to construct a theoretical best-case time for your draw-call?

Edited by Hodgman, 21 September 2012 - 11:36 PM.


#7 L. Spiro   Crossbones+   -  Reputation: 14198

Like
0Likes
Like

Posted 22 September 2012 - 12:25 AM

Are you using any of the GLSL commands that disable optimizations for depth buffering/tiling at all? Calling discard in a shader is a quick way to drop from 60 to 30 in a single call.

In the example I gave above where I enable only 1 draw call, that shader is basically just this:
vec4 vOut = texture2D( tex, coord );
gl_FragColor.xyz = sqrt( vOut.xyz );
gl_FragColor.w = vOut.w;

Which makes it more unreasonable that I am getting this behavior.



However... only a "finish" should wait for the GPU -- a "flush" should only allow the GPU to begin work on the newly pushed commands. It could be that the iOS GPU driver always does a finish before a flush... or there are reasons when a flush could block until more GPU work is done, such as if the GPU's command buffer was full.

::glFlush() and ::glFinish() work as you described. In a flush, the CPU shouldn’t be stalling, though it obviously is. However performing a “finish” is about twice as slow.



Also, are you using vsync? Dropping from 60Hz to 30Hz sounds like your frame-times have been pushed from just below 16.6ms to just over 16.6ms, which causes vsync to "round up" to 33.3ms.

The way iOS works is that internally no matter what you do it is double-buffering (in addition to the renderbuffer you use as your drawing surface), and it normally only stalls the CPU when it is out of buffers.
So it is normal to get every number of FPS smoothly between 30 and 60. For example, my personal engine can run physically based shaders at 50-60 FPS, then drops to 35-40 when I enable shadows.

That (over 1,000 total ::gl* calls, render-target switches, extremely complex shaders, etc.) is considerably more work than the drawing I am doing in my first post, which is just 57 ::gl* calls total.



Does Apple have any GPU profiling utilities, so you can inpsect how much time each draw-call is actually taking on the GPU? Ideally, it would be able to give you an even more specific breakdown, showing you how much pressure the draw-call was putting on rasterization, texture fetching, ROP, etc...
Or does it support GPU timing events, where you can insert a timing marker before/after your draw-calls to time the GPU usage yourself?

They have a ton of profiling tools and I have been using them extensively, but they aren’t helping so much.
For one, the OpenGL ES Analyzer gives performance warnings related to logical buffer loads with advice on how to stop them, but no matter what I try they don’t go away.
I tried what it told me to do, then checked online for deeper advice and tried that. Still there.
I wouldn’t be surprised though if that is heavily related to this issue I am having.

Another thing is that I haven’t had any time to do deeper profiling with these tools since I stripped the drawing down to 0 or 1 draw calls. I will need to check the differences again Monday between “0 draw calls and 60 FPS” and “1 draw call and 30 FPS”.


Your draw-call currently requires about 3MB of combined input and output -- is the performance still just as bad if you resize your quad so that it only reads a single texel and ouputs a single pixel?

I will have to try Monday, but my feeling is that it will not change the performance, because in the example I mentioned with my own engine and shadows/physically based rendering, a lot more data was transferred than 3 megabytes.


Do you have access to low-level details on your shader (such as the number of cycles required), on the driver implementation (such as how commands are buffered, and the regular GPU/CPU latency) and specifications on the hardware (such as clock speed, texture cache size, shading registers, memory bandwith, ROP bandwidth, etc...) that you could use to construct a theoretical best-case time for your draw-call?

No tools for any of that, but past experiences with my own engines makes me 100% certain that the single draw call I am issuing should run at 60 FPS.


L. Spiro

Edited by L. Spiro, 22 September 2012 - 12:27 AM.

It is amazing how often people try to be unique, and yet they are always trying to make others be like them. - L. Spiro 2011
I spent most of my life learning the courage it takes to go out and get what I want. Now that I have it, I am not sure exactly what it is that I want. - L. Spiro 2013
I went to my local Subway once to find some guy yelling at the staff. When someone finally came to take my order and asked, “May I help you?”, I replied, “Yeah, I’ll have one asshole to go.”
L. Spiro Engine: http://lspiroengine.com
L. Spiro Engine Forums: http://lspiroengine.com/forums

#8 return0   Members   -  Reputation: 444

Like
0Likes
Like

Posted 22 September 2012 - 04:07 AM

Are you performing a fullscreen clear at the beginning of each frame? Are you calling glDiscardFrameBufferEXT at the end of each frame?

If you are and GL ES driver profiler is still complaining of expensive logical loads etc you probably have state corruption. Check vertex attribute bindings are valid, etc.

Problem will be ogl shuttling the rendered tiles back to cpu.

#9 L. Spiro   Crossbones+   -  Reputation: 14198

Like
0Likes
Like

Posted 22 September 2012 - 08:53 AM

Are you performing a fullscreen clear at the beginning of each frame? Are you calling glDiscardFrameBufferEXT at the end of each frame?

Yes and yes.
I have tested all combinations of clearing and discarding, and while I have found some better than others, none of them prevent this issue I am having now.


If you are and GL ES driver profiler is still complaining of expensive logical loads etc you probably have state corruption. Check vertex attribute bindings are valid, etc.

Problem will be ogl shuttling the rendered tiles back to cpu.

While I am fairly confident that they are all valid, I did not write the rendering pipeline in this case, so I will add that to my list of things to check Monday morning.
But every OpenGL call was listed in my first post and all the pointers to vertex attributes appear to be valid. Do you have a way to check for valid pointers, because handling exceptions via ::sys* calls on iOS does not work. The only way to check pointers for being valid is via exception handlers, and, outside of a debugger, exception handlers work fine, but without a debugger I can’t get the information I need. Apple was extremely idiotic on this point.



L. Spiro
It is amazing how often people try to be unique, and yet they are always trying to make others be like them. - L. Spiro 2011
I spent most of my life learning the courage it takes to go out and get what I want. Now that I have it, I am not sure exactly what it is that I want. - L. Spiro 2013
I went to my local Subway once to find some guy yelling at the staff. When someone finally came to take my order and asked, “May I help you?”, I replied, “Yeah, I’ll have one asshole to go.”
L. Spiro Engine: http://lspiroengine.com
L. Spiro Engine Forums: http://lspiroengine.com/forums

#10 return0   Members   -  Reputation: 444

Like
0Likes
Like

Posted 22 September 2012 - 02:05 PM

I don't have any tooling suggestions, sorry; it just sounds really similar to the symptoms and profiler behaviour to a problem I have seen before. I guess you could skip your existing rendering abstraction layer and manually hack in hooking up your buffers? Good luck tracking it down!

#11 L. Spiro   Crossbones+   -  Reputation: 14198

Like
1Likes
Like

Posted 24 September 2012 - 02:59 AM

I was testing on an iPhone 4.
view.contentScaleFactor was 2.
Changing it to 1, or basically disabling retina mode, resolved the issue.

Does anyone know if this is specific to iPhone 4 (iPhone 4S did not seem to be affected in this way) and if there is anything we can do about it?


L. Spiro
It is amazing how often people try to be unique, and yet they are always trying to make others be like them. - L. Spiro 2011
I spent most of my life learning the courage it takes to go out and get what I want. Now that I have it, I am not sure exactly what it is that I want. - L. Spiro 2013
I went to my local Subway once to find some guy yelling at the staff. When someone finally came to take my order and asked, “May I help you?”, I replied, “Yeah, I’ll have one asshole to go.”
L. Spiro Engine: http://lspiroengine.com
L. Spiro Engine Forums: http://lspiroengine.com/forums




Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS