CPU + GPU rendering with OpenGL

Started by
15 comments, last by Dawoodoz 7 years, 6 months ago

I just started making a game in OpenGL but everything OpenGL specific is in one tiny C module so that I can easily change to Vulkan when more graphics cards are supported on Linux.

I want a faster texture upload in order to allow drawing of many tiny sprites on the CPU with full control over the depth buffer and then add deferred normal mapping, global volume light, turbulence, bloom, fog, water and gamma correction on the GPU. The problem is that rendering a texture uploaded from the CPU is very slow. Probably from being stored in write often memory by the OpenGL drivers and read often memory would be slow to upload instead.

Only CPU rasterization without GPU upload takes 0.3 ms for hard clipping and 2.0 ms using alpha filtering. This is without multi-threading or SIMD optimizations.

Software resterization + upload + sampling write often memory on GPU takes 10.0 ms which barely makes the 15 ms deadline.

Only GPU rendering with fixed textures takes 4.0 ms which is okay for OpenGL but then I cannot write freely to the depth buffer unless there is an extension for that. Copying back from fake depth buffers all the time would stall the GPU while waiting for the output as the next input texture.

Is there a memory trick that I can use in OpenGL to avoid stalling on sampling an uploaded texture?

Right now I just upload the software rasterized result to an existing texture ID using glTexImage2D.

Before you point out the obvious, my game would probably be much faster with hand coded DSP assembly on a Snapdragon 820 SoC with unified memory architecture and a HVX capable mDSP but I don't even like playing mobile games and it would have to be signed as firmware by the hardware vendor to go beyond root access.

Advertisement

Check your parameters to glTexImage2D - it's probable that the driver is having to do a format conversion before it can upload, e.g. if you're using GL_RGB (which doesn't actually exist in hardware).

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

Is there a memory trick that I can use in OpenGL to avoid stalling on sampling an uploaded texture? Right now I just upload the software rasterized result to an existing texture ID using glTexImage2D.

Maybe try triple buffering, so that you round-robin rotate between 3 different texture IDs on different frames. It's possible that the current usage is creating a sync point where the GPU has to wait for it to finish with the previous texture before it can upload the new contents.

@ mhagain:

I have tried both
"glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA8, width, height, 0, GL_RGBA, GL_UNSIGNED_BYTE, (unsigned char*)cpuBuffer);"

and

"glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA8, width, height, 0, GL_BGRA, GL_UNSIGNED_BYTE, (unsigned char*)cpuBuffer);"

@ C0lumbo:

Thanks, I will look into tripple buffering. Maybe create producer consumer multithreading if SDL has something like that.

Double buffering the upload saved one millisecond. Enough to fill the screen with sprites on the CPU each frame. :)

Increasing to three buffers did not improve speed at the moment but maybe with other memory flags.

Drawing the image that I started uploading the previous frame did not improve performance so I probably have to force the memory into slow upload but fast sampling somehow.
If I have a separate CPU thread for handling OpenGL then I should be able to do heavy CPU work without stalling the GPU since 1.5 ms is removed if I stop drawing on the CPU and just upload and draw the background every frame.

I tried many types of multi threading for CPU rendering and GPU uploads but none gave any performance increase.

I still get the sum of CPU and GPU rendering times instead of the maximum of both.

Either some resource is stalling or OpenGL already did the same thing for me.

Game loop with multiple CPU draw targets: (1 ms slower)

start rendering thread writing to output[(i + 1) modulo 2] on the CPU

upload output[i modulo 2] to the GPU and draw to the screen

wait for the rendering thread

i = (i + 1) modulo 2

Game loop with copy from CPU draw target to extra buffer: (same speed)

outputB = copy of outputA

start rendering thread writing to outputA on the CPU

upload outputB to the GPU and draw to the screen

wait for the rendering thread

Have you tried glTexSubImage2D? It may be faster as it doesn't need to respecify the texture each time.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

Thanks, I will try that. :)

The performance was about the same with either glTexImage2D or glTexSubImage2D.

I might have to try pixel buffer objects and fence objects.

You'll need to use a PBO, keeping it mapped with persistent storage and using fences for synchronization.

There is one thing I don't get though:

Only GPU rendering with fixed textures takes 4.0 ms which is okay for OpenGL but then I cannot write freely to the depth buffer unless there is an extension for that. Copying back from fake depth buffers all the time would stall the GPU while waiting for the output as the next input texture.

I assume you don't know about gl_FragDepth?

What is that depth buffer manipulation you do? why do you need it? what are you trying to achieve?

This topic is closed to new replies.

Advertisement