efficiency of combining CPU and GPU by multiple threads

Started by
5 comments, last by rollo 18 years, 5 months ago
I want to know the efficiency of combining CPU and GPU by multiple threads. The idea is that, when CPU performs some computation(has nothing to do with thread), a thread will run a complex shader on GPU( this shader has many math calculation and texture access, it takes 0.008 seconds to finish on my computer:WinXP,Nvidia Quadro FX1000). Below is the pseudo-code for the thread ; Thread(){ initGlew(); initGlut(); initOpenGL(); while(1){ DownloadDataToGPU();// 48 bytes long data DrawQuad(); //a pixel shader with math compute and texture access. glFinish(); ReadBackData(); //using glReadPixel() } } I run many test instances, it seems that by using such kind of multiple thread, the CPU will slow down by 15%. Calling of glFinish() consumes most of the time. My question is, when calling glfinish(), GPU should begin to work and the thread should be blocked, so that control is given to CPU. In this way, calling of glFinish() shouldn’t cost too much time. Any hints for me ?
Advertisement
First of all, you shouldnt need glFinish at all. When you do the readback gl will make sure that all needed operations are already performed.

Second, it really kills parallelism between CPU/GPU to use glFinish. glFinish means that the CPU should stall and wait for GPU to finish performing all your commands. Unless DownloadDataToGPU needs the result from ReadBackData right away I think you could get much improved performance by using PBOs instead of a plain glReadPixels and just remove the call to glFinish.

Quote:From the PBO spec:

Asynchronous glReadPixels: If an application needs to read back a
number of images and process them with the CPU, the existing GL
interface makes it nearly impossible to pipeline this operation.
The driver will typically send the hardware a readback command
when glReadPixels is called, and then wait for all of the data to
be available before returning control to the application. Then,
the application can either process the data immediately or call
glReadPixels again; in neither case will the readback overlap with
the processing. If the application issues several readbacks
into several buffer objects, however, and then maps each one to
process its data, then the readbacks can proceed in parallel with
the data processing.

(see Example 3 at the bottom of the document too).
Sorry, I have made more experiments, it seems that glReadPixel() is the bottleneck. it consumes 5/6 of overhead time introduced by thread. I should swtich to PBO.

Is there any other ways to speedup readback ?
It is possible that some pixel formats will perform better than others, you would have to experiment to find out the best one for you.
I read specification of PBO, it seems I have to combine PBO with Pbuffer or FBO. otherwise, if I render to frame buffer, the value will be clamped into 0,1.
is it correct ?
That is true. If you need values outside 0,1 you either have to render to a floating point texture, or be clever and pack your float result into the normal RGB components. I have never done that myself, but unless you are targeting old hardware I would just go with a fp texture for simplicity.

I really recommend going with FBOs instead of pbuffers. A lot easier to work with and more efficient. Just make sure you have updated drivers.
In case you are still working on this, I saw a good paper on nvidia that you probably want to read: Fast Texture Downloads and Readbacks using Pixel Buffer Objects in OpenGL.

This topic is closed to new replies.

Advertisement