I have a difficult problem that I cannot figure out an efficient way to solve. Part of my problem is, I'm not intimately familiar with every last nook and cranny of OpenGL and GPU pipelines, and this program can obviously be solved in several ways. I need to find the most efficient solution (fastest running), not the simplest, because millions of vertices must be processed each frame.
Here is a generic statement of what needs to happen.
#1: The application program on the CPU contains an array of roughly 2 billion vertices in world-coordinates, each with an RGB color and possibly a couple other items of information. The original coordinates are spherical coordinates (2 angles + distance), but separate files can be created with x,y,z cartesian coordinates to speed processing by CPU and/or GPU if appropriate.
#2: Before the application runs, the vertex data is transferred from disk into CPU memory. This will consume several gigabytes of RAM, but will fit into memory without swapping.
#3: However, the entirety of vertex data will not fit into the RAM in current generation GPUs. We can assume the application is only run on GPUs that contain gigabytes of internal RAM, with at least 1 gigabyte always allocated to this vertex data.
#4: The data on disk is organized in a manner analogous to a cube-map to make the following processes efficient. The data for each face of the cube-map are subdivided into a 1024x1024 array of subsections called "fields", each of which can be easily and efficiently located and accessed independently by the application program to support efficient culling.
#5: The vertex data for all currently visible fields will presumably be held in a fixed number of dedicated VBO/VAOs in GPU memory. For normal viewport angles, probably a few million to several million vertices will be in these VBO/VAOs, and need to be processed and displayed each frame.
#6: When the camera/viewpoint rotates more than about 0.1 degree, some "fields" will no longer be visible in the viewport, and other "fields" will become visible. When this happens, the application will call OpenGL functions to write the vertex data of newly visible vertices over the vertex data of no longer visible vertices, so no reallocation of VBO/VAOs is ever required.
#7: Each frame, the entire scene except for this vertex data is first rendered into the framebuffer and depthbuffer.
#8: The OpenGL state can be modified as necessary before rendering the vertex data. New vertex and fragment shader programs can be enabled to implement rendering of the vertex data in the required manner.
#9: All these special vertex data is now rendered.
-----
The above is just background. The tricky question for OpenGL gurus is how to most efficiently render the vertex data in the following manner:
#1: Each vertex is a point. No lines or triangles are rendered.
The following requirements are what make this problem difficult...
#2: For each vertex, find the value in the depth buffer that corresponds to where this vertex/point would be displayed. The nearest point is what we need to find, since we must assume for this step that the actual physical size of each point is zero (infinitesimal).
#3: If the depth buffer value indicates the depth buffer (and corresponding pixel in color-buffer) has already been written this frame, then no further action may be taken for this vertex (the color-buffer and depth-buffer are not modified). In effect, we want to discard this vertex and not perform any subsequent processes for this vertex. Otherwise perform the following steps.
#4: Based upon the brightness and color of the vertex data (the RGB or ARGB values in each vertex structure), render a "blob" of the appropriate brightness and size (maximum 16x16 ~ 64x64 screen pixels), centered on the screen pixel where the depth-buffer was tested.
NOTE: The most desirable way to render the image for this "blob" is "procedurally". In other words, all screen pixels within 1 to 32 pixels of the computed screen pixel would be rendered by a fragment shader program that knows the vertex data (brightness and color), plus how far away each pixel is from the center of the image (dx,dy). Based on that information, the fragment shader code would appropriately render its pixel of the blob image.
Alternatively, a "point sprite texture" could be selected from an array of tiny textures (or a computed subregion of one texture) based the brightness and color information. Then a point sprite of the appropriate size and brightness could be rendered, centered on the screen pixel computed for the original vertex.
In either case, the RGB of each screen pixel must be summed (framebuffer RGB = framebuffer RGB + new RGB).
The depth buffer must not be updated.
-----
The above is what needs to happen.
What makes these requirements so problematic?
#1: We need to make the rendering of an extended region of the screen conditional upon whether a single vertex/point in the depth-buffer has been written. I infer that the many pixels in a point-sprite rendering are independently subjected to depth tests by their fragment shaders, and therefore the entire point-sprite would not be extinguished just because the center happened to be obscured. Similarly, I do not see a way for a vertex shader or geometry shader to discard the original vertex before it invokes a whole bunch of independent fragment shaders (either to render the pixels in the point-sprite, or to execute a procedural routine).
#2: It appears to me that vertex-shaders and geometry-shaders cannot determine which framebuffer pixel corresponds to a vertex, and therefore cannot test the depth-buffer for that pixel (and discard the vertex and stop subsequent steps from happening).
The last couple days I've been reading the new SuperBible and the latest OpenGL specs... and they are so chock full of very cool, very flexible, very powerful capabilities and features. At many points I thought I found a way to meet these requirements, but... always came up short. In some cases I could see a feature won't work, but in other cases I had to make certain assumptions about subtle details of how GPU pipelines work, and the OpenGL specification. So... I'm not convinced there isn't some way to accomplish what I need to accomplish. In fact, I have the distinct feeling "there must be"... but I just can't find it or figure it out.
But I bet someone who has been mucking around in OpenGL, GLSL and the details of the GPU pipeline understands how everything works in sufficient detail that... they'll immediately flash on the solution! And tell me!
I'd hate for this process to require two passes. I suppose we could create a first pass that simply writes or clears a single bit (in a texture or ???) that corresponds to each vertex, indicating whether the screen pixel where the vertex would be displayed as a point is currently written or not (assuming the fragment shader can read the depth buffer, and assume 1.0 means "not drawn"). Then on the second pass the vertex shader could discard any vertex with bit=0 in that texture. Oops! Wait... can vertex shaders discard vertices? Or maybe a whole freaking extra geometry shader would be needed for this. Gads, I hate multiple passes, especially for something this trivial.
I'll even be happy if the solution requires a very recent version of OpenGL or GLSL.
Who has the solution?