I need to draw hundreds of thousands of cubes, each centered at a different location and of different size.
I first drew the cubes as 12 triangles (2 per cube face), using glDrawArrays. To draw this I need to pass 12*9 floating point coordinates to the graphics card.
I then thought of speeding this up using the following algorithm:
1) Use glVertexPointer on a standard unit cube centered at zero
2) Then, for each cube, I pass its center and a scaling parameter with glUniform1fv. I use a vertex shader the parses these 4 numbers by scaling and translating each coordinate component. The glDrawArray command always draws the same unit cube, but the shader program will take care of positioning each point in its proper location.
The algorithm works: I do get my cubes with the right size and at the location I want them. And I am passing only 4 floating point numbers per cube, rather than 108. Yet, this process is probably about 100 times slower than before.
I narrowed the problem down to the call to glUniform1fv. If I call it just once, rather than once per cube, I get my performance back. Of course, the cubes are not in the right location and are not of the right size, but at least I know the culprit.
Can anyone shed some light on why there is such a loss of performance in this function? And, better yet, a suggestion on how to really improve my performance with an algorithm like this to the point that it's better than my original triangulation?
I'm puzzled by this loss of performance when I'm sending fewer points to the graphics card.