Jump to content
  • Advertisement
Sign in to follow this  
spek

OpenGL Grouping/batching to optimize rendering

This topic is 2470 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hey,

As we all know, batching things is good for performance. However, there are several ways to sort/group things. And I'm wondering if I'm going the right way. I have 3 examples in my engine, and would like to know if the sorting can be done more efficient.

* For the info, this is mainly an indoor situation, and I'm using OpenGL, though I assume the principles are the same regardless the rendering platform.


Example1: Objects
----------------------------------
Several rooms are visible, each with props like furniture, boxes, lamps, decorations, and so on. Typically, such an object has 500 to 2.000 triangles, and 1 "material". A material is:
- a vertex / fragment shader
- Sometimes a geometry shader (rare)
- Shader parameters (textures, color-vectors, factors, ...)

Let's say I have 100 objects to render. Currently, I sort on material. So if there are 50 boxes and 50 barrels, the rendering loop could look as follow:

1- apply material "box" (shaders + set textures/parameters)
2- for each box in boxes
glPushMatrix;
glMultMatrix( box.matrix );
box.renderVBO; // Render the raw mesh via a VBO
glPopMatrix;
3- apply material "barrel"
4- for each barrel .... see step 2

So, we only had to switch materials/textures twice instead of 100 times. However, I could also sort on mesh VBO. Now for each object, the VBO mesh gets set (glBindBufferARB), rendered, unset.
* DirectX has "instancing"... can that be compared to this?

In that case I only have to activate the VBO twice, then repeat the render call. But obviously, this causes more material switches. In practice all the objects with the same VBO have the same material as well, but it will still cause more material switches. 4 different box-models may still share the same material, and thus get grouped together in case I sort per material.

Anyway, how would you sort / batch it?



Example2: Static geometry (walls, floors, pillars, ...)
----------------------------------
Since I'm using a portal-rendering approach, the world gets split up in rooms. Then currently each room gets split up in material-groups. For example:
1- All polygons using material "wood floor"
2- All polygons using material "brick walls"
3- All polygons using material "metal support bars"
4- All polygons using material "concrete ceiling"

Usually the polycount for each "polygon chunk" is very small, since walls/floors are usually simple shapes. A 6-face box room would have 2 tris for group1 (wood floor), 8 tris for the walls, and so on. In practice, the polycount is a bit higher in my case; the average room uses ~2.000 polygons in total. And 1 to 10 rooms are visible at the same time, usually.

Before rendering, all the chunks from all visible rooms are grouped together. So if there are 10 rooms visible, all using the same brick-walls, all those small chunks get grouped together. The render-loop would look like this:

1- material "wood floor" apply (set shaders, textures, parameters)
2- for each floorChunk_using_Wood
chunk.render; // glDrawArray (vertices, normals, tangents, weights, texcoords)

This minimizes the amount of material switches. But in case multiple rooms are visible, I'm rendering quite a lot small geometry pieces, using glDrawArray (no VBO's).


Another method I used is rendering per room & a VBO. First a VBO that contains all the geometry (vertices, normals, tangents, weights, texcoords) for the entire room will be activated. Then for each material, I apply the material and render a piece of that VBO via indices:

for each visibleRoom
room.activateVBO;
for each chunkSortedOnMaterial in roomChunks
chunk.material.apply
chunk.renderVBO_viaIndices

room.deactivateVBO;

Geometry wise, this seems a smarter way. Although I'm not sure the relative small polycounts (~2k per room) makes a real difference. The big disadvantage however, is that when there are 10 rooms visible, we need to repeat this 10 times too. Even if all rooms are using the same materials.

* Got to mention that each surface (floor, wall, ...) can use quite a lot texture data each (up to 16 MB texture data all together).

In both cases, I'm still rendering small parts of the room. Combining rooms and thus their geometru would fix that, but does not coop with the portal-rendering approach of course.



Example3: Transparent objects (trees, plants, windows, bottles, ...)
----------------------------------
This one is nasty. I could sort in the similar ways, but obviously that causes depth issues. If I sort purely on depth, batching is near to impossible. Will I sort on material and/or mesh, the depth-order gets screwed up.

Now fortunately, I have little transparent objects. But I wonder how a game like Crysis renders its jungle. I mean, that uses a lot of transparent surfaces right?


Cheers,
Rick

Share this post


Link to post
Share on other sites
Advertisement
Hi Rick!


So, we only had to switch materials/textures twice instead of 100 times. However, I could also sort on mesh VBO. Now for each object, the VBO mesh gets set (glBindBufferARB), rendered, unset.
* DirectX has "instancing"... can that be compared to this?

OpenGL has instancing, too. (Since 3.1 or 3.2 if I recall correctly.) With instancing you would only submit one draw call to render many objects (keeps the command buffer smaller). You can add in some variation by reading the instanceID in the shader and the fetch from another texture or use other constants for colors.


Anyway, how would you sort / batch it?

I have the feeling that material switches are more expensive, since more stuff has to be set (probably several states, uniforms and worst textures). Perhaps you could profile to find out, which costs more? I think it depends on the size of the textures and VBOs, thus can’t be generalized. smile.png

If you group by materials you could batch geometry with the same material into a few VBOs, probably somehow clustered so that you can cull them. I think I would place objects that are just once in the room at the beginning of the vbo (all objects could be drawn with a single draw call, if all of them a visible) and objects that can be drawn instanced at the end, this way you can still use instancing by drawing only from parts of the VBO.


In both cases, I'm still rendering small parts of the room. Combining rooms and thus their geometry would fix that, but does not coop with the portal-rendering approach of course.

You could probably find out which objects are multiple times in the scene and draw them by instancing (even if they are in different rooms). It gets a little more complicated by that, and to be honest: I’m not sure whether it is worth the trouble. smile.png


Now fortunately, I have little transparent objects. But I wonder how a game like Crysis renders its jungle. I mean, that uses a lot of transparent surfaces right?

Most titles try to avoid alpha blended objects if possible (also because they don't work well with deferred shading...). For foliage one would rather use alpha to coverage. I’m not sure whether it is in the GL standard by now, but there are Nvidia extensions that do the work for you. If you don’t want to use them you can build the sample mask based on your alpha value in the fragment shader yourself and set it to gl_SampleMask (the more opaque the more sample bits are 1). With this you (hopefully) only have a few “true” alpha objects (so batching wouldn’t be of much use.)

Cheers! smile.png

Share this post


Link to post
Share on other sites
OpenGL3... I was afraid I had to take that step some day. Ho shit, I just discovered OpenGL4 is there as well! Will it ever stop? I'm spending more time on upgrading graphics / shaders / sound / physics libraries instead of the game itself these days biggrin.png

by reading the instanceID in the shader and the fetch from another texture or use other constants

In my oldskool knowledge, you can only pass 16 textures at a time to a shader (unless you pack more in a texArray). But a short while ago I red some stuff about c- and tbuffers. Does that mean I can let an instance decide which textures (loaded in the video memory) to grab before it renders? Some thing for constants. I have, for example, 1.000 different materials. If each had 4 parameters, it could fit in one cbuffer. But are the shaders/instance flexible enough to decide, on the GPU, where to find the correct parameters? And moreover, does it really gain some speed? I think yes, because currently the CPU has to define all parameters each time before rendering something. But just checking...

Excuse me, but I'm a bit prehistoric with OpenGL2x, Delphi7, 32 bit computer, and an older Cg shader language!

build the sample mask based on your alpha value in the fragment shader yourself and set it to gl_SampleMask

Another new thing. Never looked at multisample demo's, so could you please explain what's going on here? From my little understanding, you don't use traditional glBlend / glAlpha, but you sort it out yourself by defining a number of layers, and keeping track of previous results somehow? That means the shader does the (limited) depth-sort? Not very precise, but who will notice between a few million tree leaves / grass-blades... Ifso, I don't have to bother about rendering order, and I can try to implement foliage into the deferred-pipeline. Although I still wonder how the edges around leaves/metal fence/barb wire can be done since blending values for deferred rendering is not done.

Share this post


Link to post
Share on other sites

OpenGL3... I was afraid I had to take that step some day. Ho shit, I just discovered OpenGL4 is there as well! Will it ever stop?


You could just use extensions if you prefer that, no need to switch OpenGL version for one feature, look at the EXT_draw_instanced extension. (or NV_draw_instanced for the even older nvidia version of the extension). (any nvidia card that supports the nv version will support the ext version aswell, unless you use a very old driver)

Share this post


Link to post
Share on other sites
Hi,


In my oldskool knowledge, you can only pass 16 textures at a time to a shader (unless you pack more in a texArray).

The actual number of active texture units you can have depends a little on the graphics card. (On current hardware it goes up to about 160, see here.)


But a short while ago I red some stuff about c- and tbuffers. Does that mean I can let an instance decide which textures (loaded in the video memory) to grab before it renders? Some thing for constants. I have, for example, 1.000 different materials. If each had 4 parameters, it could fit in one cbuffer. But are the shaders/instance flexible enough to decide, on the GPU, where to find the correct parameters?

The shader decides from which texture to fetch. Therefore each possible texture must be bound to a texture unit. In Cg you have a special input semantic (INSTANCEID) in the vertex shader. You can use this value to index in a sort of material index buffer.

int myTexIndex = materialIndexBuffer[instanceID]; // do this in the vertex shader
vec4 color = tex2d( inputTexture[myTexIndex], texcoord ); // do this in the fragment shader.

Same thing for reading colors and transformations.


And moreover, does it really gain some speed? I think yes, because currently the CPU has to define all parameters each time before rendering something. But just checking...

The binding time is nearly the same, since you have to bind all at once. But the rendering should be faster (fewer draw calls). Though, it is only really beneficial when you have many instances.


Another new thing. Never looked at multisample demo's, so could you please explain what's going on here?

Multi-sampling is an approach to do anti-aliasing. For each pixel a certain number of subpixels is evaluated. The position of the subpixels are randomized (but stratified) and are not in the GL standard (, which means they are up to the hardware vendors). Though, there is one “default” setting of subpixels being equal on all cards.
For 4xMSAA it could look like this:
[font=courier new,courier,monospace]|__x____|
|______x|
|x______|
|____x__| // Positions of subpixels in a pixel.[/font]

You can use multi-sampling in two ways: shading at sample-frequency or pixel-frequency. Sample-frequency means the fragment shader is executed for every subpixel (storing color and depth). In the end, when all objects were rendered, the subpixels are averaged to give the final pixel color. You should prefer this mode if you have strong discontinuities in your textures, since shading at pixel-frequency only anti-aliases polygon edges.
The other mode is shading at pixel-frequency. Here, the fragment shader is executed just once per pixel(!) and only the depth values and a coverage values are generated for each subpixel. (Coverage is a bit vector telling for each subpixel, whether the geometry is there or not.)
[font=courier new,courier,monospace]|..x../_|
|..../_x|
|x../___|
|../_x__| // In this example only two subpixels cover the object.[/font]

With alpha to coverage you modify the coverage vector, by writing to gl_SampleMask (only possible when running at pixel-frequency). This means for instance, when you have an alpha value under 25% you set one bit to 1, which means only for this subpixel the color of the fragment is used. The other three bits can be set by objects behind. In the end all values get averaged and you get smooth borders. Of course this is only a fake and is not as precise as sorting all fragments and blending them correctly, but it gives still good results.

Cheers!

Share this post


Link to post
Share on other sites
Extensions of course. I wonder though, does the usage of extensions have negative impact somehow, compared to GL3 or 4.x where a lot of functions are part of the core now? I need to enhance the performance of the engine, so all the little things that can help, should be considered.


On current hardware it goes up to about 160, see here.

Really? With passing I mean I bind a texture to one of the channels (glActiveTextureARB( channelNum ). If I let a (cg) shader explicit refer to "TEXUNIT16" or higher, it crashes AFAIK. Tried that on a ~1.5 year old nVidia card btw. Correct me if I'm wrong, but the table you showed is the maximum number of texture-reads within a shader program, not the amount of active textures right?
Ifso, that still means I'm limited to 16 different textures when rendering (instanced) objects...

Well, it doesn't really matter. I assume instancing only makes sense if you have a lot of the same objects. Which is usually not the case for my maps and their contents. Although plants or floor-junk could share the same material(texture-set) to minimize the switches, then gets sorted on VBO instead and render in packs with instancing. A chair that appears 4 times probably doesn't benefit much from instancing, so I keep it sorted on material instead.


Constant-buffers
You said binding-time is nearly the same. Could be, but the main difference is that I only have to fill the constant-buffer once, while with traditional parameters, I have to pass the parameters each time I apply a shader. Example, I have a material "wood1a". It uses 2 textures, and has a specularColor float4 vector. When I apply "wood1a", I have to bind 2 textures, and pass vector. Each time again. Thousands of passings happen each render-cycle currently.

Asides from global dynamic stuff like the cameraposition or lightcolor, those parameters never change though, so isn't it just possible (in OpenGL) to create a buffer that contains *all* static parameters, and upload it once? So I don't have to pass them anymore? I red somewhere such a buffer (in DirectX) has 4096 floats or vectors. Dunno if that is enough to store all numeric parameters from all materials I have, but it is quite a lot. As you showed, I could then do something like:
- float4 parameter1 = cbuffer[ material_offsetIndex + 0 ];
- float4 parameter2 = cbuffer[ material_offsetIndex + 1 ];
- Use the parameters, whatever they are
Then I only have to bind that cbuffer once, and pass the "material_offsetIndex" parameter for each different material. Would that make sense (if possible at all)?



Multi-Sample
I'm a slow learner, but that gives a better understanding. There are plenty of MSAA demo's so I should look there. And hopefully a demo that uses the masking/coverage technique. Faulty ordering doesn't matter as long as the viewer doesn't really see it... which I doubt with foliage. I guess it only works properly for "black-white" transparency (a metal fence for example) and not for translucent surfaces such as glass). But that's ok for grass and the likes.

I still wonder though how edges will look when using it in a deferred rendering pipeline. Averaging colors is not a problem, but normals, and especially positions/depth creates weird results. I could only let the (near) opaque pixels write normals/depth, but that gives a blocky edge when applying a light on it, right?



- Thanks again for the explanation!!
Rick

Share this post


Link to post
Share on other sites
Hey Rick,


Extensions of course. I wonder though, does the usage of extensions have negative impact somehow, compared to GL3 or 4.x where a lot of functions are part of the core now? I need to enhance the performance of the engine, so all the little things that can help, should be considered.

Extensions are probably not that well optimized. (But I don't think that it makes a big difference.)


[quote name='Tsus' timestamp='1331113927' post='4920018']
On current hardware it goes up to about 160, see here.

Really? With passing I mean I bind a texture to one of the channels (glActiveTextureARB( channelNum ). If I let a (cg) shader explicit refer to "TEXUNIT16" or higher, it crashes AFAIK. Tried that on a ~1.5 year old nVidia card btw. Correct me if I'm wrong, but the table you showed is the maximum number of texture-reads within a shader program, not the amount of active textures right?
Ifso, that still means I'm limited to 16 different textures when rendering (instanced) objects...
[/quote]
Hm, the OpenGL docs say you can bind with glActiveTexture exactly that number of textures.
You can test yourself how many are supported on your machine:
int value;
glGetIntegerv(GL_MAX_COMBINED_TEXTURE_IMAGE_UNITS,&value);


I just looked in the Cg docs for shader model 5 and surprisingly they really only have 16, as you said. blink.png
The number of shader resource views that can be bound in DirectX is also much higher than 16.


Well, it doesn't really matter. I assume instancing only makes sense if you have a lot of the same objects. Which is usually not the case for my maps and their contents. Although plants or floor-junk could share the same material(texture-set) to minimize the switches, then gets sorted on VBO instead and render in packs with instancing. A chair that appears 4 times probably doesn't benefit much from instancing, so I keep it sorted on material instead.

Makes sense, considering the time it would take to setup the instancing. This would have been my choice here, too.


Asides from global dynamic stuff like the cameraposition or lightcolor, those parameters never change though, so isn't it just possible (in OpenGL) to create a buffer that contains *all* static parameters, and upload it once? So I don't have to pass them anymore? I red somewhere such a buffer (in DirectX) has 4096 floats or vectors.

You can create constant buffers (in OpenGL they are called uniform buffer objects) that are even bigger than 64 KB (=4096 32bit 4-component constants). But you can only bind 64 KB of it at a time.


Dunno if that is enough to store all numeric parameters from all materials I have, but it is quite a lot. As you showed, I could then do something like:
- float4 parameter1 = cbuffer[ material_offsetIndex + 0 ];
- float4 parameter2 = cbuffer[ material_offsetIndex + 1 ];
- Use the parameters, whatever they are
Then I only have to bind that cbuffer once, and pass the "material_offsetIndex" parameter for each different material. Would that make sense (if possible at all)?

Perhaps. Reading from a constant buffer is very fast (unless each fragment/vertex tries to read from a different location. That’s why there are tbuffers, because constant buffer suffer from constant waterfalling). Though you have one more indirection: reading from the materialIndex uniform. I think you have to test, whether it is faster, but it would definitely decrease the number of state changes.


Faulty ordering doesn't matter as long as the viewer doesn't really see it... which I doubt with foliage. I guess it only works properly for "black-white" transparency (a metal fence for example) and not for translucent surfaces such as glass). But that's ok for grass and the likes.

Indeed, that’s where stochastic transparency comes in. Unfortunately it is still too expensive for practical usage.


I still wonder though how edges will look when using it in a deferred rendering pipeline. Averaging colors is not a problem, but normals, and especially positions/depth creates weird results. I could only let the (near) opaque pixels write normals/depth, but that gives a blocky edge when applying a light on it, right?

You could take a look at Johan Andersson’s slides. Aside from other cool stuff (like tile-based deferred shading) he has code for the alpha to coverage technique in the slides. (See page 52 for a comparison. smile.png). Yes, deferred shading can look crappy at the boundaries if you shade at pixel frequency. In the slides I linked is also a note (page 18) how this is resolved. (They adaptively switch to sample-frequency at boundaries).

Cheers! smile.png

Share this post


Link to post
Share on other sites
With deferred rendering you would never resolve (average) your G-Buffer contents. You would render the G-Buffer with MSAA enabled, which would give you unique G-Buffer samples in the subsamples belonging to pixels along triangle edges. Then in your lighting you would light each of those subsamples individually, and either write them all out to an MSAA render target (to be resolved later) or resolve it on the fly (which can cause artifacts for certain cases, see the article in ShaderX7 for details). What this really boils down to is a form of selective supersampling, where you have the same memory footprint but only supersample the shading on triangle edges (actually with deferred rendering you can even limit supersampled shading to edges with depth, normal, or material discontinuities to avoid wasted work inside mesh silhouettes). This extends naturally to alpha testing, where you can supersample the alpha test in your shader and use that to have per-subsample visibility rather than per-pixel visibility. After that point everything just works, and when you resolve you get properly antialiased edges. There are no issues with sorting or blending, since each subsample is still depth tested.

Alpha to coverage is similar, but a little different. Instead of supersampling the alpha test you use the alpha value to drive a dither pattern that works across subsamples. But after that point it's the same as alpha test: the G-Buffer will contain normal/depth/albedo data for multiple subsamples, you'll light each subsample individually, and then resolve the result. Again there's no issues with sorting or blending, since each subsample is being treated as opaque and is properly depth tested. You can think of it as if you rendered the whole screen at a higher resolution, and the resolve would then be a downscale to the display resolution.

Share this post


Link to post
Share on other sites
There should be no performance impact from use of extensions vs core OpenGL.

In many cases a GL_ARB_ extension is promoted directly to the core API completely unmodified.

In recent times features from newer GL_VERSIONs have been back-ported to extension status so they could be used on older hardware.

In some cases an extension is ubiquitous but never promoted to core - anisotropic filtering and S3TC texture compression are two examples; they can't go to core because of patent/IP/legal crap, but everyone uses them and there is no performance impact due to their extension status.

In all cases the extension version should run just as well as the core version. The exception is where the driver exposes an extension but the hardware doesn't actually support it (i.e. the driver emulates it in software) - the GeForce FX series notoriously exposed GL_ARB_texture_non_power_of_two in order to claim OpenGL 2.0 status but would drop you right back to full software emulation if you actually tried to use it. So long as you catch instances of that happening (and you'll know it pretty quick when your performance falls to less than 1 fps) you'll be OK. Otherwise - no cause to be concerned about using extensions.

Share this post


Link to post
Share on other sites
Ah, I get well served here smile.png

As for OpenGLxxx, that means I can safely keep using GL2 for now. Maybe its time step to drop the ancient OpenGL 1997 way of thinking, and force myself with VBO's and using shaders for everything (as far as I'm not already doing that). But... upgrading is boring, so I might delay it a bit longer hehe.

Texture amount
Maybe Cg is still a bit behind with that then? I can save myself with 16 textures probably though. But having more allows some more flexibility, and less multi-passing in some cases maybe.

Uniform Buffer Objects
I don't yet understand the difference between c- and tbuffers (I thought tbuffers = texture-buffer), but in practice the parameters will be grouped together in the buffer, and the materialIndex does not vary per pixel normally. One object or one wall has 1 materialID normally.
-edit-
Setting up an UBO in GL is childsplay but... How the hell do I pass & use it in Cg?
-edit again-
Last february, Cg released version 3.1 that supports UBO's. Now it's just a matter of updating my delphi header for the DLL :|


Deferred lighting
Thanks for the battlefield paper, awesome stuff. Also the tiled deferred rendering caught my interest (I hate it when you want to try 100 things at the same time). Little question about that... is it better to make a different shader for each possible lightcount ("shader_1lamp", "shader_2lamps", ... ) or is looping through a variable sized list totally ok these days?

As for Alpha-Coverage... I can ask, but it's better just to try it myself first smile.png I can see why you have to take multiple samples along edges and light them individually, then combine them as one in the end. But the techniques/steps to do that efficiently are new to me. So again, someone knows a good (OpenGL) demo or paper on alpha-coverage?

Thank you all!

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!