Jump to content
  • Advertisement

Treesong

Member
  • Content Count

    22
  • Joined

  • Last visited

Community Reputation

177 Neutral

About Treesong

  • Rank
    Member

Personal Information

  • Website
  • Role
    Art Director
    Creative Director
    Technical Director
  • Interests
    Art
    Audio
    Business
    Design
    DevOps
    Production
    Programming
    QA

Recent Profile Visitors

The recent visitors block is disabled and is not being shown to other users.

  1. Thanks Greedy Goblin. The GPU is great for much more than just graphics related processing. Compute shaders are an easy way to do generic tasks when you're already familiar with shaders.
  2. The problem You’re building a game-world that is big, so big in a fact that not all of it can be loaded into memory at once. You also don’t want to introduce portals or level loading. You want the player to have an uninterrupted experience. For true continuous streaming, a typical scenario would be something like this: The world is partitioned into tiles (Quad-tree) When the Camera moves, tile-data is read from disk and pre-processed in the background. We need to render meshes for each tile. There can be more than 1000 tiles in the AOI, more than 100 different meshes and up to 10000 instances per mesh on one tile. How to improve from worst-case 1000000000 draw calls to best-case 1 draw call? Introduction To focus on the render-data preparation specifically, I assume the reader is familiar with the following concepts: Instanced mesh rendering Compute Shaders AOI (Area Of Interest) Quad-tree tile-based space partitioning For an introduction I recommend this BLOG entry on our website:http://militaryoperationshq.com/dev-blog-a-global-scope/ I will use OpenGL to demonstrate details because we use it ourselves and because it is the platform independent alternative. The technique however can be adapted for any modern graphics API that supports compute shaders. The solution The solution is to do the work on the GPU. This is the type of processing a GPU is particularly good at. The diagrams below show memory layout. Each colour represents a different type of instance data, stored non-interleaved. For example, position, texture-array layer-index or mesh scale-factor etc. Within each instance-data-type (colour) range, a sub-range (grey) will be used for storing data for instances of a particular mesh. In this example, there are 4 different meshes that can be instanced. Within the sub-range, there is room to store instance-data for “budget” amount of instances. After loop-stage step 4, we know exactly where to store instance data of each type (pos, tex-index, scale, etc.) for a particular mesh-type. In this example, the scene contains no mesh-type 2 instances and many mesh-type 3 instances. Prepare once at start-up Load all mesh data of the models you want to be able to show in one buffer. Prepare GL state by creating a Vertex Array Object containing all bindings. Create a command-buffer containing Indirect-Structures, one structure for each mesh that you want to be able to render. Fill the Indirect-Structure members that point to (non-instance) mesh vertex data. Steps for one new tile entering the AOI Read geometry from disk Rasterize geometry into a material-map Generate instance-points covering the tile. Select a grid-density and randomise points inside their grid-cell to make it look natural if you’re doing procedural instancing. Whole papers have been written about this topic alone. Sample from the material-map at the grid-point to cull points and decorate data. Store the result in a buffer per tile. Keep the result-buffer of a tile for as long as it is in the AOI Step 1, 2, 3 and 4 may well be replaced by simply loading points from disk if they are pre-calculated offline. In our case we cover the entire planet, so we need to store land-use data in vector form and convert it into raster data online, to keep the install size manageable. Steps for each loop This is where things get interesting. Do frustum and other culling of the tiles so you know what tiles are visible and contain meshes that need rendering. Clear instance-count and base-instance fields of indirect-structures in the command buffer. Run a simple compute shader for this. If you would map the buffer or use glBufferData to allow access from the CPU, you introduce an expensive upload and synchronisation which we want to prevent. Run a compute shader over the tile-set in view to determine which meshes to render. Just count instances per mesh in the instance-count member of the Indirect_structure. This may require sampling from the material map again or doing other calculations to pick a mesh LOD or reflect game-state. It may very well require procedural math to “randomly” spawn meshes. This all depends on your particular game requirements. Fill-in the base-instance member of the Indirect-Structures by launching a single compute shader instance. Run a compute shader to prepare render data. Do the calculations that determine what mesh to select, again. Claim a slot in the vertex-attributes buffer and store render data. since at this point we already know exactly how many instances of each mesh will need rendering (all counts and offsets), we know in what range a particular mesh instance needs to store instance data. The order within the range for a particular mesh doesn’t matter. The important, maybe counterintuitive thing here is, that we do all calculation to determine what mesh to instance, twice. We don’t store the results from the first time. It would be complicated, memory consuming and slow to remember what mesh instance of what tile ends-up at what vertex-data location in the render buffer, just so we can look up an earlier stored result. It may feel wasteful to do the same work twice, but that is procedural thinking. On the GPU it is often faster to recalculate something then to store and read back an earlier result. Now everything is done on the GPU and we only need to add some memory-barriers to make sure data is actually committed before a next step is using it. Atomic operations Note that step 3 and 5 of the loop-stage require the compute shader to use atomic operations. They are guaranteed to not conflict with other shader-instances when writing to the same memory location. Instance budget You need to select a budget for the maximum number of meshes that can be drawn at once. It defines the size of the instance-data buffer. This budget may not cover certain extreme situations. This means we need to make sure we do not exceed the budget. Step 4 updates the base-instance of the indirect-structure. At that point, we can detect if we exceed the budget. We can simply force instance-counts to zero when we reach the budget. But this will have the effect of potentially very visible mesh instances to be in or excluded from the render-set each loop. To solve this, sort the indirect-structures, representing meshes, in the command-buffer from high to low detail. This is only needed once at start-up. That way the first meshes that will be dropped are low LOD and should have the least impact. If you’re using tessellation to handle LOD, you’ll have to solve this differently or make sure your budget can handle the extreme cases. Ready to render We now have one buffer containing all render data needed to render all instances of all meshes on all tiles, in one call. We simply do a single Render call using the buffer that contains the indirect-structures. In fact, we render all possible meshes. If for the current situation, some meshes do not need to be rendered, their instance-count in the indirect-structure will be zero and it will be skipped with very little to no overhead. How it used to be In a traditional scenario, we may have filled an indirect-structure buffer with only structures for the meshes that need rendering. Then copying all render data in a vertex-attribute buffer, in an order to match the order of indirect-structures in the command buffer. Which means we need a sorting step. Next, an upload of this render data to the GPU is required. Since the upload is slow, we will probably need to double or, even better, triple buffer this to hide transfer and driver stages so we don’t end up waiting before we can access/render a buffer. Summary Key points Preprocess and store intermediate data on the GPU Make mesh instance render order fixed and render all meshes always Use a 2 pass approach, first count instances so we can predict memory layout the second time. Benefits No upload of render data each render-loop No need to refill/reorder the command-buffer (indirect-structures) every loop No sort needed to pack vertex-data for mesh instances No need for double/triple buffering Improvements The most performance is gained by selecting the best design for your software implementation. Nevertheless, there is some low hanging fruit to be picked before going into proper profiling and tackling bottlenecks. Memory allocation You should use glBufferStorage to allocate GPU memory for buffers. Since we never need to access GPU memory from the CPU, we can do this: glBufferStorage(GL_ARRAY_BUFFER, data_size, &data[0], 0); The last parameter tells the GL how we want to access the memory. In our case we simple pass 0 meaning we will only access it from the GPU. This can make a substantial difference depending on vendor, platform and driver. It allows the implementation to make performance critical assumptions when allocating GPU memory. Storing data This article describes vertex-data that is stored non-interleaved. We are in massively parallel country now, not OO. Without going into details, the memory access patterns make it more efficient if, for example, all positions of all vertices/instances are packed. The same goes for all other attributes. Frustum Culling In this example, tile frustum-culling is done on the CPU. A tile, however, may contain many mesh instances and it makes good sense to do frustum culling for those as well. It can be easily integrated into step 3 and performed on the GPU. Launching compute shaders The pseudo-code example shows how compute shaders are launched for the number of points on each tile. This means the number of points needs to be known on the CPU. Even though this is determined in the background, downloading this information requires an expensive transfer/sync. We can store this information on the GPU and use a glDispatchComputeIndirect call that reads the launch size from GPU memory. A trend This article shows how more work can be pushed to the GPU. We took this notion to the extreme by designing an engine from the ground up, that is completely running on the GPU. The CPU is only doing I/O, user interaction and starting GPU jobs. You can read more about our “Metis Tech” on our blog page: http://militaryoperationshq.com/blog/ The main benefits are the lack of large data up/downloads and profiting from the huge difference in processing power between GPU and CPU. At some point, this gap will become a limiting bottleneck. According to NVidia, the GPU is expected to be 1000x more powerful than the CPU by 2025! (https://www.nextplatform.com/2017/05/16/embiggening-bite-gpus-take-datacenter-compute/) Appendix A - Pseudo C++ code //! \brief Prepare render data to draw all static meshes in one draw call void prepare_instancing( const std::vector<int32_t>& p_tiles_in_view // Points per tile , const std::vector<GLuint> p_point_buffer , int32_t p_mesh_count , GLuint p_scratch_buffer , GLuint p_command_buffer , GLuint p_render_buffer , GLuint p_clear_shader , GLuint p_count_shader , GLuint p_update_shader , GLuint p_prepare_shader) { glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 0, p_command_buffer); glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 2, p_scratch_buffer); glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 3, p_render_buffer); // 2. Clear instance base and count glUseProgram(p_clear_shader); glDispatchCompute(p_mesh_count, 1, 1); glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT); // 3. Count instances per mesh glUseProgram(p_count_shader); for (int32_t l_tile_index = 0; l_tile_index < p_tiles_in_view.size(); ++l_tile_index) { glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 1, p_point_buffer[l_tile_index]); glDispatchCompute(p_tiles_in_view[l_tile_index], 1, 1); glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT); } // 4. Update instance base glUseProgram(p_update_shader); glDispatchCompute(1, 1, 1); glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT); // 5. Prepare render data glUseProgram(p_prepare_shader); for (int32_t l_tile_index = 0; l_tile_index < p_tiles_in_view.size(); ++l_tile_index) { glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 1, p_point_buffer[l_tile_index]); glDispatchCompute(p_tiles_in_view[l_tile_index], 1, 1); glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT); } glUseProgram(0); glBindBuffersBase(GL_SHADER_STORAGE_BUFFER, 0, 4, nullptr); } //! \brief Render all instances of all meshes on all tiles in one draw call void render_instanced( GLuint p_vao , GLuint p_command_buffer , GLuint p_render_shader , int32_t p_mesh_count) // Number of different meshes that can be shown { glBindVertexArray(p_vao); glUseProgram(p_render_shader); glBindBuffer(GL_DRAW_INDIRECT_BUFFER, p_command_buffer); glMultiDrawElementsIndirect(GL_TRIANGLES, GL_UNSIGNED_INT, 0, p_mesh_count, 0); glBindBuffer(GL_DRAW_INDIRECT_BUFFER, 0); glUseProgram(0); glBindVertexArray(0); } Appendix B – Compute-shader pseudo-code 2. Clear shader //**************************************************************************** //! \brief 2. p_clear_shader: Clear counts and offsets //**************************************************************************** // The local workgroup size layout(local_size_x = 1, local_size_y = 1, local_size_z = 1) in; // Input: Contains the indirects-structs for rendering the meshes layout(std430, binding=0) buffer p_command_buffer { uint p_indirect_structs[]; }; // IO: Containing uints for counting point-instance-data per mesh, and claiming slots layout(std430, binding=2) buffer p_scratch_buffer { uint p_instance_counts[]; // Globally for all tiles. Size = number of mesh variants }; void main() { uint l_invocation_id = gl_GlobalInvocationID.x; p_indirect_structs[l_invocation_id * 5 + 1] = 0; // 5 uints, second is the instance-count. p_instance_counts[l_invocation_id] = 0; } 3. Count shader //**************************************************************************** //! \brief 3. p_count_shader: Count instances //**************************************************************************** // The local workgroup size layout(local_size_x = 1, local_size_y = 1, local_size_z = 1) in; // Output: Contains the indirect-structs for rendering the meshes layout(std430, binding=0) buffer p_command_buffer { uint p_indirect_structs[]; // Globally for all tiles }; layout(std430, binding=1) buffer p_point_buffer { uint p_point_data[]; }; void main() { uint l_invocation_id = gl_GlobalInvocationID.x; //! \note What p_point_data contains is application specific. Probably at least a tile-local position. uint l_data = p_point_data[l_invocation_id]; //! \todo Use data in p_point_data to determine which mesh to render, if at all. uint l_mesh_index = 0; atomicAdd(p_indirect_structs[l_mesh_index], 1); // Count per instance } 4. Update shader //**************************************************************************** //! \brief 4. p_update_shader: Update instance base //**************************************************************************** // The local workgroup size layout(local_size_x = 1, local_size_y = 1, local_size_z = 1) in; // Input: Contains the indirect-structs for rendering the meshes layout(std430, binding=0) buffer p_command_buffer { uint p_indirect_structs[]; }; uniform uint g_indirect_struct_count = 0; uniform uint g_instance_budget = 0; void main() { uint l_invocation_id = gl_GlobalInvocationID.x; // This compute-shader should have been launched with 1 global instance! if (l_invocation_id > 0) { return; } // Update base-instance values in DrawElementsIndirectCommand int l_index, l_n = 0; p_indirect_structs[l_index * 5 + 4] = 0; // First entry is zero bool l_capacity_reached = false; for (l_index = 1; l_index < g_indirect_struct_count; ++l_index) { l_n = l_index – 1; // Index to the indirect-struct before uint l_base_instance = p_indirect_structs[l_n * 5 + 4] + p_indirect_structs[l_n * 5 + 1]; // If the budget is exceeded, set instance count to zero if (l_base_instance >= g_instance_budget) { p_indirect_structs[l_index * 5 + 1] = 0; p_indirect_structs[l_index * 5 + 4] = p_indirect_structs[l_n * 5 + 4]; } else { p_indirect_structs[l_index * 5 + 4] = l_base_instance; } } } 5. Prepare shader // The local workgroup size layout(local_size_x = 1, local_size_y = 1, local_size_z = 1) in; // Input: Contains the indirect-structs for rendering the meshes layout(std430, binding=0) buffer p_command_buffer { uint p_indirect_structs[]; }; // Input: Containing point data layout(std430, binding=1) buffer p_point_buffer { uint p_point_data[]; }; // IO: Containing mesh-counts for claiming slots layout(std430, binding=2) buffer p_scratch_buffer { uint p_instance_counts[]; // Globally for all tiles. Size = g_indirect_struct_count }; // Output: Containing render data layout(std430, binding=3) buffer p_render_buffer { uint p_render_data[]; }; uniform uint g_indirect_struct_count = 0; uniform uint g_instance_budget = 0; void main() { uint l_invocation_id = gl_GlobalInvocationID.x; uint l_data = p_point_data[l_invocation_id]; //! \todo Use data in p_point_data to determine which mesh to render, if at all. Again. uint l_mesh_index = 0; // This should never happen! if ( l_mesh_index >= g_IndirectStructCount) { return; } // Only process meshes that have an instance count > 0 if (p_indirect_structs[l_mesh_index * 5 + 1] == 0) { return; } // Reserve a spot to copy the instance data to uint l_slot_index = atomicAdd(p_instance_counts[l_mesh_index], 1); // From mesh-local to global instance-index l_slot_index += p_indirect_structs[l_mesh_index * 5 + 4]; // Make sure to not trigger rendering for more instances than there is budget for. if (l_slot_index >= g_instance_budget) { return; } //! \todo Write any data you prepare for rendering to p_render_data using l_slot_index } PDF format: Efficient instancing in a streaming scenario.pdf [Wayback Machine Archive]
  3. Treesong

    Military Operations

    Military Operations is an operational level wargame set in World War II. The game features large-scale battles taking place in continuous time, with emphasis on manoeuvrability, managing services, and effective order execution all with a focus on realism. The game is played on highly detailed topographic maps, allowing commanders to zoom-in between different levels of military hierarchy. Military Operations is based on a unique game engine, which uses a spherical Earth model to visualize the environment. For more information see: MilOpsHQ website Steam Storepage Development BLOG Discussion FORUM
  4. Treesong

    GLSL vertex displacement slow

    Vertex displacement was not a critical feature of the project at that time so I didn't bother to investigate further. Testing was limited to 8 bit luminance texture format. No idea how the results relate to other formats, sorry. Regards, Serge van Keulen
  5. Treesong

    Splitscreen problems

    Normally you would render the main view first. Then changing the viewport and scissor to restrict rendering to a specific subset of the framebuffer. If the main view section is convex you can change it's viewport and scissor settings to exclude the region of the framebuffer that will be written by the second rendering hereby preventing overdraw. Take a look at glScissor, for example here .
  6. Treesong

    GLSL vertex displacement slow

    I noticed that sampling in vertex shaders in GLSL is slow in comparison to Cg for anything that is not the latest generation GPU. I suspect that it is a driver development trade off. I tested this on a GF7950. The same test for Cg and GLSL did several seconds per frame for GLSL where Cg could manage interactive framerates (15 FPS and up). I did the same test on a GF8800. Both GLSL and Cg caused no noticeable impact on the framerate (over 300 fps). Stressing the test a bit more by increasing the amount of samples per frame ultimately resulted in lower frame-rates but Cg and GLSL did not differ much. To be honest I did not repeat these test on AMD hardware, but I still suspect I would find similar results.
  7. Treesong

    WxWidgets and OpenGL Canvases

    wxWidgets/wxGLCanvas can be used just fine for OpenGL based tooling. I've used it in several projects from simple tools to complex level editors and it works just fine. It basicly does the contex setup after which you are free to do what ever you want with the context including using extensions or extension libraries like GLew. It is also safe to use multiple context and to share resources between them. Context sharing used to be a bit of a gamble in the past but that was due to poor support for this in drivers (talking about a few years back). Of course using one context and splitting views using scissors works fine also. There are a few small issues as can be expected in any library, like having a wxAUI manager manage a wxGLCanvas. Some different behavior between platforms like adding scroll bars to a wxGLCanvas on OSX. But in general I can recommend wxWidgets and wxGLCanvas for 3D tool development. There are also good WYSIWYG editors for wxWidgets like DialogBlocks that handle wxGLCanvas fine.
  8. Treesong

    Terrain Realism

    I guess you read the PDF document? Blending three texels per pixel on a GForce MX (actually a TNT2) is possible. In fact, I was using a TNT2 when I was developing the technique. And it was fast enough too. Sure you can just select what texture to use using elevation values. However, at least two issues rise: 1. Since you don't want to (or can't) use shaders, you can only switch textures at a vertex. This means that texture transitions will follow triangle edges. Unless you use a very high vertex density (which you probably shouldn't), this will not look good (jaggy). 2. To make it look natural, you probably want to make a nice blended transition when switching textures. This means blending at least two textures for some overlapping area. If you use blending factors per vertex, you will need several triangles to make any kind of gradual transition. And again, triangle density determines quality. (Unless you use only use elevation for texture selection (only vertical transitions), you will need more transition possibilities. The moment more then 2 materials (for example; grass, rock and sand) meet at one vertex, you need more blending possibilities.) To get rid of the vertex density dependency and to have more freedom in applying textures and making transitions I developed the technique described in the PDF. My first attempt was based on vertex based blend-factors. And it is a good place to start. I would recommend you to try the vertex approach too. Maybe after that you may feel up to the challenge to try the blend-factor per texel approach. (The blend map could be generated automatically based on anything you want including elevation) For the blend-factors per vertex approach, the Delphi3D article as suggested by "ViLiO" is very good. Regards, Serge van Keulen
  9. Treesong

    Texturing terrain slopes

    Simply put the technique from 3b is combining planar mapping from 5 directions using 2 different textures. This minimizes stretching. It is very much like cubemapping. The surface normal determines what side of the cube to use. To get nice transitions the normal is also used to calculate blend factors for blending textures. It's a relatively new idea to use for terrain texturing. Without branching it demands three samples per fragment just for the diffuse color af the fragment. Regards, Serge van Keulen
  10. Treesong

    Terrain Realism

    I just replied to a similar post. You may also find this document (PDF) on our old terrain engine usefull. Regards, Serge van keulen
  11. Treesong

    Texturing terrain slopes

    Sorry, option 2b got mangled: 2a. Blend texturing Blend up to three tiling textures using the RGBA components of the vertex colors. Pro's - No hand made blend texture needed, blend info can be calculated and stored in the vertex color. - No shaders needed - Compatible with older hardware - High resolution texturing possible Con's - The blending is as good as the vertex density. This is less of a problem with high density regular grids. With terrain meshes that use LOD techniques this may be a less suitable approach. - Only three different textures possible - May be slow on older hardware - Stretching of texels on steep slopes Hope it helps Serge van keulen
  12. Treesong

    Texturing terrain slopes

    Several terrain texturing approaches are possible. I will explain a few of the more wider used techniques in short. 1. Planar mapping The most simple way is to apply one planar mapped texture. Pro's - Very simple to implement. - Fast - No advanced techniques needed (shaders etc.) Con's - Texture must be hand made to fit the terrain - On large terrain the quality suffers - On steep areas the texels stretch 2a. Blend texturing Blend up to three tiling textures using the RGBA components of the vertex colors. Pro's - No hand made - No 2b. Blend texturing Blend several tiling textures using blend info from a terrain covering map. Map one blend-info map on the entire terrain. Each component (RGBA) containing a blend factor Pro's - Blend resolution can be as good as the blend info map's resolution (vertex density independant) - Compatible with older hardware - No shaders needed. - High resolution texturing Con's - Limited to three textures - May not be fast on older hardware - Blend info map must be hand made for the terrain - Texel stretching on steep slopes 3a. Detail texturing Mapping one big texture on the terrain containing simple colors. The color from the texture is used to select a specific (tiling) detail texture. The detail texture is combined with the color map. Transition between detail maps are also possible using the color info. Pro's - Many different detail maps are possible - Good quality results (Far Cry uses this technique) - Not extremely difficult to implement. Con's - Shaders needed - Slow on older hardware - Global color map must be made by hand to match the terrain - Texel stretching on steep slopes 3b. Detail texturing In the fragment shader, use the elevation and the components of surface normal to calculate blend factors for blending (tiling) detail textures. The surface normal manipulates the blend factor for steep or flat terrain sections, the elevation selects different textures to match the altitude. Pro's - No hand made terrain matching texture needed - Very good result are possible - High resolution texturing - Alternative info can be used to improve/alter detail texture selection - No texel stretching - As many different textures as the hardware allows are possible. Con's - Shaders needed - Only for modern hardware - More difficult to implement - All possible detail textures must be available in the fragment shader 4. A combination of the above mentioned techniques. For example, the blend map from option 2b could be generated using the techniques from 3b. It's up to your own inventiveness to come up with a clever combination that suits your goals. Serge van Keulen
  13. Treesong

    3ds max to opengl

    For reading, writing and creating .3ds files an official toolkit is available here. It may not be the most user friendly library, but it is the most complete 3ds library as far as I know. Serge van Keulen
  14. Treesong

    The big red Book

    If you really have to pick one, and if you are beginning to learn OpenGL/Graphics programming I guess you should pick the Super bible. But the red book is something that ever OpenGL programmer should have within reach. Especially when you start using more advanced techniques.
  15. Treesong

    User Interface

    For a simple way to include menu's in your OpenGL or Direct3D view, take a look at: http://www.antisphere.com/Wiki/tools:anttweakbar
  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!