Jump to content
  • Advertisement

Fen

Member
  • Content Count

    72
  • Joined

  • Last visited

Community Reputation

882 Good

About Fen

  • Rank
    Member

Personal Information

  1. Originally posted at Bartek's Code and Graphics blog. Let's say we have the following code: LegacyList* pMyList = new LegacyList(); ... pMyList->ReleaseElements(); delete pMyList; In order to fully delete an object we need to do some additional action. How to make it more C++11? How to use unique_ptr or shared_ptr here? Intro We all know that smart pointers are really nice things and we should be using them instead of raw new and delete. But what if deleting a pointer is not only the thing we need to call before the object is fully destroyed? In our short example we have to call ReleaseElements() to completely clear the list. Side Note: we could simply redesign LegacyList so that it properly clears its data inside its destructor. But for this exercise we need to assume that LegacyList cannot be changed (it's some legacy, hard to fix code, or it might come from a third party library). ReleaseElements is only my invention for this article. Other things might be involved here instead: logging, closing a file, terminating a connection, returning object to C style library... or in general: any resource releasing procedure, RAII. To give more context to my example, let's discuss the following use of LegacyList: class WordCache { public: WordCache() { m_pList = nullptr; } ~WordCache() { ClearCache(); } void UpdateCache(LegacyList *pInputList) { ClearCache(); m_pList = pInputList; if (m_pList) { // do something with the list... } } private: void ClearCache() { if (m_pList) { m_pList->ReleaseElements(); delete m_pList; m_pList = nullptr; } } LegacyList *m_pList; // owned by the object }; You can play with the source code here: using Coliru online compiler. This is a bit old style C++ class. The class owns the m_pList pointer, so it has to be cleared in the constructor. To make life easier there is ClearCache() method that is called from the destructor or from UpdateCache(). The main method UpdateCache() takes pointer to a list and gets ownership of that pointer. The pointer is deleted in the destructor or when we update the cache again. Simplified usage: WordCache myTestClass; LegacyList* pList = new LegacyList(); // fill the list... myTestClass.UpdateCache(pList); LegacyList* pList2 = new LegacyList(); // fill the list again // pList should be deleted, pList2 is now owned myTestClass.UpdateCache(pList2); With the above code there shouldn't be any memory leaks, but we need to carefully pay attention what's going on with the pList pointer. This is definitely not modern C++! Let's update the code so it's modernized and properly uses RAII (smart pointers in these cases). Using unique_ptr or shared_ptr seems to be easy, but here we have a slight complication: how to execute this additional code that is required to fully delete LegacyList ? What we need is a Custom Deleter Custom Deleter for shared_ptr I'll start with shared_ptr because this type of pointer is more flexible and easier to use. What should you do to pass a custom deleter? Just pass it when you create a pointer: std::shared_ptr pIntPtr(new int(10), [](int *pi) { delete pi; }); // deleter The above code is quite trivial and mostly redundant. If fact, it's more or less a default deleter - because it's just calling delete on a pointer. But basically, you can pass any callable thing (lambda, functor, function pointer) as deleter while constructing a shared pointer. In the case of LegacyList let's create a function: void DeleteLegacyList(LegacyList* p) { p->ReleaseElements(); delete p; } The modernized class is super simple now: class ModernSharedWordCache { public: void UpdateCache(std::shared_ptr pInputList) { m_pList = pInputList; // do something with the list... } private: std::shared_ptr m_pList; }; No need for constructor - the pointer is initialized to nullptr by default No need for destructor - pointer is cleared automatically No need for helper ClearCache - just reset pointer and all the memory and resources are properly cleared. When creating the pointer we need to pass that function: ModernSharedWordCache mySharedClass; std::shared_ptr ptr(new LegacyList(), DeleteLegacyList) mySharedClass.UpdateCache(ptr); As you can see there is no need to take care about the pointer, just create it (remember about passing a proper deleter) and that's all. [subheading]Where is custom deleter stored?[/subheading] When you use a custom deleter it won't affect the size of your shared_ptr type. If you remember, that should be roughly 2 x sizeof(ptr) (8 or 16 bytes)... so where does this deleter hide? shared_ptr consists of two things: pointer to the object and pointer to the control block (that contains reference counter for example). Control block is created only once per given pointer, so two shared_pointers (for the same pointer) will point to the same control block. Inside control block there is a space for custom deleter and allocator. [subheading]Can I use make_shared?[/subheading] Unfortunately you can pass a custom deleter only in the constructor of shared_ptr there is no way to use make_shared. This might be a bit of disadvantage, because as I described in Why create shared_ptr with make_shared? - from my old blog post, make_shared allocates the object and its control block for it next to each other in memory. Without make_shared you get two, probably separate, blocks of allocated mem. Update: I got a very good comment on reddit: from quicknir saying that I am wrong in this point and there is something you can use instead of make_shared. Indeed, you can use allocate_shared and leverage both the ability to have custom deleter and being able to share the same memory block. However, that requires you to write custom allocator, so I considered it to be too advanced for the original article. Custom Deleter for unique_ptr With unique_ptr there is a bit more complication. The main thing is that a deleter type will be part of unique_ptr type. By default we get std::default_delete: template < class T, class Deleter = std::default_delete > class unique_ptr; Deleter is part of the pointer, heavy deleter (in terms of memory consumption) means larger pointer type. [subheading]What to chose as deleter?[/subheading] What is best to use as a deleter? Let's consider the following options: std::function Function pointer Stateless functor State-full functor Lambda What is the smallest size of unique_ptr with the above deleter types? Can you guess? (Answer at the end of the article) [subheading]How to use?[/subheading] For our example problem let's use a functor: struct LegacyListDeleterFunctor { void operator()(LegacyList* p) { p->ReleaseElements(); delete p; } }; And here is a usage in the updated class: class ModernWordCache { public: using unique_legacylist_ptr = std::unique_ptr; public: void UpdateCache(unique_legacylist_ptr pInputList) { m_pList = std::move(pInputList); // do something with the list... } private: unique_legacylist_ptr m_pList; }; Code is a bit more complex than the version with `shared_ptr` - we need to define a proper pointer type. Below I show how to use that new class: ModernWordCache myModernClass; ModernWordCache::unique_legacylist_ptr pUniqueList(new LegacyList()); myModernClass.UpdateCache(std::move(pUniqueList)); All we have to remember, since it's a unique pointer, is to move the pointer rather than copy it. [subheading]Can I use make_unique?[/subheading] Similarly as with shared_ptr you can pass a custom deleter only in the constructor of unique_ptr and thus you cannot use make_unique. Fortunately, make_unique is only for convenience (wrong!) and doesn't give any performance/memory benefits over normal construction. Update: I was too confident about make_unique :) There is always a purpose for such functions. Look here GotW #89 Solution: Smart Pointers - guru question 3: make_unique is important because: First of all: Guideline: Use make_unique to create an object that isn't shared (at least not yet), unless you need a custom deleter or are adopting a raw pointer from elsewhere. Secondly: make_unique gives exception safety: Exception safety and make_unique So, by using a custom deleter we lose a bit of security. It's worth knowig the risk behind that choice. Still, custom deleter with unique_ptr is far more better than playing with raw pointers. Things to remember: Custom Deleters give a lot of flexibility that improves resource management in your apps. Summary In this post I've shown you how to use custom deleters with C++ smart pointer: shared_ptr and unique_ptr. Those deleters can be used in all the places wher 'normal' delete ptr is not enough: when you wrap FILE*, some kind of a C style structure ( SDL_FreeSurface, free(), destroy_bitmap from Allegro library, etc). Remember that proper garbage collection is not only related to memory destruction, often some other actions needs to be invoked. With custom deleters you have that option. Gist with the code is located here: fenbf/smart_ptr_deleters.cpp Let me know what are your common problems with smart pointers? What blocks you from using them? References Item 18, 19, 21 from Effective Modern C++ by Scott Meyers The C++ Standard Library, 2nd, by Nicolai M. Josuttis (my review) Smart pointer gotchas More C++ Idioms/Resource Acquisition Is Initialization StackOverflow: C++ std::unique_ptr : Why isn't there any size fees with lambdas? StackOverflow: How to pass deleter to make_shared? Answer to the question about pointer size 1. std::function - heavy stuff, on 64 bit, gcc it showed me 40 bytes. 2. Function pointer - it's just a pointer, so now unique_ptr contains two pointers: for the object and for that function... so 2*sizeof(ptr) = 8 or 16 bytes. 3. Stateless functor (and also stateless lambda) - it's actually very tircky thing. You would probably say: two pointers... but it's not. Thanks to empty base optimization - EBO the final size is just a size of one pointer, so the smallest possible thing. 4. State-full functor - if there is some state inside the functor then we cannot do any optimizations, so it will be the size of ptr + sizeof(functor) 5. Lambda (statefull) - similar to statefull functor
  2. It depends on the size of this part that you want to update. If that's small maybe it better to use other approach. You could use small buffer only for this changing part... and then copy that part (after the update) to full texture - but do it on GPU (that offers better bandwidth). On the other hand, if that size to update is relatively large (maybe 60%... or more) persistent mapped buffer would be better. 
  3. It seems that it's not easy to efficiently move data from CPU to GPU. Especially if we like to do it often - like every frame, for example. Fortunately, OpenGL (since version 4.4) gives us a new technique to fight this problem. It's called persistent mapped buffers that comes from the ARB_buffer_storage extension. Let us revisit this extension. Can it boost your rendering code? Note 1:Originally posted at Code And Graphics blog Note 2: This post is an introduction to the Persistent Mapped Buffers topic, see the Second Part with Benchmark Results @myblog [alink=1]Intro[/alink] [alink=2]Moving Data[/alink] [alink=4]Synchronization[/alink] [alink=5]Double (Multiple) Buffering/Orphaning[/alink] [alink=6]Persistent Mapping[/alink] [alink=7]Demo[/alink] [alink=8]Summary[/alink] [alink=9]References[/alink] [aname=1]Intro First thing I'd like to mention is that there are already a decent number of articles describing Persistent Mapped Buffers. I've learned a lot especially from Persistent mapped buffers @ferransole.wordpress.com and Maximizing VBO upload performance! - javagaming. This post serves as a summary and a recap for modern techniques used to handle buffer updates. I've used those techniques in my particle system - please wait a bit for the upcoming post about renderer optimizations. OK... but let's talk about our main hero in this story: persistent mapped buffer technique. It appeared in ARB_buffer_storage and it become core in OpenGL 4.4. It allows you to map buffer once and keep the pointer forever. No need to unmap it and release the pointer to the driver... all the magic happens underneath. Persistent Mapping is also included in modern OpenGL set of techniques called "AZDO" - Aproaching Zero Driver Overhead. As you can imagine, by mapping the buffer only once we significantly the reduce number of heavy OpenGL function calls and what's more important, fight synchronization problems. One note: this approach can simplify the rendering code and make it more robust, still, try to stay as much as possible only on the GPU side. Any CPU to GPU data transfer will be much slower than GPU to GPU communication. [aname=2]Moving Data Let's now go through the process of updating the data in a buffer. We can do it in at least two different ways: glBuffer*Data and glMapBuffer*. To be precise: we want to move some data from App memory (CPU) into GPU so that the data can be used in rendering. I'm especially interested in the case where we do it every frame, like in a particle system: you compute the new position on CPU, but then you want to render it. CPU to GPU memory transfer is needed. Even more complicated example would be to update video frames: you load data from a media file, decode it and then modify texture data that is then displayed. Often such process is referred as streaming. In other terms: CPU is writing data, GPU is reading. Although I mention 'moving', GPU can actually directly read from system memory (using GART). So there is no need to copy data from one buffer (on CPU side) to a buffer that is on the GPU side. In that approach we should rather think about 'making data visible' to the GPU. glBufferData/glBufferSubData Those two procedures (available since OpenGL 1.5!) will copy your input data into pinned memory. Once it's done, an asynchronous DMA transfer can be started and the invoked procedure returns. After that call you can even delete your input memory chunk. The above picture shows a "theoretical" flow for this method: data is passed to glBuffer*Data functions and then internally OpenGL performs DMA transfer to GPU... Note: glBufferData invalidates and reallocates the whole buffer. Use glBufferSubData to only update the data inside. glMap*/glUnmap* With mapping approach you simply get a pointer to pinned memory (might depend on actual implementation!). You can copy your input data and then call glUnmap to tell the driver that you are finished with the update. So, it looks like the approach with glBufferSubData, but you manage copying data by yourself. Plus you get some more control about the entire process. A "theoretical" flow for this method: you obtain a pointer to (probably) pinned memory, then you can copy your orignal data (or compute it), at the end you have to release the pointer via glUnmapBuffer method. ... All the above methods look quite easy: you just pay for the memory transfer. It could be that way if only there was no such thing as synchronization... [aname=4]Synchronization Unfortunately life is not that easy: you need to remember that GPU and CPU (and even the driver) runs asynchronously. When you submit a draw call it will not be executed immediately... it will be recorded in the command queue but will probably be executed much later by the GPU. When we update a buffer data we might easily get a stall - GPU will wait while we modify the data. We need to be smarter about it. For instance, when you call glMapBuffer the driver can create a mutex so that the buffer (which is a shared resource) is not modified by CPU and GPU at the same time. If it happens often, we'll lose a lot of GPU power. GPU can be blocked even in a situation when your buffer is only recorded to be rendered and not currently read. In the picture above I tried to show a very generic and simplified view of how GPU and CPU work when they need to synchronize - wait for each other. In a real life scenario those gaps might have different sizes and there might be multiple sync points in a frame. The less waiting the more performance we can get. So, reducing synchronization problems is an another incentive to have everything happening on GPU. [aname=5]Double (Multiple) Buffering/Orphaning Quite recommended idea is to use double or even triple buffering to solve the problem with synchronization: create two buffers update the first one in the next frame update the second one swap buffer ID... That way the GPU can draw (read) from one buffer while you will update the next one. How can you do that in OpenGL? explicitly use several buffers and use round robin algorithm to update them. use glBufferData with a NULL pointer before each update: the whole buffer will be recreated so we can store our data in a completely new place the old buffer will be used by GPU - no synchronization will be needed GPU will probably figure out that the following buffer allocations are similar so it will use the same memory chunks. I remember that this approach was not suggested in older versions of OpenGL. use glMapBufferRange with GL_MAP_INVALIDATE_BUFFER_BIT aditionally use UNSYNCHRONIZED bit and perform sync on your own. there is also a procedure called glInvalidateBufferData that does the same job Triple buffering GPU and CPU runs asynchronously... but there is also another factor: the driver. It may happen (and on desktop driver implementations it happens quite often) that the driver also runs asynchronously. To solve this even more complicated synchronization scenario, you might consider triple buffering: one buffer for CPU one for the driver one for GPU This way there should be no stalls in the pipeline, but you need to sacrifice a bit more memory for your data. More reading on the @hacksoflife blog Double-Buffering VBOs Double-Buffering Part 2 - Why AGP Might Be Your Friend One More On VBOs - glBufferSubData [aname=6]Persistent Mapping Ok, we've covered common techniques for data streaming, but now let's talk about persistent mapped buffers technique in more details. Assumptions: GL_ARB_buffer_storage must be available or OpenGL 4.4 Creation: glGenBuffers(1, &vboID); glBindBuffer(GL_ARRAY_BUFFER, vboID); flags = GL_MAP_WRITE_BIT | GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT; glBufferStorage(GL_ARRAY_BUFFER, MY_BUFFER_SIZE, 0, flags); Mapping (only once after creation...): flags = GL_MAP_WRITE_BIT | GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT; myPointer = glMapBufferRange(GL_ARRAY_BUFFER, 0, MY_BUFFER_SIZE, flags); Update: // wait for the buffer // just take your pointer (myPointer) and modyfy underlying data... // lock the buffer As the name suggests, it allows you to map the buffer once and keep the pointer forever. At the same time you are left with the synchronization problem - that's why there are comments about waiting and locking the buffer in the code above. On the diagram you can see that in the first place we need to get a pointer to the buffer memory (but we do it only once), then we can update the data (without any special calls to OpenGL). The only additional action we need to perform is synchronization or making sure that GPU will not read while we write at the same time. All the needed DMA transfers are invoked by the driver. The GL_MAP_COHERENT_BIT flag makes your changes in the memory automatically visible to GPU. Without this flag you would have to manually set a memory barrier. Although, it looks like that GL_MAP_COHERENT_BIT should be slower than explicit and custom memory barriers and syncing, my first tests did not show any meaningful difference. I need to spend more time on that... Maybe you would like some more thoughts on that? BTW: even in the original AZDO presentation the authors mention to use GL_MAP_COHERENT_BIT so this shouldn't be a serious problem :) Syncing // waiting for the buffer GLenum waitReturn = GL_UNSIGNALED; while (waitReturn != GL_ALREADY_SIGNALED && waitReturn != GL_CONDITION_SATISFIED) { waitReturn = glClientWaitSync(syncObj, GL_SYNC_FLUSH_COMMANDS_BIT, 1); } // lock the buffer: glDeleteSync(syncObj); syncObj = glFenceSync(GL_SYNC_GPU_COMMANDS_COMPLETE, 0); When we write to the buffer we place a sync object. Then, in the following frame we need to wait until this sync object is signaled. In other words, we wait till the GPU processes all the commands before setting that sync. Triple buffering But we can do better: by using triple buffering we can be sure that GPU and CPU will not touch the same data in the buffer: allocate one buffer with 3x of the original size map it forever bufferID = 0 update/draw update bufferID range of the buffer only draw that range bufferID = (bufferID+1)%3 That way, in the next frame you will update another part of the buffer so that there will be no conflict. Another way would be to create three separate buffers and update them in a similar way. [aname=7]Demo I've forked a demo application of Ferran Sole's example and extended it a bit. Here is the github repo: fenbf/GLSamples configurable number of triangles configurable number of buffers: single/double/triple optional syncing optional debug flag benchmark mode output: number of frames counter that is incremented each time we wait for the buffer Full results will be published in post: at my blog. [aname=8]Summary This was a long article, but I hope I explained everything in a decent way. We went through the standard approach of buffer updates (buffer streaming), saw our main problem: synchronization. Then I've described usage of persistence mapped buffers. Should you use persistent mapped buffers? Here is the short summary about that: Pros Easy to use Obtained pointer can be passed around in the app In most cases gives performance boost for very frequent buffer updates (when data comes from CPU side) reduces driver overhead minimizes GPU stalls Advised for AZDO techniques Cons Do not use it for static buffers or buffers that do not require updates from CPU side. Best performance with triple buffering (might be a problem when you have large buffers, because you need a lot of memory to allocate). Need to do explicit synchronization. In OpenGL 4.4, so only latest GPU can support it. In the next post I'll share my results from the Demo application. I've compared glMapBuffer approach with glBuffer*Data and persistent mapping. Interesting questions: Is this extension better or worse than AMD_pinned_memory? What if you forget to sync, or do it in a wrong way? I did not get any apps crashes and hardly see any artifacts, but what's the expected result of such a situation? What if you forget to use GL_MAP_COHERENT_BIT? Is there that much performance difference? [aname=9]References [PDF] OpenGL Insights, Chapter 28 - Asynchronous Buffer Transfers by Ladislav Hrabcak and Arnaud Masserann, a free Chapter from [OpenGL Insights].(http://openglinsights.com/) Persistent mapped buffers @ferransole.wordpress.com Maximizing VBO upload performance! @Java-Gaming.org Forum Buffer Object @OpenGL Wiki Buffer Object Streaming @OpenGL Wiki persistent buffer mapping - what kind of magic is this? @OpenGL Forum Article Update Log 4th Feb 2015: Initial release
  4. As a software/game developer, you usually want more and more... of everything actually! More pixels, more triangles, more FPS, more objects on the screen, bots, monsters. Unfortunately you don't have endless resources and you end up with some compromises. The optimization process can help in the reduction of performance bottlenecks and it may free some available powers hidden in the code. Optimization shouldn't be based on random guesses: "oh, I think, if I rewrite this code to SIMD, the game will run a bit faster". How do you know that "this code" makes some real performance problems? Is investing there a good option? Will it pay off? It would be nice to have some clear guide, a direction. In order to get some better understanding on what to improve, you need to detect a base line of the system/game. In other words, you need to measure the current state of the system and find hot spots and bottlenecks. Then think about factors you would like to improve... and then... start optimizing the code! Such a process might not be perfect, but at least you will minimize potential errors and maximize the outcome. Of course, the process will not be finished with only one iteration. Every time you make a change, the process starts from the beginning. Do one small step at a time. Iteratively. At the end your game/app should still work (without new bugs, hopefully) and it should run X times faster. The factor X, can be even measured accurately, if you do the optimization right. The Software Optimization Process According to this and this book, the process should look like this: Benchmark Find hot spots and bottlenecks Improve Test Go back The whole process should not start after the whole implementation (when usually there is no time to do it), but should be executed during the project's time. In case of our particle system I tried to think about possible improvements up front. 1. The benchmark Having a good benchmark is a crucial thing. If you do it wrong then the whole optimization process can be even a waste of time. From The Software Optimization Cookbook book: The benchmark is the program or process used to: Objectively evaluate the performance of an application Provide repeatable application behavior for use with performance analysis tools. The core and required attributes: Repeatable - gives the same results every time you run it. Representative - uses large portion of the main application's use cases. It would be pointless if you focus only on a small part of it. For a game such a benchmark could include the most common scene or scene with maximum triangles/objects (that way simpler scenes will also work faster). Easy to run - you don't want to spend hours setting up and running the benchmark. A benchmark is definitely harder to make than a unit test, but it would be nice if it runs as fast as possible. Another point is that it should produce easy to read output: for instance FPS report, timing report, simple logs... but not hundreds of lines of messages from internal subsystems. Verifiable - make sure the benchmark produces valid and meaningful results. 2. Find hot spots and bottlenecks When you run your benchmark you will get some output. You can also run profiling tools and get more detailed results of how the application is performing. But, having data is one, but actually, it is more important to understand it, analyze and have good conclusion. You need to find a problem that blocks the application from running at full speed. Just to summarize: bottleneck - place in the system that makes the whole application slower. Like the weakest element of a chain. For instance, you can have a powerful GPU, but without fast memory bandwidth you will not be able to feed this GPU monster with the data - it will wait. hot spot - place in the system that does crucial, intensive job. If you optimize such a module then the whole system should work faster. For instance, if CPU is too hot then maybe offload some work to GPU (if it has some free compute resources available). This part may be the hardest. In a simple system it is easy to see a problem, but in large-scale software it can be quite tough. Sometimes it can be only one small function, or the whole design, or some algorithm used. Usually it is better to use a top-down approach. For example: Your framerate is too low. Measure your CPU/GPU utilization. Then go to CPU or GPU side. If CPU: think about your main subsystems: is this a animation module, AI, physics? Or maybe your driver cannot process so many draw calls? If GPU: vertex or fragment bound... Go down to the details. 3. Improve Now the fun part! Improve something and the application should work better :) What you can improve: at system level - look at utilization of your whole app. Are any resources idle? (CPU or GPU waiting?) Do you use all the cores? at algorithmic level - do you use proper data structures/algorithms? Maybe instead of O(n) solution you can reduce it to O(log n) ? at micro level - the 'funniest' part, but do it only when the first two levels are satisfied. If you are sure, that nothing more can be designed better, you need to use some dirty code tricks to make things faster. One note: Instead of rewriting everything to Assembler use your tools first. Today's compilers are powerful optimizers as well. Another issue here is portability: one trick might not work on another platform. 4. Test After you make a change test how the system behaves. Did you get 50% of the speed increase? Or maybe it is even slower? Beside performance testing, please make sure you are not breaking anything! I know that making systems 10% faster is nice, but your boss will not be happy if, thanks to this improvement, you introduce several hard-to-find bugs! 5. Go back After you are sure everything works even better than before... just run your bechmark and repeat the process. It is better if you make a small, simple change, rather than big, but complex. With smaller moves it is harder to make a mistake. Additionally, it is easy to revert the changes. Profiling Tools Main methods: custom timers/counters - you can create a separate configuration (based on Release mode) and enable a set of counters or timers. For instance, you can place it in every function in a critical subsystem. You can generate call hierarchy and analyse it further on. instrumentation - tool adds special fragments of code to your executable so that it can measure the execution process. interception - tool intercepts API calls (for instance OpenGL - glIntercept, or DirectX) and later on analyses such register. sampling - tool stops the application at specific intervals and analyses the function stack. This method is usually much lighter than instrumentation. Below is a list of professional tools that can help: Intel(R) VTune(TM) Amplifier Visual Studio Profiler AMD CodeXL - FREE. AMD created a good, easy to use, profiling tool for CPU and GPU as well. Does the best job when you have also AMD CPU (that I don't have ;/) but for Intel CPU's it will give you at least timing reports. ValGrind - runs your app on a virtual machine and can detect various problems: from memory leaks to performance issues. GProf - Unix, uses a hybrid of sampling and instrumentation. Lots of others... here on wikipedia Something more Automate I probably do not need to write this... but the more you automate the easier your job will be. This rule applies, nowadays, to almost everything: testing, setup of application, running the application, etc. Have Fun! The above process sounds very 'professional' and 'boring'. There is also another factor that plays an important role when optimizing the code: just have fun! You want to make mistakes, you want to guess what to optimize and you want to learn new things. In the end, you will still get some new experience (even if you optimized a wrong method). You might not have enough time for this at your day job, but what about some hobby project? The more experience with the optimization process you have, the faster your code can run. References The Software Optimization Cookbook: High Performance Recipes for IA-32 Platforms, 2nd Edition, Intel Press; 2nd edition (December 2005) - Contains lots of useful information, written in a light way. I've won it on GDC Europe 2011 :) Video Game Optimization, by Eric Preisz - another good book, also quite unique in this area. I would like to see the second edition - improved, updated and maybe extended. C++ For Game Programmers (Game Development Series) Game Coding Complete, Fourth Edition Agner`s optimization manuals Understanding Profiling Methods @MSDN Sampling vs Instrumentation/oktech-profiler docs Code and Graphics: particle system implementation series Article Update Log 17th August 2014: Initial version, based on post from Code and Graphics blog
  5. One of the most crucial part of a particle system is the container for all the particles. It has to hold all the data that describe particles, it should be easy to extend and fast enough. In this post I will write about choices, problems and possible solutions for such a container. The Series Introduction Particle Container 1 - problems (this post) Particle Container 2 - implementation Generators & Emitters Updaters Renderer Tools Optimizations SIMD Optimizations Renderer Optimizations Introduction What is wrong with this code? class Particle { public: bool m_alive; Vec4d m_pos; Vec4d m_col; float time; // ... other fields public: // ctors... void update(float deltaTime); void render(); }; and then usage of this class: std::vector particles; // update function: for (auto &p : particles) p.update(dt); // rendering code: for (auto &p : particles) p.render(); Actually one could say that it is OK. And for some simple cases indeed it is. But let us ask several questions: Are we OK with SRP principle here? What if we would like to add one field to the particle? Or have one particle system with pos/col and other with pos/col/rotations/size? Is our structure capable of such configuration? What if we would like to implement a new update method? Should we implement it in some derived class? Is the code efficient? My answers: It looks like SRP is violated here. The Particle class is responsible not only for holding the data but also performs updates, generations and rendering. Maybe it would be better to have one configurable class for storing the data, some other systems/modules for its update and another for rendering? I think that this option is much better designed. Having the Particle class built that way we are blocked from the possibility to add new properties dynamically. The problem is that we use here an AoS (Array of Structs) pattern rather than SoA (Structure of Arrays). In SoA when you want to have one more particle property you simply create/add a new array. As I mentioned in the first point: we are violating SRP so it is better to have a separate system for updates and rendering. For simple particle systems our original solution will work, but when you want some modularity/flexibility/usability then it will not be good. There are at least three performance issues with the design: AoS pattern might hurt performance. In the update code for each particle we have not only the computation code, but also a (virtual) function call. We will not see almost any difference for 100 particles, but when we aim for 100k or more it will be visible for sure. The same problem goes for rendering. We cannot render each particle on its own, we need to batch them in a vertex buffer and make as few draw calls as possible. All of above problems must be addressed in the design phase. Add/Remove Particles It was not visible in the above code, but another important topic for a particle system is an algorithm for adding and killing particles: void kill(particleID) { ?? } void wake(particleID) { ?? } How to do it efficiently? First thing: Particle Pool It looks like particles need a dynamic data structure - we would like to dynamically add and delete particles. Of course we could use list or std::vector and change it every time, but would that be efficient? Is it good to reallocate memory often (each time we create a particle)? One thing that we can initially assume is that we can allocate one huge buffer that will contain the maximum number of particles. That way we do not need to have memory reallocations all the time. We solved one problem: numerous buffer reallocations, but on the other hand we now face a problem with fragmentation. Some particles are alive and some of them are not. So how to manage them in one single buffer? Second thing: Management We can manage the buffer it at least two ways: Use an alive flag and in the for loop update/render only active particles. this unfortunately causes another problem with rendering because there we need to have a continuous buffer of things to render. We cannot easily check if a particle is alive or not. To solve this we could, for instance, create another buffer and copy alive particles to it every time before rendering. Dynamically move killed particles to the end so that the front of the buffer contains only alive particles. As you can see in the above picture when we decide that a particle needs to be killed we swap it with the last active one. This method is faster than the first idea: When we update particles there is no need to check if it is alive. We update only the front of the buffer. No need to copy only alive particles to some other buffer What's Next In the article I've introduced several problems we can face when designing a particle container. Next time I will show my implementation of the system and how I solved described problems. BTW: do you see any more problems with the design? Please share your opinions in the comments. Links Coding: AoS & SoA Explorations Part 1, Part 2 and Part 3 and Four 14 Jun 2014: Initial version, reposted from Code and Graphics blog
  6. There are at least several questions about using smart pointers in modern C++11 Why is auto_ptr deprecated? Why does unique_ptr finally work good? How to use arrays with unique_ptr? Why create shared_ptr with make_shared? How to use arrays with shared_ptr? How to pass smart pointers to functions? While learning how to use the new C++ standard I came across several issues with smart pointers. In general you can mess up a lot less using those helper objects and thus you should use them in your code instead of raw pointers. Unfortunately there are some topics you have to understand to take full advantage of them. As in most cases when you get a new tool to solve your problems, this tool introduces another problems as well. Some predefines Let us take a simple Test class with one member field to present further concepts: class Test { public: Test():m_value(0) { std::cout VS 2012 local's view Above you can see a picture with local's view in the VS 2012. Compare the addresses of object data and reference counter block. For the sp2 we can see that they are very close to each other. To be sure I got proper results I've even asked question on stackoverflow: http://stackoverflow.com/questions/14665935/make-shared-evidence-vs-default-construct How to use arrays with shared_ptr? Arrays with shared_ptr are a bit trickier that when using unique_ptr, but we can use our own deleter and have full control over them as well: std::shared_ptr sp(new Test[2], [](Test *p) { delete [] p; }); We need to use custom deleter (here as a lambda expression). Additionally we cannot use make_shared construction. Unfortunately using shared pointers for arrays is not so nice. I suggest taking boost instead. For instance: http://www.boost.org/doc/libs/1520/libs/smartptr/sharedarray.htm How to pass smart pointers to functions? We should use smart pointers as a first class objects in C++, so in general we should pass them by value to functions. That way reference counter will increase/decrease correctly. But we can use some other constructions which seems to be a bit misleading. Here is some code: void testSharedFunc(std::shared_ptr sp) { sp->m_value = 10; } void testSharedFuncRef(const std::shared_ptr &sp) { sp->m_value = 10; } void SharedPtrParamTest() { std::shared_ptr sp = std::make_shared(); testSharedFunc(sp); testSharedFuncRef(sp); } The above code will work as assumed, but in testSharedFuncRef we get no benefit of using shared pointers at all! Only testSharedFunc will increase reference counter. For some performance critical code we, additionally, need to notice that passing by value will need to copy the whole pointer block, so maybe it is better to use even raw pointer there. But perhaps the second option (with reference) is better? It depends. The main question is if you want to have full ownership of the object. If not (for instance you have some generic function that calls methods of the object) then we do not need ownership... simple passing by reference is a good and fast method. It is not only me who got confused. Even Herb Sutter paid some attention to this problem and here is his post on that matter: http://herbsutter.com/2012/06/05/gotw-105-smart-pointers-part-3-difficulty-710/ Some additional comments Smart pointers are very useful, but we, as users, also need to be smart :) I am not as experienced with smart pointers as I would like to be. For instance sometimes I am tempted to use raw pointers: I know what will happen, and at a time I can guarantee that it will not mess with the memory. Unfortunately this can be a potential problem in the future. When code changes my assumptions can be not valid any more and new bugs may occur. With smart pointers it is not so easy to break things. All this topic is a bit complicated, but as usually in C++, we get something at a price. We need to know what we are doing to fully utilize the particular feature. Code for the article: https://github.com/fenbf/review/blob/master/smart_ptr.cpp Links "The C++ Standard Library" Second Edition was main reference for this post. http://stackoverflow.com/questions/8114276/how-do-i-pass-a-unique-ptr-argument-to-a-constructor-or-a-function http://stackoverflow.com/questions/14027079/stop-heap-allocation-via-make-shared http://stackoverflow.com/questions/9302296/is-make-shared-really-more-efficient-than-new http://ootips.org/yonat/4dev/smart-pointers.html This article also is hosted on: www.codeproject.com Reprinted with permission from Bart?omiej Filipek's blog
  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!