Fen

Members
  • Content count

    72
  • Joined

  • Last visited

Community Reputation

882 Good

About Fen

  • Rank
    Member

Personal Information

  1. Originally posted at Bartek's Code and Graphics blog. Let's say we have the following code: LegacyList* pMyList = new LegacyList(); ... pMyList->ReleaseElements(); delete pMyList; In order to fully delete an object we need to do some additional action. How to make it more C++11? How to use unique_ptr or shared_ptr here? Intro We all know that smart pointers are really nice things and we should be using them instead of raw new and delete. But what if deleting a pointer is not only the thing we need to call before the object is fully destroyed? In our short example we have to call ReleaseElements() to completely clear the list. Side Note: we could simply redesign LegacyList so that it properly clears its data inside its destructor. But for this exercise we need to assume that LegacyList cannot be changed (it's some legacy, hard to fix code, or it might come from a third party library). ReleaseElements is only my invention for this article. Other things might be involved here instead: logging, closing a file, terminating a connection, returning object to C style library... or in general: any resource releasing procedure, RAII. To give more context to my example, let's discuss the following use of LegacyList: class WordCache { public: WordCache() { m_pList = nullptr; } ~WordCache() { ClearCache(); } void UpdateCache(LegacyList *pInputList) { ClearCache(); m_pList = pInputList; if (m_pList) { // do something with the list... } } private: void ClearCache() { if (m_pList) { m_pList->ReleaseElements(); delete m_pList; m_pList = nullptr; } } LegacyList *m_pList; // owned by the object }; You can play with the source code here: using Coliru online compiler. This is a bit old style C++ class. The class owns the m_pList pointer, so it has to be cleared in the constructor. To make life easier there is ClearCache() method that is called from the destructor or from UpdateCache(). The main method UpdateCache() takes pointer to a list and gets ownership of that pointer. The pointer is deleted in the destructor or when we update the cache again. Simplified usage: WordCache myTestClass; LegacyList* pList = new LegacyList(); // fill the list... myTestClass.UpdateCache(pList); LegacyList* pList2 = new LegacyList(); // fill the list again // pList should be deleted, pList2 is now owned myTestClass.UpdateCache(pList2); With the above code there shouldn't be any memory leaks, but we need to carefully pay attention what's going on with the pList pointer. This is definitely not modern C++! Let's update the code so it's modernized and properly uses RAII (smart pointers in these cases). Using unique_ptr or shared_ptr seems to be easy, but here we have a slight complication: how to execute this additional code that is required to fully delete LegacyList ? What we need is a Custom Deleter Custom Deleter for shared_ptr I'll start with shared_ptr because this type of pointer is more flexible and easier to use. What should you do to pass a custom deleter? Just pass it when you create a pointer: std::shared_ptr pIntPtr(new int(10), [](int *pi) { delete pi; }); // deleter The above code is quite trivial and mostly redundant. If fact, it's more or less a default deleter - because it's just calling delete on a pointer. But basically, you can pass any callable thing (lambda, functor, function pointer) as deleter while constructing a shared pointer. In the case of LegacyList let's create a function: void DeleteLegacyList(LegacyList* p) { p->ReleaseElements(); delete p; } The modernized class is super simple now: class ModernSharedWordCache { public: void UpdateCache(std::shared_ptr pInputList) { m_pList = pInputList; // do something with the list... } private: std::shared_ptr m_pList; }; No need for constructor - the pointer is initialized to nullptr by default No need for destructor - pointer is cleared automatically No need for helper ClearCache - just reset pointer and all the memory and resources are properly cleared. When creating the pointer we need to pass that function: ModernSharedWordCache mySharedClass; std::shared_ptr ptr(new LegacyList(), DeleteLegacyList) mySharedClass.UpdateCache(ptr); As you can see there is no need to take care about the pointer, just create it (remember about passing a proper deleter) and that's all. [subheading]Where is custom deleter stored?[/subheading] When you use a custom deleter it won't affect the size of your shared_ptr type. If you remember, that should be roughly 2 x sizeof(ptr) (8 or 16 bytes)... so where does this deleter hide? shared_ptr consists of two things: pointer to the object and pointer to the control block (that contains reference counter for example). Control block is created only once per given pointer, so two shared_pointers (for the same pointer) will point to the same control block. Inside control block there is a space for custom deleter and allocator. [subheading]Can I use make_shared?[/subheading] Unfortunately you can pass a custom deleter only in the constructor of shared_ptr there is no way to use make_shared. This might be a bit of disadvantage, because as I described in Why create shared_ptr with make_shared? - from my old blog post, make_shared allocates the object and its control block for it next to each other in memory. Without make_shared you get two, probably separate, blocks of allocated mem. Update: I got a very good comment on reddit: from quicknir saying that I am wrong in this point and there is something you can use instead of make_shared. Indeed, you can use allocate_shared and leverage both the ability to have custom deleter and being able to share the same memory block. However, that requires you to write custom allocator, so I considered it to be too advanced for the original article. Custom Deleter for unique_ptr With unique_ptr there is a bit more complication. The main thing is that a deleter type will be part of unique_ptr type. By default we get std::default_delete: template < class T, class Deleter = std::default_delete > class unique_ptr; Deleter is part of the pointer, heavy deleter (in terms of memory consumption) means larger pointer type. [subheading]What to chose as deleter?[/subheading] What is best to use as a deleter? Let's consider the following options: std::function Function pointer Stateless functor State-full functor Lambda What is the smallest size of unique_ptr with the above deleter types? Can you guess? (Answer at the end of the article) [subheading]How to use?[/subheading] For our example problem let's use a functor: struct LegacyListDeleterFunctor { void operator()(LegacyList* p) { p->ReleaseElements(); delete p; } }; And here is a usage in the updated class: class ModernWordCache { public: using unique_legacylist_ptr = std::unique_ptr; public: void UpdateCache(unique_legacylist_ptr pInputList) { m_pList = std::move(pInputList); // do something with the list... } private: unique_legacylist_ptr m_pList; }; Code is a bit more complex than the version with `shared_ptr` - we need to define a proper pointer type. Below I show how to use that new class: ModernWordCache myModernClass; ModernWordCache::unique_legacylist_ptr pUniqueList(new LegacyList()); myModernClass.UpdateCache(std::move(pUniqueList)); All we have to remember, since it's a unique pointer, is to move the pointer rather than copy it. [subheading]Can I use make_unique?[/subheading] Similarly as with shared_ptr you can pass a custom deleter only in the constructor of unique_ptr and thus you cannot use make_unique. Fortunately, make_unique is only for convenience (wrong!) and doesn't give any performance/memory benefits over normal construction. Update: I was too confident about make_unique :) There is always a purpose for such functions. Look here GotW #89 Solution: Smart Pointers - guru question 3: make_unique is important because: First of all: Guideline: Use make_unique to create an object that isn't shared (at least not yet), unless you need a custom deleter or are adopting a raw pointer from elsewhere. Secondly: make_unique gives exception safety: Exception safety and make_unique So, by using a custom deleter we lose a bit of security. It's worth knowig the risk behind that choice. Still, custom deleter with unique_ptr is far more better than playing with raw pointers. Things to remember: Custom Deleters give a lot of flexibility that improves resource management in your apps. Summary In this post I've shown you how to use custom deleters with C++ smart pointer: shared_ptr and unique_ptr. Those deleters can be used in all the places wher 'normal' delete ptr is not enough: when you wrap FILE*, some kind of a C style structure ( SDL_FreeSurface, free(), destroy_bitmap from Allegro library, etc). Remember that proper garbage collection is not only related to memory destruction, often some other actions needs to be invoked. With custom deleters you have that option. Gist with the code is located here: fenbf/smart_ptr_deleters.cpp Let me know what are your common problems with smart pointers? What blocks you from using them? References Item 18, 19, 21 from Effective Modern C++ by Scott Meyers The C++ Standard Library, 2nd, by Nicolai M. Josuttis (my review) Smart pointer gotchas More C++ Idioms/Resource Acquisition Is Initialization StackOverflow: C++ std::unique_ptr : Why isn't there any size fees with lambdas? StackOverflow: How to pass deleter to make_shared? Answer to the question about pointer size 1. std::function - heavy stuff, on 64 bit, gcc it showed me 40 bytes. 2. Function pointer - it's just a pointer, so now unique_ptr contains two pointers: for the object and for that function... so 2*sizeof(ptr) = 8 or 16 bytes. 3. Stateless functor (and also stateless lambda) - it's actually very tircky thing. You would probably say: two pointers... but it's not. Thanks to empty base optimization - EBO the final size is just a size of one pointer, so the smallest possible thing. 4. State-full functor - if there is some state inside the functor then we cannot do any optimizations, so it will be the size of ptr + sizeof(functor) 5. Lambda (statefull) - similar to statefull functor
  2. It depends on the size of this part that you want to update. If that's small maybe it better to use other approach. You could use small buffer only for this changing part... and then copy that part (after the update) to full texture - but do it on GPU (that offers better bandwidth). On the other hand, if that size to update is relatively large (maybe 60%... or more) persistent mapped buffer would be better. 
  3. It seems that it's not easy to efficiently move data from CPU to GPU. Especially if we like to do it often - like every frame, for example. Fortunately, OpenGL (since version 4.4) gives us a new technique to fight this problem. It's called persistent mapped buffers that comes from the ARB_buffer_storage extension. Let us revisit this extension. Can it boost your rendering code? Note 1:Originally posted at Code And Graphics blog Note 2: This post is an introduction to the Persistent Mapped Buffers topic, see the Second Part with Benchmark Results @myblog [alink=1]Intro[/alink] [alink=2]Moving Data[/alink] [alink=4]Synchronization[/alink] [alink=5]Double (Multiple) Buffering/Orphaning[/alink] [alink=6]Persistent Mapping[/alink] [alink=7]Demo[/alink] [alink=8]Summary[/alink] [alink=9]References[/alink] [aname=1]Intro First thing I'd like to mention is that there are already a decent number of articles describing Persistent Mapped Buffers. I've learned a lot especially from Persistent mapped buffers @ferransole.wordpress.com and Maximizing VBO upload performance! - javagaming. This post serves as a summary and a recap for modern techniques used to handle buffer updates. I've used those techniques in my particle system - please wait a bit for the upcoming post about renderer optimizations. OK... but let's talk about our main hero in this story: persistent mapped buffer technique. It appeared in ARB_buffer_storage and it become core in OpenGL 4.4. It allows you to map buffer once and keep the pointer forever. No need to unmap it and release the pointer to the driver... all the magic happens underneath. Persistent Mapping is also included in modern OpenGL set of techniques called "AZDO" - Aproaching Zero Driver Overhead. As you can imagine, by mapping the buffer only once we significantly the reduce number of heavy OpenGL function calls and what's more important, fight synchronization problems. One note: this approach can simplify the rendering code and make it more robust, still, try to stay as much as possible only on the GPU side. Any CPU to GPU data transfer will be much slower than GPU to GPU communication. [aname=2]Moving Data Let's now go through the process of updating the data in a buffer. We can do it in at least two different ways: glBuffer*Data and glMapBuffer*. To be precise: we want to move some data from App memory (CPU) into GPU so that the data can be used in rendering. I'm especially interested in the case where we do it every frame, like in a particle system: you compute the new position on CPU, but then you want to render it. CPU to GPU memory transfer is needed. Even more complicated example would be to update video frames: you load data from a media file, decode it and then modify texture data that is then displayed. Often such process is referred as streaming. In other terms: CPU is writing data, GPU is reading. Although I mention 'moving', GPU can actually directly read from system memory (using GART). So there is no need to copy data from one buffer (on CPU side) to a buffer that is on the GPU side. In that approach we should rather think about 'making data visible' to the GPU. glBufferData/glBufferSubData Those two procedures (available since OpenGL 1.5!) will copy your input data into pinned memory. Once it's done, an asynchronous DMA transfer can be started and the invoked procedure returns. After that call you can even delete your input memory chunk. The above picture shows a "theoretical" flow for this method: data is passed to glBuffer*Data functions and then internally OpenGL performs DMA transfer to GPU... Note: glBufferData invalidates and reallocates the whole buffer. Use glBufferSubData to only update the data inside. glMap*/glUnmap* With mapping approach you simply get a pointer to pinned memory (might depend on actual implementation!). You can copy your input data and then call glUnmap to tell the driver that you are finished with the update. So, it looks like the approach with glBufferSubData, but you manage copying data by yourself. Plus you get some more control about the entire process. A "theoretical" flow for this method: you obtain a pointer to (probably) pinned memory, then you can copy your orignal data (or compute it), at the end you have to release the pointer via glUnmapBuffer method. ... All the above methods look quite easy: you just pay for the memory transfer. It could be that way if only there was no such thing as synchronization... [aname=4]Synchronization Unfortunately life is not that easy: you need to remember that GPU and CPU (and even the driver) runs asynchronously. When you submit a draw call it will not be executed immediately... it will be recorded in the command queue but will probably be executed much later by the GPU. When we update a buffer data we might easily get a stall - GPU will wait while we modify the data. We need to be smarter about it. For instance, when you call glMapBuffer the driver can create a mutex so that the buffer (which is a shared resource) is not modified by CPU and GPU at the same time. If it happens often, we'll lose a lot of GPU power. GPU can be blocked even in a situation when your buffer is only recorded to be rendered and not currently read. In the picture above I tried to show a very generic and simplified view of how GPU and CPU work when they need to synchronize - wait for each other. In a real life scenario those gaps might have different sizes and there might be multiple sync points in a frame. The less waiting the more performance we can get. So, reducing synchronization problems is an another incentive to have everything happening on GPU. [aname=5]Double (Multiple) Buffering/Orphaning Quite recommended idea is to use double or even triple buffering to solve the problem with synchronization: create two buffers update the first one in the next frame update the second one swap buffer ID... That way the GPU can draw (read) from one buffer while you will update the next one. How can you do that in OpenGL? explicitly use several buffers and use round robin algorithm to update them. use glBufferData with a NULL pointer before each update: the whole buffer will be recreated so we can store our data in a completely new place the old buffer will be used by GPU - no synchronization will be needed GPU will probably figure out that the following buffer allocations are similar so it will use the same memory chunks. I remember that this approach was not suggested in older versions of OpenGL. use glMapBufferRange with GL_MAP_INVALIDATE_BUFFER_BIT aditionally use UNSYNCHRONIZED bit and perform sync on your own. there is also a procedure called glInvalidateBufferData that does the same job Triple buffering GPU and CPU runs asynchronously... but there is also another factor: the driver. It may happen (and on desktop driver implementations it happens quite often) that the driver also runs asynchronously. To solve this even more complicated synchronization scenario, you might consider triple buffering: one buffer for CPU one for the driver one for GPU This way there should be no stalls in the pipeline, but you need to sacrifice a bit more memory for your data. More reading on the @hacksoflife blog Double-Buffering VBOs Double-Buffering Part 2 - Why AGP Might Be Your Friend One More On VBOs - glBufferSubData [aname=6]Persistent Mapping Ok, we've covered common techniques for data streaming, but now let's talk about persistent mapped buffers technique in more details. Assumptions: GL_ARB_buffer_storage must be available or OpenGL 4.4 Creation: glGenBuffers(1, &vboID); glBindBuffer(GL_ARRAY_BUFFER, vboID); flags = GL_MAP_WRITE_BIT | GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT; glBufferStorage(GL_ARRAY_BUFFER, MY_BUFFER_SIZE, 0, flags); Mapping (only once after creation...): flags = GL_MAP_WRITE_BIT | GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT; myPointer = glMapBufferRange(GL_ARRAY_BUFFER, 0, MY_BUFFER_SIZE, flags); Update: // wait for the buffer // just take your pointer (myPointer) and modyfy underlying data... // lock the buffer As the name suggests, it allows you to map the buffer once and keep the pointer forever. At the same time you are left with the synchronization problem - that's why there are comments about waiting and locking the buffer in the code above. On the diagram you can see that in the first place we need to get a pointer to the buffer memory (but we do it only once), then we can update the data (without any special calls to OpenGL). The only additional action we need to perform is synchronization or making sure that GPU will not read while we write at the same time. All the needed DMA transfers are invoked by the driver. The GL_MAP_COHERENT_BIT flag makes your changes in the memory automatically visible to GPU. Without this flag you would have to manually set a memory barrier. Although, it looks like that GL_MAP_COHERENT_BIT should be slower than explicit and custom memory barriers and syncing, my first tests did not show any meaningful difference. I need to spend more time on that... Maybe you would like some more thoughts on that? BTW: even in the original AZDO presentation the authors mention to use GL_MAP_COHERENT_BIT so this shouldn't be a serious problem :) Syncing // waiting for the buffer GLenum waitReturn = GL_UNSIGNALED; while (waitReturn != GL_ALREADY_SIGNALED && waitReturn != GL_CONDITION_SATISFIED) { waitReturn = glClientWaitSync(syncObj, GL_SYNC_FLUSH_COMMANDS_BIT, 1); } // lock the buffer: glDeleteSync(syncObj); syncObj = glFenceSync(GL_SYNC_GPU_COMMANDS_COMPLETE, 0); When we write to the buffer we place a sync object. Then, in the following frame we need to wait until this sync object is signaled. In other words, we wait till the GPU processes all the commands before setting that sync. Triple buffering But we can do better: by using triple buffering we can be sure that GPU and CPU will not touch the same data in the buffer: allocate one buffer with 3x of the original size map it forever bufferID = 0 update/draw update bufferID range of the buffer only draw that range bufferID = (bufferID+1)%3 That way, in the next frame you will update another part of the buffer so that there will be no conflict. Another way would be to create three separate buffers and update them in a similar way. [aname=7]Demo I've forked a demo application of Ferran Sole's example and extended it a bit. Here is the github repo: fenbf/GLSamples configurable number of triangles configurable number of buffers: single/double/triple optional syncing optional debug flag benchmark mode output: number of frames counter that is incremented each time we wait for the buffer Full results will be published in post: at my blog. [aname=8]Summary This was a long article, but I hope I explained everything in a decent way. We went through the standard approach of buffer updates (buffer streaming), saw our main problem: synchronization. Then I've described usage of persistence mapped buffers. Should you use persistent mapped buffers? Here is the short summary about that: Pros Easy to use Obtained pointer can be passed around in the app In most cases gives performance boost for very frequent buffer updates (when data comes from CPU side) reduces driver overhead minimizes GPU stalls Advised for AZDO techniques Cons Do not use it for static buffers or buffers that do not require updates from CPU side. Best performance with triple buffering (might be a problem when you have large buffers, because you need a lot of memory to allocate). Need to do explicit synchronization. In OpenGL 4.4, so only latest GPU can support it. In the next post I'll share my results from the Demo application. I've compared glMapBuffer approach with glBuffer*Data and persistent mapping. Interesting questions: Is this extension better or worse than AMD_pinned_memory? What if you forget to sync, or do it in a wrong way? I did not get any apps crashes and hardly see any artifacts, but what's the expected result of such a situation? What if you forget to use GL_MAP_COHERENT_BIT? Is there that much performance difference? [aname=9]References [PDF] OpenGL Insights, Chapter 28 - Asynchronous Buffer Transfers by Ladislav Hrabcak and Arnaud Masserann, a free Chapter from [OpenGL Insights].(http://openglinsights.com/) Persistent mapped buffers @ferransole.wordpress.com Maximizing VBO upload performance! @Java-Gaming.org Forum Buffer Object @OpenGL Wiki Buffer Object Streaming @OpenGL Wiki persistent buffer mapping - what kind of magic is this? @OpenGL Forum Article Update Log 4th Feb 2015: Initial release
  4. @all - thanks for you positive feedback!   @NightCreature83 Actually, I will not spend much time writing about 'profiling tools'. Next article will be about some compiler tweaks, then how to use SIMD. So, it is a basic stuff. Maybe next time, with some other example.   VTune is awesome, but unfortunately, costs a decent amount of money.
  5. As a software/game developer, you usually want more and more... of everything actually! More pixels, more triangles, more FPS, more objects on the screen, bots, monsters. Unfortunately you don't have endless resources and you end up with some compromises. The optimization process can help in the reduction of performance bottlenecks and it may free some available powers hidden in the code. Optimization shouldn't be based on random guesses: "oh, I think, if I rewrite this code to SIMD, the game will run a bit faster". How do you know that "this code" makes some real performance problems? Is investing there a good option? Will it pay off? It would be nice to have some clear guide, a direction. In order to get some better understanding on what to improve, you need to detect a base line of the system/game. In other words, you need to measure the current state of the system and find hot spots and bottlenecks. Then think about factors you would like to improve... and then... start optimizing the code! Such a process might not be perfect, but at least you will minimize potential errors and maximize the outcome. Of course, the process will not be finished with only one iteration. Every time you make a change, the process starts from the beginning. Do one small step at a time. Iteratively. At the end your game/app should still work (without new bugs, hopefully) and it should run X times faster. The factor X, can be even measured accurately, if you do the optimization right. The Software Optimization Process According to this and this book, the process should look like this: Benchmark Find hot spots and bottlenecks Improve Test Go back The whole process should not start after the whole implementation (when usually there is no time to do it), but should be executed during the project's time. In case of our particle system I tried to think about possible improvements up front. 1. The benchmark Having a good benchmark is a crucial thing. If you do it wrong then the whole optimization process can be even a waste of time. From The Software Optimization Cookbook book: The benchmark is the program or process used to: Objectively evaluate the performance of an application Provide repeatable application behavior for use with performance analysis tools. The core and required attributes: Repeatable - gives the same results every time you run it. Representative - uses large portion of the main application's use cases. It would be pointless if you focus only on a small part of it. For a game such a benchmark could include the most common scene or scene with maximum triangles/objects (that way simpler scenes will also work faster). Easy to run - you don't want to spend hours setting up and running the benchmark. A benchmark is definitely harder to make than a unit test, but it would be nice if it runs as fast as possible. Another point is that it should produce easy to read output: for instance FPS report, timing report, simple logs... but not hundreds of lines of messages from internal subsystems. Verifiable - make sure the benchmark produces valid and meaningful results. 2. Find hot spots and bottlenecks When you run your benchmark you will get some output. You can also run profiling tools and get more detailed results of how the application is performing. But, having data is one, but actually, it is more important to understand it, analyze and have good conclusion. You need to find a problem that blocks the application from running at full speed. Just to summarize: bottleneck - place in the system that makes the whole application slower. Like the weakest element of a chain. For instance, you can have a powerful GPU, but without fast memory bandwidth you will not be able to feed this GPU monster with the data - it will wait. hot spot - place in the system that does crucial, intensive job. If you optimize such a module then the whole system should work faster. For instance, if CPU is too hot then maybe offload some work to GPU (if it has some free compute resources available). This part may be the hardest. In a simple system it is easy to see a problem, but in large-scale software it can be quite tough. Sometimes it can be only one small function, or the whole design, or some algorithm used. Usually it is better to use a top-down approach. For example: Your framerate is too low. Measure your CPU/GPU utilization. Then go to CPU or GPU side. If CPU: think about your main subsystems: is this a animation module, AI, physics? Or maybe your driver cannot process so many draw calls? If GPU: vertex or fragment bound... Go down to the details. 3. Improve Now the fun part! Improve something and the application should work better :) What you can improve: at system level - look at utilization of your whole app. Are any resources idle? (CPU or GPU waiting?) Do you use all the cores? at algorithmic level - do you use proper data structures/algorithms? Maybe instead of O(n) solution you can reduce it to O(log n) ? at micro level - the 'funniest' part, but do it only when the first two levels are satisfied. If you are sure, that nothing more can be designed better, you need to use some dirty code tricks to make things faster. One note: Instead of rewriting everything to Assembler use your tools first. Today's compilers are powerful optimizers as well. Another issue here is portability: one trick might not work on another platform. 4. Test After you make a change test how the system behaves. Did you get 50% of the speed increase? Or maybe it is even slower? Beside performance testing, please make sure you are not breaking anything! I know that making systems 10% faster is nice, but your boss will not be happy if, thanks to this improvement, you introduce several hard-to-find bugs! 5. Go back After you are sure everything works even better than before... just run your bechmark and repeat the process. It is better if you make a small, simple change, rather than big, but complex. With smaller moves it is harder to make a mistake. Additionally, it is easy to revert the changes. Profiling Tools Main methods: custom timers/counters - you can create a separate configuration (based on Release mode) and enable a set of counters or timers. For instance, you can place it in every function in a critical subsystem. You can generate call hierarchy and analyse it further on. instrumentation - tool adds special fragments of code to your executable so that it can measure the execution process. interception - tool intercepts API calls (for instance OpenGL - glIntercept, or DirectX) and later on analyses such register. sampling - tool stops the application at specific intervals and analyses the function stack. This method is usually much lighter than instrumentation. Below is a list of professional tools that can help: Intel(R) VTune(TM) Amplifier Visual Studio Profiler AMD CodeXL - FREE. AMD created a good, easy to use, profiling tool for CPU and GPU as well. Does the best job when you have also AMD CPU (that I don't have ;/) but for Intel CPU's it will give you at least timing reports. ValGrind - runs your app on a virtual machine and can detect various problems: from memory leaks to performance issues. GProf - Unix, uses a hybrid of sampling and instrumentation. Lots of others... here on wikipedia Something more Automate I probably do not need to write this... but the more you automate the easier your job will be. This rule applies, nowadays, to almost everything: testing, setup of application, running the application, etc. Have Fun! The above process sounds very 'professional' and 'boring'. There is also another factor that plays an important role when optimizing the code: just have fun! You want to make mistakes, you want to guess what to optimize and you want to learn new things. In the end, you will still get some new experience (even if you optimized a wrong method). You might not have enough time for this at your day job, but what about some hobby project? The more experience with the optimization process you have, the faster your code can run. References The Software Optimization Cookbook: High Performance Recipes for IA-32 Platforms, 2nd Edition, Intel Press; 2nd edition (December 2005) - Contains lots of useful information, written in a light way. I've won it on GDC Europe 2011 :) Video Game Optimization, by Eric Preisz - another good book, also quite unique in this area. I would like to see the second edition - improved, updated and maybe extended. C++ For Game Programmers (Game Development Series) Game Coding Complete, Fourth Edition Agner`s optimization manuals Understanding Profiling Methods @MSDN Sampling vs Instrumentation/oktech-profiler docs Code and Graphics: particle system implementation series Article Update Log 17th August 2014: Initial version, based on post from Code and Graphics blog