Jump to content
  • Advertisement


  • Content Count

  • Joined

  • Last visited

Community Reputation

1359 Excellent

About JoshuaWaring

  • Rank

Personal Information


  • Twitter
  • Github

Recent Profile Visitors

The recent visitors block is disabled and is not being shown to other users.

  1. JoshuaWaring

    Cache In A Multi-Core Environment

    @frob I think you've just picked the topic of the next article.
  2. JoshuaWaring

    Cache In A Multi-Core Environment

    I've removed the paragraph about Volatiles, I was hesitant to put them in at first because they're not exactly cache related, but I felt it needed something more than just False Sharing.    For those interested I ran a benchmark and found that there's no real reason to use a Volatile as a flag. The test would read/write 1 million unsigned integers and take the high_resolution_clock time before and after.   Results:  Atomic Reads: 371776 Volatile Reads: 349741  Atomic Writes: 7860217 Volatile Writes: 345923
  3. Cache In A Multi-Core Environment In my previous article I discussed the use of cache and some practices that can provide increased performance while also teaching you what cache is. I also stated that cache in a multicore environment is a whole other topic so I've written this article to cover the different considerations that come along with Multicore Programming. Why does it matter if we're using two cores? Cache comes in levels, typically 3 each with their own group of cores that can access it, L1 Cache is only visible to a single core with each core having it's own private cache and is the fastest of all caches. L2 Cache is usually visible to a group of cores, for instance the AMD 8150 shares L2 Cache between two cores and finally there's L3 Cache that is accessible to all cores and is the slowest of caches, but still much faster in comparison to RAM. Now that we know that there are different banks of cache for each core, what happens when two cores are accessing the same memory? If there was no system in place then both cores would cache the memory, then lets say one core wrote to that memory, This would be visible in memory, although the other core would still have it's cache of the old value. To solve this when a core writes to it's cached memory; Any other core that stores that cache line will be removed or updated, which is where our problem comes into play. Let's say you have 2 Integers on a single cache line and each core was writing to an integer each that are next to each other in an array, although they're not the same variables and it won't cause any unexpected results, because they're on the same cache line. Every time one core writes to that memory, the other core loses its cache. This is referred to as False Sharing and there's a simple solution to this, the hardest part is determining if you're having this problem. False Sharing False Sharing can hinder the performance for any program. For this example I'll go through the optimisations I did on a single producer single consumer queue and provide a few steps to solving most of your False Sharing problems. To test the queue I have two threads, one writing integers from 0 to 1 million and another reading them and checking if they're all in order. The queue doesn't undergo any resizing and is allocated with enough capacity for 1 million objects. template class alignas(64) Queue{ T* data_; size_t push_position_; size_t pop_position_; std::atomic size_; size_t capacity_; }; The problem with this code is that all the variables are packed together with no spacing together, the whole structure would fit on up to two cache lines. This is perfect if we're in a single core environment, although separate cores access pop_position and push_position therefor there's high contention between these cache lines in a multicore Environment. I break the problem down into a shared read section; shared write section and one section for each thread. A section may be larger than a single cache line and may require two cache lines to implement, it's for this reason I call them a section. Our first step would be to determine what memory belongs to what section. With data_ and capacity_ both being shared, but rarely written to, they therefor belong to the shared read cache line, size_ is the only variable that is a shared write and push and pop both belong to their own cache line as each thread uses one each. In this example that leaves us with 4 cache lines. This leaves us with template class alignas(64) Queue{ // Producers C-Line size_t push_position_; char pad_p[64 - sizeof(size_t)]; // Consumers C-Line size_t pop_position_; char pad_c[64 - sizeof(size_t)]; // Shared Read C-Line T* data_; size_t capacity_; char pad_sr[64 - sizeof(size_t) - sizeof(T*)]; // Shared Write C-Line std::atomic size_; char pad_sw[64 - sizeof(std::atomic)]; }; Notice that the alignas(n) this is a new keyword added in C++14. The keyword ensures that the structure is aligned to a multiple of n bytes in memory and therefor allows us to assume that our first variable will be placed at the start of a cache line, which is vital for our separation. Before accounting for False sharing, to push and pop 1 million Integers it took 60ms, but after accounting for False Sharing it's been reduced to 34ms on a Intel Core I5 3210M @ 2.5Ghz. The majority of the time comes from the atomic access, which we use to check to see if there's room to push and anything to pop. You could potentially optimise the atomic access out of most of the pushes by remembering how many objects can be pushed and popped until your next size check, this way we can lower the atomic access and dramatically improve performance again. Example Source While on the same subject of False Sharing, another example would occur when storing data within an array and having a number of threads access that array. Lets think about pool of threads which keep count how much work they've done and store it in an array. We need access to these variables to check how much work's been done while running. int work_done[n]; An easy mistake to make, but would result in a plethora of cache misses. As each core goes to increment it's work_done it would invalidate the other cores cache. A solution to this would be to turn the array into an array of pointers to store a pointer to a local variable inside the thread, this would require that we pass a pointer to work_done so we can populate that pointer with the address of the local variable. From a synthetic test where the worker thread is only iterating on work_done, we can see over 5 seconds of iteration across 4 cores we get a result of ~890M iterations per core while once we'd accounted for False Sharing and utilized local variables we get a result of ~1.8B iterations per core which is a ~2x improvement on the I5 3210M @ 2.5Ghz. The same test on an AMD 8150 @ 4.2Ghz reached 44M iterations with False Sharing, while without we reached 2.3B iterations which is a shocking ~52x improvement in speed, I had to double check this result because it's left me in disbelief**! In this case we use a local variable instead of padding between all the variables to save space, but both would work equally as well. Example Source Summary Keep classes with multicore access segmented by cache lines to eliminate False Sharing Local variables are preferred over sharing data outside of the thread Conclusion False Sharing can be a problematic side affect of multicore programming which should be a consideration whenever two cores use data in proximity to one another. From these tests on an Intel 3210M we can see that by eliminating False Sharing we receive a ~2x performance boost, obviously this would differ on different hardware. Notes * AMD 8150 is tested on Windows 8.1 with Visual Studio 2013 and Intel 3210M is tested on OSX 10.10 LLVM 6.1.0. ** After seeing such a large difference, I went looking for a cause to such dramatic performance loss; I found that the L3 cache on the Bulldozer architecture is broken into 2MB per module (2 cores) that cannot be accessed by other modules [1]. Sharing would result in a cache miss all the way to RAM, while the I5 3210M shares it L3 Cache between all cores and would only need to go to the L3 cache in the case of a cache miss. This wouldn't be a problem if the operating system had placed the threads onto the two cores in a module. I kept running the tests with 2 threads until the result went from 44M to 1.2B per thread, assuming that in these cases the threads were placed on the same module and therefor shared L3 cache. This is a perfect example of the importance of testing your software on different hardware. [1] isscc.org/doc/2011/isscc2011.advanceprogrambooklet_abstracts.pdf pg. 40
  4. JoshuaWaring

    Cache In A Multi-Core Environment

    I clarified the use of a volatile, what it means for the compiler and underlined the assumption that is required to use it instead of a atomic. It's a pretty big assumption I wouldn't think would apply to many things apart from Boolean flags.
  5. Why Cache Matters Simply put, Cache is a small amount of memory located on your CPU that is used to reduce latencies to main memory (RAM). We have no control over the cache, but programming and storing data in a cache-friendly manner is imperative to performance improvements and, in turn, a reduction in power consumption. Without realising it, you may have already used a form of Cache before when reading files from a HDD where you have loaded the file into RAM and then manipulated it there before writing back the changes, you've cached the file on the RAM. Although this example is slightly different and technically a buffer, the similarities are present. How Cache Works When your CPU requests a piece of memory it'll go out to the RAM to fetch it. During that time the CPU is waiting while the request goes through the Memory Controller, off the CPU chip to the RAM to be served and returned. This is a waste of possible compute time and therefore a reduction in power efficiency. Cache sits between this line of communication and listens in on the requests to RAM; if it has the memory already it'll stop the request and respond with what it has, this is referred to a Cache Hit and if it's unable to serve the request because it doesn't have the requested memory then it's a Cache Miss. There are systems in place to assure that when writing to memory, both the Cache and memory are updated, but I'll save those for another article, as they are more important when working with a multithreaded application. The part that we can assist The Cache is always trying to guess what memory you'll need to have before you request it, this prediction is called Cache Prefetching. This is why when working on an array it's best to go through in sequential order instead of randomly jumping through, as the Cache Prefetcher will be able to guess what you're using and have it ready before you need it. Cache loads things in groups of 64 bytes. The size is CPU-dependant and can be checked under your CPU's specification under Cache Line size, although it's typically 64 bytes. This means that if you have an array of integers and you grab 1 of those integers, the cache has also grabbed the Cache Line that it sits on. Grabbing the next integer stored next to it will be a Cache Hit and subsequently extremely fast. The alignment of the Cache Line will always be a multiple of the Cache Line's size, meaning that if you fetch memory at 0x00 (0) then what will be cached is everything between 0x00 (0) and 0x40 (64) and if you fetch something at 0x4F (79) then you'll get everything between 0x40 (64) and 0x80 (128). Use Case Arrays store their data sequentially, so they're ideal for accessing in a cache-friendly way. In this example we access data going across the row before going to the next row and then in the second loop, we go across the column first and then move to the next column. Each dimensions in the array is 10,000 integers. for (unsigned x = 0; x < ARRAY_SIZE; ++x) for (unsigned y = 0; y < ARRAY_SIZE; ++y) total += array_[ y ][ x ]; for (unsigned x = 0; x < ARRAY_SIZE; ++x) for (unsigned y = 0; y < ARRAY_SIZE; ++y) total += array_[ x ][ y ]; The only difference between these loops is the array_[x][y] and [y][x], but the results between the two are extreme. On my AMD 8150, the first loop finished in 1,501ms while the second only took 281ms which is a ~5x speed increase! Simply because of the Prefetcher friendly access, although the first loop may look sequential, it's actually extremely fragmented in its accesses. A 2D array stores its data in one large block with each row next to each other and after each row it is the next row (Row-Major order). In the diagram below we have the access patterns of the two loops showing the order of their access and the values returned. The first loop accesses its data in a predictable way, but not a way that the Prefetcher is able to prefetch for. While the second loop accesses its memory sequentially which allows the CPU to prefetch. Example source: https://gist.github.com/Joshhua5/c0fe80d67d3990c528e7 (Access) 1 2 3 4 5 6 7 8 9 (Loop 1) 1 4 7 2 5 8 3 6 9 (Loop 2) 1 2 3 4 5 6 7 8 9 One more example of an optimisation is the layout of a class, assuming that your compiler doesn't optimise this for you. struct Foo{ bool value1; char data[64]; bool value2; } Let's say we want to check value1 and value2 before accessing any data inside the array. Let's also assume that Foo is stored at 0x00 for simplicity. When we access value1, we'll have a Cache Miss as it's our first time accessing this piece of memory, but now we've got that line stored for us. We then check value2, which will also result in a Cache Miss, because data has polluted our Cache Line and pushed value2 into the next line meaning we now have to wait twice. An alternative would be to store value1 and value2 next to each other so that we only have the first Cache Miss and a Cache Hit for the second. Assuming we don't access data[62] and upwards then we won't have to deal with another Cache Miss. Optimally we could store value1 and value2 in the same byte because we only need one bit to store a Boolean and could technically fit eight Booleans into a single byte. If we performed this optimisation and used bitwise shifting to access each byte we could squeeze another byte of the array into out cache line. Example source: https://gist.github.com/Joshhua5/83da475070174f309ff2 From this on the AMD 8150 we recieved 47ms for the first loop and 31ms for the second loop which is a 1.5x performance increase. Notice that we have to times the array access by * 2 to break up sequential access, because there would be no performance difference between the two loops. If we always fetch 64 bytes then we're bound to fetch more than needed for value2 and therefor cache the next object in the array unintentionally. Another interesting behaviour which shows the function of the prefetcher is if we accessed the array sequentially so the CPU could prefetch it would only take 15ms to go through the entire array instead of 31ms to go through half the array 'randomly'. Summary Optimally, you want objects to be as small as possible to fit as many in a Cache Line as possible. You can fit two shorts in an integer and therefore 16 integers in a cache line or 32 if you use shorts. Consider a pointer, first to fetch that pointer from memory is a Cache Miss and then fetching that object is a second Cache Miss is a worst case scenario. Keep objects smaller than the Cache Line or divisible by it to prevent a single object spreading over two Cache Lines. C++14 added the alignas keyword that will make sure that these objects are aligned in memory to the value passed which can be beneficial in stopping an object being placed in two Cache Lines. Conclusion Cache is more important than ever with the extreme speeds of CPU and relatively slow RAM, great speed improvements or power efficiency can be observed from utilising it correctly which is always welcome in the world of game creation. For a deeper look into how to you can change your design practices for better Cache utilisation, I'd highly recommend watching this video: https://vimeo.com/97337258 - Scott Meyers - Norwegian Developers Conference Article Update Log 23 March 2015: Initial release 7 April 2015 : Added source examples
  6. JoshuaWaring

    Writing Efficient Endian-Independent Code in C++

    You can accelerate the swapping of bytes by using bswap, you should be able to swap 2, 4 and 8 bytes at a time.   http://www.c-jump.com/CIS77/ASM/DataTypes/T77_0230_bswap.htm   Edit:   Arm also has instructions to swap the byte order http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0552a/BABJJDDB.html
  7. JoshuaWaring

    Cache And How To Work For It

    I'll add this to the article, but a good practice for Component-Based design for cache is to store all components of a single type in a single sequential block of memory, instead of creating enough room for one component per component on the heap. Somewhere there's an example of someone who did this and saw major improvements yet again from simply storing components next to each other.  Edit: (http://gameprogrammingpatterns.com/data-locality.html) this isn't the original page I read this and after skimming I don't think it actually states how much faster it was, but from memory I believe it was at least 20x.
  8.   There isn't that I know of, OpenGL doesn't provide information about the memory usage on the GPU, the only way we're able to know is due to vendor specific extensions and I don't think Intel has such an extension or at least that I can find.
  9. With OpenGL you can poll the video card memory with these, they're vendor specific so you can use glGetString(GL_VENDOR) to decide which functions to call.   Nvidia(http://developer.download.nvidia.com/opengl/specs/GL_NVX_gpu_memory_info.txt) AMD (https://www.opengl.org/registry/specs/ATI/meminfo.txt)   an example for AMD would be   GLint vbo_free[4]; GLint texture_free[4]; GLint render_buffer_free[4]; // Poll VRAM usage glGetIntegerv(GL_VBO_FREE_MEMORY_ATI, vbo_free); glGetIntegerv(GL_TEXTURE_FREE_MEMORY_ATI, texture_free); glGetIntegerv(GL_RENDERBUFFER_FREE_MEMORY_ATI, render_buffer_free);     You'll notice it'll grab 4 integers, here's what they are param[0] - total memory free in the pool param[1] - largest available free block in the pool param[2] - total auxiliary memory free param[3] - largest auxiliary free block
  10. JoshuaWaring

    Why a Game Development Degree?

    I'm doing a double major in CS and Games Programming at a full length university. How can I go wrong :)
  11. There are some changes such as new apis , the shift from single to multithreaded games and OO to entity based games. Obviously not all apply to everything if ant.
  • Advertisement

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!