• Advertisement
  • Popular Tags

  • Popular Now

  • Advertisement
  • Similar Content

    • By lxjk
      Hi guys,
      There are many ways to do light culling in tile-based shading. I've been playing with this idea for a while, and just want to throw it out there.
      Because tile frustums are general small compared to light radius, I tried using cone test to reduce false positives introduced by commonly used sphere-frustum test.
      On top of that, I use distance to camera rather than depth for near/far test (aka. sliced by spheres).
      This method can be naturally extended to clustered light culling as well.
      The following image shows the general ideas

       
      Performance-wise I get around 15% improvement over sphere-frustum test. You can also see how a single light performs as the following: from left to right (1) standard rendering of a point light; then tiles passed the test of (2) sphere-frustum test; (3) cone test; (4) spherical-sliced cone test
       

       
      I put the details in my blog post (https://lxjk.github.io/2018/03/25/Improve-Tile-based-Light-Culling-with-Spherical-sliced-Cone.html), GLSL source code included!
       
      Eric
    • By akshayMore
      Hello,
      I am trying to make a GeometryUtil class that has methods to draw point,line ,polygon etc. I am trying to make a method to draw circle.  
      There are many ways to draw a circle.  I have found two ways, 
      The one way:
      public static void drawBresenhamCircle(PolygonSpriteBatch batch, int centerX, int centerY, int radius, ColorRGBA color) { int x = 0, y = radius; int d = 3 - 2 * radius; while (y >= x) { drawCirclePoints(batch, centerX, centerY, x, y, color); if (d <= 0) { d = d + 4 * x + 6; } else { y--; d = d + 4 * (x - y) + 10; } x++; //drawCirclePoints(batch,centerX,centerY,x,y,color); } } private static void drawCirclePoints(PolygonSpriteBatch batch, int centerX, int centerY, int x, int y, ColorRGBA color) { drawPoint(batch, centerX + x, centerY + y, color); drawPoint(batch, centerX - x, centerY + y, color); drawPoint(batch, centerX + x, centerY - y, color); drawPoint(batch, centerX - x, centerY - y, color); drawPoint(batch, centerX + y, centerY + x, color); drawPoint(batch, centerX - y, centerY + x, color); drawPoint(batch, centerX + y, centerY - x, color); drawPoint(batch, centerX - y, centerY - x, color); } The other way:
      public static void drawCircle(PolygonSpriteBatch target, Vector2 center, float radius, int lineWidth, int segments, int tintColorR, int tintColorG, int tintColorB, int tintColorA) { Vector2[] vertices = new Vector2[segments]; double increment = Math.PI * 2.0 / segments; double theta = 0.0; for (int i = 0; i < segments; i++) { vertices[i] = new Vector2((float) Math.cos(theta) * radius + center.x, (float) Math.sin(theta) * radius + center.y); theta += increment; } drawPolygon(target, vertices, lineWidth, segments, tintColorR, tintColorG, tintColorB, tintColorA); } In the render loop:
      polygonSpriteBatch.begin(); Bitmap.drawBresenhamCircle(polygonSpriteBatch,500,300,200,ColorRGBA.Blue); Bitmap.drawCircle(polygonSpriteBatch,new Vector2(500,300),200,5,50,255,0,0,255); polygonSpriteBatch.end(); I am trying to choose one of them. So I thought that I should go with the one that does not involve heavy calculations and is efficient and faster.  It is said that the use of floating point numbers , trigonometric operations etc. slows down things a bit.  What do you think would be the best method to use?  When I compared the code by observing the time taken by the flow from start of the method to the end, it shows that the second one is faster. (I think I am doing something wrong here ).
      Please help!  
      Thank you.  
    • By CPPapprentice
      Hi Forum,
      in terms of rendering a tiled game level, lets say the level is 3840x2208 pixels using 16x16 tiles. which method is recommended;
      method 1- draw the whole level, store it in a texture-object, and only render whats in view, each frame.
      method 2- on each frame, loop trough all tiles, and only draw and render it to the window if its in view.
       
      are both of these methods valid? is there other ways? i know method 1 is memory intensive  but method 2 is processing heavy.
      thanks in advance
    • By wobes
      Hi there. I am really sorry to post this, but I would like to clarify the delta compression method. I've read Quake 3 Networking Model: http://trac.bookofhook.com/bookofhook/trac.cgi/wiki/Quake3Networking, but still have some question. First of all, I am using LiteNetLib as networking library, it works pretty well with Google.Protobuf serialization. But then I've faced with an issue when the server pushes a lot of data, let's say 10 players, and server pushes 250kb/s of data with 30hz tickrate, so I realized that I have to compress it, let's say with delta compression. As I understood, the client and server both use unreliable channel. LiteNetLib meta file says that unreliable packet can be dropped, or duplicated; while sequenced channel says that packet can be dropped but never duplicated, so I think I have to use the sequenced channel for Delta compression? And do I have to use reliable channel for acknowledgment, or I can just go with sequenced, and send the StateId with a snapshot and not separately? 
      Thank you. 
    • By dp304
      Hello!
      As far as I understand, the traditional approach to the architecture of a game with different states or "screens" (such as a menu screen, a screen where you fly your ship in space, another screen where you walk around on the surface of a planet etc.) is to make some sort of FSM with virtual update/render methods in the state classes, which in turn are called in the game loop; something similar to this:
      struct State { virtual void update()=0; virtual void render()=0; virtual ~State() {} }; struct MenuState:State { void update() override { /*...*/ } void render() override { /*...*/ } }; struct FreeSpaceState:State { void update() override { /*...*/ } void render() override { /*...*/ } }; struct PlanetSurfaceState:State { void update() override { /*...*/ } void render() override { /*...*/ } }; MenuState menu; FreeSpaceState freespace; PlanetSurfaceState planet; State * states[] = {&menu, &freespace, &planet}; int currentState = 0; void loop() { while (!exiting) { /* Handle input, time etc. here */ states[currentState]->update(); states[currentState]->render(); } } int main() { loop(); } My problem here is that if the state changes only rarely, like every couple of minutes, then the very same update/render method will be called several times for that time period, about 100 times per second in case of a 100FPS game. This seems a bit to make dynamic dispatch, which has some performance penalty, pointless. Of course, one may argue that a couple hundred virtual function calls per second is nothing for even a not so modern computer, and especially nothing compared to the complexity of the render/update function in a real life scenario. But I am not quite sure. Anyway, I might have become a bit too paranoid about virtual functions, so I wanted to somehow "move out" the virtual function calls from the game loop, so that the only time a virtual function is called is when the game enters a new state. This is what I had in mind:
      template<class TState> void loop(TState * state) { while (!exiting && !stateChanged) { /* Handle input, time etc. here */ state->update(); state->render(); } } struct State { /* No update or render function declared here! */ virtual void run()=0; virtual ~State() {} }; struct MenuState:State { void update() { /*...*/ } void render() { /*...*/ } void run() override { loop<MenuState>(this); } }; struct FreeSpaceState:State { void update() { /*...*/ } void render() { /*...*/ } void run() override { loop<FreeSpaceState>(this); } }; struct PlanetSurfaceState:State { void update() { /*...*/ } void render() { /*...*/ } void run() override { loop<PlanetSurfaceState>(this); } }; MenuState menu; FreeSpaceState freespace; PlanetSurfaceState planet; State * states[] = {&menu, &freespace, &planet}; void run() { while (!exiting) { stateChanged = false; states[currentState]->run(); /* Runs until next state change */ } } int main() { run(); } The game loop is basically the same as the one before, except that it now exits in case of a state change as well, and the containing loop() function has become a function template.
      Instead of loop() being called directly by main(), it is now called by the run() method of the concrete state subclasses, each instantiating the function template with the appropriate type. The loop runs until the state changes, in which case the run() method shall be called again for the new state. This is the task of the global run() function, called by main().
      There are two negative consequences. First, it has become slightly more complicated and harder to maintain than the one above; but only SLIGHTLY, as far as I can tell based on this simple example. Second, code for the game loop will be duplicated for each concrete state; but it should not be a big problem as a game loop in a real game should not be much more complicated than in this example.
      My question: Is this a good idea at all? Does anybody else do anything like this, either in a scenario like this, or for completely different purposes? Any feedback is appreciated!
  • Advertisement
  • Advertisement
Sign in to follow this  

SOA and cache locality

This topic is 434 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I've read in several places at this point that to get the most out of your CPU's cache, it's important to pack relevant data together, and access it in a linear fashion. That way the first access to that region of memory loads it into a cache line, and the remaining accesses will be much cheaper. One thing that isn't clear to me, however, is how many cache lines you can have "active" at once.

 

So for example, if you have a series of 3D vectors, and you lay them out like this:

[xxxx...]
[yyyy...]
[zzzz...]

And then you access your data as:

for (std::size_t i = 0; i < len; ++i) {
    auto x_i = x[i];
    auto y_i = y[i];
    auto z_i = z[i];

    // Do something with x, y, and z
}

Does each array get it's own cache line? Or does accessing the 'y' element push 'x' out of the cache, and then accessing the 'z' element push 'y' out of the cache? If you were to iterate backwards, would that cause more cache misses than iterating forwards?

 

On another note, while I try to follow best practices for this stuff where possible, I literally have zero idea how effective (or ineffective) it is, since I have no tools for profiling it, and I don't have time write one version of something and then test it against another. Are there any free (or free for students) tools for cache profiling on Windows? I'd love to use Valgrind, but I don't have anything that can run Linux that is also powerful enough to run my game.

 

Thanks!

Edited by Easy Rider

Share this post


Link to post
Share on other sites
Advertisement

The CPU will have a particular cache line size, an L1 cache for each core, an L2 cache (probably shared between cores) and possibly even an L3 cache.

 

e.g. the Intel i7 6850K has 64-byte sized cache lines, a 64KiB L1 cache per core, 256KiB L2 cache per core, and 15MiB of shared L3 cache.

 

When you read from a pointer, the lower 6 bits are set to zero, which has the effect of rounding down to the nearest multiple of 64 (64 is 1000000b). The CPU then starts reading 64 bytes starting from that address into the L1 cache (which will in turn try to read the data from the L2 cache, which will try to read it from the L3 cache, which will read it from RAM).

Because there's 64KiB of L1 cache and a cache-line takes up 64 bytes, you can fit 1024 lines into the L1 cache (and 4096 lines into L2, and ~a quarter million lines in L3).

 

The way that a CPU decides where to put each line in the cache is called associativity.

One simple system is to simply divide up RAM addresses by the cache size and use the remainder as the address. e.g if the cache size was 1000, and you had items in RAM at the addresses a=1337, b=2004, c=1300001, then their local cache addresses would be a=337, b=4, c=1.  This can lead to "conflicts" -- e.g. if a=1042 and b=99042, then they both result in the local cache address of 42! If you used a and then b, then b would evict a from the cache (and next time you use a it would evict b from the cache).

There's also many other (smarter) ways that CPU's can translate between local cache addresses and RAM addresses...

 

Iterating backwards and forwards should have similar performance. The CPU has a predictive prefetcher that will try to guess your memory access patterns (e.g. "iterating linearly forwards") and starts moving data from RAM into cache before you even ask for it. Both "iterating linearly forwards" and "iterating linearly backwards" are common patterns that they will look for.

On less smart CPU's, programmers had to put special "prefetch" instructions into their code to let the CPU know what data was going to be used in the near future. You still sometimes see these used on modern desktop CPU's, but it's almost always unneeded these days.

Share this post


Link to post
Share on other sites

If you're curious about your own system then you can check your cache sizes with a program called CPU-Z:

http://www.cpuid.com/softwares/cpu-z.html

 

The address mapping conflict with cache lines that Hodgman talked about is called "cache thrashing" in case you want to do more research on it. It's not something that you need to worry about under normal circumstances, but it's good to be aware of it and understand how to deal with it in case you do encounter a performance issue. Generally locality (keep related data close together in memory) and predictable access patterns are enough to keep things running smoothly. In a contiguous container the size of the elements also begins to matter since it determines how many elements will fit in a cache line.

Share this post


Link to post
Share on other sites

In your particular case however your access in that loop is not cache friendly because the arrays are stored after each other if declared and created at the same time. Had you looped over all x's then y's and then z's you would have had a cache friendly access. The trick is to layout your data in such a way that when you need to access that data you access all/most of it in the same loop.

 

Herb has done a talk about the benefit of looping in order through your data and has some slides that show what happens an why. ( https://channel9.msdn.com/Events/Build/2014/2-661 there is more in this talk than just that.) https://view.officeapps.live.com/op/view.aspx?src=http%3a%2f%2fvideo.ch9.ms%2fsessions%2fbuild%2f2014%2f2-661.pptx

 

Ps: on most PC hardware the cache line size is 64-bytes.

Share this post


Link to post
Share on other sites

The structure of arrays is ONE useful solution, but like every solution it can be misapplied.

 

The fundamental issue is how the data is accessed. The pattern matches one efficient way to access the data but it is not the only way, and if your access pattern doesn't match the data layout it will backfire.

 

Does each array get it's own cache line?

A cache line has been 64-bytes for many years now.  It will likely eventually become 128 bytes.

 

 

 

Or does accessing the 'y' element push 'x' out of the cache, and then accessing the 'z' element push 'y' out of the cache?

 

Probably not. Modern processors have large caches, but the inner workings of how they load and evict contents is an implementation detail. There is no guarantee that they work any particular way.

 

There is a prediction system that can evaluate the current memory accesses and the memory addresses coming through the pipeline (about 30 instructions currently). It can start prefetching your data quite early.

 

Object sizes still play a role. If your data is not at a 64-byte stride or something easily predictable (96 bytes, 128 bytes, 160 bytes, etc) then it cannot be readily predicted.  If it fits on the same cache line the processing is very nearly free. If it doesn't fit on the cache line but the prefetcher is able to predict it the cost is more than it being in L1 but far less than a full cache miss.

 

It is the cache prediction that gives the big speed advantage in this case.  If it cannot predict what is next there will probably be a cache miss.

 

 

 

If you were to iterate backwards, would that cause more cache misses than iterating forwards?

 

Cache predictor is usually good about those things. 

 

 

 

On another note, while I try to follow best practices for this stuff where possible, I literally have zero idea how effective (or ineffective) it is, since I have no tools for profiling it, and I don't have time write one version of something and then test it against another. Are there any free (or free for students) tools for cache profiling on Windows?

 

AMD's profiler CodeXL can do it.  It was made open source, freely available on github.  If you haven't used a profiler before you will likely have a bit of a learning curve, but there are many online guides.

 

 

 

Be aware that a structure of arrays really only helps when you are accessing your data sequentially as data streams.  Having SoA doesn't provide any benefit if you are jumping around the array, or if you are using individual elements as objects, in which case the array of structures can be better because you're touching all those data members instead.   The key detail is that whatever format you have for the data, it should be processed as a regular interval to enable prediction and tightly packed sequentially so multiple data read values span a single cache line.

Share this post


Link to post
Share on other sites

It's worth keeping in mind that one usually applies cache optimisations very late in the optimisation process, after one has made all the algorithmic improvements one can.

 

You usually have a very clear picture of your data access patterns by that point (and they aren't going to change very much in the future). Then and only then should you start applying things like SoA, and only where it matches the actual data access pattern.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement