Jump to content

  • Log In with Google      Sign In   
  • Create Account

How to cache?


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
20 replies to this topic

#1 JohnnyCode   Members   -  Reputation: 334

Like
0Likes
Like

Posted 04 September 2013 - 09:31 PM

Hi.

 

I have trouble understanding how cache is implemented. I understand all low level aspects of digital computations, yet this is totaly out of my scope. Is there a way to implement cache out of cpu management? Like, if there is a part of memory that is going to get accesed to read from, how would I cache it? The question is not wheather it would benefit me, the question is why it would not benefit me. My more precise problem is GPU cache. There are multiple SIMD (single instruction multiple data) threads running at my side, but they many times access the same memory to read from. This couses huge stalks if the memory is accessed by more SIMD threads. So my understanding is that cache is a memory of every  individual SIMD thread that gets populated with memory that SIMD thread is likely to access. Is it true? My problem is, how to make sure that every SIMD reads from its own spot in memory. Only way I can distinguish between SIMD thread is index of pixel they are computing, this differs fo r every SIMD thread, running or going to run. My only idea of implementing cache for those threads would be:

 

cache=new [numofthreads*cachesize]<=data likely to read by all paralel threads (say 80, that means 80 identical data often) 

 

thread

{

.........

     var some=cache[threadID][particulardata];

........

}

 

Is this how it works?

Thanks a lot for any clarifications!



Sponsor:

#2 HappyCoder   Members   -  Reputation: 2886

Like
4Likes
Like

Posted 04 September 2013 - 10:50 PM

When programming applications what you need to know about the cache is that reusing the same piece area of memory will give much better performance than jumping around in the memory. When you request something from memory there is bit of a process that goes on to retrieve that.

Lets say you started up the application and requested memory at 0x2000, lets say this memory has not been requested yet. First of all the cpu doesn't actually access the memory address 0x2000. Instead it is translated into a virtual address. I wont go into a lot of detail but I will try to briefly explain how the translation happens. The first time you request the memory the cpu will see that a page of memory has not been assigned to that memory range, so the operating system finds a free page of memory, maps it to the address region you requested and the program continues on. This process is known as a page fault. Everytime memory is accessed using its address, this address is actually the virtual address. The virtual address needs to be translated to a physical address to be requested from the ram. There are caches specifically used to store information about how to translate virtual addresses. That cache is called the translation lookaside buffer. Everytime the program has a page fault it actually sets a aside an entire region of memory the requested memory lies in. The TLB cannot store all of the virtual memory translation tables so only the most recently used are cached.

So the takeaway of what I was talking about above is to do your calculations in localized area of memory. This means you favor allocating things together in continuous memory if possible instead of allocating separate blocks. Also, when looping over a 2d array, be sure you loop over the rightmost index in the inner loop. like this

int array[4][4];

for (int y = 0; y < 4; ++y)
{
  for (int x = 0; x < 4; ++x)
  {
     array[y][x] = x * y;
  }
}
Do not flip x and y. With large data sets, you will generate lots of page faults and will greatly slow down your code.

I don't know much about the other types of caches other than the same rules of locality apply. Also, using the stack when possible in c++ leads to better cache performance, this is because the same region of memory can be used between function calls and so the same parts of memory stay in the cache.

As for how the cache is implemented? How I understand it is for each cache element it stores the address, the value stored at the address, and information about its status and other information needed to function. When an address is requested the address is requested from the cache and every cache line simultaneously checks to see if it stores that address, if it does then it outputs that is has the address as well as the value. All of the outputs of the cache lines are ORed together, but because only one line will match the address, that single value will come back. I don't think that is a very clear explanation and I don't really know much details of how it works, but I hope that somewhat answers your question.

#3 frob   Moderators   -  Reputation: 22791

Like
3Likes
Like

Posted 04 September 2013 - 11:47 PM

As HappyCoder mentioned, this is almost entirely automatic. The CPU is able to detect sequential access patterns. Even if you are pulling from two or four or more data streams, as long as you are traversing them sequentially the CPU can almost always detect it and prefetch your data.

A lot of experts have spent decades trying to get processors to automatically detect and prefetch your data. The processors do a wonderful job of it when you access your data sequentially.

That said, there are a few hardware-specific prefetch instructions available at a low level. Most programmers will never use them. They are generally used in extreme cases, and only when you have a fixed target of knowing exactly what processor you are using. Not "exact processor" as in, "I'm using an x86 chip", or "I'm using an i7 processor." They are only useful as a final level of optimization on an exact processor like "This will only run on the 4960X" or "This will only run on the ARM694E-S". Even then, they would only be used after nearly every other type of algorithmic and structural optimization has been exhausted. In games this typically means they are limited only to consoles with known chipsets.

Check out my book, Game Development with Unity, aimed at beginners who want to build fun games fast.

Also check out my personal website at bryanwagstaff.com, where I write about assorted stuff.


#4 King Mir   Members   -  Reputation: 2050

Like
1Likes
Like

Posted 05 September 2013 - 12:38 AM

Hi.
 
I have trouble understanding how cache is implemented. I understand all low level aspects of digital computations, yet this is totaly out of my scope. Is there a way to implement cache out of cpu management? Like, if there is a part of memory that is going to get accesed to read from, how would I cache it? The question is not wheather it would benefit me, the question is why it would not benefit me.

Caching means storing something when it's accessed, so that next time the same thing is used, you don't have to fetch it again.

The CPU cache caches data that's located in main memory on the CPU itself. That way, accessing the same memory location twice in a row, or near to it, is faster.

The GPU cache is the same, for the GPU.

You can implement a cache in memory, but it wouldn't gain you anything, because accessing the cache would be just as slow as accessing the thing being cached. You could of course implement a of something other than memory, and that might be helpful; for example, your web browser has a cache for visited webpages.

My more precise problem is GPU cache. There are multiple SIMD (single instruction multiple data) threads running at my side, but they many times access the same memory to read from. This couses huge stalks if the memory is accessed by more SIMD threads. So my understanding is that cache is a memory of every  individual SIMD thread that gets populated with memory that SIMD thread is likely to access. Is it true? My problem is, how to make sure that every SIMD reads from its own spot in memory. Only way I can distinguish between SIMD thread is index of pixel they are computing, this differs fo r every SIMD thread, running or going to run. My only idea of implementing cache for those threads would be:
 
cache=new [numofthreads*cachesize]<=data likely to read by all paralel threads (say 80, that means 80 identical data often) 
 
thread
{
.........
     var some=cache[threadID][particulardata];
........
}
 
Is this how it works?
Thanks a lot for any clarifications!

So in the GPU there is actually not 1 cache but 2 caches (though this may vary between GPUs). The L1 cach is per core. This is a cache of a L2 per processor cache. When two cores want to access the same memory address at the same time, and one of them is a write, the GPU has to work to ensure that they both see a consistent view of memory. So it must propagate writes to the L2 cache. And when a core attempts access a memory location it must check that the value hasn't been changed on the L2 cache. If it has, then it has to get the new value from the L2 cache. If it hasn't then it can get the value from the L1 cache. That's why multiple cores writing to the same memory location is slower than when they don't share any data.

The CPU works the same way, except there are generally 3 levels of cache.

Note however that this only applies when at least one thread is writing to memory. Having multiple cores read from the same location is not a problem.

So the best way to ensure that the L1 cache is used efficiently is not to share mutable data between multiple cores. I presume you could either partition the data smartly between cores, or buffer the data so that a texture is not being written to and read from at the same time.

#5 King Mir   Members   -  Reputation: 2050

Like
1Likes
Like

Posted 05 September 2013 - 01:09 AM

Lets say you started up the application and requested memory at 0x2000, lets say this memory has not been requested yet. First of all the cpu doesn't actually access the memory address 0x2000. Instead it is translated into a virtual address. I wont go into a lot of detail but I will try to briefly explain how the translation happens. The first time you request the memory the cpu will see that a page of memory has not been assigned to that memory range, so the operating system finds a free page of memory, maps it to the address region you requested and the program continues on. This process is known as a page fault. Everytime memory is accessed using its address, this address is actually the virtual address. The virtual address needs to be translated to a physical address to be requested from the ram. There are caches specifically used to store information about how to translate virtual addresses. That cache is called the translation lookaside buffer. Everytime the program has a page fault it actually sets a aside an entire region of memory the requested memory lies in. The TLB cannot store all of the virtual memory translation tables so only the most recently used are cached.

This is all true, but is not related to the CPU cache at all, except that both do caching of a sort. Pages and page faults are a system that runs as software on the operating system, though the TLB does seem to be a cpu assisted part. The CPU cache is implemented entirely in hardware on the CPU (and the same may exist for the GPU).

Operating system paging can be viewed as a fully-associative cache of the page file, but the more common way of looking at is that the page file (or swap partition) is a backup to main memory.

So the takeaway of what I was talking about above is to do your calculations in localized area of memory. This means you favor allocating things together in continuous memory if possible instead of allocating separate blocks. Also, when looping over a 2d array, be sure you loop over the rightmost index in the inner loop. like this
 

int array[4][4];

for (int y = 0; y < 4; ++y)
{
  for (int x = 0; x < 4; ++x)
  {
     array[y][x] = x * y;
  }
}
Do not flip x and y. With large data sets, you will generate lots of page faults and will greatly slow down your code.

Despite the inaccuracy of the above, this is good advice. The advantage of accessing data near to prior accesses applies to the CPU cache even more than it did to page files. (Paging is not really relevant to modern programming, unless you're implementing a database, or should be using one).
 

As for how the cache is implemented? How I understand it is for each cache element it stores the address, the value stored at the address, and information about its status and other information needed to function. When an address is requested the address is requested from the cache and every cache line simultaneously checks to see if it stores that address, if it does then it outputs that is has the address as well as the value. All of the outputs of the cache lines are ORed together, but because only one line will match the address, that single value will come back. I don't think that is a very clear explanation and I don't really know much details of how it works, but I hope that somewhat answers your question.

That's not how it works, because the CPU caches are not fully associative. They may be non-associate or set associative. You can look those terms up for more info.

I don't know the associativity of GPUs.

Edited by King Mir, 05 September 2013 - 01:28 AM.


#6 AllEightUp   Moderators   -  Reputation: 4273

Like
4Likes
Like

Posted 05 September 2013 - 01:59 AM

No offense to anyone but GPU cache is something "you" don't optimize for in terms of shader code. Some interesting information has been posted but unless you are working on a specific console (fixed hardware) you simply don't have any way to really optimize your cache usage in shaders properly because each GPU is very different. But more importantly, the big cache items are way beyond your control and you simply can not "optimize" them. The big items in order of cost: texture lookup, stream data lookup (vertex/indice buffers) then the very small number of variables you use in your shaders and the shader code itself. (NOTE: Texture also means render targets including the current backbuffer.)

Textures are swizzled in memory for (usually) bilinear lookup, so the swizzles are typically at least a 2x2. Based on fetch sizes though, some use 4x4 and some use 8x8. So, trying to optimize your code to deal with 2x2 would degrade performance on a 4x4 swizzle card. So, in general, don't try to optimize this.

Geometry data is the next biggest hit, but it is an order of magnitude smaller than the textures. You "can" get good benefits by simply running a model through a generic 128 byte cache line optimizer. On smaller cache line GPU's, it still optimizes reasonably compared to raw unsorted data and on larger cacheline machines it reduces a significant number of hits. But.... This is actually pretty trivial in comparison to texture costs.

Your shaders. Other than the linear work they do per pixel, there is nearly no way to optimize them for cache coherency in such an environment. Of course most communication is through "registers" in the shader code (i.e. the limit on the number of items you can send to a shader) and as such, those are not cached at all, they are "in GPU" items and there simply is no cache to optimize for. So, other than reducing instruction counts and using best operations to reduce the GPU concept of cycle counts, you really can't "optimize" beyond supplying the smallest and most cycle efficient code.

Give me a CPU, I'll take what Happy posted and blow your mind. With a GPU, you are exceptionally limited because the GPU's vary so much, even when made by the same company, that your optimizations will likely die horribly on other GPU's.

#7 DareDeveloper   GDNet+   -  Reputation: 978

Like
1Likes
Like

Posted 05 September 2013 - 03:17 AM

Not sure if this is helpful at all, but Java's ThreadLocal came to mind when I read the SIMD stuff in the original post.

 

It kinda abstracts away the part where you link the thread id to the data for one thread. You have one ThreadLocal instance that all threads use, but get() returns an object for the current thread (making sure that threads don't share objects). Not sure if the same thing can be done with C++ templates or with whatever language you use ... but here is basically what the ThreadLocal needs:

 

T get()

Returns the value in the current thread's copy of this thread-local variable.
 

protected T initialValue()

Returns the current thread's "initial value" for this thread-local variable.
 

void remove()

Removes the current thread's value for this thread-local variable.
 

void set(T value)

Sets the current thread's copy of this thread-local variable to the specified value.
 

 

How does it work?

There is a map where the thread ids are linked to the value. The map is managed in the get() and remove() functions.

If there is no value when get() is called it uses initialValue() to create a new object and to store it in the map.

 

Not sure if this is what you are asking for or if that is even a valid solution when it comes to performance ... I'll just dump those thoughts here.

 

Guess it might not even work in some languages because you want to be able to extends ThreadLocal and overwrite initialValue() ... but that means mixing static and dynamic polymorphism?


Edited by DareDeveloper, 05 September 2013 - 03:18 AM.

Given enough eyeballs, all mysteries are shallow.

ProcGames.com


#8 Satharis   Members   -  Reputation: 1281

Like
1Likes
Like

Posted 05 September 2013 - 09:44 AM

Actually helping the CPU utilize cached instructions, as has been pointed out, is pretty hard. It's not often something you actively think about. In terms of game development the farthest engines tend to go with it is things like changing how memory is allocated in order to create contiguous blocks. Of course you can stick to more specific optimization techniques, there's quite a few related to the GPU I'm not very experienced enough in to comment on.

Essentially it works the same way as in HappyCoder's array example in that the objects are allocated as arrays and are thus in contiguous blocks of memory. A good way to think of it is similar to how a hard drive would read data, that you want to prevent the read head from having to "jump" all over to different memory blocks in order to just read a small section.

Dealing with it on the GPU is a bit of a different story since it basically relates to how you allocate and send data to the video card and the relationship between it and the CPU.

In short it's not really a broad subject of one thing to do, more a bunch of smaller techniques for getting a bit more speed out of sections of code, critical ones obviously benefiting more due to repeat behavior experiencing exponential reduction in time if you optimize a single go around.

Edited by Satharis, 05 September 2013 - 09:45 AM.


#9 JohnnyCode   Members   -  Reputation: 334

Like
0Likes
Like

Posted 05 September 2013 - 03:06 PM

 

Actually helping the CPU utilize cached instructions, as has been pointed out, is pretty hard. It's not often something you actively think about.

So, at the level of OS and compiler , nothing is done towards this maintenance, except "as close memory acceces at a time as possible"?

And, is accessing memory to read from by multiple threads all performance safe? And if so why? Is there a point in making every thread have its own memory pool populated with even duplicated virtual RAM or not?



#10 Satharis   Members   -  Reputation: 1281

Like
1Likes
Like

Posted 05 September 2013 - 03:51 PM

So, at the level of OS and compiler, nothing is done towards this maintenance, except "as close memory acceces at a time as possible"?

It depends on the compiler of course but most of the repetoire of what a compiler does for optimization is a lot of little techniques for either reducing instruction count and redundancies or to help cache misses by aligning memory or instructions. There's a lot of techniques really that all come down to how the code is translated so it really isn't something the programmer themself deals with.
 

And, is accessing memory to read from by multiple threads all performance safe? And if so why? Is there a point in making every thread have its own memory pool populated with even duplicated virtual RAM or not?

It's often performance debatable at best. Accessing memory from two different cores usually is not performance helpful because of two things:

Processor design: Processors vary a LOT in how they use cache, a lot use seperate L1 caches for each core and may use shared cache at L2 or L3, but there's nothing saying that HAS to be true. It's certainly possible to not use shared cache, it just often is done in order to save chip space. in the case of each processor having it's own cache that means even if they all accessed the same data they would -all- make a copy into their cache, thus being very redundant.

Shared cache itself: Shared cache can vary in its performance based on the type of operations being performed. Shared cache is often L2 or L3 cache so it is explicitly slower than the L1, so in most cases the core will get the most bang for the buck out of utilizing the non-shared cache anyway. Utilizing the shared cache can be slower than reaching out to RAM depending on how many cores are accessing a particular cache line and what operations they're doing. In general it's better to have each core working on a different piece of data or instruction in the shared cache anyway.

In terms of overall performance the biggest real way to increase performance is to align data in memory and then work in it in cache line sized segments. In that case you could theoretically have performance gains if you say, took a loop over a set of data and split into 4 smaller loops. But actually coding for something like this would be a bit of a waste of time. It may be a compiler optimization already, can vary based on the CPU and also would be meaningless in most cases unless it was done in extremely performance critical sections, to where the compiler is likely to make a lot of optimizations anyway.

Edited by Satharis, 05 September 2013 - 03:52 PM.


#11 frob   Moderators   -  Reputation: 22791

Like
1Likes
Like

Posted 05 September 2013 - 04:27 PM


And, is accessing memory to read from by multiple threads all performance safe?
Thread safety for memory is accomplished through memory barriers. 

 

It is one of many issues involved in threaded programming. You must establish a memory barrier before modifying any shared variable. This can be accomplished with interlocked functions which have great performance, or through more complex locks which require additional CPU time.


Check out my book, Game Development with Unity, aimed at beginners who want to build fun games fast.

Also check out my personal website at bryanwagstaff.com, where I write about assorted stuff.


#12 King Mir   Members   -  Reputation: 2050

Like
1Likes
Like

Posted 05 September 2013 - 11:32 PM

Actually helping the CPU utilize cached instructions, as has been pointed out, is pretty hard. It's not often something you actively think about. In terms of game development the farthest engines tend to go with it is things like changing how memory is allocated in order to create contiguous blocks.

I disagree that people writing performance critical code should not think about cache behavior. Quite the opposite, it can be the greatest bottleneck in a memory rich application. At the least one should be aware of you should avoid accessing disparate memory, and that cache lines are 64 bytes.
 

And, is accessing memory to read from by multiple threads all performance safe? And if so why? Is there a point in making every thread have its own memory pool populated with even duplicated virtual RAM or not?

Accessing memory to read by multiple threads is safe and fast. In rare cases it can even lead to superlinear speedup, because one thread can pull a piece of data from memory to the L3 cache, that data can then be used by another thread. As long as no threads are writing to that memory. Reading the same piece of memory is safe because there is no synchronization; the data cannot change.
 

And, is accessing memory to read from by multiple threads all performance safe?

Thread safety for memory is accomplished through memory barriers.

It is one of many issues involved in threaded programming. You must establish a memory barrier before modifying any shared variable. This can be accomplished with interlocked functions which have great performance, or through more complex locks which require additional CPU time.

As per above, this is not true for constant data. If no thread is writing to shared data, then no synchronization is needed on that data.

Edited by King Mir, 05 September 2013 - 11:52 PM.


#13 LorenzoGatti   Crossbones+   -  Reputation: 2779

Like
1Likes
Like

Posted 06 September 2013 - 01:50 AM

Apart from sequential memory accesses, there's a complementary way to take advantage of cache: doing as much computation as possible with the same data before loading other data.

 

For example, instead of looping (in the best possible sequential fashion) through an array of guided missiles to update their position, then again to compute collisions, then again to compact it after some missiles have exploded, then again to compute their acceleration from the guide system, then again to compute their new speed,  looping through them only twice is going to reduce loads and stores, and the ensuing cache misses, by 60%.


Produci, consuma, crepa

#14 King Mir   Members   -  Reputation: 2050

Like
0Likes
Like

Posted 06 September 2013 - 08:47 AM

If were giving out general cache performance tips, here's a good one from Andrei Alexandrescu given at Going Native on Wednesday: organize your class data members by hotness, so that the hottest data is withing the first 64 bytes of the class.

#15 fir   Members   -  Reputation: -456

Like
0Likes
Like

Posted 06 September 2013 - 10:31 AM

I had read somewhere that current systems had not one but many hardware prefetchers and you could (if I remember correctly) even experiment in disabling enabling them (probably in bios) to check it influence of performance - Do someone heremaybe  know something about it, could say something? (maybe to check if he has something like that iin bios? I got an old machine)



#16 Ravyne   GDNet+   -  Reputation: 8186

Like
1Likes
Like

Posted 06 September 2013 - 12:12 PM

On the GPU its hard to really take any concrete action on the very low-level. For the most part, you can't rely on any set of GPUs behaving in the same way WRT caching because each vendor has their own secret sauce. Newer GPUs are becoming more homogenized across generations and even across vendors as GPU compute becomes more and more a factor in their hardware designs, but there's still quite a bit of variance today. I predict they will continue to trend closer and closer, but I doubt we'll ever see the kind of homogeneity that exists between, say, AMD and Intel processors.

 

What you can do instead is be less worried about specific caching behaviors, and more worried about the capabilities defined by different Direct3D versions -- D3D specifies the number of instruction slots, registers, and shared memory available, among other things. You also need to be aware of how GPU execution differs (e.g. branches execute both sides if even one element goes in each direction, results are masked and merged -- and this grows exponentially). Optimal GPU code loads in some memory that no other thread-group wants to touch at the same time, parties on it for a bit, writes back the result, and moves onto the next bit of memory.



#17 fir   Members   -  Reputation: -456

Like
1Likes
Like

Posted 06 September 2013 - 02:28 PM

As HappyCoder mentioned, this is almost entirely automatic. The CPU is able to detect sequential access patterns. Even if you are pulling from two or four or more data streams, as long as you are traversing them sequentially the CPU can almost always detect it and prefetch your data.

 

A couple of days before there was topic on that, i was asking mainly for two examples of prefetching that could

be expressed in such simple example

 

for(int i = 0; i < N; i++)

    A[i*4] = B[i*3]; 

 

1) many interlaced acces arrays to prefetch - will this work ?

2) prefetching by stride      - will this work ?

 

I found some hints over the net which say

 

1) seems so that many interlaced streams can be prefetched (I do not know if there is some limit I suspect there is some)

 

2) prefetchers can do by stride

- (probably they do it in stride way that is they are able to get some bytes, skip gap, get some bytes, skip gap, not just all contigious memory (that stride acces was in the suggestions I was reading then about it - but if cache line to get is 64 bytes it would seem that will probably get at least 64 bytes  - i dont know)

 

 

[if so thinking about examples in those previous topic

can bring such outcome - when scanning full table of used and unused record prefetcher will prefetch it all -

this is more cache streem (used and unused records)

but no cache misses - in the second example when

reaching only for used records - it would be much slight

cache stream but I think prefetcher could not work

here - if so I could suspect that first example can be 

better]

 

 

 

- they can not work when stride is big about 1000 or something , and in such case there would be a stall 

 

thats what I found - it is at all important knowledge to consider imo if someone is interested in optimization


Edited by fir, 06 September 2013 - 11:50 PM.


#18 fir   Members   -  Reputation: -456

Like
0Likes
Like

Posted 06 September 2013 - 11:56 PM

If were giving out general cache performance tips, here's a good one from Andrei Alexandrescu given at Going Native on Wednesday: organize your class data members by hotness, so that the hottest data is withing the first 64 bytes of the class.

 

I also wonder if they should be not aligned to 64 bytes sometimes.. Is this 64B cache line loaded aligned one or just 64B long unaligned one? 



#19 King Mir   Members   -  Reputation: 2050

Like
0Likes
Like

Posted 07 September 2013 - 01:04 PM

If were giving out general cache performance tips, here's a good one from Andrei Alexandrescu given at Going Native on Wednesday: organize your class data members by hotness, so that the hottest data is withing the first 64 bytes of the class.

 
I also wonder if they should be not aligned to 64 bytes sometimes.. Is this 64B cache line loaded aligned one or just 64B long unaligned one?

Yes it does need to be 64 Byte aligned. So you want padding after large classes to make them divisible by 64 bytes.

Edited by King Mir, 07 September 2013 - 01:08 PM.


#20 fir   Members   -  Reputation: -456

Like
0Likes
Like

Posted 07 September 2013 - 01:32 PM

 

 

If were giving out general cache performance tips, here's a good one from Andrei Alexandrescu given at Going Native on Wednesday: organize your class data members by hotness, so that the hottest data is withing the first 64 bytes of the class.

 
I also wonder if they should be not aligned to 64 bytes sometimes.. Is this 64B cache line loaded aligned one or just 64B long unaligned one?

 

Yes it does need to be 64 Byte aligned. So you want padding after large classes to make them divisible by 64 bytes.

 

 

Does someone maybe know? - if cpu reads some byte (and it goes through cache) it always get such aligned 64B line (it is some bytes forward and some bytes backward usually?) ? That is when I read some byte may I assume that whole such aligned line (back and forth) is in cache?






Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS