faster than memset?

Started by
8 comments, last by Ului 20 years ago
Anyone know of a routine that is faster than memset, or even better for that matter. ZeroMemory, as far as I know uses the same thing. Not that i''d be calling this amillion times, but its always good to know u have the fastest thing out there. Dun mes wit me!
Dun mes wit me!
Advertisement
Memset is pretty fast. Fast enough to use everywhere, and if you profile and determine that it''s a problem somewhere, you can try to replace it with something else.
Neva Mind, I worked it out. Coded a faster one for what I need to do, as long as it can be divided by 64 it''s fine..




mov eax,dword ptr [nSize]; // start position for loop
mov edx, dword ptr [buffer]; //pointer to begining of array
myloop: // start loop
sub eax,64; // decrese counter

mov dword ptr [edx+eax],0;
mov dword ptr [edx+eax+4],0;
mov dword ptr [edx+eax+8],0;
mov dword ptr [edx+eax+12],0;
mov dword ptr [edx+eax+16],0;
mov dword ptr [edx+eax+20],0;
mov dword ptr [edx+eax+24],0;
mov dword ptr [edx+eax+28],0;
mov dword ptr [edx+eax+32],0;
mov dword ptr [edx+eax+36],0;
mov dword ptr [edx+eax+40],0;
mov dword ptr [edx+eax+44],0;
mov dword ptr [edx+eax+48],0;
mov dword ptr [edx+eax+52],0;
mov dword ptr [edx+eax+56],0;
mov dword ptr [edx+eax+60],0;

cmp eax,0; // if counter = 0
jne myloop; // else loop


Is there anyway to optimize this further??

Dun mes wit me!
Dun mes wit me!
Yes.. first.. you must decide if you want to load the area you are filling into the memory caches or not - typically when you are filling memory you do NOT want to also fill the caches but what you are doing there is most certainly filling the caches.

For when you desire/need cache pollution and the destination is 32bit aligned and 512 bytes or less..

sub eax, eax
mov edi, StartAddress
mov ecx, BytesToFill/4
rep stosd


In all other cases.. use MMX instructions - the AMD optimisation guide (available online) has a nice memory fill routine for MMX that is also near optimal on intel machines.

- Rockoon
sweet, so could u say what each command does in what u said.. I haven''t learnt all ASM yet.
Can''t find the Microsoft MASM documentation.


Dun mes wit me!
Dun mes wit me!
I tried compiling code which calls memset with g++ on intel.

If you call memset() without optimisation enabled, it calls memset from the C library.

If you enable optimisation, it plants code similar to Anonymous Poster''s - that''s to say, it uses rep stosl.

So just call memset, if you enable optimisation it should be as fast as any. I fail to see how it can be any faster than a rep stosl instruction.

Mark
Thanks. as far as cache pollution goes, what is that?
How does it work. Was I filling caches & memory, which part.
This way I can learn and avoid possible desaster.
& also cache pollution a bad thing?

Dun mes wit me!
Dun mes wit me!
When we first start thinking about it, unrolling loops so we don''t have the extra increment, compare, and jumps every iteration sounds like it''d be faster (less ops have to be done) But in most cases this doesn''t work.

The CPU has two caches, a data cache and a program cache. As our program is ran, the CPU loads chunks of it into the program cache and then runs the instructions from there. We get a speed increase from this because reading from cache is much faster than reading from RAM.

When we unroll a loop we may lower the number of ops the CPU has to perform, but we also greatly increase the size of the program code. With increased program code we can end up with a much higher number of cache misses, which is when the code needed isn''t in cache.

Everytime there is a cache miss the CPU has to flush the cache and refill it from RAM, which we know is quite slow, and while it''s waiting for the cache to be refilled the CPU has no choice but to sit and do nothing. To make matters worse, the increase in program size means there is more code that must be loaded into cache everytime we have a cache miss.

We know that we are going to have cache misses. A couple dozen cache misses per frame isn''t a big deal, but it isn''t hard to imagine how bad it would be if we caused a cache miss for every pixel we plot, every polygon we render, or anything else we do thousands of times per frame. Typically games are made of a couple small pieces of code that is run thousands of times in a row each frame, and with a little care we can get them to fit in the cache, giving us some great performance.

Now back to unrolling loops... The only time we tend to gain anything from unrolling a loop is when we have a small loop (only a few lines of code and few iterations) that gets run a large number of times (such as a loop that does some simple op for each vertex in a tri, and gets run for every tri in the scene). Large loops or loops with a large number of iterations (such as copying memory byte by byte) tend to almost always cause more penalties because of cache misses and code size than they could ever hope to gain by unrolling.


And a bit off topic... Optimizing by unrolling loops is one of the last things you want to do. If you find you are spending a lot of time in a particular loop modifying/changing to a better algorithm, cleaning up code inside loop, and minimizing the number of times the loop will be called will have a much greater effect.


Drakonite

Shoot Pixels Not People
Shoot Pixels Not People
Maybe a better optimization would be to not call memset so much. The quickest operation is the one you don''t perform.
Very good, cheer ! clap clap clap.

!o)
!o)

This topic is closed to new replies.

Advertisement