D-loop : New method for optimizing loop

Started by
24 comments, last by d-loop 20 years ago
quote:Original post by Tramboi
quote:
That''s right. This is why you have a complet math analyse.


Agreed but I''m not sure it is really relevant to what happens in a CPU with the I$/D$ miches, out of order execution, pipelines stalls, pipeline flushes and so on

quote:
Could you justify all this with VTune on your platform and hardware counters?

Already done. I must say I''m not sure I really understand your point of view.


Basically I''m trying to see if this optimisation is worth it (if so, I could use it of course ) or if it is just noise in the signal of all other optimisations the compiler and programmer (masks, sentinels, precomputed tables) applies to the routines...
I tend to think you mixed too much things here, but maybe I''m wrong.



I understand .

A part from the use of the mask technic which is not really a good optimisation in most case, you should be satisfied using D-loop when you know the size of the table and don''t want to use asm. (for example SSE)







Advertisement
I compiled this on a Sun 420R server with 4 processors, 4 Gigs memory running Solairs 8 on it, using Sun 6.1 workshop ANSI C++ compiler.

Lord Bart
quote:Original post by Lord Bart
I compiled this on a Sun 420R server with 4 processors, 4 Gigs memory running Solairs 8 on it, using Sun 6.1 workshop ANSI C++ compiler.

Lord Bart


I just tried your version. It''s slower than the funciton I give (20%). In fact I have to rework it for intel''s compiler. It don''t like things like :

while(char_mask[*(++str1)]==0)

but that

while(char_mask[*(++str1)]==0)
{
}

Well not all compiler like d-loop code. for exemple VC6 don''t like it (don''t now about vc7.1 but I heard that MS did a good job on optimising ALU instructions).




quote:Original post by d-loop
quote:Original post by Lord Bart
I compiled this on a Sun 420R server with 4 processors, 4 Gigs memory running Solairs 8 on it, using Sun 6.1 workshop ANSI C++ compiler.

Lord Bart


I just tried your version. It's slower than the funciton I give (20%). In fact I have to rework it for intel's compiler. It don't like things like :

while(char_mask[*(++str1)]==0)

but that

while(char_mask[*(++str1)]==0)
{
}

Well not all compiler like d-loop code. for exemple VC6 don't like it (don't now about vc7.1 but I heard that MS did a good job on optimising ALU instructions).



Yeap I forgot to put the {} brackets for the while loop, sorry.

Strange it slow down on Intel.

Inc pointer and defer should be faster then asignment of defer to a char and then increment pointer.

your first loop
while(char_mask[car]==0)
{
car = (unsigned char*)*str1++; // deref, asign, inc
}

my first loop
while(char_mask[*(++str1)]==0) // inc, deref no asign.
{
}

And st3 should work out to *(st3+i) which if you replace with *(++st3) for your check should be faster since you inc the pointer instead of a pointer addition.<br><br>yours<br>while((unsigned char)st3==(unsigned char)str2)<br>{<br> i++; // extra inc of var need for defer above.<br>}<br>mine<br>while(*(++st3)==*(++st2))<br>{<br> //no need for i inc pointer instead and deref<br>}<br><br>My code keeps four things in use for main part: str1, str2, st2, st3 all of which are pointers. and are most like kept in registers.<br><br>Your code has five: car, str1, str2, st3, and i use in the main part. but it should also keep in registers, maybe, not sure on Intel box?<br><br>But I believe that compiler and processor differences matter most here. Sun processor has lots of registers, not sure how many pipelines.<br><br>Maybe I track down the gcc compiler on my box and compile it with gcc. But I won't get to it until Thurdays.<br><br>Also need to figure away I can see the asm code from the Sun compiler.<br><br>Any way it sped thing up on the Sun <img src="smile.gif" width=15 height=15 align=middle>, but slow things on Intel. <img src="sad.gif" width=15 height=15 align=middle><br><br>Well anyway nice little paper by the way. <img src="smile.gif" width=15 height=15 align=middle><br><br>Lord Bart <img src="smile.gif" width=15 height=15 align=middle> <br><br><SPAN CLASS=editedby>[edited by - lord bart on March 23, 2004 6:22:02 PM]</SPAN> <br><br><SPAN CLASS=editedby>[edited by - lord bart on March 23, 2004 6:23:22 PM]</SPAN>
OMFG... dost thou not believe that those kinds of microoptimizations are best left in the hands of the compiler writer.
The code is a little too unreadable for me, Im afraid.
quote:Original post by Tramboi
By the way,

0d :
cmp ebx,edx
jnz 14 :
add eax,01h
14 :
movzx ebx,BYTE PTR [ecx+01h]
add ecx,01h
test ecx,01h >> ebx?
jnz 0d :

seems wrong...


I''ve checked and

test ecx,01h

should be replace by

test ebx,ebx.

Thx Tramboi .

This topic is closed to new replies.

Advertisement