Archived

This topic is now archived and is closed to further replies.

d-loop

D-loop : New method for optimizing loop

Recommended Posts

davepermen    1047
i''m interested in how this could work together with ... the while() { switch() {} } thing.. to optimize loops..

an interesting read, for sure. thanks for sharing.




If that''s not the help you''re after then you''re going to have to explain the problem better than what you have. - joanusdmentia

davepermen.net

Share this post


Link to post
Share on other sites
davepermen    1047
duff''s device, or something, it''s called




If that''s not the help you''re after then you''re going to have to explain the problem better than what you have. - joanusdmentia

davepermen.net

Share this post


Link to post
Share on other sites
davepermen    1047
yes, and for making it having less branches, too.

with help of duffs device, you can do something 8 times, then reloop if needed, and again, 8 times. you just have to jump into the right beginning, so that you, in the end, process all items you want.

that way, you have only 1/8th branch "per iteration".

of course, you can do this for 8, 16, or only 4 times.




If that''s not the help you''re after then you''re going to have to explain the problem better than what you have. - joanusdmentia

davepermen.net

Share this post


Link to post
Share on other sites
vaneger    100
so lets "d-loopify" this function::::::


#include<math.h>
typedef unsigned long ulong;
inline bool isPrime(ulong n)
{
if ( (n % 2 == 0) || (n % 5 == 0))
return false;
ulong sqr = sqrt(n);
for (ulong x = 3; (x <= sqr); x +=2)
if(n % x == 0)
return false;
return true;
}

Share this post


Link to post
Share on other sites
d-loop    122
quote:
Original post by davepermen
i''m interested in how this could work together with ... the while() { switch() {} } thing.. to optimize loops..

an interesting read, for sure. thanks for sharing.




If that''s not the help you''re after then you''re going to have to explain the problem better than what you have. - joanusdmentia

davepermen.net


hi,

As I wrote in the pdf, using d-loop is not always simple or a good idea. With switch condition (or nested if ...) I doubt in most cases you will gain in execution time, but you can make a d-loop version. It will looks like that :

while(x)
{
switch(n)
{
case a :
case b :
case ... :
}
}


become (something like that)

while(x)
{
while(a)
{
}
switch(n)
{
case b :
case ... :
}
}

Where a is the most executed condition



Share this post


Link to post
Share on other sites
d-loop    122
quote:
Original post by vaneger
so lets "d-loopify" this function::::::


#include<math.h>
typedef unsigned long ulong;
inline bool isPrime(ulong n)
{
if ( (n % 2 == 0) || (n % 5 == 0))
return false;
ulong sqr = sqrt(n);
for (ulong x = 3; (x <= sqr); x +=2)
if(n % x == 0)
return false;
return true;
}



hi,

D-loop is usefull when the inner loop is the bottleneck. in your case, the bottleneck is the modulo function. A second point is that d-loop works when a sequential read of a table occur, there is no sequential read because there is no table.

Share this post


Link to post
Share on other sites
davepermen    1047
quote:
Original post by d-loop
With switch condition (or nested if ...) I doubt in most cases you will gain in execution time, but you can make a d-loop version.


uhm, yes, you gain in execution time, as you can decimate the while(something) test to 1/n while(something) tests, wich can help.

depending on situation, you could even combine it, possibly.. dunno..

then again, newest processors get even more and more logic to directly work with standard loops. thats one of the situations where prescott really shines. about 0% missprediction in simple loops. so the loop-test will not really hurt.




If that''s not the help you''re after then you''re going to have to explain the problem better than what you have. - joanusdmentia

davepermen.net

Share this post


Link to post
Share on other sites
Lord Bart    226
Hello d-loop,

I try your new_strstr on a SUN multiprocessor system using Sun compiler.
Use a 10 meg long allocated buffer memset to ''A'' and then copied in torward the end the substr I would look for.

I ran a loop of 1000 times for both std strstr and new_strstr and print out time start and end for each.
I found no speed up at all.
new_strstr was about .45 seconds slower.

I don''t have a compiler on the PC so can''t test it there, but form looking at the code it should be faster.
I am not sure why on the Sun it is not, still looking at it.

I typed the code just as you had it in the pdf.

I am not sure how the std strstr is done on a SUN.

Lord Bart

Share this post


Link to post
Share on other sites
d-loop    122
quote:
Original post by Lord Bart
Hello d-loop,

I try your new_strstr on a SUN multiprocessor system using Sun compiler.
Use a 10 meg long allocated buffer memset to ''A'' and then copied in torward the end the substr I would look for.

I ran a loop of 1000 times for both std strstr and new_strstr and print out time start and end for each.
I found no speed up at all.
new_strstr was about .45 seconds slower.

I don''t have a compiler on the PC so can''t test it there, but form looking at the code it should be faster.
I am not sure why on the Sun it is not, still looking at it.

I typed the code just as you had it in the pdf.

I am not sure how the std strstr is done on a SUN.

Lord Bart




for the new_strstr have you use the code in listing 12 ? (it use the optimized d-loop).

Share this post


Link to post
Share on other sites
Tramboi    122
I tried the count_char on my PS2 devkit, out of curiosity.

The original version, the new version, and an asm version I coded with conditional moves to annihilate inner loop branching.
I used a 10000 char array out of the cache.

I used hardware counters to count branch misprediction.
And found the gain in your optimisation was only due to a better CPU issue behaviour once compiled (MWCW used conditional sets and had nearly the same speed as my MIPS coded routine, which was surprising but reassuring ), but not to a few branch mispredicts avoided...

The outer loop only issues one misprediction (the first time of course)...
And the inner loop entirely depends on your data.

So I guess you''re thinking you invented something great a bit too fast

Could you justify all this with VTune on your platform and hardware counters?

My guideline is : replace branching with conditional opcodes (set and moves) and eliminate branching where you can instead of trying to tweak the branch prediction!!!

Bertrand

Share this post


Link to post
Share on other sites
d-loop    122
quote:
Original post by Tramboi
I tried the count_char on my PS2 devkit, out of curiosity.

The original version, the new version, and an asm version I coded with conditional moves to annihilate inner loop branching.
I used a 10000 char array out of the cache.

I used hardware counters to count branch misprediction.
And found the gain in your optimisation was only due to a better CPU issue behaviour once compiled (MWCW used conditional sets and had nearly the same speed as my MIPS coded routine, which was surprising but reassuring ), but not to a few branch mispredicts avoided...


in fact as I said in the PDF, it's not exactly the reduction in mispredicted branch, but reduce number of instruction and the more parallelism you get get.

The outer loop only issues one misprediction (the first time of course)...
And the inner loop entirely depends on your data.


That's right. This is why you have a complet math analyse.

So I guess you're thinking you invented something great a bit too fast

You're guessing wrong it's not the first time since 20 years I invented something, but it's the first time I publish something. It's not a important thing to me. what is improtant is if it could help some dev, that's all .

Could you justify all this with VTune on your platform and hardware counters?

Already done. I must say I'm not sure I really understand your point of view.

Bertrand




[edited by - d-loop on March 23, 2004 11:15:31 AM]

Share this post


Link to post
Share on other sites
Tramboi    122
quote:

That''s right. This is why you have a complet math analyse.



Agreed but I''m not sure it is really relevant to what happens in a CPU with the I$/D$ miches, out of order execution, pipelines stalls, pipeline flushes and so on

quote:

Could you justify all this with VTune on your platform and hardware counters?

Already done. I must say I''m not sure I really understand your point of view.



Basically I''m trying to see if this optimisation is worth it (if so, I could use it of course ) or if it is just noise in the signal of all other optimisations the compiler and programmer (masks, sentinels, precomputed tables) applies to the routines...
I tend to think you mixed too much things here, but maybe I''m wrong.


Share this post


Link to post
Share on other sites
Tramboi    122
By the way,

0d :
cmp ebx,edx
jnz 14 :
add eax,01h
14 :
movzx ebx,BYTE PTR [ecx+01h]
add ecx,01h
test ecx,01h >> ebx?
jnz 0d :

seems wrong...

Share this post


Link to post
Share on other sites
Lord Bart    226
Hello d-loop,

Ok I made listing 12 code, but is slower then list 4 code.

I made imporvements that speed it up which you might want to use.

I took listing 12 and made it use pointers only.


const char* new_strstr12(const char* str1, const char* str2)
{
unsigned char char_mask[256];
if(*str2=='\0') return str1;
if(*(str2+1)=='\0') return strchr((char*)str1,(char*)str2));
unsigned *st2((unsigned char*)str2+1), *st3;
memset(char_mask,0,sizeof(char_mask));
char_mask[0]=1;
char_mask[(unsigned char)*str2]=1;
while(char_mask[*(++str1)]==0) // inc pointer first then use value to check char_mask big speed up here

while(*str1!=0) // we hit char_mask of 1 which is ethier NULL or fisrt byte in str2

{
if(*(++str1)==*st2) // check 2nd bytes

{
st3=(unsigned char*)str1; // set st3 to point at now 2nd byte of substr of str1

while(*(++st3)==*(++st2))
{
// this while loop continues as *st3 == *st2

// so need to check next byte in st2 for NULL

if(*(st2+1)==0)
{
return str1-1; // str1 should point to 2 letter of substr looking for so backup 1 to get start of actaul substr

}
}
st2=(unsigned char*)(str2+1); // nope not our substr so reset st2 to 2 byte in str2

}
while(char_mask[*(++str1)]==0) // inc pointer first then use value to check char_mask big speed up here

}
return NULL; // substr not found return NULL

}


You don't need to have car var and asigning to it all the time.
Use the power of pointers and ++ use the per++ as it should be faster.

Try this I bet you will see at lest a 10% speed up, over org listing 12.

By the way I did this same thing using pointers in listing 4 and got a speed up on that one to. And in my tests listing 4 beat listing 12 with my changes.

Lord Bart

[edited by - lord bart on March 23, 2004 12:10:47 PM]

Share this post


Link to post
Share on other sites
d-loop    122
quote:
Original post by Tramboi
By the way,

0d :
cmp ebx,edx
jnz 14 :
add eax,01h
14 :
movzx ebx,BYTE PTR [ecx+01h]
add ecx,01h
test ecx,01h >> ebx?
jnz 0d :

seems wrong...



strange, I will check that, it must be a typo .

Share this post


Link to post
Share on other sites
d-loop    122
quote:
Original post by Lord Bart
Hello d-loop,

Ok I made listing 12 code, but is slower then list 4 code.

I made imporvements that speed it up which you might want to use.

I took listing 12 and made it use pointers only.


const char* new_strstr12(const char* str1, const char* str2)
{
unsigned char char_mask[256];
if(*str2==''\0'') return str1;
if(*(str2+1)==''\0'') return strchr((char*)str1,(char*)str2));
unsigned *st2((unsigned char*)str2+1), *st3;
memset(char_mask,0,sizeof(char_mask));
char_mask[0]=1;
char_mask[(unsigned char)*str2]=1;
while(char_mask[*(++str1)]==0) // inc pointer first then use value to check char_mask big speed up here

while(*str1!=0) // we hit char_mask of 1 which is ethier NULL or fisrt byte in str2

{
if(*(++str1)==*st2) // check 2nd bytes

{
st3=(unsigned char*)str1; // set st3 to point at now 2nd byte of substr of str1

while(*(++st3)==*(++st2))
{
// this while loop continues as *st3 == *st2

// so need to check next byte in st2 for NULL

if(*(st2+1)==0)
{
return str1-1; // str1 should point to 2 letter of substr looking for so backup 1 to get start of actaul substr

}
}
st2=(unsigned char*)(str2+1); // nope not our substr so reset st2 to 2 byte in str2

}
while(char_mask[*(++str1)]==0) // inc pointer first then use value to check char_mask big speed up here

}
return NULL; // substr not found return NULL

}


You don''t need to have car var and asigning to it all the time.
Use the power of pointers and ++ use the per++ as it should be faster.

Try this I bet you will see at lest a 10% speed up, over org listing 12.

By the way I did this same thing using pointers in listing 4 and got a speed up on that one to. And in my tests listing 4 beat listing 12 with my changes.

Lord Bart

[edited by - lord bart on March 23, 2004 12:10:47 PM]


that''s great .

will try it to night. which compiler did you use ? and processor ?

Share this post


Link to post
Share on other sites
d-loop    122
quote:
Original post by Tramboi
quote:

That''s right. This is why you have a complet math analyse.



Agreed but I''m not sure it is really relevant to what happens in a CPU with the I$/D$ miches, out of order execution, pipelines stalls, pipeline flushes and so on

quote:

Could you justify all this with VTune on your platform and hardware counters?

Already done. I must say I''m not sure I really understand your point of view.



Basically I''m trying to see if this optimisation is worth it (if so, I could use it of course ) or if it is just noise in the signal of all other optimisations the compiler and programmer (masks, sentinels, precomputed tables) applies to the routines...
I tend to think you mixed too much things here, but maybe I''m wrong.




I understand .

A part from the use of the mask technic which is not really a good optimisation in most case, you should be satisfied using D-loop when you know the size of the table and don''t want to use asm. (for example SSE)







Share this post


Link to post
Share on other sites
d-loop    122
quote:
Original post by Lord Bart
I compiled this on a Sun 420R server with 4 processors, 4 Gigs memory running Solairs 8 on it, using Sun 6.1 workshop ANSI C++ compiler.

Lord Bart


I just tried your version. It''s slower than the funciton I give (20%). In fact I have to rework it for intel''s compiler. It don''t like things like :

while(char_mask[*(++str1)]==0)

but that

while(char_mask[*(++str1)]==0)
{
}

Well not all compiler like d-loop code. for exemple VC6 don''t like it (don''t now about vc7.1 but I heard that MS did a good job on optimising ALU instructions).




Share this post


Link to post
Share on other sites
Lord Bart    226
quote:
Original post by d-loop
quote:
Original post by Lord Bart
I compiled this on a Sun 420R server with 4 processors, 4 Gigs memory running Solairs 8 on it, using Sun 6.1 workshop ANSI C++ compiler.

Lord Bart


I just tried your version. It's slower than the funciton I give (20%). In fact I have to rework it for intel's compiler. It don't like things like :

while(char_mask[*(++str1)]==0)

but that

while(char_mask[*(++str1)]==0)
{
}

Well not all compiler like d-loop code. for exemple VC6 don't like it (don't now about vc7.1 but I heard that MS did a good job on optimising ALU instructions).




Yeap I forgot to put the {} brackets for the while loop, sorry.

Strange it slow down on Intel.

Inc pointer and defer should be faster then asignment of defer to a char and then increment pointer.

your first loop
while(char_mask[car]==0)
{
car = (unsigned char*)*str1++; // deref, asign, inc
}

my first loop
while(char_mask[*(++str1)]==0) // inc, deref no asign.
{
}

And st3[ i ] should work out to *(st3+i) which if you replace with *(++st3) for your check should be faster since you inc the pointer instead of a pointer addition.

yours
while((unsigned char)st3[ i ]==(unsigned char)str2[ i ])
{
i++; // extra inc of var need for defer above.
}
mine
while(*(++st3)==*(++st2))
{
//no need for i inc pointer instead and deref
}

My code keeps four things in use for main part: str1, str2, st2, st3 all of which are pointers. and are most like kept in registers.

Your code has five: car, str1, str2, st3, and i use in the main part. but it should also keep in registers, maybe, not sure on Intel box?

But I believe that compiler and processor differences matter most here. Sun processor has lots of registers, not sure how many pipelines.

Maybe I track down the gcc compiler on my box and compile it with gcc. But I won't get to it until Thurdays.

Also need to figure away I can see the asm code from the Sun compiler.

Any way it sped thing up on the Sun , but slow things on Intel.

Well anyway nice little paper by the way.

Lord Bart

[edited by - lord bart on March 23, 2004 6:22:02 PM]

[edited by - lord bart on March 23, 2004 6:23:22 PM]

Share this post


Link to post
Share on other sites
psamty10    148
OMFG... dost thou not believe that those kinds of microoptimizations are best left in the hands of the compiler writer.
The code is a little too unreadable for me, Im afraid.

Share this post


Link to post
Share on other sites