# loop break

This topic is 1356 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Note that these are just examples; I don't know what you or your co-workers would find more readable, since the post-decrement loop is very readable to me.

It's a bit deeper than that. The coding standards are legal standards, and go off to a lab to get certed. You can break a rule, but it means getting in both reviewers, and  senior manager  booked for a meeting(costly, and pisses people off, although there's biscuits), then a document is written justifying the deviation from standard (costly and boring, massive waste of time and resource), then the certification lab want to check it but charge more (maybe £6k). Basically if you pull that 'clever' shit you can consider yourself unemployed pretty quick. They canned 12 guys on NYE no warning, another 4 last week, all for bad business practice, they were probably adequate coders, just didn't understand the industry.

There are times when you get pulled in to make something uber efficient, like context switching on a blistering fast fpga, but mostly, write it like a muppet can read it and you keep on getting paid.

I do like that loop though :)

##### Share on other sites

...Isn't that the point, that counting down doesn't fetch on every iteration?

No, that's not the point.  This was never about counting down versus counting up, you're picking a completely different fight here, one that nobody else has even gone near talking about, and you're treating everyone else as if they're having the same fight as you whereas they're not.  This is about comparing the cutesy construct given in the OP with a more idiomatic for loop.

I.e compare this:

for (int i = 10; i--;)

To this:

for (int i = 10; i; i--)

That's what the point is.

##### Share on other sites

This is really just an obfuscated version of the idiom:

while (i--)
{
/* do something */
}

Where i is a preexisting variable.

Not really complete unless you show the  'i' variable getting its initial value.

Somehow it would have to get the right value previously (and having it set somewhere far above would obscure the codes obvious operation and add to the potential for a error).

##### Share on other sites

...Isn't that the point, that counting down doesn't fetch on every iteration? It is cached into a local variable; if you look at the ASM, it mov's the value from an offset from the stack pointer. That sounds local to me. It's doing exactly what it would if there were too many values held in the loop, and the counter spilled to RAM.

Counting up wouldn't have fetched each iteration if you'd cached he volatile into a local. If you want to simulate register pressure, then use a lot of variables in the loop body rather than disabling the optimizer via volatile.

I posted the ASM. You can see exactly what happened.

and it's not necessarily the same thing that the optimizer would've actually done in real code. Your written an example that does exactly what you assume the compiler will do, but can't possibly show us what the compiler would actually do in a real situation.
You can't use a forced example to demonstrate the optimizer's actual behavior! ;-)

In other words, allow the compiler to see that it's an unchanging number and inline the constant limit (I've tried that).

You can use a global, or a parameter from another function, or get the value from another translation unit, etc... That will stop the compiler from hard doing the loop limit in he same way that it does in real code -- rather than forcing its hand with code that gives it zero leeway.

...In other words, just don't use signed values, so that I can feel good about counting up all the time? I fail to see the point of actually modifying the use case to suit the algorithm, rather than the other way around.

unsigneds work for both up and down. If you want to compare those two loops fairly then you've got to make them as equivalent as possible -- if one uses "not zero", the other should use "not limit", etc. Neither use negative values so signed/unsigned is nuetral, but it would be interesting to see if it affects the codegen (many code bases use uint over int by default).
IMO it's also more common to write !=limit and !=0 rather than <limit and >0...

I don't mean to be abrasive, but did you read the assembly output? It clearly moves zero to each of the array elements, exactly as the C code says (array is volatile qualified). It's guaranteed to have not been optimized out, because it says right in the resulting code that it is doing it.

If you're testing whether the compiler can transform between up/down, then the use of volatile on the output array completely voids your test -- volatile writes can't be reordered, so you've told the compiler explicitly not to modify your iteration order!
So to test this behavior of the optimizer, you need to remove volatile, and then add an alternative method of stopping the code from being optimized away entirely.

##### Share on other sites

There is also that thing that Ive seen more than a little where in the test clause the constant is put first

for(i=0; 10 > i; i++)

{

}

or more commonly

if (10 > varx)  {  }

or

if (Mode1 == varx)   {     }

which usually makes me double check rather more than the classic  i<10  style

##### Share on other sites

This is really just an obfuscated version of the idiom:

while (i--)
{
/* do something */
}

Where i is a preexisting variable.

Not really complete unless you show the  'i' variable getting its initial value.

Somehow it would have to get the right value previously (and having it set somewhere far above would obscure the codes obvious operation and add to the potential for a error).

Yes, the point is a lot of the time the bounds you are interested in are already in a local variable (or a function parameter). Not that I would do this unless it was clear from context what I was doing. But in my opinion, it's much better than using a for loop with its iteration function (i--) in its termination condition.

##### Share on other sites

Ectara, you don't cache the limit variable into a local in the up case, so it has to be fetched every iteration, which doesn't happen in your down case.

...Isn't that the point, that counting down doesn't fetch on every iteration? It is cached into a local variable; if you look at the ASM, it mov's the value from an offset from the stack pointer. That sounds local to me. It's doing exactly what it would if there were too many values held in the loop, and the counter spilled to RAM.

I thought the original point was the compiler's optimizer would essentially create the same exact code for:

for(int i=10;i--;){}

and

for(int i=0;i<10;i++){}

however by including violatile, you effectively shut off the optimizer from doing it's thing.

yea, there are edge cases where this would not be true(and you were trying to show that edge case).  but that's just all it is, edge cases where yes you are correct, a count down version would theoretically generate tinier code.  but I think if you get to the point where your profiling indicates that this one extra instruction is your bottlenexk, you likely have too weak of hardware for w/e your trying to do.

Edited by slicer4ever

##### Share on other sites

I.e compare this:

for (int i = 10; i--;)

To this:

for (int i = 10; i; i--)

I have no idea where you're getting that, as the second version never appeared in the thread. Furthermore, that's not even the same loop! Think about it: i is 10 at the start, passes the non-zero check, then it is 10 as it goes into the loop body...

I have no idea what you're on about.

Counting up wouldn't have fetched each iteration if you'd cached he volatile into a local. If you want to simulate register pressure, then use a lot of variables in the loop body rather than disabling the optimizer via volatile.

Clearly, your definition of "local" and mine don't align. To me, a local variable is one of automatic storage duration defined with in a function or deeper scope. Volatile doesn't change that.

Moreover, it seems like you're arguing my method, rather than my result. I don't know why it isn't obvious that using two temporary variables (that the compiler can't optimize away to a constant) instead of one increases register pressure, and I need to prove it with numerous examples.

Basically if you pull that 'clever' shit you can consider yourself unemployed pretty quick.

Someday, people won't always assume that I'm not employed as a software engineer right now, and tell me the dictionary definition of obvious terms.

You can't use a forced example to demonstrate the optimizer's actual behavior!

Then that nullifies all of your suggested modifications to my exercise, and we're at a stalemate.

rather than forcing its hand with code that gives it zero leeway.

Because the point was that if it isn't put into a register, it can result in larger code. You are making all suggestions that would either optimize it to a constant or cache it in a register, which would still be no faster than counting down; at best, it'd break even. My argument from the beginning is reducing register pressure and comparing against zero, while your suggestions are allowing the compiler to assume there's no register pressure and use all registers at once.

Neither use negative values so signed/unsigned is nuetral, but it would be interesting to see if it affects the codegen (many code bases use uint over int by default).

Post results, I'm curious.

IMO it's also more common to write !=limit and !=0 rather than 0...

In my and my co-worker's opinions, it isn't.

If you're testing whether the compiler can transform between up/down

I'm not.

I thought the original point was the compiler's optimizer would essentially create the same exact code for:

for(int i=10;i--;){}

and

for(int i=0;i<10;i++){}

I posted three examples that disagreed with the consensus, and I got refutation about none of them in particular, so I chose the one that was most likely to make the biggest difference.

but that's just all it is, edge cases where yes you are correct, a count down version would theoretically generate tinier code.

I hardly consider any of my original three examples an edge case. I have a lot of loops that does a lot of permutations on every iteration, and the counter spills to RAM.

but I think if you get to the point where your profiling indicates that this one extra instruction is your bottlenexk

No, I never said that. People said they would be the same, and I posted an example where they weren't, with proof. This and many other arguments in this thread is a complete and utter strawman argument.

Really, I should have known better than to come into a coding horror thread, and post a use case for the idiom that people are dumping on.

##### Share on other sites

Moreover, it seems like you're arguing my method, rather than my result.

Exactly. I'm not attacking you or arguing the results. I'm honestly interested in what the results are. No need to get super defensive.
...I was just pointing out that by using volatile, you've forced the compiler's hand, so you may as well have just hand-written the assembly; it's not a test of the compiler, so seeing that the method is artificial, the results are also artificially achieved.

You can't use a forced example to demonstrate the optimizer's actual behavior!

Then that nullifies all of your suggested modifications to my exercise, and we're at a stalemate.

Not quite. I was suggesting that your example forces the compiler to only produce very specific assembly -- that your use of volatile has explicitly told the compiler exactly how it's allowed to construct the loops, and that it's not allowed to make any changes/optimizations.
Real production code will not use the volatile keyword like this (in 99.99% of cases) so it's not at all natural code. Write it naturally and perhaps the compiler can convert a forwards loop into a superior, optimal backwards loop?

If you're testing whether the compiler can transform between up/down

I'm not.

Oh, you threw me off with this where you tried to prove that the up/down versions produced different code

You do however realize that an optimizing compiler will produce exactly the same code whether you use the obfuscated version or not?

[up.c ... down.c ...] They aren't identical, and the counting down version has one fewer instruction (compiled with gcc -O3 -S). Yes, cleaner code is better. However, it is all too easy to say that they compile to the same instructions.

##### Share on other sites

getting certed [...] documents [...] meetings [...] lots of shit [...] huge extra cost
If I didn't know that this is exactly true and if it wasn't so sad, I'd say congrats on an excellent joke. Sadly, all that matters is whether or not some totally irrelevant (and likely not even competent) person gets an itch, because when that happens you simply don't get the "certified" stamp which is soooooooooooo important. But again, sadly, it's really important because the next incompetent idiot will only buy a product that is certified, no matter whether it's good or bad, and he simply won't buy yours if you don't have the stamp...

##### Share on other sites

getting certed [...] documents [...] meetings [...] lots of shit [...] huge extra cost
If I didn't know that this is exactly true and if it wasn't so sad, I'd say congrats on an excellent joke.

mostly our stuff is safety critical, it's a legal matter, as in you go to jail or pulled up on corporate manslaughter if mistakes are made. Reviews are logged with your name at official bodies, every last test result is recorded and signed in triplicate. Every line of code is scrutinised. The restrictions on what you can do with the languages are huge. It can be miserable at times.

Edited by catslap

##### Share on other sites

ah, sorry Samoth, I misread that as if you weren't sure how quality standards are met. Now I see you do! It's been a long day

##### Share on other sites

In addition to making sure that it doesn't all get optimized out, the volatile limit simulates register pressure; it is entirely possible for that counter to spill to RAM.

It's actually really cute what happens when you remove the volatile from each case:
// down.c
.cfi_offset %rbx, -24
movq	___stack_chk_guard@GOTPCREL(%rip), %rbx
movq	(%rbx), %rax
movq	%rax, -16(%rbp)
movl	10000, %eax ## imm = 0x2710 xorps %xmm0, %xmm0 .align 4, 0x90 LBB0_1: movups %xmm0, -10032(%rbp,%rax) addq-16, %rax
jne	LBB0_1
// up.c
.cfi_offset %rbx, -24
movq	___stack_chk_guard@GOTPCREL(%rip), %rbx
movq	(%rbx), %rax
movq	%rax, -16(%rbp)
leaq	-10016(%rbp), %rdi
movl	$10000, %esi ## imm = 0x2710 callq ___bzero It appears that the optimiser does not know that the two loops are identical, as the more normal upwards loop is optimised to a call to bzero(), whereas the downwards case actually generates a loop. #### Share this post ##### Link to post ##### Share on other sites volatile simulates register pressure [...] possible for the counter to spill to RAM There is a lot more to volatile than just spilling the contents to RAM. The compiler is told to assume that the contents of the memory location can be changed externally, at any time. Which means that it cannot reorder reads and writes, and cannot coalesce fetches. It must fetch the value exactly when it is used, not any earlier or later, and it will completely stall the processor's pipeline doing so. If the loop counter has to be written to RAM because of register pressure, the compiler hides the latency by fetching the value early, so it happens while other instructions execute (if the loop is not totally trivial of course, but a trivial loop won't have high register pressure). #### Share this post ##### Link to post ##### Share on other sites In addition to making sure that it doesn't all get optimized out, the volatile limit simulates register pressure; it is entirely possible for that counter to spill to RAM. It's actually really cute what happens when you remove the volatile from each case: // down.c .cfi_offset %rbx, -24 movq ___stack_chk_guard@GOTPCREL(%rip), %rbx movq (%rbx), %rax movq %rax, -16(%rbp) movl$10000, %eax            ## imm = 0x2710
xorps	%xmm0, %xmm0
.align	4, 0x90
LBB0_1:
movups	%xmm0, -10032(%rbp,%rax)
addq	$-16, %rax jne LBB0_1 // up.c .cfi_offset %rbx, -24 movq ___stack_chk_guard@GOTPCREL(%rip), %rbx movq (%rbx), %rax movq %rax, -16(%rbp) leaq -10016(%rbp), %rdi movl$10000, %esi            ## imm = 0x2710
callq	___bzero
It appears that the optimiser does not know that the two loops are identical, as the more normal upwards loop is optimised to a call to bzero(), whereas the downwards case actually generates a loop.

Amazing what happens when you actually compare like-with-like, isn't it?

Of course a contrived case can be set up to prove any point, so it is important to do an equivalent comparison.

##### Share on other sites

Of course a contrived case can be set up to prove any point, so it is important to do an equivalent comparison.

The zero case is kind of degenerate, anyways. Assuming we assign the index variable to the matching array slot, and print a random value from the array after the loop to prevent it being optimised away, the results are even stranger...

At -O2 optimisation, both loops get unrolled, and converted to SSE. However, the up case weights in at 120 lines of ASM, whereas the down case is a whopping 164.

Now, my SSE isn't up to scratch enough to tell which of these is more efficient, but it's fairly obvious which one is simpler

// up.c
#include <stdio.h>
#include <stdlib.h>

int main() {
char buf[10000];
const unsigned int limit = 10000;

for(unsigned int i = 0; i != limit; ++i){
buf[i] = i;
}

printf("%d\n", buf[rand() % 10000]);
}
// down.c
#include <stdio.h>
#include <stdlib.h>

int main() {
char buf[10000];
const unsigned int limit = 10000;

for(unsigned int i = limit; i--;){
buf[i] = i;
}

printf("%d\n", buf[rand() % 10000]);
}
// up.s
.section	__TEXT,__text,regular,pure_instructions
.section	__TEXT,__const
.align	4
LCPI0_0:
.byte	0                       ## 0x0
.byte	1                       ## 0x1
.byte	2                       ## 0x2
.byte	3                       ## 0x3
.byte	4                       ## 0x4
.byte	5                       ## 0x5
.byte	6                       ## 0x6
.byte	7                       ## 0x7
.byte	8                       ## 0x8
.byte	9                       ## 0x9
.byte	10                      ## 0xa
.byte	11                      ## 0xb
.byte	12                      ## 0xc
.byte	13                      ## 0xd
.byte	14                      ## 0xe
.byte	15                      ## 0xf
LCPI0_1:
.byte	16                      ## 0x10
.byte	17                      ## 0x11
.byte	18                      ## 0x12
.byte	19                      ## 0x13
.byte	20                      ## 0x14
.byte	21                      ## 0x15
.byte	22                      ## 0x16
.byte	23                      ## 0x17
.byte	24                      ## 0x18
.byte	25                      ## 0x19
.byte	26                      ## 0x1a
.byte	27                      ## 0x1b
.byte	28                      ## 0x1c
.byte	29                      ## 0x1d
.byte	30                      ## 0x1e
.byte	31                      ## 0x1f
.section	__TEXT,__text,regular,pure_instructions
.globl	_main
.align	4, 0x90
_main:                                  ## @main
.cfi_startproc
## BB#0:                                ## %vector.ph
pushq	%rbp
Ltmp3:
.cfi_def_cfa_offset 16
Ltmp4:
.cfi_offset %rbp, -16
movq	%rsp, %rbp
Ltmp5:
.cfi_def_cfa_register %rbp
pushq	%rbx
subq	10008, %rsp ## imm = 0x2718 Ltmp6: .cfi_offset %rbx, -24 movq ___stack_chk_guard@GOTPCREL(%rip), %rbx movq (%rbx), %rax movq %rax, -16(%rbp) xorl %eax, %eax movdqa LCPI0_0(%rip), %xmm0 movdqa LCPI0_1(%rip), %xmm1 .align 4, 0x90 LBB0_1: ## %vector.body ## =>This Inner Loop Header: Depth=1 movd %eax, %xmm2 punpcklbw %xmm2, %xmm2 ## xmm2 = xmm2[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7] punpcklbw %xmm2, %xmm2 ## xmm2 = xmm2[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7] pshufd0, %xmm2, %xmm2        ## xmm2 = xmm2[0,0,0,0]
movdqa	%xmm2, %xmm3
movdqa	%xmm3, -10016(%rbp,%rax)
movdqa	%xmm2, -10000(%rbp,%rax)
addq	$32, %rax cmpq$9984, %rax             ## imm = 0x2700
jne	LBB0_1
xorl	%eax, %eax
.align	4, 0x90
LBB0_3:                                 ## %scalar.ph
## =>This Inner Loop Header: Depth=1
movb	%al, -32(%rbp,%rax)
incq	%rax
cmpq	$16, %rax jne LBB0_3 ## BB#4: callq _rand cltq imulq$1759218605, %rax, %rcx ## imm = 0x68DB8BAD
movq	%rcx, %rdx
shrq	$63, %rdx sarq$44, %rcx
imull	$10000, %ecx, %ecx ## imm = 0x2710 subl %ecx, %eax cltq movsbl -10016(%rbp,%rax), %esi leaq L_.str(%rip), %rdi xorl %eax, %eax callq _printf movq (%rbx), %rax cmpq -16(%rbp), %rax jne LBB0_6 ## BB#5: xorl %eax, %eax addq$10008, %rsp            ## imm = 0x2718
popq	%rbx
popq	%rbp
ret
LBB0_6:
callq	___stack_chk_fail
.cfi_endproc

.section	__TEXT,__cstring,cstring_literals
L_.str:                                 ## @.str
.asciz	"%d\n"

.subsections_via_symbols
// down.s
.section	__TEXT,__text,regular,pure_instructions
.section	__TEXT,__const
.align	4
LCPI0_0:
LCPI0_1:
LCPI0_2:
LCPI0_3:
LCPI0_4:
LCPI0_5:
LCPI0_6:
LCPI0_7:
.byte	15                      ## 0xf
.byte	14                      ## 0xe
.byte	13                      ## 0xd
.byte	12                      ## 0xc
.byte	11                      ## 0xb
.byte	10                      ## 0xa
.byte	9                       ## 0x9
.byte	8                       ## 0x8
.byte	7                       ## 0x7
.byte	6                       ## 0x6
.byte	5                       ## 0x5
.byte	4                       ## 0x4
.byte	3                       ## 0x3
.byte	2                       ## 0x2
.byte	1                       ## 0x1
.byte	0                       ## 0x0
.section	__TEXT,__text,regular,pure_instructions
.globl	_main
.align	4, 0x90
_main:                                  ## @main
.cfi_startproc
## BB#0:                                ## %vector.ph
pushq	%rbp
Ltmp3:
.cfi_def_cfa_offset 16
Ltmp4:
.cfi_offset %rbp, -16
movq	%rsp, %rbp
Ltmp5:
.cfi_def_cfa_register %rbp
pushq	%rbx
subq	$10136, %rsp ## imm = 0x2798 Ltmp6: .cfi_offset %rbx, -24 movq ___stack_chk_guard@GOTPCREL(%rip), %rbx movq (%rbx), %rax movq %rax, -16(%rbp) movl$10000, %eax            ## imm = 0x2710
movq	$-1, %rcx movd %rcx, %xmm0 pslldq$8, %xmm0
movdqa	LCPI0_1(%rip), %xmm9
movdqa	LCPI0_2(%rip), %xmm10
movdqa	LCPI0_3(%rip), %xmm11
movdqa	LCPI0_4(%rip), %xmm12
movdqa	LCPI0_5(%rip), %xmm13
movdqa	LCPI0_6(%rip), %xmm14
movdqa	LCPI0_7(%rip), %xmm15
.align	4, 0x90
LBB0_1:                                 ## %vector.body
## =>This Inner Loop Header: Depth=1
leaq	-1(%rax), %rcx
movd	%rcx, %xmm2
movlhps	%xmm2, %xmm2            ## xmm2 = xmm2[0,0]
movaps	%xmm2, %xmm3
movaps	%xmm2, %xmm4
movaps	%xmm2, %xmm5
movaps	%xmm2, %xmm6
movaps	%xmm2, %xmm7
movaps	%xmm2, %xmm1
movaps	%xmm2, %xmm8
movdqa	%xmm2, -10144(%rbp)
movdqa	%xmm8, -10080(%rbp)
punpcklbw	%xmm2, %xmm8    ## xmm8 = xmm8[0],xmm2[0],xmm8[1],xmm2[1],xmm8[2],xmm2[2],xmm8[3],xmm2[3],xmm8[4],xmm2[4],xmm8[5],xmm2[5],xmm8[6],xmm2[6],xmm8[7],xmm2[7]
movdqa	%xmm1, -10112(%rbp)
movdqa	%xmm7, -10048(%rbp)
punpcklbw	%xmm1, %xmm7    ## xmm7 = xmm7[0],xmm1[0],xmm7[1],xmm1[1],xmm7[2],xmm1[2],xmm7[3],xmm1[3],xmm7[4],xmm1[4],xmm7[5],xmm1[5],xmm7[6],xmm1[6],xmm7[7],xmm1[7]
punpcklbw	%xmm8, %xmm7    ## xmm7 = xmm7[0],xmm8[0],xmm7[1],xmm8[1],xmm7[2],xmm8[2],xmm7[3],xmm8[3],xmm7[4],xmm8[4],xmm7[5],xmm8[5],xmm7[6],xmm8[6],xmm7[7],xmm8[7]
movdqa	%xmm6, -10128(%rbp)
movdqa	%xmm5, -10064(%rbp)
movdqa	%xmm4, -10096(%rbp)
movdqa	%xmm3, -10032(%rbp)
punpcklbw	%xmm6, %xmm5    ## xmm5 = xmm5[0],xmm6[0],xmm5[1],xmm6[1],xmm5[2],xmm6[2],xmm5[3],xmm6[3],xmm5[4],xmm6[4],xmm5[5],xmm6[5],xmm5[6],xmm6[6],xmm5[7],xmm6[7]
punpcklbw	%xmm4, %xmm3    ## xmm3 = xmm3[0],xmm4[0],xmm3[1],xmm4[1],xmm3[2],xmm4[2],xmm3[3],xmm4[3],xmm3[4],xmm4[4],xmm3[5],xmm4[5],xmm3[6],xmm4[6],xmm3[7],xmm4[7]
punpcklbw	%xmm5, %xmm3    ## xmm3 = xmm3[0],xmm5[0],xmm3[1],xmm5[1],xmm3[2],xmm5[2],xmm3[3],xmm5[3],xmm3[4],xmm5[4],xmm3[5],xmm5[5],xmm3[6],xmm5[6],xmm3[7],xmm5[7]
punpcklbw	%xmm7, %xmm3    ## xmm3 = xmm3[0],xmm7[0],xmm3[1],xmm7[1],xmm3[2],xmm7[2],xmm3[3],xmm7[3],xmm3[4],xmm7[4],xmm3[5],xmm7[5],xmm3[6],xmm7[6],xmm3[7],xmm7[7]
movd	-10136(%rbp), %xmm1
movd	-10072(%rbp), %xmm2
punpcklbw	%xmm1, %xmm2    ## xmm2 = xmm2[0],xmm1[0],xmm2[1],xmm1[1],xmm2[2],xmm1[2],xmm2[3],xmm1[3],xmm2[4],xmm1[4],xmm2[5],xmm1[5],xmm2[6],xmm1[6],xmm2[7],xmm1[7]
movd	-10104(%rbp), %xmm1
movd	-10040(%rbp), %xmm4
punpcklbw	%xmm1, %xmm4    ## xmm4 = xmm4[0],xmm1[0],xmm4[1],xmm1[1],xmm4[2],xmm1[2],xmm4[3],xmm1[3],xmm4[4],xmm1[4],xmm4[5],xmm1[5],xmm4[6],xmm1[6],xmm4[7],xmm1[7]
punpcklbw	%xmm2, %xmm4    ## xmm4 = xmm4[0],xmm2[0],xmm4[1],xmm2[1],xmm4[2],xmm2[2],xmm4[3],xmm2[3],xmm4[4],xmm2[4],xmm4[5],xmm2[5],xmm4[6],xmm2[6],xmm4[7],xmm2[7]
movd	-10120(%rbp), %xmm1
movd	-10056(%rbp), %xmm2
punpcklbw	%xmm1, %xmm2    ## xmm2 = xmm2[0],xmm1[0],xmm2[1],xmm1[1],xmm2[2],xmm1[2],xmm2[3],xmm1[3],xmm2[4],xmm1[4],xmm2[5],xmm1[5],xmm2[6],xmm1[6],xmm2[7],xmm1[7]
movd	-10088(%rbp), %xmm1
movd	-10024(%rbp), %xmm5
punpcklbw	%xmm1, %xmm5    ## xmm5 = xmm5[0],xmm1[0],xmm5[1],xmm1[1],xmm5[2],xmm1[2],xmm5[3],xmm1[3],xmm5[4],xmm1[4],xmm5[5],xmm1[5],xmm5[6],xmm1[6],xmm5[7],xmm1[7]
punpcklbw	%xmm2, %xmm5    ## xmm5 = xmm5[0],xmm2[0],xmm5[1],xmm2[1],xmm5[2],xmm2[2],xmm5[3],xmm2[3],xmm5[4],xmm2[4],xmm5[5],xmm2[5],xmm5[6],xmm2[6],xmm5[7],xmm2[7]
punpcklbw	%xmm4, %xmm5    ## xmm5 = xmm5[0],xmm4[0],xmm5[1],xmm4[1],xmm5[2],xmm4[2],xmm5[3],xmm4[3],xmm5[4],xmm4[4],xmm5[5],xmm4[5],xmm5[6],xmm4[6],xmm5[7],xmm4[7]
punpcklbw	%xmm5, %xmm3    ## xmm3 = xmm3[0],xmm5[0],xmm3[1],xmm5[1],xmm3[2],xmm5[2],xmm3[3],xmm5[3],xmm3[4],xmm5[4],xmm3[5],xmm5[5],xmm3[6],xmm5[6],xmm3[7],xmm5[7]
pshufb	%xmm15, %xmm3
movdqu	%xmm3, -10032(%rbp,%rax)
addq	$-16, %rax jne LBB0_1 ## BB#2: ## %middle.block callq _rand cltq imulq$1759218605, %rax, %rcx ## imm = 0x68DB8BAD
movq	%rcx, %rdx
shrq	$63, %rdx sarq$44, %rcx
imull	$10000, %ecx, %ecx ## imm = 0x2710 subl %ecx, %eax cltq movsbl -10016(%rbp,%rax), %esi leaq L_.str(%rip), %rdi xorl %eax, %eax callq _printf movq (%rbx), %rax cmpq -16(%rbp), %rax jne LBB0_4 ## BB#3: ## %middle.block xorl %eax, %eax addq$10136, %rsp            ## imm = 0x2798
popq	%rbx
popq	%rbp
ret
LBB0_4:                                 ## %middle.block
callq	___stack_chk_fail
.cfi_endproc

.section	__TEXT,__cstring,cstring_literals
L_.str:                                 ## @.str
.asciz	"%d\n"

.subsections_via_symbols

##### Share on other sites

At this stage I feel a new rule of the internet coming on.

"There is no coding standard so reprehensible, so obviously wrong-looking, that somebody somewhere won't try to defend it."  (See also comma-first.)

##### Share on other sites

It's actually really cute what happens when you remove the volatile from each case:

Heck, may as well write

;

for both, and watch the compiler optimize away.

##### Share on other sites

Amazing what happens when you actually compare like-with-like, isn't it?

That's not what that is comparing. That's pretty much saying x*0 vs. x - x, where the observer can clearly see that the calculation is useless, and thus do anything they want. If I turn the optimization higher, without volatile, it emits an empty program.

Of course a contrived case can be set up to prove any point, so it is important to do an equivalent comparison.

How is a case where the compiler is permitted to do nothing at all not contrived, again? It feels like you're agreeing with anything that faintly sounds against my argument.

It appears that the optimiser does not know that the two loops are identical, as the more normal upwards loop is optimised to a call to bzero(), whereas the downwards case actually generates a loop.

I acknowledge that the compiler is intelligent enough to convert this to a bzero() call (how it doesn't discard the array entirely is interesting). However, writing zeroes is an example, as the contents of the loop aren't the important part. The fact that the compiler can cheat the test and remove the loop if the loop looks like a certain pattern is not an argument for either loop preference.

To be honest, this doesn't feel like a valid debate anymore. I'm content with its current course of analyzing optimizer behavior.

##### Share on other sites

At -O2 optimisation, both loops get unrolled, and converted to SSE. However, the up case weights in at 120 lines of ASM, whereas the down case is a whopping 164.

That's scary, where an optimizer can take nearly equal code, and make the one that performs better in theory perform worse in practice. However, this does assert a monopoly of the registers, again.

##### Share on other sites

At -O2 optimisation, both loops get unrolled, and converted to SSE. However, the up case weights in at 120 lines of ASM, whereas the down case is a whopping 164.

That's scary, where an optimizer can take nearly equal code, and make the one that performs better in theory perform worse in practice. However, this does assert a monopoly of the registers, again.

At this stage I think you really need to read this: http://blogs.msdn.com/b/oldnewthing/archive/2004/12/16/317157.aspx

The technical specifics in it are of course out of date, but the lesson remains valid, and it's instructive to follow him walking through an optimization scenario where he produces code that - going by the disassembly - should be faster in theory, but turns out to be slower in practice because it ends up totally blowing the return address predictor stack.

This is the same Raymond Chen who has credits on early releases of the Linux kernel, by the way.

##### Share on other sites

It's actually really cute what happens when you remove the volatile from each case:

Heck, may as well write

;

for both, and watch the compiler optimize away.

printf("%d\n", buf[rand() % 10000]);

That is the proper way (or one proper way) of preventing the compiler from optimizing out the loop. Which, contrary to using volatile on the loop counter, does not include some very different (and wrong) assumptions about the variable and its contents and does not screw up the optimizer.

The difference is that Swiftcoder's program is non-trivial (well it is of course kind of "trivial", but it's not trivial insofar as it actually does something by consuming a value of the array) whereas your solution with volatile is a trivial program (one that basically does nothing at all) that inhibits optimization of that variable.

It would probably be entirely legal for a very aggressively optimizing compiler to optimize out the array alltogether, and only count the volatile loop counter up (or down). Because, in the end, whether or not there is an array at all makes no externally observable difference, it's not consumed anywhere, and the program ends right after initializing it.

Edited by samoth

##### Share on other sites

This is the same Raymond Chen who has credits on early releases of the Linux kernel, by the way.

I know who Raymond Chen is, and I've read that article years ago.

That is the proper way (or one proper way) of preventing the compiler from optimizing out the loop. Which, contrary to using volatile on the loop counter, does not include some very different (and wrong) assumptions about the variable and its contents and does not screw up the optimizer.

Okay, his program is more proper. However, that still ignores the fact that the compiler is optimizing his limit away to a constant (cheating the test), which my volatile limit avoided.

##### Share on other sites

However, that still ignores the fact that the compiler is optimizing his limit away to a constant (cheating the test), which my volatile limit avoided.

Again, easily avoided:
// up.c
#include <stdio.h>
#include <stdlib.h>

int main() {
char buf[10000];
unsigned int limit;
printf("enter a number up to 10000: ");
scanf("%u", &limit);

for(unsigned int i = 0; i != limit; ++i){
buf[i] = i;
}

printf("%d\n", buf[rand() % 10000]);
}
And the same of the other program.

On my system:
$gcc -S -O2 *.c -std=c99 down.c: In function ‘main’: down.c:9:7: warning: ignoring return value of ‘scanf’, declared with attribute warn_unused_result [-Wunused-result] up.c: In function ‘main’: up.c:9:7: warning: ignoring return value of ‘scanf’, declared with attribute warn_unused_result [-Wunused-result]$ diff {up,down}.s
1c1
< 	.file	"up.c"
---
> 	.file	"down.c"
33,34c33,34
< 	movl	24(%esp), %eax
< 	testl	%eax, %eax
---
> 	movl	24(%esp), %edx
> 	testl	%edx, %edx
36c36
< 	subl	$1, %eax --- > xorl %eax, %eax 41,42c41,42 < subl$1, %eax
< 	cmpl	$-1, %eax --- > addl$1, %eax
> 	cmpl	%edx, %eax

##### Share on other sites

On my system:

Those two look equal in length, and time complexity, so I have no issues with that.