Followers 0

# loop break

## 48 posts in this topic

Note that these are just examples; I don't know what you or your co-workers would find more readable, since the post-decrement loop is very readable to me.

It's a bit deeper than that. The coding standards are legal standards, and go off to a lab to get certed. You can break a rule, but it means getting in both reviewers, and  senior manager  booked for a meeting(costly, and pisses people off, although there's biscuits), then a document is written justifying the deviation from standard (costly and boring, massive waste of time and resource), then the certification lab want to check it but charge more (maybe £6k). Basically if you pull that 'clever' shit you can consider yourself unemployed pretty quick. They canned 12 guys on NYE no warning, another 4 last week, all for bad business practice, they were probably adequate coders, just didn't understand the industry.

There are times when you get pulled in to make something uber efficient, like context switching on a blistering fast fpga, but mostly, write it like a muppet can read it and you keep on getting paid.

I do like that loop though :)

1

##### Share on other sites

...Isn't that the point, that counting down doesn't fetch on every iteration?

No, that's not the point.  This was never about counting down versus counting up, you're picking a completely different fight here, one that nobody else has even gone near talking about, and you're treating everyone else as if they're having the same fight as you whereas they're not.  This is about comparing the cutesy construct given in the OP with a more idiomatic for loop.

I.e compare this:

for (int i = 10; i--;)

To this:

for (int i = 10; i; i--)

That's what the point is.

1

##### Share on other sites

This is really just an obfuscated version of the idiom:

while (i--)
{
/* do something */
}


Where i is a preexisting variable.

Not really complete unless you show the  'i' variable getting its initial value.

Somehow it would have to get the right value previously (and having it set somewhere far above would obscure the codes obvious operation and add to the potential for a error).

0

##### Share on other sites

...Isn't that the point, that counting down doesn't fetch on every iteration? It is cached into a local variable; if you look at the ASM, it mov's the value from an offset from the stack pointer. That sounds local to me. It's doing exactly what it would if there were too many values held in the loop, and the counter spilled to RAM.

Counting up wouldn't have fetched each iteration if you'd cached he volatile into a local. If you want to simulate register pressure, then use a lot of variables in the loop body rather than disabling the optimizer via volatile.

I posted the ASM. You can see exactly what happened.

and it's not necessarily the same thing that the optimizer would've actually done in real code. Your written an example that does exactly what you assume the compiler will do, but can't possibly show us what the compiler would actually do in a real situation.
You can't use a forced example to demonstrate the optimizer's actual behavior! ;-)

In other words, allow the compiler to see that it's an unchanging number and inline the constant limit (I've tried that).

You can use a global, or a parameter from another function, or get the value from another translation unit, etc... That will stop the compiler from hard doing the loop limit in he same way that it does in real code -- rather than forcing its hand with code that gives it zero leeway.

...In other words, just don't use signed values, so that I can feel good about counting up all the time? I fail to see the point of actually modifying the use case to suit the algorithm, rather than the other way around.

unsigneds work for both up and down. If you want to compare those two loops fairly then you've got to make them as equivalent as possible -- if one uses "not zero", the other should use "not limit", etc. Neither use negative values so signed/unsigned is nuetral, but it would be interesting to see if it affects the codegen (many code bases use uint over int by default).
IMO it's also more common to write !=limit and !=0 rather than <limit and >0...

I don't mean to be abrasive, but did you read the assembly output? It clearly moves zero to each of the array elements, exactly as the C code says (array is volatile qualified). It's guaranteed to have not been optimized out, because it says right in the resulting code that it is doing it.

If you're testing whether the compiler can transform between up/down, then the use of volatile on the output array completely voids your test -- volatile writes can't be reordered, so you've told the compiler explicitly not to modify your iteration order!
So to test this behavior of the optimizer, you need to remove volatile, and then add an alternative method of stopping the code from being optimized away entirely.
1

##### Share on other sites

There is also that thing that Ive seen more than a little where in the test clause the constant is put first

for(i=0; 10 > i; i++)

{

}

or more commonly

if (10 > varx)  {  }

or

if (Mode1 == varx)   {     }

which usually makes me double check rather more than the classic  i<10  style

0

##### Share on other sites

This is really just an obfuscated version of the idiom:

while (i--)
{
/* do something */
}


Where i is a preexisting variable.

Not really complete unless you show the  'i' variable getting its initial value.

Somehow it would have to get the right value previously (and having it set somewhere far above would obscure the codes obvious operation and add to the potential for a error).

Yes, the point is a lot of the time the bounds you are interested in are already in a local variable (or a function parameter). Not that I would do this unless it was clear from context what I was doing. But in my opinion, it's much better than using a for loop with its iteration function (i--) in its termination condition.

0

##### Share on other sites

Ectara, you don't cache the limit variable into a local in the up case, so it has to be fetched every iteration, which doesn't happen in your down case.

...Isn't that the point, that counting down doesn't fetch on every iteration? It is cached into a local variable; if you look at the ASM, it mov's the value from an offset from the stack pointer. That sounds local to me. It's doing exactly what it would if there were too many values held in the loop, and the counter spilled to RAM.

I thought the original point was the compiler's optimizer would essentially create the same exact code for:

for(int i=10;i--;){}


and

for(int i=0;i<10;i++){}


however by including violatile, you effectively shut off the optimizer from doing it's thing.

yea, there are edge cases where this would not be true(and you were trying to show that edge case).  but that's just all it is, edge cases where yes you are correct, a count down version would theoretically generate tinier code.  but I think if you get to the point where your profiling indicates that this one extra instruction is your bottlenexk, you likely have too weak of hardware for w/e your trying to do.

Edited by slicer4ever
0

##### Share on other sites

I.e compare this:

for (int i = 10; i--;)

To this:

for (int i = 10; i; i--)

I have no idea where you're getting that, as the second version never appeared in the thread. Furthermore, that's not even the same loop! Think about it: i is 10 at the start, passes the non-zero check, then it is 10 as it goes into the loop body...

I have no idea what you're on about.

Counting up wouldn't have fetched each iteration if you'd cached he volatile into a local. If you want to simulate register pressure, then use a lot of variables in the loop body rather than disabling the optimizer via volatile.

Clearly, your definition of "local" and mine don't align. To me, a local variable is one of automatic storage duration defined with in a function or deeper scope. Volatile doesn't change that.

Moreover, it seems like you're arguing my method, rather than my result. I don't know why it isn't obvious that using two temporary variables (that the compiler can't optimize away to a constant) instead of one increases register pressure, and I need to prove it with numerous examples.

Basically if you pull that 'clever' shit you can consider yourself unemployed pretty quick.

Someday, people won't always assume that I'm not employed as a software engineer right now, and tell me the dictionary definition of obvious terms.

You can't use a forced example to demonstrate the optimizer's actual behavior!

Then that nullifies all of your suggested modifications to my exercise, and we're at a stalemate.

rather than forcing its hand with code that gives it zero leeway.

Because the point was that if it isn't put into a register, it can result in larger code. You are making all suggestions that would either optimize it to a constant or cache it in a register, which would still be no faster than counting down; at best, it'd break even. My argument from the beginning is reducing register pressure and comparing against zero, while your suggestions are allowing the compiler to assume there's no register pressure and use all registers at once.

Neither use negative values so signed/unsigned is nuetral, but it would be interesting to see if it affects the codegen (many code bases use uint over int by default).

Post results, I'm curious.

IMO it's also more common to write !=limit and !=0 rather than 0...

In my and my co-worker's opinions, it isn't.

If you're testing whether the compiler can transform between up/down

I'm not.

I thought the original point was the compiler's optimizer would essentially create the same exact code for:

for(int i=10;i--;){}

and

for(int i=0;i<10;i++){}

I posted three examples that disagreed with the consensus, and I got refutation about none of them in particular, so I chose the one that was most likely to make the biggest difference.

but that's just all it is, edge cases where yes you are correct, a count down version would theoretically generate tinier code.

I hardly consider any of my original three examples an edge case. I have a lot of loops that does a lot of permutations on every iteration, and the counter spills to RAM.

but I think if you get to the point where your profiling indicates that this one extra instruction is your bottlenexk

No, I never said that. People said they would be the same, and I posted an example where they weren't, with proof. This and many other arguments in this thread is a complete and utter strawman argument.

Really, I should have known better than to come into a coding horror thread, and post a use case for the idiom that people are dumping on.

0

##### Share on other sites

Moreover, it seems like you're arguing my method, rather than my result.

Exactly. I'm not attacking you or arguing the results. I'm honestly interested in what the results are. No need to get super defensive.
...I was just pointing out that by using volatile, you've forced the compiler's hand, so you may as well have just hand-written the assembly; it's not a test of the compiler, so seeing that the method is artificial, the results are also artificially achieved.

You can't use a forced example to demonstrate the optimizer's actual behavior!

Then that nullifies all of your suggested modifications to my exercise, and we're at a stalemate.

Not quite. I was suggesting that your example forces the compiler to only produce very specific assembly -- that your use of volatile has explicitly told the compiler exactly how it's allowed to construct the loops, and that it's not allowed to make any changes/optimizations.
Real production code will not use the volatile keyword like this (in 99.99% of cases) so it's not at all natural code. Write it naturally and perhaps the compiler can convert a forwards loop into a superior, optimal backwards loop?

If you're testing whether the compiler can transform between up/down

I'm not.

Oh, you threw me off with this where you tried to prove that the up/down versions produced different code

You do however realize that an optimizing compiler will produce exactly the same code whether you use the obfuscated version or not?

[up.c ... down.c ...] They aren't identical, and the counting down version has one fewer instruction (compiled with gcc -O3 -S). Yes, cleaner code is better. However, it is all too easy to say that they compile to the same instructions.
1

##### Share on other sites

getting certed [...] documents [...] meetings [...] lots of shit [...] huge extra cost
If I didn't know that this is exactly true and if it wasn't so sad, I'd say congrats on an excellent joke. Sadly, all that matters is whether or not some totally irrelevant (and likely not even competent) person gets an itch, because when that happens you simply don't get the "certified" stamp which is soooooooooooo important. But again, sadly, it's really important because the next incompetent idiot will only buy a product that is certified, no matter whether it's good or bad, and he simply won't buy yours if you don't have the stamp...
0

##### Share on other sites

getting certed [...] documents [...] meetings [...] lots of shit [...] huge extra cost
If I didn't know that this is exactly true and if it wasn't so sad, I'd say congrats on an excellent joke.

mostly our stuff is safety critical, it's a legal matter, as in you go to jail or pulled up on corporate manslaughter if mistakes are made. Reviews are logged with your name at official bodies, every last test result is recorded and signed in triplicate. Every line of code is scrutinised. The restrictions on what you can do with the languages are huge. It can be miserable at times.

Edited by catslap
0

##### Share on other sites

ah, sorry Samoth, I misread that as if you weren't sure how quality standards are met. Now I see you do! It's been a long day

0

##### Share on other sites
volatile simulates register pressure [...] possible for the counter to spill to RAM

There is a lot more to volatile than just spilling the contents to RAM. The compiler is told to assume that the contents of the memory location can be changed externally, at any time.

Which means that it cannot reorder reads and writes, and cannot coalesce fetches. It must fetch the value exactly when it is used, not any earlier or later, and it will completely stall the processor's pipeline doing so.

If the loop counter has to be written to RAM because of register pressure, the compiler hides the latency by fetching the value early, so it happens while other instructions execute (if the loop is not totally trivial of course, but a trivial loop won't have high register pressure).

1

##### Share on other sites

In addition to making sure that it doesn't all get optimized out, the volatile limit simulates register pressure; it is entirely possible for that counter to spill to RAM.

It's actually really cute what happens when you remove the volatile from each case:
// down.c
.cfi_offset %rbx, -24
movq	___stack_chk_guard@GOTPCREL(%rip), %rbx
movq	(%rbx), %rax
movq	%rax, -16(%rbp)
movl	10000, %eax ## imm = 0x2710 xorps %xmm0, %xmm0 .align 4, 0x90 LBB0_1: movups %xmm0, -10032(%rbp,%rax) addq-16, %rax
jne	LBB0_1
// up.c
.cfi_offset %rbx, -24
movq	___stack_chk_guard@GOTPCREL(%rip), %rbx
movq	(%rbx), %rax
movq	%rax, -16(%rbp)
leaq	-10016(%rbp), %rdi
movl	10000, %esi ## imm = 0x2710 callq ___bzero It appears that the optimiser does not know that the two loops are identical, as the more normal upwards loop is optimised to a call to bzero(), whereas the downwards case actually generates a loop. Amazing what happens when you actually compare like-with-like, isn't it? Of course a contrived case can be set up to prove any point, so it is important to do an equivalent comparison. 0 #### Share this post ##### Link to post ##### Share on other sites Of course a contrived case can be set up to prove any point, so it is important to do an equivalent comparison. The zero case is kind of degenerate, anyways. Assuming we assign the index variable to the matching array slot, and print a random value from the array after the loop to prevent it being optimised away, the results are even stranger... At -O2 optimisation, both loops get unrolled, and converted to SSE. However, the up case weights in at 120 lines of ASM, whereas the down case is a whopping 164. Now, my SSE isn't up to scratch enough to tell which of these is more efficient, but it's fairly obvious which one is simpler // up.c #include <stdio.h> #include <stdlib.h> int main() { char buf[10000]; const unsigned int limit = 10000; for(unsigned int i = 0; i != limit; ++i){ buf[i] = i; } printf("%d\n", buf[rand() % 10000]); } // down.c #include <stdio.h> #include <stdlib.h> int main() { char buf[10000]; const unsigned int limit = 10000; for(unsigned int i = limit; i--;){ buf[i] = i; } printf("%d\n", buf[rand() % 10000]); } // up.s .section __TEXT,__text,regular,pure_instructions .section __TEXT,__const .align 4 LCPI0_0: .byte 0 ## 0x0 .byte 1 ## 0x1 .byte 2 ## 0x2 .byte 3 ## 0x3 .byte 4 ## 0x4 .byte 5 ## 0x5 .byte 6 ## 0x6 .byte 7 ## 0x7 .byte 8 ## 0x8 .byte 9 ## 0x9 .byte 10 ## 0xa .byte 11 ## 0xb .byte 12 ## 0xc .byte 13 ## 0xd .byte 14 ## 0xe .byte 15 ## 0xf LCPI0_1: .byte 16 ## 0x10 .byte 17 ## 0x11 .byte 18 ## 0x12 .byte 19 ## 0x13 .byte 20 ## 0x14 .byte 21 ## 0x15 .byte 22 ## 0x16 .byte 23 ## 0x17 .byte 24 ## 0x18 .byte 25 ## 0x19 .byte 26 ## 0x1a .byte 27 ## 0x1b .byte 28 ## 0x1c .byte 29 ## 0x1d .byte 30 ## 0x1e .byte 31 ## 0x1f .section __TEXT,__text,regular,pure_instructions .globl _main .align 4, 0x90 _main: ## @main .cfi_startproc ## BB#0: ## %vector.ph pushq %rbp Ltmp3: .cfi_def_cfa_offset 16 Ltmp4: .cfi_offset %rbp, -16 movq %rsp, %rbp Ltmp5: .cfi_def_cfa_register %rbp pushq %rbx subq10008, %rsp            ## imm = 0x2718
Ltmp6:
.cfi_offset %rbx, -24
movq	___stack_chk_guard@GOTPCREL(%rip), %rbx
movq	(%rbx), %rax
movq	%rax, -16(%rbp)
xorl	%eax, %eax
movdqa	LCPI0_0(%rip), %xmm0
movdqa	LCPI0_1(%rip), %xmm1
.align	4, 0x90
LBB0_1:                                 ## %vector.body
## =>This Inner Loop Header: Depth=1
movd	%eax, %xmm2
punpcklbw	%xmm2, %xmm2    ## xmm2 = xmm2[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
punpcklbw	%xmm2, %xmm2    ## xmm2 = xmm2[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
pshufd	$0, %xmm2, %xmm2 ## xmm2 = xmm2[0,0,0,0] movdqa %xmm2, %xmm3 paddb %xmm0, %xmm3 paddb %xmm1, %xmm2 movdqa %xmm3, -10016(%rbp,%rax) movdqa %xmm2, -10000(%rbp,%rax) addq$32, %rax
cmpq	9984, %rax ## imm = 0x2700 jne LBB0_1 ## BB#2: ## %scalar.ph.preheader xorl %eax, %eax .align 4, 0x90 LBB0_3: ## %scalar.ph ## =>This Inner Loop Header: Depth=1 movb %al, -32(%rbp,%rax) incq %rax cmpq16, %rax
jne	LBB0_3
## BB#4:
callq	_rand
cltq
imulq	$1759218605, %rax, %rcx ## imm = 0x68DB8BAD movq %rcx, %rdx shrq$63, %rdx
sarq	$44, %rcx addl %edx, %ecx imull$10000, %ecx, %ecx      ## imm = 0x2710
subl	%ecx, %eax
cltq
movsbl	-10016(%rbp,%rax), %esi
leaq	L_.str(%rip), %rdi
xorl	%eax, %eax
callq	_printf
movq	(%rbx), %rax
cmpq	-16(%rbp), %rax
jne	LBB0_6
## BB#5:
xorl	%eax, %eax
addq	10008, %rsp ## imm = 0x2718 popq %rbx popq %rbp ret LBB0_6: callq ___stack_chk_fail .cfi_endproc .section __TEXT,__cstring,cstring_literals L_.str: ## @.str .asciz "%d\n" .subsections_via_symbols // down.s .section __TEXT,__text,regular,pure_instructions .section __TEXT,__const .align 4 LCPI0_0: .quad -8 ## 0xfffffffffffffff8 .quad -9 ## 0xfffffffffffffff7 LCPI0_1: .quad -4 ## 0xfffffffffffffffc .quad -5 ## 0xfffffffffffffffb LCPI0_2: .quad -12 ## 0xfffffffffffffff4 .quad -13 ## 0xfffffffffffffff3 LCPI0_3: .quad -2 ## 0xfffffffffffffffe .quad -3 ## 0xfffffffffffffffd LCPI0_4: .quad -10 ## 0xfffffffffffffff6 .quad -11 ## 0xfffffffffffffff5 LCPI0_5: .quad -6 ## 0xfffffffffffffffa .quad -7 ## 0xfffffffffffffff9 LCPI0_6: .quad -14 ## 0xfffffffffffffff2 .quad -15 ## 0xfffffffffffffff1 LCPI0_7: .byte 15 ## 0xf .byte 14 ## 0xe .byte 13 ## 0xd .byte 12 ## 0xc .byte 11 ## 0xb .byte 10 ## 0xa .byte 9 ## 0x9 .byte 8 ## 0x8 .byte 7 ## 0x7 .byte 6 ## 0x6 .byte 5 ## 0x5 .byte 4 ## 0x4 .byte 3 ## 0x3 .byte 2 ## 0x2 .byte 1 ## 0x1 .byte 0 ## 0x0 .section __TEXT,__text,regular,pure_instructions .globl _main .align 4, 0x90 _main: ## @main .cfi_startproc ## BB#0: ## %vector.ph pushq %rbp Ltmp3: .cfi_def_cfa_offset 16 Ltmp4: .cfi_offset %rbp, -16 movq %rsp, %rbp Ltmp5: .cfi_def_cfa_register %rbp pushq %rbx subq10136, %rsp            ## imm = 0x2798
Ltmp6:
.cfi_offset %rbx, -24
movq	___stack_chk_guard@GOTPCREL(%rip), %rbx
movq	(%rbx), %rax
movq	%rax, -16(%rbp)
movl	$10000, %eax ## imm = 0x2710 movq$-1, %rcx
movd	%rcx, %xmm0
pslldq	8, %xmm0 movdqa LCPI0_1(%rip), %xmm9 movdqa LCPI0_2(%rip), %xmm10 movdqa LCPI0_3(%rip), %xmm11 movdqa LCPI0_4(%rip), %xmm12 movdqa LCPI0_5(%rip), %xmm13 movdqa LCPI0_6(%rip), %xmm14 movdqa LCPI0_7(%rip), %xmm15 .align 4, 0x90 LBB0_1: ## %vector.body ## =>This Inner Loop Header: Depth=1 leaq -1(%rax), %rcx movd %rcx, %xmm2 movlhps %xmm2, %xmm2 ## xmm2 = xmm2[0,0] movaps %xmm2, %xmm3 movaps %xmm2, %xmm4 movaps %xmm2, %xmm5 movaps %xmm2, %xmm6 movaps %xmm2, %xmm7 movaps %xmm2, %xmm1 movaps %xmm2, %xmm8 paddq %xmm13, %xmm8 paddq %xmm14, %xmm2 movdqa %xmm2, -10144(%rbp) movdqa %xmm8, -10080(%rbp) punpcklbw %xmm2, %xmm8 ## xmm8 = xmm8[0],xmm2[0],xmm8[1],xmm2[1],xmm8[2],xmm2[2],xmm8[3],xmm2[3],xmm8[4],xmm2[4],xmm8[5],xmm2[5],xmm8[6],xmm2[6],xmm8[7],xmm2[7] paddq %xmm11, %xmm7 paddq %xmm12, %xmm1 movdqa %xmm1, -10112(%rbp) movdqa %xmm7, -10048(%rbp) punpcklbw %xmm1, %xmm7 ## xmm7 = xmm7[0],xmm1[0],xmm7[1],xmm1[1],xmm7[2],xmm1[2],xmm7[3],xmm1[3],xmm7[4],xmm1[4],xmm7[5],xmm1[5],xmm7[6],xmm1[6],xmm7[7],xmm1[7] punpcklbw %xmm8, %xmm7 ## xmm7 = xmm7[0],xmm8[0],xmm7[1],xmm8[1],xmm7[2],xmm8[2],xmm7[3],xmm8[3],xmm7[4],xmm8[4],xmm7[5],xmm8[5],xmm7[6],xmm8[6],xmm7[7],xmm8[7] paddq %xmm0, %xmm3 paddq LCPI0_0(%rip), %xmm4 paddq %xmm9, %xmm5 paddq %xmm10, %xmm6 movdqa %xmm6, -10128(%rbp) movdqa %xmm5, -10064(%rbp) movdqa %xmm4, -10096(%rbp) movdqa %xmm3, -10032(%rbp) punpcklbw %xmm6, %xmm5 ## xmm5 = xmm5[0],xmm6[0],xmm5[1],xmm6[1],xmm5[2],xmm6[2],xmm5[3],xmm6[3],xmm5[4],xmm6[4],xmm5[5],xmm6[5],xmm5[6],xmm6[6],xmm5[7],xmm6[7] punpcklbw %xmm4, %xmm3 ## xmm3 = xmm3[0],xmm4[0],xmm3[1],xmm4[1],xmm3[2],xmm4[2],xmm3[3],xmm4[3],xmm3[4],xmm4[4],xmm3[5],xmm4[5],xmm3[6],xmm4[6],xmm3[7],xmm4[7] punpcklbw %xmm5, %xmm3 ## xmm3 = xmm3[0],xmm5[0],xmm3[1],xmm5[1],xmm3[2],xmm5[2],xmm3[3],xmm5[3],xmm3[4],xmm5[4],xmm3[5],xmm5[5],xmm3[6],xmm5[6],xmm3[7],xmm5[7] punpcklbw %xmm7, %xmm3 ## xmm3 = xmm3[0],xmm7[0],xmm3[1],xmm7[1],xmm3[2],xmm7[2],xmm3[3],xmm7[3],xmm3[4],xmm7[4],xmm3[5],xmm7[5],xmm3[6],xmm7[6],xmm3[7],xmm7[7] movd -10136(%rbp), %xmm1 movd -10072(%rbp), %xmm2 punpcklbw %xmm1, %xmm2 ## xmm2 = xmm2[0],xmm1[0],xmm2[1],xmm1[1],xmm2[2],xmm1[2],xmm2[3],xmm1[3],xmm2[4],xmm1[4],xmm2[5],xmm1[5],xmm2[6],xmm1[6],xmm2[7],xmm1[7] movd -10104(%rbp), %xmm1 movd -10040(%rbp), %xmm4 punpcklbw %xmm1, %xmm4 ## xmm4 = xmm4[0],xmm1[0],xmm4[1],xmm1[1],xmm4[2],xmm1[2],xmm4[3],xmm1[3],xmm4[4],xmm1[4],xmm4[5],xmm1[5],xmm4[6],xmm1[6],xmm4[7],xmm1[7] punpcklbw %xmm2, %xmm4 ## xmm4 = xmm4[0],xmm2[0],xmm4[1],xmm2[1],xmm4[2],xmm2[2],xmm4[3],xmm2[3],xmm4[4],xmm2[4],xmm4[5],xmm2[5],xmm4[6],xmm2[6],xmm4[7],xmm2[7] movd -10120(%rbp), %xmm1 movd -10056(%rbp), %xmm2 punpcklbw %xmm1, %xmm2 ## xmm2 = xmm2[0],xmm1[0],xmm2[1],xmm1[1],xmm2[2],xmm1[2],xmm2[3],xmm1[3],xmm2[4],xmm1[4],xmm2[5],xmm1[5],xmm2[6],xmm1[6],xmm2[7],xmm1[7] movd -10088(%rbp), %xmm1 movd -10024(%rbp), %xmm5 punpcklbw %xmm1, %xmm5 ## xmm5 = xmm5[0],xmm1[0],xmm5[1],xmm1[1],xmm5[2],xmm1[2],xmm5[3],xmm1[3],xmm5[4],xmm1[4],xmm5[5],xmm1[5],xmm5[6],xmm1[6],xmm5[7],xmm1[7] punpcklbw %xmm2, %xmm5 ## xmm5 = xmm5[0],xmm2[0],xmm5[1],xmm2[1],xmm5[2],xmm2[2],xmm5[3],xmm2[3],xmm5[4],xmm2[4],xmm5[5],xmm2[5],xmm5[6],xmm2[6],xmm5[7],xmm2[7] punpcklbw %xmm4, %xmm5 ## xmm5 = xmm5[0],xmm4[0],xmm5[1],xmm4[1],xmm5[2],xmm4[2],xmm5[3],xmm4[3],xmm5[4],xmm4[4],xmm5[5],xmm4[5],xmm5[6],xmm4[6],xmm5[7],xmm4[7] punpcklbw %xmm5, %xmm3 ## xmm3 = xmm3[0],xmm5[0],xmm3[1],xmm5[1],xmm3[2],xmm5[2],xmm3[3],xmm5[3],xmm3[4],xmm5[4],xmm3[5],xmm5[5],xmm3[6],xmm5[6],xmm3[7],xmm5[7] pshufb %xmm15, %xmm3 movdqu %xmm3, -10032(%rbp,%rax) addq-16, %rax
jne	LBB0_1
## BB#2:                                ## %middle.block
callq	_rand
cltq
imulq	$1759218605, %rax, %rcx ## imm = 0x68DB8BAD movq %rcx, %rdx shrq$63, %rdx
sarq	$44, %rcx addl %edx, %ecx imull$10000, %ecx, %ecx      ## imm = 0x2710
subl	%ecx, %eax
cltq
movsbl	-10016(%rbp,%rax), %esi
leaq	L_.str(%rip), %rdi
xorl	%eax, %eax
callq	_printf
movq	(%rbx), %rax
cmpq	-16(%rbp), %rax
jne	LBB0_4
## BB#3:                                ## %middle.block
xorl	%eax, %eax
addq	$10136, %rsp ## imm = 0x2798 popq %rbx popq %rbp ret LBB0_4: ## %middle.block callq ___stack_chk_fail .cfi_endproc .section __TEXT,__cstring,cstring_literals L_.str: ## @.str .asciz "%d\n" .subsections_via_symbols 2 #### Share this post ##### Link to post ##### Share on other sites At this stage I feel a new rule of the internet coming on. "There is no coding standard so reprehensible, so obviously wrong-looking, that somebody somewhere won't try to defend it." (See also comma-first.) 0 #### Share this post ##### Link to post ##### Share on other sites It's actually really cute what happens when you remove the volatile from each case: Heck, may as well write ;  for both, and watch the compiler optimize away. 0 #### Share this post ##### Link to post ##### Share on other sites Amazing what happens when you actually compare like-with-like, isn't it? That's not what that is comparing. That's pretty much saying x*0 vs. x - x, where the observer can clearly see that the calculation is useless, and thus do anything they want. If I turn the optimization higher, without volatile, it emits an empty program. Of course a contrived case can be set up to prove any point, so it is important to do an equivalent comparison. How is a case where the compiler is permitted to do nothing at all not contrived, again? It feels like you're agreeing with anything that faintly sounds against my argument. It appears that the optimiser does not know that the two loops are identical, as the more normal upwards loop is optimised to a call to bzero(), whereas the downwards case actually generates a loop. I acknowledge that the compiler is intelligent enough to convert this to a bzero() call (how it doesn't discard the array entirely is interesting). However, writing zeroes is an example, as the contents of the loop aren't the important part. The fact that the compiler can cheat the test and remove the loop if the loop looks like a certain pattern is not an argument for either loop preference. To be honest, this doesn't feel like a valid debate anymore. I'm content with its current course of analyzing optimizer behavior. 0 #### Share this post ##### Link to post ##### Share on other sites At -O2 optimisation, both loops get unrolled, and converted to SSE. However, the up case weights in at 120 lines of ASM, whereas the down case is a whopping 164. That's scary, where an optimizer can take nearly equal code, and make the one that performs better in theory perform worse in practice. However, this does assert a monopoly of the registers, again. 0 #### Share this post ##### Link to post ##### Share on other sites At -O2 optimisation, both loops get unrolled, and converted to SSE. However, the up case weights in at 120 lines of ASM, whereas the down case is a whopping 164. That's scary, where an optimizer can take nearly equal code, and make the one that performs better in theory perform worse in practice. However, this does assert a monopoly of the registers, again. At this stage I think you really need to read this: http://blogs.msdn.com/b/oldnewthing/archive/2004/12/16/317157.aspx The technical specifics in it are of course out of date, but the lesson remains valid, and it's instructive to follow him walking through an optimization scenario where he produces code that - going by the disassembly - should be faster in theory, but turns out to be slower in practice because it ends up totally blowing the return address predictor stack. This is the same Raymond Chen who has credits on early releases of the Linux kernel, by the way. 0 #### Share this post ##### Link to post ##### Share on other sites It's actually really cute what happens when you remove the volatile from each case: Heck, may as well write ;  for both, and watch the compiler optimize away. No. Please note this line: printf("%d\n", buf[rand() % 10000]); That is the proper way (or one proper way) of preventing the compiler from optimizing out the loop. Which, contrary to using volatile on the loop counter, does not include some very different (and wrong) assumptions about the variable and its contents and does not screw up the optimizer. The difference is that Swiftcoder's program is non-trivial (well it is of course kind of "trivial", but it's not trivial insofar as it actually does something by consuming a value of the array) whereas your solution with volatile is a trivial program (one that basically does nothing at all) that inhibits optimization of that variable. It would probably be entirely legal for a very aggressively optimizing compiler to optimize out the array alltogether, and only count the volatile loop counter up (or down). Because, in the end, whether or not there is an array at all makes no externally observable difference, it's not consumed anywhere, and the program ends right after initializing it. Edited by samoth 1 #### Share this post ##### Link to post ##### Share on other sites This is the same Raymond Chen who has credits on early releases of the Linux kernel, by the way. I know who Raymond Chen is, and I've read that article years ago. That is the proper way (or one proper way) of preventing the compiler from optimizing out the loop. Which, contrary to using volatile on the loop counter, does not include some very different (and wrong) assumptions about the variable and its contents and does not screw up the optimizer. Okay, his program is more proper. However, that still ignores the fact that the compiler is optimizing his limit away to a constant (cheating the test), which my volatile limit avoided. 0 #### Share this post ##### Link to post ##### Share on other sites However, that still ignores the fact that the compiler is optimizing his limit away to a constant (cheating the test), which my volatile limit avoided. Again, easily avoided: // up.c #include <stdio.h> #include <stdlib.h> int main() { char buf[10000]; unsigned int limit; printf("enter a number up to 10000: "); scanf("%u", &limit); for(unsigned int i = 0; i != limit; ++i){ buf[i] = i; } printf("%d\n", buf[rand() % 10000]); }  And the same of the other program. On my system: $ gcc -S -O2 *.c -std=c99
down.c: In function ‘main’:
down.c:9:7: warning: ignoring return value of ‘scanf’, declared with attribute warn_unused_result [-Wunused-result]
up.c: In function ‘main’:
up.c:9:7: warning: ignoring return value of ‘scanf’, declared with attribute warn_unused_result [-Wunused-result]

$diff {up,down}.s 1c1 < .file "up.c" --- > .file "down.c" 33,34c33,34 < movl 24(%esp), %eax < testl %eax, %eax --- > movl 24(%esp), %edx > testl %edx, %edx 36c36 < subl$1, %eax
---
> 	xorl	%eax, %eax
41,42c41,42
< 	subl	$1, %eax < cmpl$-1, %eax
---
> 	cmpl	%edx, %eax

0

##### Share on other sites

On my system:

Those two look equal in length, and time complexity, so I have no issues with that.

0

## Create an account

Register a new account