• Announcements

    • khawk

      Download the Game Design and Indie Game Marketing Freebook   07/19/17

      GameDev.net and CRC Press have teamed up to bring a free ebook of content curated from top titles published by CRC Press. The freebook, Practices of Game Design & Indie Game Marketing, includes chapters from The Art of Game Design: A Book of Lenses, A Practical Guide to Indie Game Marketing, and An Architectural Approach to Level Design. The GameDev.net FreeBook is relevant to game designers, developers, and those interested in learning more about the challenges in game development. We know game development can be a tough discipline and business, so we picked several chapters from CRC Press titles that we thought would be of interest to you, the GameDev.net audience, in your journey to design, develop, and market your next game. The free ebook is available through CRC Press by clicking here. The Curated Books The Art of Game Design: A Book of Lenses, Second Edition, by Jesse Schell Presents 100+ sets of questions, or different lenses, for viewing a game’s design, encompassing diverse fields such as psychology, architecture, music, film, software engineering, theme park design, mathematics, anthropology, and more. Written by one of the world's top game designers, this book describes the deepest and most fundamental principles of game design, demonstrating how tactics used in board, card, and athletic games also work in video games. It provides practical instruction on creating world-class games that will be played again and again. View it here. A Practical Guide to Indie Game Marketing, by Joel Dreskin Marketing is an essential but too frequently overlooked or minimized component of the release plan for indie games. A Practical Guide to Indie Game Marketing provides you with the tools needed to build visibility and sell your indie games. With special focus on those developers with small budgets and limited staff and resources, this book is packed with tangible recommendations and techniques that you can put to use immediately. As a seasoned professional of the indie game arena, author Joel Dreskin gives you insight into practical, real-world experiences of marketing numerous successful games and also provides stories of the failures. View it here. An Architectural Approach to Level Design This is one of the first books to integrate architectural and spatial design theory with the field of level design. The book presents architectural techniques and theories for level designers to use in their own work. It connects architecture and level design in different ways that address the practical elements of how designers construct space and the experiential elements of how and why humans interact with this space. Throughout the text, readers learn skills for spatial layout, evoking emotion through gamespaces, and creating better levels through architectural theory. View it here. Learn more and download the ebook by clicking here. Did you know? GameDev.net and CRC Press also recently teamed up to bring GDNet+ Members up to a 20% discount on all CRC Press books. Learn more about this and other benefits here.
Sign in to follow this  
Followers 0
catslap

loop break

48 posts in this topic

Note that these are just examples; I don't know what you or your co-workers would find more readable, since the post-decrement loop is very readable to me.

 

It's a bit deeper than that. The coding standards are legal standards, and go off to a lab to get certed. You can break a rule, but it means getting in both reviewers, and  senior manager  booked for a meeting(costly, and pisses people off, although there's biscuits), then a document is written justifying the deviation from standard (costly and boring, massive waste of time and resource), then the certification lab want to check it but charge more (maybe £6k). Basically if you pull that 'clever' shit you can consider yourself unemployed pretty quick. They canned 12 guys on NYE no warning, another 4 last week, all for bad business practice, they were probably adequate coders, just didn't understand the industry.

 

There are times when you get pulled in to make something uber efficient, like context switching on a blistering fast fpga, but mostly, write it like a muppet can read it and you keep on getting paid.

 

I do like that loop though :)

1

Share this post


Link to post
Share on other sites

...Isn't that the point, that counting down doesn't fetch on every iteration?

 

No, that's not the point.  This was never about counting down versus counting up, you're picking a completely different fight here, one that nobody else has even gone near talking about, and you're treating everyone else as if they're having the same fight as you whereas they're not.  This is about comparing the cutesy construct given in the OP with a more idiomatic for loop.

 

I.e compare this:

 

for (int i = 10; i--;)

 

To this:

 

for (int i = 10; i; i--)

 

That's what the point is.

1

Share this post


Link to post
Share on other sites

This is really just an obfuscated version of the idiom:

while (i--)
{
    /* do something */
}

Where i is a preexisting variable.

 

Not really complete unless you show the  'i' variable getting its initial value.

 

Somehow it would have to get the right value previously (and having it set somewhere far above would obscure the codes obvious operation and add to the potential for a error).

0

Share this post


Link to post
Share on other sites

...Isn't that the point, that counting down doesn't fetch on every iteration? It is cached into a local variable; if you look at the ASM, it mov's the value from an offset from the stack pointer. That sounds local to me. It's doing exactly what it would if there were too many values held in the loop, and the counter spilled to RAM.

Counting up wouldn't have fetched each iteration if you'd cached he volatile into a local. If you want to simulate register pressure, then use a lot of variables in the loop body rather than disabling the optimizer via volatile.

I posted the ASM. You can see exactly what happened.

and it's not necessarily the same thing that the optimizer would've actually done in real code. Your written an example that does exactly what you assume the compiler will do, but can't possibly show us what the compiler would actually do in a real situation.
You can't use a forced example to demonstrate the optimizer's actual behavior! ;-)
 

In other words, allow the compiler to see that it's an unchanging number and inline the constant limit (I've tried that).

You can use a global, or a parameter from another function, or get the value from another translation unit, etc... That will stop the compiler from hard doing the loop limit in he same way that it does in real code -- rather than forcing its hand with code that gives it zero leeway.
 

...In other words, just don't use signed values, so that I can feel good about counting up all the time? I fail to see the point of actually modifying the use case to suit the algorithm, rather than the other way around.

unsigneds work for both up and down. If you want to compare those two loops fairly then you've got to make them as equivalent as possible -- if one uses "not zero", the other should use "not limit", etc. Neither use negative values so signed/unsigned is nuetral, but it would be interesting to see if it affects the codegen (many code bases use uint over int by default).
IMO it's also more common to write !=limit and !=0 rather than <limit and >0...
 

I don't mean to be abrasive, but did you read the assembly output? It clearly moves zero to each of the array elements, exactly as the C code says (array is volatile qualified). It's guaranteed to have not been optimized out, because it says right in the resulting code that it is doing it.

If you're testing whether the compiler can transform between up/down, then the use of volatile on the output array completely voids your test -- volatile writes can't be reordered, so you've told the compiler explicitly not to modify your iteration order!
So to test this behavior of the optimizer, you need to remove volatile, and then add an alternative method of stopping the code from being optimized away entirely.
1

Share this post


Link to post
Share on other sites

There is also that thing that Ive seen more than a little where in the test clause the constant is put first

 

for(i=0; 10 > i; i++)

   {

   }

 

or more commonly

 

if (10 > varx)  {  }

 

or

 

if (Mode1 == varx)   {     }

 

which usually makes me double check rather more than the classic  i<10  style

0

Share this post


Link to post
Share on other sites

 

This is really just an obfuscated version of the idiom:

while (i--)
{
    /* do something */
}

Where i is a preexisting variable.

 

Not really complete unless you show the  'i' variable getting its initial value.

 

Somehow it would have to get the right value previously (and having it set somewhere far above would obscure the codes obvious operation and add to the potential for a error).

 

 

Yes, the point is a lot of the time the bounds you are interested in are already in a local variable (or a function parameter). Not that I would do this unless it was clear from context what I was doing. But in my opinion, it's much better than using a for loop with its iteration function (i--) in its termination condition.

0

Share this post


Link to post
Share on other sites

 


Ectara, you don't cache the limit variable into a local in the up case, so it has to be fetched every iteration, which doesn't happen in your down case.

...Isn't that the point, that counting down doesn't fetch on every iteration? It is cached into a local variable; if you look at the ASM, it mov's the value from an offset from the stack pointer. That sounds local to me. It's doing exactly what it would if there were too many values held in the loop, and the counter spilled to RAM.

 

I thought the original point was the compiler's optimizer would essentially create the same exact code for:

for(int i=10;i--;){}

and

for(int i=0;i<10;i++){}

however by including violatile, you effectively shut off the optimizer from doing it's thing.

 

yea, there are edge cases where this would not be true(and you were trying to show that edge case).  but that's just all it is, edge cases where yes you are correct, a count down version would theoretically generate tinier code.  but I think if you get to the point where your profiling indicates that this one extra instruction is your bottlenexk, you likely have too weak of hardware for w/e your trying to do.

Edited by slicer4ever
0

Share this post


Link to post
Share on other sites


I.e compare this:



for (int i = 10; i--;)



To this:



for (int i = 10; i; i--)

I have no idea where you're getting that, as the second version never appeared in the thread. Furthermore, that's not even the same loop! Think about it: i is 10 at the start, passes the non-zero check, then it is 10 as it goes into the loop body...

 

I have no idea what you're on about.

 


Counting up wouldn't have fetched each iteration if you'd cached he volatile into a local. If you want to simulate register pressure, then use a lot of variables in the loop body rather than disabling the optimizer via volatile.

Clearly, your definition of "local" and mine don't align. To me, a local variable is one of automatic storage duration defined with in a function or deeper scope. Volatile doesn't change that.

 

Moreover, it seems like you're arguing my method, rather than my result. I don't know why it isn't obvious that using two temporary variables (that the compiler can't optimize away to a constant) instead of one increases register pressure, and I need to prove it with numerous examples.

 


Basically if you pull that 'clever' shit you can consider yourself unemployed pretty quick.

 

Someday, people won't always assume that I'm not employed as a software engineer right now, and tell me the dictionary definition of obvious terms.

 


You can't use a forced example to demonstrate the optimizer's actual behavior!

Then that nullifies all of your suggested modifications to my exercise, and we're at a stalemate.

 


rather than forcing its hand with code that gives it zero leeway.

Because the point was that if it isn't put into a register, it can result in larger code. You are making all suggestions that would either optimize it to a constant or cache it in a register, which would still be no faster than counting down; at best, it'd break even. My argument from the beginning is reducing register pressure and comparing against zero, while your suggestions are allowing the compiler to assume there's no register pressure and use all registers at once.

 


Neither use negative values so signed/unsigned is nuetral, but it would be interesting to see if it affects the codegen (many code bases use uint over int by default).

Post results, I'm curious.

 


IMO it's also more common to write !=limit and !=0 rather than 0...

In my and my co-worker's opinions, it isn't.

 


If you're testing whether the compiler can transform between up/down

I'm not.

 


I thought the original point was the compiler's optimizer would essentially create the same exact code for:

for(int i=10;i--;){}

and

for(int i=0;i<10;i++){}

I posted three examples that disagreed with the consensus, and I got refutation about none of them in particular, so I chose the one that was most likely to make the biggest difference.

 


but that's just all it is, edge cases where yes you are correct, a count down version would theoretically generate tinier code.

I hardly consider any of my original three examples an edge case. I have a lot of loops that does a lot of permutations on every iteration, and the counter spills to RAM.

 


but I think if you get to the point where your profiling indicates that this one extra instruction is your bottlenexk

No, I never said that. People said they would be the same, and I posted an example where they weren't, with proof. This and many other arguments in this thread is a complete and utter strawman argument.

 

Really, I should have known better than to come into a coding horror thread, and post a use case for the idiom that people are dumping on.

0

Share this post


Link to post
Share on other sites

Moreover, it seems like you're arguing my method, rather than my result.

Exactly. I'm not attacking you or arguing the results. I'm honestly interested in what the results are. No need to get super defensive.
...I was just pointing out that by using volatile, you've forced the compiler's hand, so you may as well have just hand-written the assembly; it's not a test of the compiler, so seeing that the method is artificial, the results are also artificially achieved.
 

You can't use a forced example to demonstrate the optimizer's actual behavior!

Then that nullifies all of your suggested modifications to my exercise, and we're at a stalemate.

Not quite. I was suggesting that your example forces the compiler to only produce very specific assembly -- that your use of volatile has explicitly told the compiler exactly how it's allowed to construct the loops, and that it's not allowed to make any changes/optimizations.
Real production code will not use the volatile keyword like this (in 99.99% of cases) so it's not at all natural code. Write it naturally and perhaps the compiler can convert a forwards loop into a superior, optimal backwards loop?
 

If you're testing whether the compiler can transform between up/down

I'm not.

Oh, you threw me off with this where you tried to prove that the up/down versions produced different code tongue.png

You do however realize that an optimizing compiler will produce exactly the same code whether you use the obfuscated version or not?

[up.c ... down.c ...] They aren't identical, and the counting down version has one fewer instruction (compiled with gcc -O3 -S). Yes, cleaner code is better. However, it is all too easy to say that they compile to the same instructions.
1

Share this post


Link to post
Share on other sites

getting certed [...] documents [...] meetings [...] lots of shit [...] huge extra cost
If I didn't know that this is exactly true and if it wasn't so sad, I'd say congrats on an excellent joke. Sadly, all that matters is whether or not some totally irrelevant (and likely not even competent) person gets an itch, because when that happens you simply don't get the "certified" stamp which is soooooooooooo important. But again, sadly, it's really important because the next incompetent idiot will only buy a product that is certified, no matter whether it's good or bad, and he simply won't buy yours if you don't have the stamp... sad.png
0

Share this post


Link to post
Share on other sites

 

getting certed [...] documents [...] meetings [...] lots of shit [...] huge extra cost
If I didn't know that this is exactly true and if it wasn't so sad, I'd say congrats on an excellent joke. 

 

 

mostly our stuff is safety critical, it's a legal matter, as in you go to jail or pulled up on corporate manslaughter if mistakes are made. Reviews are logged with your name at official bodies, every last test result is recorded and signed in triplicate. Every line of code is scrutinised. The restrictions on what you can do with the languages are huge. It can be miserable at times.

Edited by catslap
0

Share this post


Link to post
Share on other sites

ah, sorry Samoth, I misread that as if you weren't sure how quality standards are met. Now I see you do! It's been a long day smile.png

0

Share this post


Link to post
Share on other sites
volatile simulates register pressure [...] possible for the counter to spill to RAM

There is a lot more to volatile than just spilling the contents to RAM. The compiler is told to assume that the contents of the memory location can be changed externally, at any time.

Which means that it cannot reorder reads and writes, and cannot coalesce fetches. It must fetch the value exactly when it is used, not any earlier or later, and it will completely stall the processor's pipeline doing so.

 

If the loop counter has to be written to RAM because of register pressure, the compiler hides the latency by fetching the value early, so it happens while other instructions execute (if the loop is not totally trivial of course, but a trivial loop won't have high register pressure).

1

Share this post


Link to post
Share on other sites

 

In addition to making sure that it doesn't all get optimized out, the volatile limit simulates register pressure; it is entirely possible for that counter to spill to RAM.

It's actually really cute what happens when you remove the volatile from each case:
// down.c
	.cfi_offset %rbx, -24
	movq	___stack_chk_guard@GOTPCREL(%rip), %rbx
	movq	(%rbx), %rax
	movq	%rax, -16(%rbp)
	movl	$10000, %eax            ## imm = 0x2710
	xorps	%xmm0, %xmm0
	.align	4, 0x90
LBB0_1:
	movups	%xmm0, -10032(%rbp,%rax)
	addq	$-16, %rax
	jne	LBB0_1
// up.c
	.cfi_offset %rbx, -24
	movq	___stack_chk_guard@GOTPCREL(%rip), %rbx
	movq	(%rbx), %rax
	movq	%rax, -16(%rbp)
	leaq	-10016(%rbp), %rdi
	movl	$10000, %esi            ## imm = 0x2710
	callq	___bzero
It appears that the optimiser does not know that the two loops are identical, as the more normal upwards loop is optimised to a call to bzero(), whereas the downwards case actually generates a loop.

 

 

Amazing what happens when you actually compare like-with-like, isn't it?

 

Of course a contrived case can be set up to prove any point, so it is important to do an equivalent comparison.

0

Share this post


Link to post
Share on other sites

Of course a contrived case can be set up to prove any point, so it is important to do an equivalent comparison.

The zero case is kind of degenerate, anyways. Assuming we assign the index variable to the matching array slot, and print a random value from the array after the loop to prevent it being optimised away, the results are even stranger...

At -O2 optimisation, both loops get unrolled, and converted to SSE. However, the up case weights in at 120 lines of ASM, whereas the down case is a whopping 164.

Now, my SSE isn't up to scratch enough to tell which of these is more efficient, but it's fairly obvious which one is simpler smile.png
 
// up.c
#include <stdio.h>
#include <stdlib.h>

int main() {
	char buf[10000];
	const unsigned int limit = 10000;
	
	for(unsigned int i = 0; i != limit; ++i){
		buf[i] = i;
	}
	
	printf("%d\n", buf[rand() % 10000]);
}
// down.c
#include <stdio.h>
#include <stdlib.h>

int main() {
	char buf[10000];
	const unsigned int limit = 10000;
	
	for(unsigned int i = limit; i--;){
		buf[i] = i;
	}
	
	printf("%d\n", buf[rand() % 10000]);
}
// up.s
	.section	__TEXT,__text,regular,pure_instructions
	.section	__TEXT,__const
	.align	4
LCPI0_0:
	.byte	0                       ## 0x0
	.byte	1                       ## 0x1
	.byte	2                       ## 0x2
	.byte	3                       ## 0x3
	.byte	4                       ## 0x4
	.byte	5                       ## 0x5
	.byte	6                       ## 0x6
	.byte	7                       ## 0x7
	.byte	8                       ## 0x8
	.byte	9                       ## 0x9
	.byte	10                      ## 0xa
	.byte	11                      ## 0xb
	.byte	12                      ## 0xc
	.byte	13                      ## 0xd
	.byte	14                      ## 0xe
	.byte	15                      ## 0xf
LCPI0_1:
	.byte	16                      ## 0x10
	.byte	17                      ## 0x11
	.byte	18                      ## 0x12
	.byte	19                      ## 0x13
	.byte	20                      ## 0x14
	.byte	21                      ## 0x15
	.byte	22                      ## 0x16
	.byte	23                      ## 0x17
	.byte	24                      ## 0x18
	.byte	25                      ## 0x19
	.byte	26                      ## 0x1a
	.byte	27                      ## 0x1b
	.byte	28                      ## 0x1c
	.byte	29                      ## 0x1d
	.byte	30                      ## 0x1e
	.byte	31                      ## 0x1f
	.section	__TEXT,__text,regular,pure_instructions
	.globl	_main
	.align	4, 0x90
_main:                                  ## @main
	.cfi_startproc
## BB#0:                                ## %vector.ph
	pushq	%rbp
Ltmp3:
	.cfi_def_cfa_offset 16
Ltmp4:
	.cfi_offset %rbp, -16
	movq	%rsp, %rbp
Ltmp5:
	.cfi_def_cfa_register %rbp
	pushq	%rbx
	subq	$10008, %rsp            ## imm = 0x2718
Ltmp6:
	.cfi_offset %rbx, -24
	movq	___stack_chk_guard@GOTPCREL(%rip), %rbx
	movq	(%rbx), %rax
	movq	%rax, -16(%rbp)
	xorl	%eax, %eax
	movdqa	LCPI0_0(%rip), %xmm0
	movdqa	LCPI0_1(%rip), %xmm1
	.align	4, 0x90
LBB0_1:                                 ## %vector.body
                                        ## =>This Inner Loop Header: Depth=1
	movd	%eax, %xmm2
	punpcklbw	%xmm2, %xmm2    ## xmm2 = xmm2[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
	punpcklbw	%xmm2, %xmm2    ## xmm2 = xmm2[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
	pshufd	$0, %xmm2, %xmm2        ## xmm2 = xmm2[0,0,0,0]
	movdqa	%xmm2, %xmm3
	paddb	%xmm0, %xmm3
	paddb	%xmm1, %xmm2
	movdqa	%xmm3, -10016(%rbp,%rax)
	movdqa	%xmm2, -10000(%rbp,%rax)
	addq	$32, %rax
	cmpq	$9984, %rax             ## imm = 0x2700
	jne	LBB0_1
## BB#2:                                ## %scalar.ph.preheader
	xorl	%eax, %eax
	.align	4, 0x90
LBB0_3:                                 ## %scalar.ph
                                        ## =>This Inner Loop Header: Depth=1
	movb	%al, -32(%rbp,%rax)
	incq	%rax
	cmpq	$16, %rax
	jne	LBB0_3
## BB#4:
	callq	_rand
	cltq
	imulq	$1759218605, %rax, %rcx ## imm = 0x68DB8BAD
	movq	%rcx, %rdx
	shrq	$63, %rdx
	sarq	$44, %rcx
	addl	%edx, %ecx
	imull	$10000, %ecx, %ecx      ## imm = 0x2710
	subl	%ecx, %eax
	cltq
	movsbl	-10016(%rbp,%rax), %esi
	leaq	L_.str(%rip), %rdi
	xorl	%eax, %eax
	callq	_printf
	movq	(%rbx), %rax
	cmpq	-16(%rbp), %rax
	jne	LBB0_6
## BB#5:
	xorl	%eax, %eax
	addq	$10008, %rsp            ## imm = 0x2718
	popq	%rbx
	popq	%rbp
	ret
LBB0_6:
	callq	___stack_chk_fail
	.cfi_endproc

	.section	__TEXT,__cstring,cstring_literals
L_.str:                                 ## @.str
	.asciz	"%d\n"


.subsections_via_symbols
// down.s
	.section	__TEXT,__text,regular,pure_instructions
	.section	__TEXT,__const
	.align	4
LCPI0_0:
	.quad	-8                      ## 0xfffffffffffffff8
	.quad	-9                      ## 0xfffffffffffffff7
LCPI0_1:
	.quad	-4                      ## 0xfffffffffffffffc
	.quad	-5                      ## 0xfffffffffffffffb
LCPI0_2:
	.quad	-12                     ## 0xfffffffffffffff4
	.quad	-13                     ## 0xfffffffffffffff3
LCPI0_3:
	.quad	-2                      ## 0xfffffffffffffffe
	.quad	-3                      ## 0xfffffffffffffffd
LCPI0_4:
	.quad	-10                     ## 0xfffffffffffffff6
	.quad	-11                     ## 0xfffffffffffffff5
LCPI0_5:
	.quad	-6                      ## 0xfffffffffffffffa
	.quad	-7                      ## 0xfffffffffffffff9
LCPI0_6:
	.quad	-14                     ## 0xfffffffffffffff2
	.quad	-15                     ## 0xfffffffffffffff1
LCPI0_7:
	.byte	15                      ## 0xf
	.byte	14                      ## 0xe
	.byte	13                      ## 0xd
	.byte	12                      ## 0xc
	.byte	11                      ## 0xb
	.byte	10                      ## 0xa
	.byte	9                       ## 0x9
	.byte	8                       ## 0x8
	.byte	7                       ## 0x7
	.byte	6                       ## 0x6
	.byte	5                       ## 0x5
	.byte	4                       ## 0x4
	.byte	3                       ## 0x3
	.byte	2                       ## 0x2
	.byte	1                       ## 0x1
	.byte	0                       ## 0x0
	.section	__TEXT,__text,regular,pure_instructions
	.globl	_main
	.align	4, 0x90
_main:                                  ## @main
	.cfi_startproc
## BB#0:                                ## %vector.ph
	pushq	%rbp
Ltmp3:
	.cfi_def_cfa_offset 16
Ltmp4:
	.cfi_offset %rbp, -16
	movq	%rsp, %rbp
Ltmp5:
	.cfi_def_cfa_register %rbp
	pushq	%rbx
	subq	$10136, %rsp            ## imm = 0x2798
Ltmp6:
	.cfi_offset %rbx, -24
	movq	___stack_chk_guard@GOTPCREL(%rip), %rbx
	movq	(%rbx), %rax
	movq	%rax, -16(%rbp)
	movl	$10000, %eax            ## imm = 0x2710
	movq	$-1, %rcx
	movd	%rcx, %xmm0
	pslldq	$8, %xmm0
	movdqa	LCPI0_1(%rip), %xmm9
	movdqa	LCPI0_2(%rip), %xmm10
	movdqa	LCPI0_3(%rip), %xmm11
	movdqa	LCPI0_4(%rip), %xmm12
	movdqa	LCPI0_5(%rip), %xmm13
	movdqa	LCPI0_6(%rip), %xmm14
	movdqa	LCPI0_7(%rip), %xmm15
	.align	4, 0x90
LBB0_1:                                 ## %vector.body
                                        ## =>This Inner Loop Header: Depth=1
	leaq	-1(%rax), %rcx
	movd	%rcx, %xmm2
	movlhps	%xmm2, %xmm2            ## xmm2 = xmm2[0,0]
	movaps	%xmm2, %xmm3
	movaps	%xmm2, %xmm4
	movaps	%xmm2, %xmm5
	movaps	%xmm2, %xmm6
	movaps	%xmm2, %xmm7
	movaps	%xmm2, %xmm1
	movaps	%xmm2, %xmm8
	paddq	%xmm13, %xmm8
	paddq	%xmm14, %xmm2
	movdqa	%xmm2, -10144(%rbp)
	movdqa	%xmm8, -10080(%rbp)
	punpcklbw	%xmm2, %xmm8    ## xmm8 = xmm8[0],xmm2[0],xmm8[1],xmm2[1],xmm8[2],xmm2[2],xmm8[3],xmm2[3],xmm8[4],xmm2[4],xmm8[5],xmm2[5],xmm8[6],xmm2[6],xmm8[7],xmm2[7]
	paddq	%xmm11, %xmm7
	paddq	%xmm12, %xmm1
	movdqa	%xmm1, -10112(%rbp)
	movdqa	%xmm7, -10048(%rbp)
	punpcklbw	%xmm1, %xmm7    ## xmm7 = xmm7[0],xmm1[0],xmm7[1],xmm1[1],xmm7[2],xmm1[2],xmm7[3],xmm1[3],xmm7[4],xmm1[4],xmm7[5],xmm1[5],xmm7[6],xmm1[6],xmm7[7],xmm1[7]
	punpcklbw	%xmm8, %xmm7    ## xmm7 = xmm7[0],xmm8[0],xmm7[1],xmm8[1],xmm7[2],xmm8[2],xmm7[3],xmm8[3],xmm7[4],xmm8[4],xmm7[5],xmm8[5],xmm7[6],xmm8[6],xmm7[7],xmm8[7]
	paddq	%xmm0, %xmm3
	paddq	LCPI0_0(%rip), %xmm4
	paddq	%xmm9, %xmm5
	paddq	%xmm10, %xmm6
	movdqa	%xmm6, -10128(%rbp)
	movdqa	%xmm5, -10064(%rbp)
	movdqa	%xmm4, -10096(%rbp)
	movdqa	%xmm3, -10032(%rbp)
	punpcklbw	%xmm6, %xmm5    ## xmm5 = xmm5[0],xmm6[0],xmm5[1],xmm6[1],xmm5[2],xmm6[2],xmm5[3],xmm6[3],xmm5[4],xmm6[4],xmm5[5],xmm6[5],xmm5[6],xmm6[6],xmm5[7],xmm6[7]
	punpcklbw	%xmm4, %xmm3    ## xmm3 = xmm3[0],xmm4[0],xmm3[1],xmm4[1],xmm3[2],xmm4[2],xmm3[3],xmm4[3],xmm3[4],xmm4[4],xmm3[5],xmm4[5],xmm3[6],xmm4[6],xmm3[7],xmm4[7]
	punpcklbw	%xmm5, %xmm3    ## xmm3 = xmm3[0],xmm5[0],xmm3[1],xmm5[1],xmm3[2],xmm5[2],xmm3[3],xmm5[3],xmm3[4],xmm5[4],xmm3[5],xmm5[5],xmm3[6],xmm5[6],xmm3[7],xmm5[7]
	punpcklbw	%xmm7, %xmm3    ## xmm3 = xmm3[0],xmm7[0],xmm3[1],xmm7[1],xmm3[2],xmm7[2],xmm3[3],xmm7[3],xmm3[4],xmm7[4],xmm3[5],xmm7[5],xmm3[6],xmm7[6],xmm3[7],xmm7[7]
	movd	-10136(%rbp), %xmm1
	movd	-10072(%rbp), %xmm2
	punpcklbw	%xmm1, %xmm2    ## xmm2 = xmm2[0],xmm1[0],xmm2[1],xmm1[1],xmm2[2],xmm1[2],xmm2[3],xmm1[3],xmm2[4],xmm1[4],xmm2[5],xmm1[5],xmm2[6],xmm1[6],xmm2[7],xmm1[7]
	movd	-10104(%rbp), %xmm1
	movd	-10040(%rbp), %xmm4
	punpcklbw	%xmm1, %xmm4    ## xmm4 = xmm4[0],xmm1[0],xmm4[1],xmm1[1],xmm4[2],xmm1[2],xmm4[3],xmm1[3],xmm4[4],xmm1[4],xmm4[5],xmm1[5],xmm4[6],xmm1[6],xmm4[7],xmm1[7]
	punpcklbw	%xmm2, %xmm4    ## xmm4 = xmm4[0],xmm2[0],xmm4[1],xmm2[1],xmm4[2],xmm2[2],xmm4[3],xmm2[3],xmm4[4],xmm2[4],xmm4[5],xmm2[5],xmm4[6],xmm2[6],xmm4[7],xmm2[7]
	movd	-10120(%rbp), %xmm1
	movd	-10056(%rbp), %xmm2
	punpcklbw	%xmm1, %xmm2    ## xmm2 = xmm2[0],xmm1[0],xmm2[1],xmm1[1],xmm2[2],xmm1[2],xmm2[3],xmm1[3],xmm2[4],xmm1[4],xmm2[5],xmm1[5],xmm2[6],xmm1[6],xmm2[7],xmm1[7]
	movd	-10088(%rbp), %xmm1
	movd	-10024(%rbp), %xmm5
	punpcklbw	%xmm1, %xmm5    ## xmm5 = xmm5[0],xmm1[0],xmm5[1],xmm1[1],xmm5[2],xmm1[2],xmm5[3],xmm1[3],xmm5[4],xmm1[4],xmm5[5],xmm1[5],xmm5[6],xmm1[6],xmm5[7],xmm1[7]
	punpcklbw	%xmm2, %xmm5    ## xmm5 = xmm5[0],xmm2[0],xmm5[1],xmm2[1],xmm5[2],xmm2[2],xmm5[3],xmm2[3],xmm5[4],xmm2[4],xmm5[5],xmm2[5],xmm5[6],xmm2[6],xmm5[7],xmm2[7]
	punpcklbw	%xmm4, %xmm5    ## xmm5 = xmm5[0],xmm4[0],xmm5[1],xmm4[1],xmm5[2],xmm4[2],xmm5[3],xmm4[3],xmm5[4],xmm4[4],xmm5[5],xmm4[5],xmm5[6],xmm4[6],xmm5[7],xmm4[7]
	punpcklbw	%xmm5, %xmm3    ## xmm3 = xmm3[0],xmm5[0],xmm3[1],xmm5[1],xmm3[2],xmm5[2],xmm3[3],xmm5[3],xmm3[4],xmm5[4],xmm3[5],xmm5[5],xmm3[6],xmm5[6],xmm3[7],xmm5[7]
	pshufb	%xmm15, %xmm3
	movdqu	%xmm3, -10032(%rbp,%rax)
	addq	$-16, %rax
	jne	LBB0_1
## BB#2:                                ## %middle.block
	callq	_rand
	cltq
	imulq	$1759218605, %rax, %rcx ## imm = 0x68DB8BAD
	movq	%rcx, %rdx
	shrq	$63, %rdx
	sarq	$44, %rcx
	addl	%edx, %ecx
	imull	$10000, %ecx, %ecx      ## imm = 0x2710
	subl	%ecx, %eax
	cltq
	movsbl	-10016(%rbp,%rax), %esi
	leaq	L_.str(%rip), %rdi
	xorl	%eax, %eax
	callq	_printf
	movq	(%rbx), %rax
	cmpq	-16(%rbp), %rax
	jne	LBB0_4
## BB#3:                                ## %middle.block
	xorl	%eax, %eax
	addq	$10136, %rsp            ## imm = 0x2798
	popq	%rbx
	popq	%rbp
	ret
LBB0_4:                                 ## %middle.block
	callq	___stack_chk_fail
	.cfi_endproc

	.section	__TEXT,__cstring,cstring_literals
L_.str:                                 ## @.str
	.asciz	"%d\n"


.subsections_via_symbols
2

Share this post


Link to post
Share on other sites

At this stage I feel a new rule of the internet coming on.

 

"There is no coding standard so reprehensible, so obviously wrong-looking, that somebody somewhere won't try to defend it."  (See also comma-first.)

0

Share this post


Link to post
Share on other sites


It's actually really cute what happens when you remove the volatile from each case:

Heck, may as well write

;

for both, and watch the compiler optimize away.

0

Share this post


Link to post
Share on other sites


Amazing what happens when you actually compare like-with-like, isn't it?

That's not what that is comparing. That's pretty much saying x*0 vs. x - x, where the observer can clearly see that the calculation is useless, and thus do anything they want. If I turn the optimization higher, without volatile, it emits an empty program.

 


Of course a contrived case can be set up to prove any point, so it is important to do an equivalent comparison.

How is a case where the compiler is permitted to do nothing at all not contrived, again? It feels like you're agreeing with anything that faintly sounds against my argument.

 


It appears that the optimiser does not know that the two loops are identical, as the more normal upwards loop is optimised to a call to bzero(), whereas the downwards case actually generates a loop.

I acknowledge that the compiler is intelligent enough to convert this to a bzero() call (how it doesn't discard the array entirely is interesting). However, writing zeroes is an example, as the contents of the loop aren't the important part. The fact that the compiler can cheat the test and remove the loop if the loop looks like a certain pattern is not an argument for either loop preference.

 

To be honest, this doesn't feel like a valid debate anymore. I'm content with its current course of analyzing optimizer behavior.

0

Share this post


Link to post
Share on other sites


At -O2 optimisation, both loops get unrolled, and converted to SSE. However, the up case weights in at 120 lines of ASM, whereas the down case is a whopping 164.

That's scary, where an optimizer can take nearly equal code, and make the one that performs better in theory perform worse in practice. However, this does assert a monopoly of the registers, again.

0

Share this post


Link to post
Share on other sites

 


At -O2 optimisation, both loops get unrolled, and converted to SSE. However, the up case weights in at 120 lines of ASM, whereas the down case is a whopping 164.

That's scary, where an optimizer can take nearly equal code, and make the one that performs better in theory perform worse in practice. However, this does assert a monopoly of the registers, again.

 

 

At this stage I think you really need to read this: http://blogs.msdn.com/b/oldnewthing/archive/2004/12/16/317157.aspx

 

The technical specifics in it are of course out of date, but the lesson remains valid, and it's instructive to follow him walking through an optimization scenario where he produces code that - going by the disassembly - should be faster in theory, but turns out to be slower in practice because it ends up totally blowing the return address predictor stack.

 

This is the same Raymond Chen who has credits on early releases of the Linux kernel, by the way.

0

Share this post


Link to post
Share on other sites

 


It's actually really cute what happens when you remove the volatile from each case:

Heck, may as well write

;

for both, and watch the compiler optimize away.

 

No. Please note this line:

printf("%d\n", buf[rand() % 10000]);

That is the proper way (or one proper way) of preventing the compiler from optimizing out the loop. Which, contrary to using volatile on the loop counter, does not include some very different (and wrong) assumptions about the variable and its contents and does not screw up the optimizer.

 

The difference is that Swiftcoder's program is non-trivial (well it is of course kind of "trivial", but it's not trivial insofar as it actually does something by consuming a value of the array) whereas your solution with volatile is a trivial program (one that basically does nothing at all) that inhibits optimization of that variable.

 

It would probably be entirely legal for a very aggressively optimizing compiler to optimize out the array alltogether, and only count the volatile loop counter up (or down). Because, in the end, whether or not there is an array at all makes no externally observable difference, it's not consumed anywhere, and the program ends right after initializing it.

Edited by samoth
1

Share this post


Link to post
Share on other sites




This is the same Raymond Chen who has credits on early releases of the Linux kernel, by the way.

I know who Raymond Chen is, and I've read that article years ago.

 


That is the proper way (or one proper way) of preventing the compiler from optimizing out the loop. Which, contrary to using volatile on the loop counter, does not include some very different (and wrong) assumptions about the variable and its contents and does not screw up the optimizer.

Okay, his program is more proper. However, that still ignores the fact that the compiler is optimizing his limit away to a constant (cheating the test), which my volatile limit avoided.

0

Share this post


Link to post
Share on other sites

However, that still ignores the fact that the compiler is optimizing his limit away to a constant (cheating the test), which my volatile limit avoided.

Again, easily avoided:
// up.c
#include <stdio.h>
#include <stdlib.h>

int main() {
	char buf[10000];
	unsigned int limit;
	printf("enter a number up to 10000: ");
	scanf("%u", &limit);
	
	for(unsigned int i = 0; i != limit; ++i){
		buf[i] = i;
	}
	
	printf("%d\n", buf[rand() % 10000]);
}
And the same of the other program.

On my system:
$ gcc -S -O2 *.c -std=c99
down.c: In function ‘main’:
down.c:9:7: warning: ignoring return value of ‘scanf’, declared with attribute warn_unused_result [-Wunused-result]
up.c: In function ‘main’:
up.c:9:7: warning: ignoring return value of ‘scanf’, declared with attribute warn_unused_result [-Wunused-result]

$ diff {up,down}.s
1c1
< 	.file	"up.c"
---
> 	.file	"down.c"
33,34c33,34
< 	movl	24(%esp), %eax
< 	testl	%eax, %eax
---
> 	movl	24(%esp), %edx
> 	testl	%edx, %edx
36c36
< 	subl	$1, %eax
---
> 	xorl	%eax, %eax
41,42c41,42
< 	subl	$1, %eax
< 	cmpl	$-1, %eax
---
> 	addl	$1, %eax
> 	cmpl	%edx, %eax
0

Share this post


Link to post
Share on other sites



On my system:

Those two look equal in length, and time complexity, so I have no issues with that.

0

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0