using static initialisation for paralelization?

Started by
26 comments, last by Hodgman 9 years, 10 months ago

What if the CPU reorders the first two reads, as it is allowed to do...?


Is an x86 processor allowed to reorder reads?
Advertisement
IIRC, x86 memory model is strongly ordered, so machine code generated as such for x86 is safe, I believe. For weakly-ordered architectures like ARM, equivalent machine code would not be safe without adding memory fences.

In high-level code, double-checked locking is an anti pattern because, I believe, its non-portable for this very reason.

On x86 you can even (ab)use 'volatile' to achieve a kind of threading synchronization. This works fine on strongly-ordered systems, but again fails for weakly-orders ones. Tons of Windows (1st and 3rd party) code relied on this, to the extent that the Microsoft compiler has a few switches to toggle how volatile behaves after C++11 because their historical implementation intersected with strongly-ordered systems in a way that was more strict than necessary, and to aid porting of legacy windows code to run on ARM.

throw table_exception("(? ???)? ? ???");


don't mean to second guess the GCC authors here, but isn't that the "double checked locking" anti-pattern?


In high-level code, double-checked locking is an anti pattern because, I believe, its non-portable for this very reason.

After a quick search online I'm left with the impression that double-checked locking is not an issue for x86 or x86-64 and that it can be implemented safely (at a high level) in C++11.

After a quick search online I'm left with the impression that double-checked locking is not an issue for x86 or x86-64 and that it can be implemented safely (at a high level) in C++11


Right, if you *assume* that your high-level double-checked locking pattern code will never be compiled for a weakly-ordered system, it should work. But of coarse the trouble with high-level code is that any fool can unknowingly do just that, and then be subjected to strange and intermittent bugs. That's why its an anti-pattern.

throw table_exception("(? ???)? ? ???");

After a quick search online I'm left with the impression that double-checked locking is not an issue for x86 or x86-64 and that it can be implemented safely (at a high level) in C++11


Right, if you *assume* that your high-level double-checked locking pattern code will never be compiled for a weakly-ordered system, it should work. But of coarse the trouble with high-level code is that any fool can unknowingly do just that, and then be subjected to strange and intermittent bugs. That's why its an anti-pattern.

No, C++11 offers portable high-level double-checked locking as described here: http://preshing.com/20130930/double-checked-locking-is-fixed-in-cpp11/

Ah, right. I see. Despite best efforts, I still sometimes live in a hazy world of what C++ was, what C++11 is supposed to be, and what the various compilers actually are today. Its hard to keep it all straight.

I stand corrected.

However, for sake of the thread, its worth clarifying that the old, non-portable pattern that happened to work on strongly-ordered systems (that is, the naive 'it looks right if you're unaware of weakly-ordered systems' pattern) is still broken -- you *can* do portable double-checked locking in C++11, but you have to do it right.

throw table_exception("(? ???)? ? ???");

but i use tens of itis isinitialised flags in every program and i got no slowdovn in case of them.. does c++ standard just mean that if i just use any static variable it would guard it all with locks? hell no, i hope


C++11 changed the required behavior here. Some compilers support much of C++11 but not this feature. Other compilers have compile options to turn it off.

Instead of guessing what the compiler is doing, _look at the assembly output_. I can't stress this enough. Real engineers delve into how the boxes they build off of are constructed.

Consider:


#include <stdlib.h>

int foo() {
  static int bar = rand();
}
On GCC 4.9 with full optimizations, this produces:


foo():
	cmp	BYTE PTR guard variable for foo()::bar[rip], 0
	je	.L2
	mov	eax, DWORD PTR foo()::bar[rip]
	ret
.L2:
	sub	rsp, 24
	mov	edi, OFFSET FLAT:guard variable for foo()::bar
	call	__cxa_guard_acquire
	test	eax, eax
	jne	.L4
	mov	eax, DWORD PTR foo()::bar[rip]
	add	rsp, 24
	ret
.L4:
	call	rand
	mov	edi, OFFSET FLAT:guard variable for foo()::bar
	mov	DWORD PTR [rsp+12], eax
	mov	DWORD PTR foo()::bar[rip], eax
	call	__cxa_guard_release
	mov	eax, DWORD PTR [rsp+12]
	add	rsp, 24
	ret
It won't take a lock every single time, but it does check a global boolean. The gist is something like:


if not initialized
  lock
  if not initialized
    set initial value
    initialied = true
  end if
  unlock
end if
C++11 only requires that function-scope static initialization is thread-safe, so different compilers or different runtimes may implement this less efficiently.

Note that this only applies to initialization of function-local static variables (to non-zero values). The following bit of code can have the lock optimized away with no non-standard effects:


#include <stdlib.h>

bool foo() {
  static bool bar = false;
  if (!bar)
    bar = rand() == 0;
  return bar;
}
Compiles to:


foo():
	movzx	eax, BYTE PTR foo()::bar[rip]
	test	al, al
	je	.L7
	ret
.L7:
	sub	rsp, 8
	call	rand
	test	eax, eax
	sete	al
	mov	BYTE PTR foo()::bar[rip], al
	add	rsp, 8
	ret

allright, i seem to understand though i get a little trouble with this

I understand it just mean that this static calls are a handy method

for 'one called - functions' just the same what i'am symulating by hand often in my code

yet unpleazant thing that it is seralized implicitely (at least by default)

i would prefer a keyword serialise or something

void foo()

{

serialize static int f = f();

}

to hand controll it - (c++ goes wrong way (though it is no news), as i said i am working for years on my own c2 dialect that would mend some things)

after all it is still not clear what mak 50kb bloat of my app when

turning "static int f = 0; f = f();" into "static int f = f();" where in runtime

this lock should be touched only once

is it possible that when finding this line compiler turns some mode of compilling application into some multithreading mode and puts more

locks in over other parts of my code or what?

or is this bloat from incompiling some code for this mt suppord in the background of my binary? and slowdown comes indirectly from bloat?

What if the CPU reorders the first two reads, as it is allowed to do...?

Is an x86 processor allowed to reorder reads?

x86 includes an LFENCE instruction, which tells the CPU explicitly NOT to reorder reads past other reads, so I assumed so...

But... the spec says "Reads are not reordered with other reads"... So I guess the point of LFENCE is just to ensure that a read is not moved earlier such that it might occur out of order some particular write (which itself might be constrained from being moved too with an SFENCE)?

This topic is closed to new replies.

Advertisement