Don't try and implement that unless you're aware of the memory/cache/atomicity/reordering behaviours of your target CPU and compiler.I also found a platform-independent way of doing it: the Peterson's algorithm
If you implement that code as-is on any modern CPU, it simply will not work correctly as a critical section... You'd have to insert memory fences, etc, in order to make it function correctly.
e.g. your compiler or CPU might change the order in which the writes to flag[n] and turn actually occur, or the setting of flag[n] to false at the end might actually be committed to RAM before the changes performed inside the critical section are committed!