Those articles are from 2006. That's when multicore CPUs just started appearing. There's nothing weird about bugs in new technology. There hasn't been a problem for 10 years since then.
As I tried to explain,
there is a problem in 2017, and the problem is that
QueryPerformanceCounter is implemented incorrectly.
If I break in the debugger and single-step instructions on my machine, I get this:
0x401613 callq *0x1bd53(%rip) <__imp_QueryPerformanceCounter>
...
0x771559a0 jmp 0x771559a8 <QueryPerformanceCounter+8>
...
0x771559a8 jmpq *0x882a2(%rip)
...
0x77389fd0 sub $0x28,%rsp
0x77389fd4 testb $0x1,0x7ffe02ed
0x77389fdc mov %rcx,%r9
0x77389fdf je 0x773edde0 <ntdll!EtwEventSetInformation+64608>
0x77389fe5 mov 0x7ffe03b8,%r8
0x77389fed rdtsc
0x77389fef movzbl 0x7ffe02ed,%ecx
0x77389ff7 shl $0x20,%rdx
0x77389ffb or %rdx,%rax
0x77389ffe shr $0x2,%ecx
0x7738a001 add %r8,%rax
0x7738a004 shr %cl,%rax
The translation of that is:
jump around
jump around
jump around
rdtsc
do some shit
return
The code uses rdtsc and does not serialize (the correct pattern prior to availability of rdtscp was cpuid; rdtsc since rdtsc is not a synchronizing instruction, and cpuid is the only instruction available in usermode otherwise which does that job, it is however a tidbit expensive, an extra 30 or so cycles), therefore no measurements that you make are accurate anywhere near the presumed precision.
The pipeline will be full or half-full or empty, depending on what processor you run on, what its pipeline depth is, what instructions were executed prior to calling QueryPerformanceCounter, and depending on whether those three jumps incidentially caused enough delay to retire all in-flight operations. If you only care about millisecond or possibly microsecond resolution then that's of course alright, because in that case... who cares anyway. But if you talk in terms of ten-nanosecond resolution like QPC does, then this is just shit. It's none more and none less but an incorrect implementation.
Using the rdtscp instruction instead, despite inline code you don't save anything performance-wise compared to calling into ntdll (being half-serializing is still surprisingly expensive), but what matters is that your measurements are correct. That is, the point in time that your measurement refers to is well-defined, not random.