Firstly, what makes it different to debugging regular code? There are a few categories of bugs that simply don't exist in regular code. The prime one is the "race condition," a bug that only occurs when things happen in a particular order. You've also got deadlock, priority inversion, etc. These are all related to the fact that more than one thing is happening at once.
The problem is that these things are difficult to debug precisely because more than one thing is happening at once, and the way in which they do this is unpredictable. If that file load had taken just 5ms longer, it wouldn't have finished by the time the resource managed tried reading it, etc. If you serialise things - just run each thread one after another - then the problem will likely disappear, because things are no longer happening at the same time such that the interactions cause problems. Can we reconcile the two? I smell a tool.
Say you've got three processes that might be running simultaneously; let's just consider them as functions. If you wanted to test for interactions between them (and error states arising from such interactions), what you might do is write a simple unit test that kicks off the three threads, lets them run to completion, and then checks for errors, and does this 1000 times in the hope that it'll catch a reasonable number of different sequences. If we assume that we're running on a single CPU, then our functions are actually being serialised and executed - thread1, thread1, thread1, contextswitch, thread2, thread2, contextswitch, thread3 - and we have no control over how these instructions and context switches are generated and interleaved. We run the test 1000 times and hope that the resulting sequences represent a good portion of all the possible way things could happen.
I'm not content with that. If the CPU can take machine code from three threads and interleave them with context switches in between, why can't we? We'd be able to control how/when the switches happen that way; we could execute one instruction from each thread at a time, or five from the first for every one of the second, or we could simulate the effect of a context switch /right in the middle/ of a particularly critical operation...
Even better, we could generate sequences programmatically. Say we want to run 3 functions, A, B, and C, and each function is 100 instructions long, with no odd critical sections or anything like that. We start with function A in it's plain form, take function B, and squash bits of it in between A. How many possible ways can you interleave two 100-instruction functions, you may well ask? I'm not sure. It's probably quite a lot, involving 100-factorial terms. If anyone figures it out, tell me, please. Anyway, once you've done that you've got a 200-instruction sequence (+ context switches) and can interleave C in the same way. The number of sequences increases exponentially with each additional thread, I think... there are probably some degenerate cases that could be eliminated that haven't occured to me yet.
You take your billions of generated sequences (yeah, it probably is) and you run each one through the unit tests. Any of the tests that fail, you note down, recording the sequence that failed it. You'd probably need some smartness in the test such that it can figure out when something has become stuck in deadlock and abort instead of never completing the test. You can probably group failing sequences by failure point to produce more useful information.