I was somewhat hoping the old "let's put each module in a separate thread"-approach has died out a while ago, when developers realized that having all these threads constantly compete for the same resources is a messy nightmare of synchronization and race condition bugs. Unless you make sure that all the modules working in parallel only read the same data. Alternatives usually involve either a lot of copying or a lot of locking. Don't be surprised if the overhead ends up making the whole thing pointless.
A first step is to simply use concepts like parallel_for to split up one huge chunk of work to be processed in parallel (note: make sure each element can be processed independently of others). After that, look into task based parallelism. Intel's TBB library and especially its documentation might be a good place to start.