I am under an impression that std::async and std::mutex can solve every basic-to-intermediate problem for multi-threading.
The multi-threading (in this scope) is to
1. increase performance of multi-core (to use all CPU-core)
2. avoid CPU stall e.g. L2 cache miss, especially when execute a ton of indirection (a->b->c->d)
Here is an example :-
std::vector<std::future> v;
v.push_back(std::async( &A::doThis, pointerToAInstance ));
v.push_back(std::async( &B::doThis, pointerToBInstance ));
v.get(0).wait();
v.get(1).wait();
This can be easily applied to a scenario that I have a long array of objects that need to be expensively processed.
If I want to split jobs again and again, I can std::async within std::async within std::async ...... look nice.
By the way, I heard that OpenMP is a popular library for multicore programming (https://en.wikipedia.org/wiki/OpenMP).
However, as I am researching, I also found that most of its feature can be replaced by simplier std::async and std::mutex.
OpenMP is a heavy weight library that just provides syntactic sugar, right?
I feel that I missed something.
Edit
@Josh Petrie and @SeanMiddledtich , thank for useful information.
Here is my specific experiment :-
My specific application is:-
I created my component-based-architecture game-engine.
I have little experience in multi-threading, but the architecture tend to support it inherently - it divided jobs into many systems (about 13 systems).
Each and every system manages and accesses at most 2-3 components. (whole game = 20 components)
Therefore, most of systems can be executed in parallel.
Step 1
After I learned a bit of std::async and std::mutex, I use these two functions to make it parallel.
I created std::future from each system, then wait() at the end of each timestep.
The modification is very conservative and highly-aware of concurrent modification and the order of execution, e.g. physic cannot be grouped with graphic.
Just the rough modification (4 chosen systems run in parellels, other systems were executed one by one as before.) can increase my program's performance from 40->50 fps.
Step 2
When I found a bottleneck in a certain system, I divided all components of a certain type into many groups (about 4), then process each group as a thread.
If it has to access some share data, which is not so often, I used std::mutex.
Again, after modification in the single most CPU-intensive function, I gained overall performance from 50->65 fps.
These two experiments created feeling that it is free for me. It is so easy ... that looks like a trap.
The reason might be that I had tried to optimize my program by using Pooling for a few weeks, performance increase only 5%, while this multi-core takes only 1 day to learn & code.
Edit2: Yay, thank everyone for great knowledge!, I should post this sooner. spamming +1