Option number 2 can be a performance disaster if those variables are updated often. Since the elements of the array are tightly packed, they are likely to live on the same cache line, which means that the synchronization between threads can get very expensive quickly.
Good point about false-sharing!
I was only thinking about user-implemented indexing vs automatic "thread local" indexing.
My point was that the thread-local storage provided by threading APIs (or compiler annotations as in the OP, which just call these APIs behind the scenes for you) is that they are necessarily complex because they need to work in extremely general cases. They don't know how many threads will need storage for your variables, when those threads will be initialized, how many of these thread-local variables there will be, etc...
In the best case, you've probably got a reserved register in the CPU, which holds a pointer (specific to your thread) to a table of pointers, which you then fetch another pointer out of to another table of pointers, which then points to your variable. Each access will do some branching/tests to handle initialization. That's a bunch of indirections and branches on every single access of your variable, in exchange for having a general purpose solution. It's not ideal. If you're using one of these systems, you should treat that variable as being special, and optimize for reduced accesses to it -- e.g. instead of continually summing results into it in a loop, you'd sum results into a local, which is summed into the TLS variable at the end of the loop.
In the worst case, your architecture doesn't even support TLS, or if it does, it supports it in some horrible way that imposes a cost across the board, like C++ exceptions or RTTI...
If you do implement a specific version yourself, it can be a lot simpler and predictable
The other issue with the OP is that he's using a static variable (which also may insert hidden mutable state and branching to handle initialization, etc). It would be much better if the function was fairly pure, such as:
void f(int thread_id, int num_threads, int* output)
{
//cost of using *output is the same as any other pointer-variable, no platform-specific background magic going on here!
}
You can make the above work with an extremely simple "thread locals" system such as below:
//you know what kind of data your threads need:
struct ThreadLocals
{
int a, b, c;
};
// you know how many of them there are, and that you can pre-initialize them all at once:
ThreadLocals** init( int num_threads )
{
ThreadLocals** thread_locals = malloc( sizeof(ThreadLocals*) * num_threads );
int cache_line_size = GetCacheLineSize();//<-- this function is the only platform-specific bit
int size = max( sizeof(ThreadLocals), cache_line_size );
int align = max( alignof(ThreadLocals), cache_line_size );
for( int i=0; i!=num_threads; ++i )
thread_locals[i] = malloc_aligned( size, align );//align and round-up the allocation to whole cache lines to avoid any false sharing
return thread_locals;
}
void run_job( int num_threads )
{
ThreadLocals** thread_locals = init( num_threads );
for( int i=0; i!=num_threads; ++i )
launch[i] f( i, num_threads, &thread_locals[i]->a ); // "launch[x] Func(a)" is psuedo for have thread #x run Func(a)
sync();//when all threads are done
int total = 0;
for( int i=0; i!=num_threads; ++i )
total += thread_locals[i]->a;
printf( "Total was %d", total );
shutdown( thread_locals, num_threads );//opposite of the init func
}
The other issue with TLS APIs is that often, loops like the "total += ..." one at the end of the above code block -- where one thread collects the results of all the other workers -- sometimes aren't possible (i.e. you can't ever access the "thread local" values from other threads), so you're required to use ugly work-arounds, such as extra tasks per thread that read their thread-local values and copy them into values that have been allocated using the above method anyway!
With this cross-platform KISS approach, initialization/de-initialization/collation/syncing is all straightforward and explicit.
[edit]
Getting off topic, but when you've got functions that take the thread-index and number-of-threads as arguments, I find it very handy to have a function such as below, which lets you determine a range of inputs for processing:
inline void DistributeTask( int workerIndex, int numWorkers, int items, int* begin, int* end )
{
int perWorker = items / numWorkers;
*begin = perWorker * workerIndex;
*end = (workerIndex==numWorkers-1)
? items //ensure last thread covers whole range
: *begin + perWorker;
*begin = perWorker ? *begin : min(workerIndex, items); //special case,
*end = perWorker ? *end : min(workerIndex+1, items); //less items than workers
}
//e.g.
void sum(int thread_id, int num_threads, int* output, int inputs[], int num_inputs)
{
int begin, end;// determine the range of the inputs that this thread will process
DistributeTask( thread_id, num_threads, num_inputs, &begin, &end );
int total = 0;
for( int i=begin; i!=end; ++i )//process this range only
total += inputs[i];
*output = total;
}