per thread statics

Started by
4 comments, last by fir 10 years, 2 months ago

i was reading that when i wrote

void f()

{

__declspec (thread) static int value = 2 ;



}

value here is 'per thread' static

1) does it such many copies as number of threads really calling through f()

2) is this slower than doing for example:

void f(int t)

{

static int value[4];

}

where t is some ordinal for thread (here reserve for 4 threads)
recently i was revriting some raytracing code from 1 to 2 threads and i
changed thread statics to arrays this second way here indexing the functions by thread ordinals (i was unaware of 1 posibility in this moment)
Advertisement
Option number 2 is indeed better, as long as you know how many max threads there are, and you can give them all an index from 0 to max-1.

The first method is handy if you don't know how many threads will be running, or if you can't index them... But the magic behind he scenes to make it work is quite complicated (extremely slow & complicated on some non-x86 CPUs)

Option number 2 is indeed better, as long as you know how many max threads there are, and you can give them all an index from 0 to max-1.

The first method is handy if you don't know how many threads will be running, or if you can't index them... But the magic behind he scenes to make it work is quite complicated (extremely slow & complicated on some non-x86 CPUs)

Alright tnx fot the info. Usually there is a way of checking how many cores will be runing thru branch and also indexing them by ordinals

is usually posible, ye?

like -

for(int i=0; i<16; i++)

RunThread(function, i);

though im not sure if passing thread ordinal thru whole branch is the best practice.. got no experience with that yet

Option number 2 is indeed better, as long as you know how many max threads there are, and you can give them all an index from 0 to max-1.

The first method is handy if you don't know how many threads will be running, or if you can't index them... But the magic behind he scenes to make it work is quite complicated (extremely slow & complicated on some non-x86 CPUs)


Careful! Option number 2 can be a performance disaster if those variables are updated often. Since the elements of the array are tightly packed, they are likely to live on the same cache line, which means that the synchronization between threads can get very expensive quickly. I have seen this problem slow down a multi-threaded chess program to a crawl, because someone put the counters of nodes visited (updated millions of times per second by each thread) in an array indexed by thread, similarly to option 2.

You should have per-thread data living in separate cache lines. Whoever provides your thread library also provides a mechanism to create thread-local storage, and that mechanism should take care of this low-level issue for you.

Option number 2 can be a performance disaster if those variables are updated often. Since the elements of the array are tightly packed, they are likely to live on the same cache line, which means that the synchronization between threads can get very expensive quickly.

Good point about false-sharing!

I was only thinking about user-implemented indexing vs automatic "thread local" indexing.
My point was that the thread-local storage provided by threading APIs (or compiler annotations as in the OP, which just call these APIs behind the scenes for you) is that they are necessarily complex because they need to work in extremely general cases. They don't know how many threads will need storage for your variables, when those threads will be initialized, how many of these thread-local variables there will be, etc...
In the best case, you've probably got a reserved register in the CPU, which holds a pointer (specific to your thread) to a table of pointers, which you then fetch another pointer out of to another table of pointers, which then points to your variable. Each access will do some branching/tests to handle initialization. That's a bunch of indirections and branches on every single access of your variable, in exchange for having a general purpose solution. It's not ideal. If you're using one of these systems, you should treat that variable as being special, and optimize for reduced accesses to it -- e.g. instead of continually summing results into it in a loop, you'd sum results into a local, which is summed into the TLS variable at the end of the loop.
In the worst case, your architecture doesn't even support TLS, or if it does, it supports it in some horrible way that imposes a cost across the board, like C++ exceptions or RTTI...

If you do implement a specific version yourself, it can be a lot simpler and predictable wink.png

The other issue with the OP is that he's using a static variable (which also may insert hidden mutable state and branching to handle initialization, etc). It would be much better if the function was fairly pure, such as:


void f(int thread_id, int num_threads, int* output)
{
  //cost of using *output is the same as any other pointer-variable, no platform-specific background magic going on here!
}

You can make the above work with an extremely simple "thread locals" system such as below:


//you know what kind of data your threads need:
struct ThreadLocals
{
  int a, b, c;
};
// you know how many of them there are, and that you can pre-initialize them all at once:
ThreadLocals** init( int num_threads )
{
  ThreadLocals** thread_locals = malloc( sizeof(ThreadLocals*) * num_threads );
  int cache_line_size = GetCacheLineSize();//<-- this function is the only platform-specific bit
  int size = max( sizeof(ThreadLocals), cache_line_size );
  int align = max( alignof(ThreadLocals), cache_line_size );
  for( int i=0; i!=num_threads; ++i )
    thread_locals[i] = malloc_aligned( size, align );//align and round-up the allocation to whole cache lines to avoid any false sharing
  return thread_locals;
}
 

void run_job( int num_threads )
{
  ThreadLocals** thread_locals = init( num_threads );
  for( int i=0; i!=num_threads; ++i )
    launch[i] f( i, num_threads, &thread_locals[i]->a ); // "launch[x] Func(a)" is psuedo for have thread #x run Func(a) 
 
  sync();//when all threads are done
  int total = 0;
  for( int i=0; i!=num_threads; ++i )
    total += thread_locals[i]->a;
  printf( "Total was %d", total );

  shutdown( thread_locals, num_threads );//opposite of the init func
}

The other issue with TLS APIs is that often, loops like the "total += ..." one at the end of the above code block -- where one thread collects the results of all the other workers -- sometimes aren't possible (i.e. you can't ever access the "thread local" values from other threads), so you're required to use ugly work-arounds, such as extra tasks per thread that read their thread-local values and copy them into values that have been allocated using the above method anyway!

With this cross-platform KISS approach, initialization/de-initialization/collation/syncing is all straightforward and explicit.

[edit]

Getting off topic, but when you've got functions that take the thread-index and number-of-threads as arguments, I find it very handy to have a function such as below, which lets you determine a range of inputs for processing:


inline void DistributeTask( int workerIndex, int numWorkers, int items, int* begin, int* end )
{
	int perWorker = items / numWorkers;
	*begin = perWorker * workerIndex;
	*end = (workerIndex==numWorkers-1)
		? items                      //ensure last thread covers whole range
		: *begin + perWorker;
	*begin = perWorker ? *begin : min(workerIndex, items);   //special case,
	*end   = perWorker ? *end   : min(workerIndex+1, items); //less items than workers
}
//e.g.
void sum(int thread_id, int num_threads, int* output, int inputs[], int num_inputs)
{
  int begin, end;// determine the range of the inputs that this thread will process
  DistributeTask( thread_id, num_threads, num_inputs, &begin, &end );

  int total = 0;
  for( int i=begin; i!=end; ++i )//process this range only
    total += inputs[i];
  *output = total;
}

Option number 2 is indeed better, as long as you know how many max threads there are, and you can give them all an index from 0 to max-1.

The first method is handy if you don't know how many threads will be running, or if you can't index them... But the magic behind he scenes to make it work is quite complicated (extremely slow & complicated on some non-x86 CPUs)


Careful! Option number 2 can be a performance disaster if those variables are updated often. Since the elements of the array are tightly packed, they are likely to live on the same cache line, which means that the synchronization between threads can get very expensive quickly. I have seen this problem slow down a multi-threaded chess program to a crawl, because someone put the counters of nodes visited (updated millions of times per second by each thread) in an array indexed by thread, similarly to option 2.

You should have per-thread data living in separate cache lines. Whoever provides your thread library also provides a mechanism to create thread-local storage, and that mechanism should take care of this low-level issue for you.

I was using this statics table[2] lying close one to another, and also very often colliding probably (those were some { intersection.point intersection.normal. intersection.distance} data structure updated in almost every per pixel operation (1 thread was rendering upper part of screen, 2 down part) - so it should collide writes of maybe about 200 000 per frame (or about, maybe somewhat less if not all pixel was

intersection) (no sorry this would be only 20 k pixels for half a screen i forgot i got low res 200x200 pixels - but multiplied for times of intersections because i counted it stright for all figures in brute loop which i got 20 so it was theoretical 200x200x20 practical probably about 10 k writes to this static per frame maybe - well not important [i got also other statics in this ray-pixel routines hard to cound writes but often]) if you say so (should be heavy colliding in both read and writes) but I got no slowdown - my frame in one thread was exactly 37.1 ms, in two threads i got exactly 21.4 ms (where 2 ms was a base selep() + 0.5 ms for loop maybe) so it worked quite ok (with ideal div by 2 it would be 20 ms so it seem only 2 ms, 10% away from ideal in my case)

This topic is closed to new replies.

Advertisement