TSP brute force state change algorithm

Started by
16 comments, last by stu2000 13 years, 1 month ago

Thanks for that. I did notice that i missed quite a few and it was flawed, unfortunately your suggestion wont work for me either for two reasons.
1) I didnt understand what you meant when it came to set index := index%4 is that different from index = index%4?
No, the two should be considered to be the same.


2) im actualy usingthis for CUDA and will have a single thread for calculating each and every permutaiton. The index will be generated from the threadIdx.x + blockIdx.x etc.[/quote]That's really not how you're supposed to use CUDA threads. If I recall, even the latest modern hardware limits you to about 1500 threads total in a kernel. CUDA uses for-loops just like everyone else.

function CalculatePermutation( int index )
I wouldnt need to pass the final index from that into the next call to CalculatePermutation would i?

ie can i run something like:
for ( int i = 0; i<25; i++)
{
CalculatePermutation( i );
}
[/quote]
Yeah, but it would be silly to do so, because of the extra work compared to just finding the next permutation.
Advertisement

[quote name='stu2000' timestamp='1297975874' post='4775572']
Thanks for that. I did notice that i missed quite a few and it was flawed, unfortunately your suggestion wont work for me either for two reasons.
1) I didnt understand what you meant when it came to set index := index%4 is that different from index = index%4?
No, the two should be considered to be the same.


2) im actualy usingthis for CUDA and will have a single thread for calculating each and every permutaiton. The index will be generated from the threadIdx.x + blockIdx.x etc.[/quote]That's really not how you're supposed to use CUDA threads. If I recall, even the latest modern hardware limits you to about 1500 threads total in a kernel. CUDA uses for-loops just like everyone else.
[/quote]

From reading Cuda by example you are limited to 512 threads per block and are alowed to run up to 65535 blocks. From everything I read, CUDA works best the more parallel blocks/threads you can have. Cuda applications only show huge performance increase over CPUs when you launch hundreds/thousands of parallel actions. I should probably have not used the word thread earlier but it seemed appropriate than explaining launching many parallel kernels/blocks with many threads. Im still not 100% on it as you can see.

Stu


From reading Cuda by example you are limited to 512 threads per block and are alowed to run up to 65535 blocks.


Yes, you can have up to 512 threads per block. Yes, the API allows up to 65536 blocks per grid. No, you cannot have 65536 blocks with 512 threads per block.

From everything I read, CUDA works best the more parallel blocks/threads you can have.[/quote]
Of course that's true. But think about this logically. Your GPU does not have 65536*512=33554432 processing elements in it. It cannot do 33554432 things at the same time. Why would it? Your monitor doesn't even have that many pixels.

A CUDA thread is not intended to process exactly one thing and then die. The overhead involved in the creation of a thread is small, but it is greater than the overhead of processing exactly one thing.

Bottom line: You will have thousands of threads. You will not have millions. Each thread should do thousands or millions of things, not one thing.

EDIT: Incidentally, one of the most important aspects of CUDA is that threads within a block do NOT have to be fully independent. That's going to become important when you want to figure out which path produces the lowest cost: Without intra-block communication through shared memory, you'll need to go to global (GPU) memory for each step in the reduction, which will be SLOOOOOOOOOOOOOOW.


Yes, you can have up to 512 threads per block. Yes, the API allows up to 65536 blocks per grid. No, you cannot have 65536 blocks with 512 threads per block.

From everything I read, CUDA works best the more parallel blocks/threads you can have.

Of course that's true. But think about this logically. Your GPU does not have 65536*512=33554432 processing elements in it. It cannot do 33554432 things at the same time. Why would it? Your monitor doesn't even have that many pixels.

A CUDA thread is not intended to process exactly one thing and then die. The overhead involved in the creation of a thread is small, but it is greater than the overhead of processing exactly one thing.

Bottom line: You will have thousands of threads. You will not have millions. Each thread should do thousands or millions of things, not one thing.

EDIT: Incidentally, one of the most important aspects of CUDA is that threads within a block do NOT have to be fully independent. That's going to become important when you want to figure out which path produces the lowest cost: Without intra-block communication through shared memory, you'll need to go to global (GPU) memory for each step in the reduction, which will be SLOOOOOOOOOOOOOOW.
[/quote]

Ah k thx for that. I had just figured that the algorithm to generate the permutation from a single index would be several steps, not just one instruction, and even about 300 of those running at the same time would be faster than a single cpu thread that has an easier time going from one permutation to another. I read somewhere that cuda works well because it hides memory access latencies by running other threads during this time so even though memmory is slow, this is 'hidden' though this is probably referring to accessing the gpu local/texture memory rather than the global memory, I don't know.

However you have a very good point about inter block communication, after all there's no point making it slower when it can be faster even if it is simpler to implement/imagine with each thread being completely independent.



I read somewhere that cuda works well because it hides memory access latencies by running other threads during this time so even though memmory is slow, this is 'hidden' though this is probably referring to accessing the gpu local/texture memory rather than the global memory, I don't know.


Unless the card on which it's running contains a flux capacitor or graviton generator, then the laws of physics still hold. Even on GPU, memory access patterns are crucial. Morton order, Sloan or Cuthill-McKee ordering, tricks like that all matter.

GPUs are great for things which have very small and fully independent state. For example, updating 20 million particles. But as soon as you start dealing with any kind of global or shared state, things get complicated.

For most problems, algorithmic improvements matter most, and those rely on branching to cull away unfeasible paths. And that is something that general purpose CPUs shine at with branch prediction, pipelines, large caches, ... For most realistic computational problems, CPUs and GPUs perform about the same if comparing equivalent implementations.

GPUs are great for things which have very small and fully independent state. For example, updating 20 million particles. But as soon as you start dealing with any kind of global or shared state, things get complicated.


Complicated, but not intractable. Synchronization within a thread block is much cheaper than CPU thread synchronization, and direct access to intra-block shared memory instead of an L1 cache gives much faster and more dependable concurrent access.

[quote name='stu2000' timestamp='1298056459' post='4776034']
I read somewhere that cuda works well because it hides memory access latencies by running other threads during this time so even though memmory is slow, this is 'hidden' though this is probably referring to accessing the gpu local/texture memory rather than the global memory, I don't know.


Unless the card on which it's running contains a flux capacitor or graviton generator, then the laws of physics still hold. Even on GPU, memory access patterns are crucial. Morton order, Sloan or Cuthill-McKee ordering, tricks like that all matter.
[/quote]

I wasnt referring to anything amazing in the field of physics, just referring to this section that I had read:


One key difference is in how the two architectures address the issue of accessing off-chip
memory, a very expensive operation with hundreds clock cycles of latency. CPUs devote a
lot of transitors to on-chip caches in order to reduce as much as possible the overall latency
caused by off-chip memory accesses. For applications with no or very little parallelism, this
latency reduction is the best strategy. For applications with a high number of parallel
computations, another strategy is to ensure that the processor is always busy with some
computations while other computations are waiting on memory accesses or at
synchronization points. In other words, latency is “hidden” rather than reduced. This latency NVIDIA OpenCL Programming for the CUDA Architecture 3
hiding strategy adopted by GPUs is schematized in Figure 1. Latency hiding requires the
ability to quickly switch from one computation to another. A GPU multiprocessor (i.e. a
compute unit in OpenCL terminology) is therefore designed to support hundreds of active
threads at once and unlike CPUs, the cost of switching from one thread to another is
insignificant
src: http://www.nvidia.co...ingOverview.pdf





GPUs are great for things which have very small and fully independent state. For example, updating 20 million particles. But as soon as you start dealing with any kind of global or shared state, things get complicated.

For most problems, algorithmic improvements matter most, and those rely on branching to cull away unfeasible paths. And that is something that general purpose CPUs shine at with branch prediction, pipelines, large caches, ... For most realistic computational problems, CPUs and GPUs perform about the same if comparing equivalent implementations.


Yes I had been building upon this point ( that gpu's work best in an independent state) for this algorithm, stating that I had wanted to create permutations from a continually increasing index. This way each thread is given its own integer index and creates its own permutation from which to calculate the route length (TSP) This will be many more than one instruction. However, it seems that it is very much more calculation intensive to create a permutation from an index, rather than carry on in a recursive method from the previous permutation and thats where this whole thing about shared communication/memory seems to have popped out from.

I would still like to see if running lots of threads on a GPU, completely independent, but more calculation intensive (generate permutations from index) as a whole is faster or slower than running a single thread recursive algorithm on a CPU.

Stu
Muchos kudos goes out to Panayiotis Tsamouris for supplying me with this via email. :D
http://code.google.com/p/mestrado-biocomp/source/browse/trunk/prefix-transposition-all/factoradic.c?spec=svn19&r=19

This is exactly what I needed and now can move on.

Stu

This topic is closed to new replies.

Advertisement