Sign in to follow this  

TSP brute force state change algorithm

This topic is 2482 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hello all!

Ive been working on this for about two days whenever I have spare time, and its making me feel so dumb because i get the feeling the answer should be obvious. (probably just a few well placed modulus ' % ' signs. )

I'm trying to figure out a way to get between states (list of nodes) in the TSP (travelling salesperson problem). I know many of you will say you need to use AI and thats the best/quickest way to get a solution, but i've already done/know that at university.

Example

state 1 -> A B C D
state 2 -> B A C D
state 3-> A C B D
etc


the state number continuously increases and from the number a unique state can be formulated. It might be easier to imagine the nodes as A B C D A, instead of A B C D, you already know that you have to end up back at the head of the list at the end.

The key is that the state should be represented by an integer so that it can be incrementally increased. Each state should preferably be unique, but I would be happy to settle for an algorithm that somehow ends up with the occasional same actual state from two different state numbers.

Its easy to get all the combinations of doing one switch, but getting it so that it will get all the states by performing two switches is hard,
eg like going from A B C D to C A B D

It can be quite hard to get my point across sometimes so if you don't understand that's normal and please just respond asking about the relevant bit.

Thanks in advance,
Stu

Share this post


Link to post
Share on other sites
So far I have come up only with the following algorithm.

Treat it like your trying to brute force a bicycle spin lock ( 3 to 4 wheels of numbers 0-9 remember them)
Instead of numbers you have letters representing the nodes
Since no letter can be used twice, treat the wheel as a 'priority'
You keep spinning the left most wheel one notch (letter/number) each time to change the priority, and once you have turned that wheel a full turn, turn the second wheel by one notch. You keep repeating this until the second wheel has turned fully and then you turn the third wheel one notch. Once the final wheel (rightmost wheel) has done a full circle, all the possible outcomes should have been made.

Diagram shown.
[URL=http://www.UploadScreenshot.com/image/243176/1336380][IMG]http://img1.UploadScreenshot.com/images/main/2/4617251627.jpg[/IMG][/URL]

Whenever a 'wheel' wants to output a letter which is already placed, it tries to output the next priority letter. If that letter is already placed it repeats until it places a letter.

Unfortunately, since the last wheel has last say, the last wheel changing priority is pointless and that series is pointless.

Is this method complete (get every state)? i know its not perfect as there will be duplicates =(

Share this post


Link to post
Share on other sites
[quote name='Sneftel' timestamp='1297894402' post='4775143']
[url="http://en.wikipedia.org/wiki/Permutation#Systematic_generation_of_all_permutations"]clicky[/url]
[/quote]

cheers for that. looking through it. [url="http://en.wikipedia.org/wiki/Steinhaus%E2%80%93Johnson%E2%80%93Trotter_algorithm"]http://en.wikipedia.org/wiki/Steinhaus%E2%80%93Johnson%E2%80%93Trotter_algorithm[/url] looked most interesting, then realised will be hard to reach each state without going through each previous state. I need to be able to pick an algorithm that doesnt rely on working through the previous states because this will be programmed in parallel and not on a single thread in serial.

Share this post


Link to post
Share on other sites
Read more of that page. It describes how an index can be used to generate the corresponding permutation in lexicographic order.

Share this post


Link to post
Share on other sites
[quote name='Sneftel' timestamp='1297904340' post='4775199']
Read more of that page. It describes how an index can be used to generate the corresponding permutation in lexicographic order.
[/quote]

Are you referring to the page you linked or the page I linked?

Are you referring to this section of the page you linked:

[font=sans-serif][size=2]"An obvious way to generate permutations of [i]n[/i] is to generate values for the Lehmer code (possibly using the [url="http://en.wikipedia.org/wiki/Factorial_number_system"]factorial number system[/url] representation of integers up to [i]n[/i]!), and convert those into the corresponding permutations. However the latter step, while straightforward, is hard to implement efficiently, because it requires [i]n[/i] operations each of selection from a sequence and deletion from it, at an arbitrary position; of the obvious representations of the sequence as an [url="http://en.wikipedia.org/wiki/Array_data_structure"]array[/url] or a [url="http://en.wikipedia.org/wiki/Linked_list"]linked list[/url], both require (for different reasons) about [i]n[/i][sup]2[/sup]/4 operations to perform the conversion. With [i]n[/i] likely to be rather small (especially if generation of all permutations is needed) that is not too much of a problem, but it turns out that both for random and for systematic generation there are simple alternatives that do considerably better. For this reason it does not seem useful, although certainly possible, to employ a special data structure that would allow performing the conversion from Lehmer code to permutation in [url="http://en.wikipedia.org/wiki/Big_O_notation"][i]O[/i]([i]n[/i] log [i]n[/i])[/url] time."[/size][/font]
[font="sans-serif"][size="2"]
[/size][/font]
[font="sans-serif"][size="2"]Stu[/size][/font]

Share this post


Link to post
Share on other sites
I will try tonight with set A B C D
first letter chosen is = index %4, second = index %3, thrid index %2

each time letter chosen it is removed from the base list, A B C D, and base list is always kept in alphabeitcial order.

ie
if index 6 produces 2 ,0, 0 so result is C A B D
if index 3 produces 3, 0, 1 so result C A D B

Has anyone tried this, know any fatal flaws with it?

Share this post


Link to post
Share on other sites
Well, in exactly that form it doesn't work... you can see that the period of the sequence is 12, but there's 24 possible permutations, so the algorithm only visits half of them.

A similar algorithm works fine: first letter chosen is index/4, then set index := index%4. Second letter chosen is index/3, then set index := index%3. And so on. This is equivalent in function to the factorial numbering explained in the wikipedia page.

Don't do it for every permutation, though. Do it once for each thread to identify the first permutation for it to work on (for 24 permutations and 4 threads, the first thread starts at permutation 0, the second starts at permutation 6, the third at permutation 12...), and from there just do the standard iteration algorithm. That's where the lexicographic ordering comes in handy.

Share this post


Link to post
Share on other sites
[quote name='Sneftel' timestamp='1297970036' post='4775527']
Well, in exactly that form it doesn't work... you can see that the period of the sequence is 12, but there's 24 possible permutations, so the algorithm only visits half of them.

A similar algorithm works fine: first letter chosen is index/4, then set index := index%4. Second letter chosen is index/3, then set index := index%3. And so on. This is equivalent in function to the factorial numbering explained in the wikipedia page.

Don't do it for every permutation, though. Do it once for each thread to identify the first permutation for it to work on (for 24 permutations and 4 threads, the first thread starts at permutation 0, the second starts at permutation 6, the third at permutation 12...), and from there just do the standard iteration algorithm. That's where the lexicographic ordering comes in handy.
[/quote]

Thanks for that. I did notice that i missed quite a few and it was flawed, unfortunately your suggestion wont work for me either for two reasons.
1) I didnt understand what you meant when it came to set index := index%4 is that different from index = index%4? i dont understand the :.


2) im actualy usingthis for CUDA and will have a single thread for calculating each and every permutaiton. The index will be generated from the threadIdx.x + blockIdx.x etc. This way i am trying to use the huge power of the gpu's parallel computation and even though calculating each permutation should be slower on its own, having hundreds/thousands running in parallel at a time should make this faster.

Is your algorithm summed up like: (pseudo code)


function CalculatePermutation( int index )
{
int firstLetter = index / 4;
char firstLetterChar = GetCharFromNumber ( firstLetter );
index = index % 4;

int secondLetter = index / 3;
char secondLetterChar = GetCharFromNumber ( secondLetter );
index = index % 3;

int thirdLetter = index / 2
char thirdLetterChar = GetCharFromNumber ( thirdLetter );
index = index % 2;

...etc..

}

I wouldnt need to pass the final index from that into the next call to CalculatePermutation would i?

ie can i run something like:
for ( int i = 0; i<25; i++)
{
CalculatePermutation( i );
}

Share this post


Link to post
Share on other sites
[quote name='stu2000' timestamp='1297975874' post='4775572']
Thanks for that. I did notice that i missed quite a few and it was flawed, unfortunately your suggestion wont work for me either for two reasons.
1) I didnt understand what you meant when it came to set index := index%4 is that different from index = index%4?[/quote]No, the two should be considered to be the same.


[quote]2) im actualy usingthis for CUDA and will have a single thread for calculating each and every permutaiton. The index will be generated from the threadIdx.x + blockIdx.x etc.[/quote]That's really not how you're supposed to use CUDA threads. If I recall, even the latest modern hardware limits you to about 1500 threads total in a kernel. CUDA uses for-loops just like everyone else.

[quote]function CalculatePermutation( int index )
I wouldnt need to pass the final index from that into the next call to CalculatePermutation would i?

ie can i run something like:
for ( int i = 0; i<25; i++)
{
CalculatePermutation( i );
}
[/quote]
Yeah, but it would be silly to do so, because of the extra work compared to just finding the next permutation.

Share this post


Link to post
Share on other sites
[quote name='Sneftel' timestamp='1297978973' post='4775597']
[quote name='stu2000' timestamp='1297975874' post='4775572']
Thanks for that. I did notice that i missed quite a few and it was flawed, unfortunately your suggestion wont work for me either for two reasons.
1) I didnt understand what you meant when it came to set index := index%4 is that different from index = index%4?[/quote]No, the two should be considered to be the same.


[quote]2) im actualy usingthis for CUDA and will have a single thread for calculating each and every permutaiton. The index will be generated from the threadIdx.x + blockIdx.x etc.[/quote]That's really not how you're supposed to use CUDA threads. If I recall, even the latest modern hardware limits you to about 1500 threads total in a kernel. CUDA uses for-loops just like everyone else.
[/quote]

From reading Cuda by example you are limited to 512 threads per block and are alowed to run up to 65535 blocks. From everything I read, CUDA works best the more parallel blocks/threads you can have. Cuda applications only show huge performance increase over CPUs when you launch hundreds/thousands of parallel actions. I should probably have not used the word thread earlier but it seemed appropriate than explaining launching many parallel kernels/blocks with many threads. Im still not 100% on it as you can see.

Stu

Share this post


Link to post
Share on other sites
[quote name='stu2000' timestamp='1298037737' post='4775863']
From reading Cuda by example you are limited to 512 threads per block and are alowed to run up to 65535 blocks.[/quote]

Yes, you can have up to 512 threads per block. Yes, the API allows up to 65536 blocks per grid. No, you cannot have 65536 blocks with 512 threads per block.

[quote]From everything I read, CUDA works best the more parallel blocks/threads you can have.[/quote]
Of course that's true. But think about this logically. Your GPU does not have 65536*512=33554432 processing elements in it. It cannot do 33554432 things at the same time. Why would it? Your monitor doesn't even have that many pixels.

A CUDA thread is not intended to process exactly one thing and then die. The overhead involved in the creation of a thread is small, but it is greater than the overhead of processing exactly one thing.

Bottom line: You will have thousands of threads. You will not have millions. Each thread should do thousands or millions of things, not one thing.

EDIT: Incidentally, one of the most important aspects of CUDA is that threads within a block do NOT have to be fully independent. That's going to become important when you want to figure out which path produces the lowest cost: Without intra-block communication through shared memory, you'll need to go to global (GPU) memory for each step in the reduction, which will be SLOOOOOOOOOOOOOOW.

Share this post


Link to post
Share on other sites
[quote name='Sneftel' timestamp='1298043762' post='4775917']

Yes, you can have up to 512 threads per block. Yes, the API allows up to 65536 blocks per grid. No, you cannot have 65536 blocks with 512 threads per block.

[quote]From everything I read, CUDA works best the more parallel blocks/threads you can have.[/quote]
Of course that's true. But think about this logically. Your GPU does not have 65536*512=33554432 processing elements in it. It cannot do 33554432 things at the same time. Why would it? Your monitor doesn't even have that many pixels.

A CUDA thread is not intended to process exactly one thing and then die. The overhead involved in the creation of a thread is small, but it is greater than the overhead of processing exactly one thing.

Bottom line: You will have thousands of threads. You will not have millions. Each thread should do thousands or millions of things, not one thing.

EDIT: Incidentally, one of the most important aspects of CUDA is that threads within a block do NOT have to be fully independent. That's going to become important when you want to figure out which path produces the lowest cost: Without intra-block communication through shared memory, you'll need to go to global (GPU) memory for each step in the reduction, which will be SLOOOOOOOOOOOOOOW.
[/quote]

Ah k thx for that. I had just figured that the algorithm to generate the permutation from a single index would be several steps, not just one instruction, and even about 300 of those running at the same time would be faster than a single cpu thread that has an easier time going from one permutation to another. I read somewhere that cuda works well because it hides memory access latencies by running other threads during this time so even though memmory is slow, this is 'hidden' though this is probably referring to accessing the gpu local/texture memory rather than the global memory, I don't know.

However you have a very good point about inter block communication, after all there's no point making it slower when it can be faster even if it is simpler to implement/imagine with each thread being completely independent.


Share this post


Link to post
Share on other sites
[quote name='stu2000' timestamp='1298056459' post='4776034']
I read somewhere that cuda works well because it hides memory access latencies by running other threads during this time so even though memmory is slow, this is 'hidden' though this is probably referring to accessing the gpu local/texture memory rather than the global memory, I don't know.[/quote]

Unless the card on which it's running contains a flux capacitor or graviton generator, then the laws of physics still hold. Even on GPU, memory access patterns are crucial. Morton order, Sloan or Cuthill-McKee ordering, tricks like that all matter.

GPUs are great for things which have very small and fully independent state. For example, updating 20 million particles. But as soon as you start dealing with any kind of global or shared state, things get complicated.

For most problems, algorithmic improvements matter most, and those rely on branching to cull away unfeasible paths. And that is something that general purpose CPUs shine at with branch prediction, pipelines, large caches, ... For most realistic computational problems, CPUs and GPUs perform about the same if comparing equivalent implementations.

Share this post


Link to post
Share on other sites
[quote name='Antheus' timestamp='1298068835' post='4776110']
GPUs are great for things which have very small and fully independent state. For example, updating 20 million particles. But as soon as you start dealing with any kind of global or shared state, things get complicated.[/quote]

Complicated, but not intractable. Synchronization within a thread block is much cheaper than CPU thread synchronization, and direct access to intra-block shared memory instead of an L1 cache gives much faster and more dependable concurrent access.

Share this post


Link to post
Share on other sites
[quote name='Antheus' timestamp='1298068835' post='4776110']
[quote name='stu2000' timestamp='1298056459' post='4776034']
I read somewhere that cuda works well because it hides memory access latencies by running other threads during this time so even though memmory is slow, this is 'hidden' though this is probably referring to accessing the gpu local/texture memory rather than the global memory, I don't know.[/quote]

Unless the card on which it's running contains a flux capacitor or graviton generator, then the laws of physics still hold. Even on GPU, memory access patterns are crucial. Morton order, Sloan or Cuthill-McKee ordering, tricks like that all matter.
[/quote]

I wasnt referring to anything amazing in the field of physics, just referring to this section that I had read:


One key difference is in how the two architectures address the issue of accessing off-chip
memory, a very expensive operation with hundreds clock cycles of latency. CPUs devote a
lot of transitors to on-chip caches in order to reduce as much as possible the overall latency
caused by off-chip memory accesses. For applications with no or very little parallelism, this
latency reduction is the best strategy. For applications with a high number of parallel
computations, another strategy is to ensure that the processor is always busy with some
computations while other computations are waiting on memory accesses or at
synchronization points. In other words, latency is “hidden” rather than reduced. This latency NVIDIA OpenCL Programming for the CUDA Architecture 3
hiding strategy adopted by GPUs is schematized in Figure 1. Latency hiding requires the
ability to quickly switch from one computation to another. A GPU multiprocessor (i.e. a
compute unit in OpenCL terminology) is therefore designed to support hundreds of active
threads at once and unlike CPUs, the cost of switching from one thread to another is
insignificant
src: [url="http://www.nvidia.com/content/cudazone/download/OpenCL/NVIDIA_OpenCL_ProgrammingOverview.pdf"]http://www.nvidia.co...ingOverview.pdf[/url]




[quote name='Antheus' timestamp='1298068835' post='4776110']
GPUs are great for things which have very small and fully independent state. For example, updating 20 million particles. But as soon as you start dealing with any kind of global or shared state, things get complicated.

For most problems, algorithmic improvements matter most, and those rely on branching to cull away unfeasible paths. And that is something that general purpose CPUs shine at with branch prediction, pipelines, large caches, ... For most realistic computational problems, CPUs and GPUs perform about the same if comparing equivalent implementations.
[/quote]

Yes I had been building upon this point ( that gpu's work best in an independent state) for this algorithm, stating that I had wanted to create permutations from a continually increasing index. This way each thread is given its own integer index and creates its own permutation from which to calculate the route length (TSP) This will be many more than one instruction. However, it seems that it is very much more calculation intensive to create a permutation from an index, rather than carry on in a recursive method from the previous permutation and thats where this whole thing about shared communication/memory seems to have popped out from.

I would still like to see if running lots of threads on a GPU, completely independent, but more calculation intensive (generate permutations from index) as a whole is faster or slower than running a single thread recursive algorithm on a CPU.

Stu

Share this post


Link to post
Share on other sites
Muchos kudos goes out to Panayiotis Tsamouris for supplying me with this via email. :D
[url="http://code.google.com/p/mestrado-biocomp/source/browse/trunk/prefix-transposition-all/factoradic.c?spec=svn19&r=19"]http://code.google.com/p/mestrado-biocomp/source/browse/trunk/prefix-transposition-all/factoradic.c?spec=svn19&r=19[/url]

This is exactly what I needed and now can move on.

Stu

Share this post


Link to post
Share on other sites

This topic is 2482 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this