Most efficient way to find out if two words have the same letters within them

Started by
26 comments, last by Antheus 13 years, 4 months ago
The topic title might sound a little confusing, but basically I want to unscramble one word by comparing it to another word. Right now I'm checking if both words have the same length then try to see if the scrambled word contains all the letters the normal word contains. At least that's what I'm trying to do. I'm not sure how I can tell if scrambled word contains the same letters as the original. I've looked online for a while now and a solution that seems promising (but also doesn't seem too good if there are special characters used i.e. # ¿ ♠) is to assign each letter a prime number and multiply these numbers together to get a unique number for that specific word.
example:
a = 2 b = 3 c = 5 d = 7 ect...
so the "word" abc would equal 30 where as the word cda would equal 70
This solution makes sense but is this the most efficient way of solving this problem? Is there any other way to do this?

Another way might be using the ASCII character values but...well..I'm not entirely sure how I would use them to get the results I want.

Thanks.
Advertisement
If I understand what you're asking, you can make a copy of both strings, sort both of the copies and see if the copies are equal.
I'm not completely sure but here is an idea,

Let A be some string of characters
Let B be some string of characters

To check if A and B has the same character you can do the following,

  • radixSort(A) # using the ascii values of each letter in A -- O(n)

  • radixSort(B) # using the ascii values of each letter in B -- O(n)

  • compare each letter by letter in A and B



Not sure if thats what your trying to achieve.
Edge cases will show your design flaws in your code!
Visit my site
Visit my FaceBook
Visit my github
You could use something similar to the counting sort algorithm. Count how many times each ASCII character appears in each string - if the counts are equal, then both strings contain the same characters.
e.g.
//assuming your "words" are less than 256 characters, we can use single-byte counterstypedef unsigned char number;//assuming basic ASCII strings, we only have 256 different characters to considerconst int numCharacters = 256;number word1Count[numCharacters];number word2Count[numCharacters];int sizeInBytes = sizeof(number)*numCharacters;memset( word1Count, 0, sizeInBytes );memset( word2Count, 0, sizeInBytes );for( const char* str = word1; *str; ++str )  ++word1Count[*str];for( const char* str = word2; *str; ++str )  ++word2Count[*str];bool bothCountsAreEqual = (memcmp(word1Count, word2Count, sizeInBytes)==0);
The problem with the prime number approach is you have a limited amount of characters you can check per number of bits available to store the multiplied primes. Assuming a 32-bit unsigned integer and 26 letters, 'z' would equal '101'. The string "zzzzzz" won't fit in 32-bits. In reality you only get like 5 characters that way. Even going up to 128 bits you only get up to 20 characters.

I'd go with SiCrane's or Hodgman's ideas.

C++: A Dialog | C++0x Features: Part1 (lambdas, auto, static_assert) , Part 2 (rvalue references) , Part 3 (decltype) | Write Games | Fix Your Timestep!

Quote:Original post by Hodgman
You could use something similar to the counting sort algorithm. Count how many times each ASCII character appears in each string - if the counts are equal, then both strings contain the same characters.
e.g.*** Source Snippet Removed ***


this should be pretty efficient.

You can check first if the lengths of the strings are equal and if they are, you can use the same loop to count the occurrences.
[size="2"]I like the Walrus best.
@nobodynews: Good catch there, I didn't even notice that error.
@Hodgman: I like that idea. I'm gonna go with your approach on this problem.

Thank you for all the replies. Learn something new everyday :)
Here's an idea: Quickcomp. (Quickeq? I can't think of a good name.) Like quicksort, but you sort the two arrays in parallel, and early-out if the partition sizes differ.

def partition(A):  if len(A) == 1: return  pivot = A[0]  cursor = 0  for i in range(len(A)):    if A <= pivot:      A[cursor],A = A,A[cursor]      cursor += 1  return cursordef quickcomp(A, B):  if len(A) != len(B): return false  if len(A) <= 1:     if len(A) == 0: return true    else: return (A[0] == B[0])  a_pivloc = partition(A)  b_pivloc = partition(B)  if not quickcomp(A[:a_pivloc-1], B[:a_pivloc-1]): return false  if not quickcomp(A[a_pivloc+1:], B[a_pivloc+1:]): return false  return true


Without thinking about it too hard, it seems like this should perform in average time O(n) if the sets differ, and O(n log n) if they are the same. My intuition is that that's the best you can do without non-comparison-based stuff.
If your goal is to find all possible words based on scrambled word by searching some sort of database of words, presented solution taken as is could be to slow.
In such a case I would recommend preprocessing your words database first in the following manner:

1. You take each word from the database, sort it first, and next copy it to other database, remembering original word and sorted one. This new database should be sorted by sorted names (sounds strange ;))
For example, if you have words: 'cat', 'dog', 'god', 'good', 'ape', that would be:
act->cat
aep->ape
dog->dog, god
doog->good

It is better to perform this step previously and next work on preprocessed database, as when comparing scrambled name you will not have to sort each word from database during comparison, so it will be much faster.
If you fear it will take too long, consider that it has to be done only once and if you won't do this such way, you will have to perform this task during each search.

2. When searching for true word from the scrambled one, this task will be just simple searching word from the database, what could be obtained either by some sort of logarithmic search, or by using hash-arrays, or by constructing tree from the database.

Good luck
Pomnico

Edit:
By database I don't mean SQL necessary.
Quote:Original post by Sneftel
Here's an idea: Quickcomp. (Quickeq? I can't think of a good name.) Like quicksort, but you sort the two arrays in parallel, and early-out if the partition sizes differ.

def partition(A):  if len(A) == 1: return  pivot = A[0]  cursor = 0  for i in range(len(A)):    if A <= pivot:      A[cursor],A = A,A[cursor]      cursor += 1  return cursordef quickcomp(A, B):  if len(A) != len(B): return false  if len(A) <= 1:     if len(A) == 0: return true    else: return (A[0] == B[0])  a_pivloc = partition(A)  b_pivloc = partition(B)  if not quickcomp(A[:a_pivloc-1], B[:a_pivloc-1]): return false  if not quickcomp(A[a_pivloc+1:], B[a_pivloc+1:]): return false  return true


Without thinking about it too hard, it seems like this should perform in average time O(n) if the sets differ, and O(n log n) if they are the same. My intuition is that that's the best you can do without non-comparison-based stuff.
That's a cool algorithm idea. I decided to implement it myself and one problem I ran into is that you need to ensure that the partitioning step picks the same item for both a and b. If it picks a different pivot then a different number of items will be less than it.

SiCrane mentioned my idea.
"In order to understand recursion, you must first understand recursion."
My website dedicated to sorting algorithms

This topic is closed to new replies.

Advertisement