# Most efficient way to find out if two words have the same letters within them

## Recommended Posts

SonicD007    464
The topic title might sound a little confusing, but basically I want to unscramble one word by comparing it to another word. Right now I'm checking if both words have the same length then try to see if the scrambled word contains all the letters the normal word contains. At least that's what I'm trying to do. I'm not sure how I can tell if scrambled word contains the same letters as the original. I've looked online for a while now and a solution that seems promising (but also doesn't seem too good if there are special characters used i.e. # ¿ ♠) is to assign each letter a prime number and multiply these numbers together to get a unique number for that specific word.
example:
a = 2 b = 3 c = 5 d = 7 ect...
so the "word" abc would equal 30 where as the word cda would equal 70
This solution makes sense but is this the most efficient way of solving this problem? Is there any other way to do this?

Another way might be using the ASCII character values but...well..I'm not entirely sure how I would use them to get the results I want.

Thanks.

##### Share on other sites
SiCrane    11839
If I understand what you're asking, you can make a copy of both strings, sort both of the copies and see if the copies are equal.

##### Share on other sites
Concentrate    181
I'm not completely sure but here is an idea,

Let A be some string of characters
Let B be some string of characters

To check if A and B has the same character you can do the following,

• radixSort(A) # using the ascii values of each letter in A -- O(n)

• radixSort(B) # using the ascii values of each letter in B -- O(n)

• compare each letter by letter in A and B

Not sure if thats what your trying to achieve.

##### Share on other sites
Hodgman    51231
You could use something similar to the counting sort algorithm. Count how many times each ASCII character appears in each string - if the counts are equal, then both strings contain the same characters.
e.g.
//assuming your "words" are less than 256 characters, we can use single-byte counterstypedef unsigned char number;//assuming basic ASCII strings, we only have 256 different characters to considerconst int numCharacters = 256;number word1Count[numCharacters];number word2Count[numCharacters];int sizeInBytes = sizeof(number)*numCharacters;memset( word1Count, 0, sizeInBytes );memset( word2Count, 0, sizeInBytes );for( const char* str = word1; *str; ++str )  ++word1Count[*str];for( const char* str = word2; *str; ++str )  ++word2Count[*str];bool bothCountsAreEqual = (memcmp(word1Count, word2Count, sizeInBytes)==0);

##### Share on other sites
nobodynews    3126
The problem with the prime number approach is you have a limited amount of characters you can check per number of bits available to store the multiplied primes. Assuming a 32-bit unsigned integer and 26 letters, 'z' would equal '101'. The string "zzzzzz" won't fit in 32-bits. In reality you only get like 5 characters that way. Even going up to 128 bits you only get up to 20 characters.

I'd go with SiCrane's or Hodgman's ideas.

##### Share on other sites
owl    376
Quote:
 Original post by HodgmanYou could use something similar to the counting sort algorithm. Count how many times each ASCII character appears in each string - if the counts are equal, then both strings contain the same characters.e.g.*** Source Snippet Removed ***

this should be pretty efficient.

You can check first if the lengths of the strings are equal and if they are, you can use the same loop to count the occurrences.

##### Share on other sites
SonicD007    464
@nobodynews: Good catch there, I didn't even notice that error.
@Hodgman: I like that idea. I'm gonna go with your approach on this problem.

Thank you for all the replies. Learn something new everyday :)

##### Share on other sites
Sneftel    1788
Here's an idea: Quickcomp. (Quickeq? I can't think of a good name.) Like quicksort, but you sort the two arrays in parallel, and early-out if the partition sizes differ.

def partition(A):  if len(A) == 1: return  pivot = A[0]  cursor = 0  for i in range(len(A)):    if A[i] <= pivot:      A[cursor],A[i] = A[i],A[cursor]      cursor += 1  return cursordef quickcomp(A, B):  if len(A) != len(B): return false  if len(A) <= 1:     if len(A) == 0: return true    else: return (A[0] == B[0])  a_pivloc = partition(A)  b_pivloc = partition(B)  if not quickcomp(A[:a_pivloc-1], B[:a_pivloc-1]): return false  if not quickcomp(A[a_pivloc+1:], B[a_pivloc+1:]): return false  return true

Without thinking about it too hard, it seems like this should perform in average time O(n) if the sets differ, and O(n log n) if they are the same. My intuition is that that's the best you can do without non-comparison-based stuff.

##### Share on other sites
Pomnico    110
If your goal is to find all possible words based on scrambled word by searching some sort of database of words, presented solution taken as is could be to slow.
In such a case I would recommend preprocessing your words database first in the following manner:

1. You take each word from the database, sort it first, and next copy it to other database, remembering original word and sorted one. This new database should be sorted by sorted names (sounds strange ;))
For example, if you have words: 'cat', 'dog', 'god', 'good', 'ape', that would be:
act->cat
aep->ape
dog->dog, god
doog->good

It is better to perform this step previously and next work on preprocessed database, as when comparing scrambled name you will not have to sort each word from database during comparison, so it will be much faster.
If you fear it will take too long, consider that it has to be done only once and if you won't do this such way, you will have to perform this task during each search.

2. When searching for true word from the scrambled one, this task will be just simple searching word from the database, what could be obtained either by some sort of logarithmic search, or by using hash-arrays, or by constructing tree from the database.

Good luck
Pomnico

Edit:
By database I don't mean SQL necessary.

##### Share on other sites
iMalc    2466
Quote:
 Original post by SneftelHere's an idea: Quickcomp. (Quickeq? I can't think of a good name.) Like quicksort, but you sort the two arrays in parallel, and early-out if the partition sizes differ.def partition(A): if len(A) == 1: return pivot = A[0] cursor = 0 for i in range(len(A)): if A[i] <= pivot: A[cursor],A[i] = A[i],A[cursor] cursor += 1 return cursordef quickcomp(A, B): if len(A) != len(B): return false if len(A) <= 1: if len(A) == 0: return true else: return (A[0] == B[0]) a_pivloc = partition(A) b_pivloc = partition(B) if not quickcomp(A[:a_pivloc-1], B[:a_pivloc-1]): return false if not quickcomp(A[a_pivloc+1:], B[a_pivloc+1:]): return false return trueWithout thinking about it too hard, it seems like this should perform in average time O(n) if the sets differ, and O(n log n) if they are the same. My intuition is that that's the best you can do without non-comparison-based stuff.
That's a cool algorithm idea. I decided to implement it myself and one problem I ran into is that you need to ensure that the partitioning step picks the same item for both a and b. If it picks a different pivot then a different number of items will be less than it.

SiCrane mentioned my idea.

##### Share on other sites
69mij    132
Well another way to test if the letters of word2 are the ones in word1 you can do the following:

1. Have an array A[26]. Initialize it to 0

2. For every letter of word1, set A[letter] = 1

3. set sum=0

4. For every letter of word2 check if A[letter_from_word2]==1. If yes sum++

if sum equals the length of word2 then its true

##### Share on other sites
ne0_kamen    371
The counting method is pretty common and effecient.
However,I'm particulary fond of this method :
bool equal(char* str1,char* str2){   if ( *str1 == '\0' && *str2 == '\0')       return true;   char* s = strchr(str2,*str1);   if (!s)      return false;   swap(*s,*str2);   return equal(str1+1,str2+1);}

Example
1)"TomMarvoloRiddle" "IamLordVoldemort" -> "TomMarvoloRiddle" "TamLordVoldemorI"
2)"omMarvoloRiddle" "amLordVoldemorI" -> "omMarvoloRiddle" "TmLardVoldemorI"
3)"mMarvoloRiddle" "mLardVoldemorI" -> "mMarvoloRiddle" "mLordVoldemorI"
4)"MarvoloRiddle" "LardVoldemorI" -> "MarvoloRiddle" "mardVoldeLorI"
5)"arvoloRiddle" "ardVoldeLorI" -> "arvoloRiddle" "ardVoldeLorI"
6)"rvoloRiddle" "rdVoldeLorI" -> "arvoloRiddle" "ardVoldeLorI"
7)"voloRiddle" "dVoldeLorI" -> "voloRiddle" "VdoldeLorI"
8)"oloRiddle" "doldeLorI" ->"oloRiddle" "odldeLorI"
9)"loRiddle" "dldeLorI" -> "loRiddle" "lddeLorI"
10)"oRiddle" "ddeLorI" -> "oRiddle" "odeLdrI"
11)"Riddle" "deLdrI" -> "Riddle" "reLddI"
12)"iddle" "eLddI" -> "iddle" "iLdde"
13)"ddle" "Ldde" -> "ddle" "dlde"
14)"dle" "lde" -> "dle" "dle"
15)"le" "de" -> "le" "le"
16)"e" "e" -> "e" "e"
"" "" -> True

##### Share on other sites
Sneftel    1788
Quote:
 Original post by iMalcone problem I ran into is that you need to ensure that the partitioning step picks the same item for both a and b. If it picks a different pivot then a different number of items will be less than it.

Wow, I can't believe I didn't think of that. Nice catch.
def partition(A, pivot):  if len(A) == 1: return  cursor = 0  for i in range(len(A)):    if A[i] <= pivot:      A[cursor],A[i] = A[i],A[cursor]      cursor += 1  return cursordef quickcomp(A, B):  if len(A) != len(B): return false  if len(A) <= 1:     if len(A) == 0: return true    else: return (A[0] == B[0])  pivot = A[0]  a_pivloc = partition(A, pivot)  b_pivloc = partition(B, pivot)  if not quickcomp(A[:a_pivloc-1], B[:a_pivloc-1]): return false  if not quickcomp(A[a_pivloc+1:], B[a_pivloc+1:]): return false  return true

##### Share on other sites
Antheus    2409
I was going to pitch my solution, but the prob is, i am already a manager.

Ah, the joys of technical interviews... And people wonder why 99% can't solve FizBuzz.

##### Share on other sites
Steve132    433
The best way for just two strings is the counting sort suggestion that was mentioned. It is O(n) in time and O(1) in memory.

If the problem is "find the matching palindromic string in a database" then the problem is harder...doing just the counting sort suggestion on each string in the database sequentially would be O(nm) where n is the max string size and m is the size of the database.
Using a binary search algorithm, this could be significantly reduced to O(n log m).

However, there's a different algorithm for the database approach. Using a Trie you can solve the database lookup in O(n log n) time. To construct the trie, sort each string in the database and add it to the trie. This is O(n log n m) time. Then, to look up whether a string is in the database, sort the string ( O(n log n) ) and traverse the trie (O(n)) to see if it exists.

##### Share on other sites
Antheus    2409
Quote:
 Original post by Steve132The best way for just two strings is the counting sort suggestion that was mentioned. It is O(n) in time and O(1) in memory.

Counting sort is vastly inferior:
- O(1) space is very big, wchar_t can be a 32-bit number, some encodings of chars can be even longer
- it cannot detect equal characters with different encodings
- Each call must reset the buckets, which will be up to 2^32 writes and will dwarf running time of the rest of algorithm, the n^2 algorithm with perhaps early exit will complete faster for typical strings

The space/initialization problem can be solved using different lookup structure, but those have logn complexity in general, so the end result is n logn.

Unless the "best" way is academia or interview question, which are completely disjoint from real world issues.

##### Share on other sites
alvaro    21246
For the search-in-a-database problem you could also compute a hash of the char counts in O(n) and then index the database by it in O(1).

EDIT: Take a look at this code:
#include <iostream>#include <string>typedef unsigned long long u64;u64 char_hash(char c) {  return (c * 0xa946bc89239fd1c5ull) ^ (c * 0x2772eae805def16bull);}u64 commutative_hash(std::string const &s) {  u64 result = 0ull;  for (std::string::const_iterator it = s.begin(), end=s.end();       it != end; ++it) {    result += char_hash(*it);  }  return result;}int main() {  std::cout << commutative_hash("tommarvoloriddle") << '\n';  std::cout << commutative_hash("iamlordvoldemort") << '\n';}

Of course, having the same commutative_hash doesn't mean that the strings contain the same letters, but it is a necessary condition and collisions should be exceedingly rare. For many applications this might be enough. In the case of a database it would at least dramatically narrow the set of words you have to compare against.

[Edited by - alvaro on December 3, 2010 1:43:42 PM]

##### Share on other sites
Serdar-X    102
you can constantly shuffle the second string and check if it is equal to the first
if it is there you go
if it is not try again

after some hours if you still could not detected their equality than you can be %99.99 sure that they do not include same letters

this is not a deterministic algorithm but I think it will work on most cases

##### Share on other sites
alvaro    21246
Quote:
 Original post by Serdar-Xyou can constantly shuffle the second string and check if it is equal to the firstif it is there you goif it is not try againafter some hours if you still could not detected their equality than you can be %99.99 sure that they do not include same lettersthis is not a deterministic algorithm but I think it will work on most cases

I can't figure out if this is a joke or a horrible idea.

##### Share on other sites
Antheus    2409
Quote:
 Original post by alvaroI can't figure out if this is a joke or a horrible idea.

If the shuffles are permutations without repetition, then it's a valid solution.

If shuffling can result in repeated strings, then it's not guaranteed to ever stop. It would be possible to derive the probability out of that, but that is not a deterministic algorithm anymore.

While poor fit for Von Neumann architecture, it's probably fine for quantum computers. And maybe even for genetic programming (the biological kind).

##### Share on other sites
SonicD007    464
Wow. Well I was able to use Hodgman's solution but damn I didn't realize there were this many possibilities. Well now that I feel intellectually inferior to all of you (haha) I don't think it would hurt to ask where Steve got these (algorithms?) O(nm),O(n log m), O(n log n m) ect. from.

##### Share on other sites
Concentrate    181
Another idea:

Insert string1 into a set. Then start inserting string2 into the same set, check for size increase, if so, then return false.

##### Share on other sites
SiCrane    11839
The set idea doesn't sound like it'd work. Consider the two strings "aaab" "bbba".

##### Share on other sites
owl    376
I think this is what Concentrate meant:

#include <iostream>#include <map>using namespace std;int main(){    string word1 = "holamundo";    string word2 = "doh1";    map<char,char> chars;    unsigned int prior_size=0;    for (unsigned int i=0; i<word1.length(); i++)        chars.insert(make_pair<char,char>(word1.at(i),0));    prior_size = chars.size();    for (unsigned int i=0; i<word2.length(); i++)    {        chars.insert(make_pair<char,char>(word2.at(i),0));        if (prior_size<chars.size())        {            cout << "Not all letters of word2 are in word1";            return 0;        }    }    cout << "Letters of word2 are in word1";    return 0;}

##### Share on other sites
Concentrate    181
No I think SiCrane is right. When inserting "aaab" into a set/map, the map/set size will be 2, since there are only 2 distinct characters. Therefore if you check its size increase when inserting any regular language a*b* the check for size increase will return false, because set/map doesn't allow same keys, hence your program will assume that a*b* has the same letters as "aabb". Unless, I', missing something.

## Create an account or sign in to comment

You need to be a member in order to leave a comment

## Create an account

Sign up for a new account in our community. It's easy!

Register a new account