Sign in to follow this  

Most efficient way to find out if two words have the same letters within them

This topic is 2566 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

The topic title might sound a little confusing, but basically I want to unscramble one word by comparing it to another word. Right now I'm checking if both words have the same length then try to see if the scrambled word contains all the letters the normal word contains. At least that's what I'm trying to do. I'm not sure how I can tell if scrambled word contains the same letters as the original. I've looked online for a while now and a solution that seems promising (but also doesn't seem too good if there are special characters used i.e. # ¿ ♠) is to assign each letter a prime number and multiply these numbers together to get a unique number for that specific word.
example:
a = 2 b = 3 c = 5 d = 7 ect...
so the "word" abc would equal 30 where as the word cda would equal 70
This solution makes sense but is this the most efficient way of solving this problem? Is there any other way to do this?

Another way might be using the ASCII character values but...well..I'm not entirely sure how I would use them to get the results I want.

Thanks.

Share this post


Link to post
Share on other sites
If I understand what you're asking, you can make a copy of both strings, sort both of the copies and see if the copies are equal.

Share this post


Link to post
Share on other sites
I'm not completely sure but here is an idea,

Let A be some string of characters
Let B be some string of characters

To check if A and B has the same character you can do the following,

  • radixSort(A) # using the ascii values of each letter in A -- O(n)

  • radixSort(B) # using the ascii values of each letter in B -- O(n)

  • compare each letter by letter in A and B



Not sure if thats what your trying to achieve.

Share this post


Link to post
Share on other sites
You could use something similar to the counting sort algorithm. Count how many times each ASCII character appears in each string - if the counts are equal, then both strings contain the same characters.
e.g.
//assuming your "words" are less than 256 characters, we can use single-byte counters
typedef unsigned char number;

//assuming basic ASCII strings, we only have 256 different characters to consider
const int numCharacters = 256;
number word1Count[numCharacters];
number word2Count[numCharacters];

int sizeInBytes = sizeof(number)*numCharacters;
memset( word1Count, 0, sizeInBytes );
memset( word2Count, 0, sizeInBytes );

for( const char* str = word1; *str; ++str )
++word1Count[*str];

for( const char* str = word2; *str; ++str )
++word2Count[*str];

bool bothCountsAreEqual = (memcmp(word1Count, word2Count, sizeInBytes)==0);

Share this post


Link to post
Share on other sites
The problem with the prime number approach is you have a limited amount of characters you can check per number of bits available to store the multiplied primes. Assuming a 32-bit unsigned integer and 26 letters, 'z' would equal '101'. The string "zzzzzz" won't fit in 32-bits. In reality you only get like 5 characters that way. Even going up to 128 bits you only get up to 20 characters.

I'd go with SiCrane's or Hodgman's ideas.

Share this post


Link to post
Share on other sites
Quote:
Original post by Hodgman
You could use something similar to the counting sort algorithm. Count how many times each ASCII character appears in each string - if the counts are equal, then both strings contain the same characters.
e.g.*** Source Snippet Removed ***


this should be pretty efficient.

You can check first if the lengths of the strings are equal and if they are, you can use the same loop to count the occurrences.

Share this post


Link to post
Share on other sites
@nobodynews: Good catch there, I didn't even notice that error.
@Hodgman: I like that idea. I'm gonna go with your approach on this problem.

Thank you for all the replies. Learn something new everyday :)

Share this post


Link to post
Share on other sites
Here's an idea: Quickcomp. (Quickeq? I can't think of a good name.) Like quicksort, but you sort the two arrays in parallel, and early-out if the partition sizes differ.


def partition(A):
if len(A) == 1: return
pivot = A[0]
cursor = 0
for i in range(len(A)):
if A[i] <= pivot:
A[cursor],A[i] = A[i],A[cursor]
cursor += 1
return cursor

def quickcomp(A, B):
if len(A) != len(B): return false
if len(A) <= 1:
if len(A) == 0: return true
else: return (A[0] == B[0])

a_pivloc = partition(A)
b_pivloc = partition(B)

if not quickcomp(A[:a_pivloc-1], B[:a_pivloc-1]): return false
if not quickcomp(A[a_pivloc+1:], B[a_pivloc+1:]): return false

return true


Without thinking about it too hard, it seems like this should perform in average time O(n) if the sets differ, and O(n log n) if they are the same. My intuition is that that's the best you can do without non-comparison-based stuff.

Share this post


Link to post
Share on other sites
If your goal is to find all possible words based on scrambled word by searching some sort of database of words, presented solution taken as is could be to slow.
In such a case I would recommend preprocessing your words database first in the following manner:

1. You take each word from the database, sort it first, and next copy it to other database, remembering original word and sorted one. This new database should be sorted by sorted names (sounds strange ;))
For example, if you have words: 'cat', 'dog', 'god', 'good', 'ape', that would be:
act->cat
aep->ape
dog->dog, god
doog->good

It is better to perform this step previously and next work on preprocessed database, as when comparing scrambled name you will not have to sort each word from database during comparison, so it will be much faster.
If you fear it will take too long, consider that it has to be done only once and if you won't do this such way, you will have to perform this task during each search.

2. When searching for true word from the scrambled one, this task will be just simple searching word from the database, what could be obtained either by some sort of logarithmic search, or by using hash-arrays, or by constructing tree from the database.

Good luck
Pomnico

Edit:
By database I don't mean SQL necessary.

Share this post


Link to post
Share on other sites
Quote:
Original post by Sneftel
Here's an idea: Quickcomp. (Quickeq? I can't think of a good name.) Like quicksort, but you sort the two arrays in parallel, and early-out if the partition sizes differ.


def partition(A):
if len(A) == 1: return
pivot = A[0]
cursor = 0
for i in range(len(A)):
if A[i] <= pivot:
A[cursor],A[i] = A[i],A[cursor]
cursor += 1
return cursor

def quickcomp(A, B):
if len(A) != len(B): return false
if len(A) <= 1:
if len(A) == 0: return true
else: return (A[0] == B[0])

a_pivloc = partition(A)
b_pivloc = partition(B)

if not quickcomp(A[:a_pivloc-1], B[:a_pivloc-1]): return false
if not quickcomp(A[a_pivloc+1:], B[a_pivloc+1:]): return false

return true


Without thinking about it too hard, it seems like this should perform in average time O(n) if the sets differ, and O(n log n) if they are the same. My intuition is that that's the best you can do without non-comparison-based stuff.
That's a cool algorithm idea. I decided to implement it myself and one problem I ran into is that you need to ensure that the partitioning step picks the same item for both a and b. If it picks a different pivot then a different number of items will be less than it.

SiCrane mentioned my idea.

Share this post


Link to post
Share on other sites
Well another way to test if the letters of word2 are the ones in word1 you can do the following:

1. Have an array A[26]. Initialize it to 0

2. For every letter of word1, set A[letter] = 1

3. set sum=0

4. For every letter of word2 check if A[letter_from_word2]==1. If yes sum++

if sum equals the length of word2 then its true

Share this post


Link to post
Share on other sites
The counting method is pretty common and effecient.
However,I'm particulary fond of this method :

bool equal(char* str1,char* str2){
if ( *str1 == '\0' && *str2 == '\0')
return true;
char* s = strchr(str2,*str1);
if (!s)
return false;
swap(*s,*str2);
return equal(str1+1,str2+1);
}




Example
1)"TomMarvoloRiddle" "IamLordVoldemort" -> "TomMarvoloRiddle" "TamLordVoldemorI"
2)"omMarvoloRiddle" "amLordVoldemorI" -> "omMarvoloRiddle" "TmLardVoldemorI"
3)"mMarvoloRiddle" "mLardVoldemorI" -> "mMarvoloRiddle" "mLordVoldemorI"
4)"MarvoloRiddle" "LardVoldemorI" -> "MarvoloRiddle" "mardVoldeLorI"
5)"arvoloRiddle" "ardVoldeLorI" -> "arvoloRiddle" "ardVoldeLorI"
6)"rvoloRiddle" "rdVoldeLorI" -> "arvoloRiddle" "ardVoldeLorI"
7)"voloRiddle" "dVoldeLorI" -> "voloRiddle" "VdoldeLorI"
8)"oloRiddle" "doldeLorI" ->"oloRiddle" "odldeLorI"
9)"loRiddle" "dldeLorI" -> "loRiddle" "lddeLorI"
10)"oRiddle" "ddeLorI" -> "oRiddle" "odeLdrI"
11)"Riddle" "deLdrI" -> "Riddle" "reLddI"
12)"iddle" "eLddI" -> "iddle" "iLdde"
13)"ddle" "Ldde" -> "ddle" "dlde"
14)"dle" "lde" -> "dle" "dle"
15)"le" "de" -> "le" "le"
16)"e" "e" -> "e" "e"
"" "" -> True

Share this post


Link to post
Share on other sites
Quote:
Original post by iMalc
one problem I ran into is that you need to ensure that the partitioning step picks the same item for both a and b. If it picks a different pivot then a different number of items will be less than it.

Wow, I can't believe I didn't think of that. Nice catch.

def partition(A, pivot):
if len(A) == 1: return
cursor = 0
for i in range(len(A)):
if A[i] <= pivot:
A[cursor],A[i] = A[i],A[cursor]
cursor += 1
return cursor

def quickcomp(A, B):
if len(A) != len(B): return false
if len(A) <= 1:
if len(A) == 0: return true
else: return (A[0] == B[0])

pivot = A[0]
a_pivloc = partition(A, pivot)
b_pivloc = partition(B, pivot)

if not quickcomp(A[:a_pivloc-1], B[:a_pivloc-1]): return false
if not quickcomp(A[a_pivloc+1:], B[a_pivloc+1:]): return false

return true



Share this post


Link to post
Share on other sites
The best way for just two strings is the counting sort suggestion that was mentioned. It is O(n) in time and O(1) in memory.

If the problem is "find the matching palindromic string in a database" then the problem is harder...doing just the counting sort suggestion on each string in the database sequentially would be O(nm) where n is the max string size and m is the size of the database.
Using a binary search algorithm, this could be significantly reduced to O(n log m).

However, there's a different algorithm for the database approach. Using a Trie you can solve the database lookup in O(n log n) time. To construct the trie, sort each string in the database and add it to the trie. This is O(n log n m) time. Then, to look up whether a string is in the database, sort the string ( O(n log n) ) and traverse the trie (O(n)) to see if it exists.

Share this post


Link to post
Share on other sites
Quote:
Original post by Steve132
The best way for just two strings is the counting sort suggestion that was mentioned. It is O(n) in time and O(1) in memory.

Counting sort is vastly inferior:
- O(1) space is very big, wchar_t can be a 32-bit number, some encodings of chars can be even longer
- it cannot detect equal characters with different encodings
- Each call must reset the buckets, which will be up to 2^32 writes and will dwarf running time of the rest of algorithm, the n^2 algorithm with perhaps early exit will complete faster for typical strings

The space/initialization problem can be solved using different lookup structure, but those have logn complexity in general, so the end result is n logn.


Unless the "best" way is academia or interview question, which are completely disjoint from real world issues.

Share this post


Link to post
Share on other sites
For the search-in-a-database problem you could also compute a hash of the char counts in O(n) and then index the database by it in O(1).

EDIT: Take a look at this code:
#include <iostream>
#include <string>

typedef unsigned long long u64;

u64 char_hash(char c) {
return (c * 0xa946bc89239fd1c5ull) ^ (c * 0x2772eae805def16bull);
}

u64 commutative_hash(std::string const &s) {
u64 result = 0ull;
for (std::string::const_iterator it = s.begin(), end=s.end();
it != end; ++it) {
result += char_hash(*it);
}
return result;
}

int main() {
std::cout << commutative_hash("tommarvoloriddle") << '\n';
std::cout << commutative_hash("iamlordvoldemort") << '\n';
}




Of course, having the same commutative_hash doesn't mean that the strings contain the same letters, but it is a necessary condition and collisions should be exceedingly rare. For many applications this might be enough. In the case of a database it would at least dramatically narrow the set of words you have to compare against.



[Edited by - alvaro on December 3, 2010 1:43:42 PM]

Share this post


Link to post
Share on other sites
you can constantly shuffle the second string and check if it is equal to the first
if it is there you go
if it is not try again

after some hours if you still could not detected their equality than you can be %99.99 sure that they do not include same letters

this is not a deterministic algorithm but I think it will work on most cases

Share this post


Link to post
Share on other sites
Quote:
Original post by Serdar-X
you can constantly shuffle the second string and check if it is equal to the first
if it is there you go
if it is not try again

after some hours if you still could not detected their equality than you can be %99.99 sure that they do not include same letters

this is not a deterministic algorithm but I think it will work on most cases


I can't figure out if this is a joke or a horrible idea.

Share this post


Link to post
Share on other sites
Quote:
Original post by alvaro

I can't figure out if this is a joke or a horrible idea.


If the shuffles are permutations without repetition, then it's a valid solution.

If shuffling can result in repeated strings, then it's not guaranteed to ever stop. It would be possible to derive the probability out of that, but that is not a deterministic algorithm anymore.


While poor fit for Von Neumann architecture, it's probably fine for quantum computers. And maybe even for genetic programming (the biological kind).

Share this post


Link to post
Share on other sites
Wow. Well I was able to use Hodgman's solution but damn I didn't realize there were this many possibilities. Well now that I feel intellectually inferior to all of you (haha) I don't think it would hurt to ask where Steve got these (algorithms?) O(nm),O(n log m), O(n log n m) ect. from.

Share this post


Link to post
Share on other sites
I think this is what Concentrate meant:



#include <iostream>
#include <map>

using namespace std;

int main()
{
string word1 = "holamundo";
string word2 = "doh1";
map<char,char> chars;
unsigned int prior_size=0;

for (unsigned int i=0; i<word1.length(); i++)
chars.insert(make_pair<char,char>(word1.at(i),0));

prior_size = chars.size();

for (unsigned int i=0; i<word2.length(); i++)
{
chars.insert(make_pair<char,char>(word2.at(i),0));
if (prior_size<chars.size())
{
cout << "Not all letters of word2 are in word1";
return 0;
}
}

cout << "Letters of word2 are in word1";
return 0;
}

Share this post


Link to post
Share on other sites
No I think SiCrane is right. When inserting "aaab" into a set/map, the map/set size will be 2, since there are only 2 distinct characters. Therefore if you check its size increase when inserting any regular language a*b* the check for size increase will return false, because set/map doesn't allow same keys, hence your program will assume that a*b* has the same letters as "aabb". Unless, I', missing something.

Share this post


Link to post
Share on other sites

This topic is 2566 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this