Jump to content
  • Advertisement
Sign in to follow this  
Headkaze

[.net] Fuzzy String Matching

This topic is 3840 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I'm using a Levenshtein Distance in C# algorthm to do a fuzzy matching of a game name from a list of games.
public string GetClosestMatch(string str)
{
	string strMatch = str;
	int minDistance = Int32.MaxValue;

	if (String.IsNullOrEmpty(str))
		return strMatch;

	for (int i = 0; i < GameArray.Length; i++)
	{
		string fileName = GameArray;
		int strDistance = LevenshteinDistance.Compute(str, fileName);

		if (strDistance < minDistance)
		{
			minDistance = strDistance;
			strMatch = fileName;
		}
	}

	return strMatch;
}
The problem is, this algorithm while it does work in most cases, there are cases where this algorithm doesn't work so well. For example, consider a game "APB" I want to match it from a list of games. But it is returning "Qix" as a match instead of "APB - All Points Bulletin". Is there a more efficient algorithm for my needs, or any suggestions on how to change the current algorithm to be more suited to my needs.

Share this post


Link to post
Share on other sites
Advertisement
Quote:
For example, consider a game "APB" I want to match it from a list of games. But it is returning "Qix" as a match instead of "APB - All Points Bulletin".


Because toward "ABP" the string "Qix" has a Levensthein distance of 3 and "APB - All Points Bulletin" has one of 23.

Quote:
Is there a more efficient algorithm for my needs, or any suggestions on how to change the current algorithm to be more suited to my needs.


How about splitting up the game titles from the list into words and then calculating the minimal Levensthein distance to those instead of the whole string? That'd leave only the problem of the user entering multiple words.

Share this post


Link to post
Share on other sites
Quote:
Original post by kolrabi
Because toward "ABP" the string "Qix" has a Levensthein distance of 3 and "APB - All Points Bulletin" has one of 23.


I understand why it's choosing the wrong one. For longer game names the algorithm works quite well. I'm hoping I can modify my method to improve the matching for shorter names that are susceptible to the above problem.

Quote:
Original post by kolrabi
How about splitting up the game titles from the list into words and then calculating the minimal Levensthein distance to those instead of the whole string? That'd leave only the problem of the user entering multiple words.


This is not a search thing where users type in words, it's a program that is renaming a game list to match a special file naming convention. I'm not sure if splitting the strings up into words would improve the matching in this case?

Share this post


Link to post
Share on other sites
Quote:
Original post by Spoonbender
Not sure how well it'd work in your case, but the Smith-Waterman algorithm for sequence alignment might be a good bet.


Quote:
The Smith-Waterman algorithm is a well-known algorithm for performing local sequence alignment; that is, for determining similar regions between two nucleotide or protein sequences.


Woah sounds a little overkill for my needs. I couldn't find any practical examples of it being using for comparing strings either.

Share this post


Link to post
Share on other sites
Well I improved it a little bit for those it was missing but at the same time it slightly increases the number of bad matches.

This time if it's a bad match I compare the least number of characters with non-alpha numerics removed.

public string GetClosestMatch(string str)
{
string strMatch = null;
double minDistance = Double.MaxValue;

if (String.IsNullOrEmpty(str))
return strMatch;

for (int i = 0; i < GameArray.Length; i++)
{
string fileName = Convert.RemoveBrackets(GameArray);
double strDistance = LevenshteinDistance.Compute(str, fileName);
double meanDistancePerLetter = (strDistance / str.Length);

if (meanDistancePerLetter <= 0.6)
{
if (meanDistancePerLetter < minDistance)
{
minDistance = meanDistancePerLetter;
strMatch = fileName;
}
}
else
{
string str2 = Convert.RemoveNonAlphaNumeric(str).ToLower();
string fileName2 = Convert.RemoveNonAlphaNumeric(fileName).ToLower();
int strLength = Math.Min(str2.Length, fileName2.Length);

if (str2.Substring(0, strLength) == fileName2.Substring(0, strLength))
strMatch = fileName;
}
}

return strMatch;
}

Share this post


Link to post
Share on other sites
For my program, users could type a word instead of picking from the list. I wanted to make sure that they didn't enter in a word that was on the list already, so I would use a NYSIIS Code to match strings in the list and present them to the user to make sure what they typed was correct or did they mistype.

Here is my code for the NYSIIS Code, it is by no means probably clean and probably use some improvement on the code, but it works.

public static string NYSIISCode(string item)
{
StringBuilder itemSB = new StringBuilder(item.ToUpper());

char firstChar = itemSB[0];

// remove all 'S' and 'Z' chars from the end of the surname
for (int charOffset = itemSB.Length - 1; charOffset > -1; charOffset--)
{
if (itemSB[charOffset] == 'S' || itemSB[charOffset] == 'Z')
itemSB.Remove(charOffset, 1);
else
break;
}

itemSB.Replace(" ", "");

// transcode initial strings MAC => MC
if (itemSB.Length > 2)
itemSB.Replace("MAC", "MC", 0, 3);

// transcode initial strings PF => F
if (itemSB.Length > 1)
itemSB.Replace("PF", "F", 0, 2);


// Transcode trailing strings as follows,
// IX => IC
// EX => EC
// E,EE,IE => Y
// NT,ND => D
if (itemSB.Length > 1)
{
itemSB.Replace("IX", "IC", (itemSB.Length - 2), 2);
itemSB.Replace("EX", "EC", (itemSB.Length - 2), 2);
itemSB.Replace("YE", "Y", (itemSB.Length - 2), 2);
itemSB.Replace("EE", "Y", (itemSB.Length - 2), 2);
itemSB.Replace("IE", "Y", (itemSB.Length - 2), 2);
itemSB.Replace("NT", "D", (itemSB.Length - 2), 2);
itemSB.Replace("ND", "D", (itemSB.Length - 2), 2);
}

// transcode 'EV' to 'EF' if not at start of name
if (itemSB.Length > 1)
itemSB.Replace("EV", "EF", 1, (itemSB.Length - 1));

// remove any 'W' that follows a vowel
itemSB.Replace("AW", "A");
itemSB.Replace("EW", "E");
itemSB.Replace("IW", "I");
itemSB.Replace("OW", "O");
itemSB.Replace("UW", "U");

// replace all vowels with 'A'
//itemSB.Replace("A", "A");
itemSB.Replace("E", "A");
itemSB.Replace("I", "A");
itemSB.Replace("O", "A");
itemSB.Replace("U", "A");

// transcode 'GHT' to 'GT'
itemSB.Replace("GHT", "GT");

// transcode 'DG' to 'G'
itemSB.Replace("DG", "G");

// transcode 'PH' to 'F'
itemSB.Replace("PH", "F");

// if not first character, eliminate all 'H' preceded or followed by a vowel
itemSB.Replace("AH", "A");
if (itemSB.Length > 2)
itemSB.Replace("HA", "A", 1, (itemSB.Length - 1));

// change 'KN' to 'N', else 'K' to 'C'
itemSB.Replace("KN", "N");
itemSB.Replace("K", "C");

// if not first character, change 'M' to 'N'
if (itemSB.Length > 1)
itemSB.Replace("M", "N", 1, (itemSB.Length - 1));

// if not first character, change 'Q' to 'G'
if (itemSB.Length > 1)
itemSB.Replace("Q", "G", 1, (itemSB.Length - 1));

// transcode 'SH' to 'S'
itemSB.Replace("SH", "S");

// transcode 'SCH' to 'S'
itemSB.Replace("SCH", "S");

// transcode 'YW' to 'Y'
itemSB.Replace("YW", "Y");

// if not first or last character, change 'Y' to 'A'
if (itemSB.Length > 2)
itemSB.Replace("Y", "A", 1, (itemSB.Length - 2));

// transcode 'WR' to 'R'
itemSB.Replace("WR", "R");

// if not first character, change 'Z' to 'S'
if (itemSB.Length > 1)
itemSB.Replace("Z", "S", 1, (itemSB.Length - 1));

// transcode terminal 'AY' to 'Y'
if (itemSB.Length > 2)
itemSB.Replace("AY", "Y", 1, (itemSB.Length - 1));

// remove traling vowels
// remove all 'S' and 'Z' chars from the end of the surname
for (int charOffset = itemSB.Length - 1; charOffset > -1; charOffset--)
{
if (itemSB[charOffset] == 'A')
itemSB.Remove(charOffset, 1);
else
break;
}

// collapse all strings of repeated characters
// (also removes non-alpha chars)
char previousChar = ' ';
for (int charOffset = 0; charOffset < itemSB.Length; charOffset++)
{
if(previousChar == itemSB[charOffset] || itemSB[charOffset] < 'A' || itemSB[charOffset] > 'Z')
itemSB.Remove(charOffset, 1);
else
previousChar = itemSB[charOffset];
}

// if first char of original surname is a vowel, prepend to code (or replace initial transcoded 'A')
bool prependVowel = false;
if (firstChar == 'A' || firstChar == 'E' || firstChar == 'I' || firstChar == 'O' || firstChar == 'U')
{
if (itemSB[0] == 'A')
itemSB[0] = firstChar;
else
prependVowel = true;
}

if (prependVowel)
return firstChar.ToString() + itemSB.ToString();
else
return itemSB.ToString();
}




Here is just a quick fuzzy string matching I wrote with weights based on if the character was off in the string. It looks similar to what you posted. I don't use this one, I use the NYSIIS Code. This just returns a weighted value on how close of a match it is.


public static double FuzzySearch(string listitem, string item)
{
double value = 0;

if (listitem == null || item == null)
return value;

int length = (listitem.Length < item.Length) ? listitem.Length : item.Length;

for (int i = 0; i < length; i++)
{
if (listitem == item)
value += 1;
else if ((i != 0) && (item == listitem[i - 1]))
value += 0.75;
else if ((i != (length - 1)) && (item == listitem[i + 1]))
value += 0.75;
}

value /= length;
return value;
}

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!