# [.net] Fuzzy String Matching

This topic is 3840 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

I'm using a Levenshtein Distance in C# algorthm to do a fuzzy matching of a game name from a list of games.
public string GetClosestMatch(string str)
{
string strMatch = str;
int minDistance = Int32.MaxValue;

if (String.IsNullOrEmpty(str))
return strMatch;

for (int i = 0; i < GameArray.Length; i++)
{
string fileName = GameArray;
int strDistance = LevenshteinDistance.Compute(str, fileName);

if (strDistance < minDistance)
{
minDistance = strDistance;
strMatch = fileName;
}
}

return strMatch;
}
The problem is, this algorithm while it does work in most cases, there are cases where this algorithm doesn't work so well. For example, consider a game "APB" I want to match it from a list of games. But it is returning "Qix" as a match instead of "APB - All Points Bulletin". Is there a more efficient algorithm for my needs, or any suggestions on how to change the current algorithm to be more suited to my needs.

##### Share on other sites
Quote:
 For example, consider a game "APB" I want to match it from a list of games. But it is returning "Qix" as a match instead of "APB - All Points Bulletin".

Because toward "ABP" the string "Qix" has a Levensthein distance of 3 and "APB - All Points Bulletin" has one of 23.

Quote:
 Is there a more efficient algorithm for my needs, or any suggestions on how to change the current algorithm to be more suited to my needs.

How about splitting up the game titles from the list into words and then calculating the minimal Levensthein distance to those instead of the whole string? That'd leave only the problem of the user entering multiple words.

##### Share on other sites
Quote:
 Original post by kolrabiBecause toward "ABP" the string "Qix" has a Levensthein distance of 3 and "APB - All Points Bulletin" has one of 23.

I understand why it's choosing the wrong one. For longer game names the algorithm works quite well. I'm hoping I can modify my method to improve the matching for shorter names that are susceptible to the above problem.

Quote:
 Original post by kolrabiHow about splitting up the game titles from the list into words and then calculating the minimal Levensthein distance to those instead of the whole string? That'd leave only the problem of the user entering multiple words.

This is not a search thing where users type in words, it's a program that is renaming a game list to match a special file naming convention. I'm not sure if splitting the strings up into words would improve the matching in this case?

##### Share on other sites
Not sure how well it'd work in your case, but the Smith-Waterman algorithm for sequence alignment might be a good bet.

##### Share on other sites
Quote:
 Original post by SpoonbenderNot sure how well it'd work in your case, but the Smith-Waterman algorithm for sequence alignment might be a good bet.

Quote:
 The Smith-Waterman algorithm is a well-known algorithm for performing local sequence alignment; that is, for determining similar regions between two nucleotide or protein sequences.

Woah sounds a little overkill for my needs. I couldn't find any practical examples of it being using for comparing strings either.

##### Share on other sites
Well I improved it a little bit for those it was missing but at the same time it slightly increases the number of bad matches.

This time if it's a bad match I compare the least number of characters with non-alpha numerics removed.

public string GetClosestMatch(string str){	string strMatch = null;	double minDistance = Double.MaxValue;	if (String.IsNullOrEmpty(str))		return strMatch;	for (int i = 0; i < GameArray.Length; i++)	{		string fileName = Convert.RemoveBrackets(GameArray);		double strDistance = LevenshteinDistance.Compute(str, fileName);		double meanDistancePerLetter = (strDistance / str.Length);		if (meanDistancePerLetter <= 0.6)		{			if (meanDistancePerLetter < minDistance)			{				minDistance = meanDistancePerLetter;				strMatch = fileName;			}		}		else		{			string str2 = Convert.RemoveNonAlphaNumeric(str).ToLower();			string fileName2 = Convert.RemoveNonAlphaNumeric(fileName).ToLower();			int strLength = Math.Min(str2.Length, fileName2.Length);			if (str2.Substring(0, strLength) == fileName2.Substring(0, strLength))				strMatch = fileName;		}	}	return strMatch;}

##### Share on other sites
For my program, users could type a word instead of picking from the list. I wanted to make sure that they didn't enter in a word that was on the list already, so I would use a NYSIIS Code to match strings in the list and present them to the user to make sure what they typed was correct or did they mistype.

Here is my code for the NYSIIS Code, it is by no means probably clean and probably use some improvement on the code, but it works.
public static string NYSIISCode(string item)        {            StringBuilder itemSB = new StringBuilder(item.ToUpper());                        char firstChar = itemSB[0];            // remove all 'S' and 'Z' chars from the end of the surname            for (int charOffset = itemSB.Length - 1; charOffset > -1; charOffset--)            {                if (itemSB[charOffset] == 'S' || itemSB[charOffset] == 'Z')                    itemSB.Remove(charOffset, 1);                else                    break;            }            itemSB.Replace(" ", "");            // transcode initial strings MAC => MC             if (itemSB.Length > 2)                itemSB.Replace("MAC", "MC", 0, 3);                        // transcode initial strings PF => F             if (itemSB.Length > 1)                itemSB.Replace("PF", "F", 0, 2);            // Transcode trailing strings as follows,	            // IX => IC             // EX => EC             // E,EE,IE => Y             // NT,ND => D             if (itemSB.Length > 1)            {                itemSB.Replace("IX", "IC", (itemSB.Length - 2), 2);                itemSB.Replace("EX", "EC", (itemSB.Length - 2), 2);                itemSB.Replace("YE", "Y", (itemSB.Length - 2), 2);                itemSB.Replace("EE", "Y", (itemSB.Length - 2), 2);                itemSB.Replace("IE", "Y", (itemSB.Length - 2), 2);                itemSB.Replace("NT", "D", (itemSB.Length - 2), 2);                itemSB.Replace("ND", "D", (itemSB.Length - 2), 2);            }            // transcode 'EV' to 'EF' if not at start of name            if (itemSB.Length > 1)                itemSB.Replace("EV", "EF", 1, (itemSB.Length - 1));            // remove any 'W' that follows a vowel            itemSB.Replace("AW", "A");            itemSB.Replace("EW", "E");            itemSB.Replace("IW", "I");            itemSB.Replace("OW", "O");            itemSB.Replace("UW", "U");            // replace all vowels with 'A'            //itemSB.Replace("A", "A");            itemSB.Replace("E", "A");            itemSB.Replace("I", "A");            itemSB.Replace("O", "A");            itemSB.Replace("U", "A");            // transcode 'GHT' to 'GT'            itemSB.Replace("GHT", "GT");            // transcode 'DG' to 'G'            itemSB.Replace("DG", "G");            // transcode 'PH' to 'F'            itemSB.Replace("PH", "F");            // if not first character, eliminate all 'H' preceded or followed by a vowel            itemSB.Replace("AH", "A");            if (itemSB.Length > 2)                itemSB.Replace("HA", "A", 1, (itemSB.Length - 1));            // change 'KN' to 'N', else 'K' to 'C'            itemSB.Replace("KN", "N");            itemSB.Replace("K", "C");            // if not first character, change 'M' to 'N'            if (itemSB.Length > 1)                itemSB.Replace("M", "N", 1, (itemSB.Length - 1));            // if not first character, change 'Q' to 'G'            if (itemSB.Length > 1)                itemSB.Replace("Q", "G", 1, (itemSB.Length - 1));            // transcode 'SH' to 'S'            itemSB.Replace("SH", "S");            // transcode 'SCH' to 'S'            itemSB.Replace("SCH", "S");            // transcode 'YW' to 'Y'            itemSB.Replace("YW", "Y");            // if not first or last character, change 'Y' to 'A'            if (itemSB.Length > 2)                itemSB.Replace("Y", "A", 1, (itemSB.Length - 2));            // transcode 'WR' to 'R'            itemSB.Replace("WR", "R");            // if not first character, change 'Z' to 'S'            if (itemSB.Length > 1)                itemSB.Replace("Z", "S", 1, (itemSB.Length - 1));            // transcode terminal 'AY' to 'Y'            if (itemSB.Length > 2)                itemSB.Replace("AY", "Y", 1, (itemSB.Length - 1));            // remove traling vowels            // remove all 'S' and 'Z' chars from the end of the surname            for (int charOffset = itemSB.Length - 1; charOffset > -1; charOffset--)            {                if (itemSB[charOffset] == 'A')                    itemSB.Remove(charOffset, 1);                else                    break;            }                       // collapse all strings of repeated characters            // (also removes non-alpha chars)            char previousChar = ' ';            for (int charOffset = 0; charOffset < itemSB.Length; charOffset++)            {                if(previousChar == itemSB[charOffset] || itemSB[charOffset] < 'A' || itemSB[charOffset] > 'Z')                    itemSB.Remove(charOffset, 1);                else                    previousChar = itemSB[charOffset];            }                        // if first char of original surname is a vowel, prepend to code (or replace initial transcoded 'A')            bool prependVowel = false;            if (firstChar == 'A' || firstChar == 'E' || firstChar == 'I' || firstChar == 'O' || firstChar == 'U')            {                if (itemSB[0] == 'A')                    itemSB[0] = firstChar;                else                    prependVowel = true;            }            if (prependVowel)                return firstChar.ToString() + itemSB.ToString();            else                return itemSB.ToString();        }

Here is just a quick fuzzy string matching I wrote with weights based on if the character was off in the string. It looks similar to what you posted. I don't use this one, I use the NYSIIS Code. This just returns a weighted value on how close of a match it is.

public static double FuzzySearch(string listitem, string item)        {            double value = 0;            if (listitem == null || item == null)                return value;            int length = (listitem.Length < item.Length) ? listitem.Length : item.Length;                       for (int i = 0; i < length; i++)            {                if (listitem == item)                    value += 1;                else if ((i != 0) && (item == listitem[i - 1]))                    value += 0.75;                else if ((i != (length - 1)) && (item == listitem[i + 1]))                    value += 0.75;            }            value /= length;            return value;        }

1. 1
2. 2
Rutin
16
3. 3
4. 4
5. 5

• 26
• 11
• 9
• 9
• 11
• ### Forum Statistics

• Total Topics
633703
• Total Posts
3013454
×