Jump to content
  • Advertisement

Project.GoToStart() - Test Adventure Dev 03

A4L

829 views

G'Day....

I've been a little slack and not spent a lot of time on the project the last few days but after an extended session last night and today I think it may be time to strip the entire project back and have another whack at it by building fresh code with a clearer idea of what I want to do. I have a lot of stuff in the project that is wrong, or needs re-jigging and I am starting to feel that if I just restarted with all the learned lessons I could make a much better version of the application without having to constantly "work around" my old junk code and systems. So basically I am going back to the drawing board.. but I also wanted to go through my command system idea in a way that I hope will spell it out to others, so they could give me input, and myself so it can act as notes for my rebuild.

The Main Goal of this Project

The real focus of this project is not to make a working game, but to learn programming and get an idea of what it takes to realise an application. Obviously I want to make a working game but what it is really about was setting myself a challenge and following it through. To do this I chose a "Text Adventure Game" written in C# using the console but I wanted a way to build it that was more interactive than a simple two point phraser that only accepts "get hat" or "look room", "move north" or w/e. I wanted it to be able to understand commands like "Pick up the hat and put it on" or "Smash the lock with the pickaxe and look inside" stuff like that.

So while, yes, I plan to finish this as a game, restarting is not an issue for me as it is the "path" to get to the end result that matters, the learning and experimenting, not so much the game itself.

Tokenise Code

The first part of the new build is going to focus on the user input. I wanted to rebuild what I have done but in a cleaner way. This game will take lines of user input and phrase them into commands. To do that we are going to be using this tokeniser code.

        public static List<string> TokenizeStringList (string input)
        {
            List<string> cleanedInputList = new List<string>();
            string[] raw_cleanedInputString = input.ToLower().Trim().Split(null);

            foreach (string word in raw_cleanedInputString)
            {
                if (word != "")
                {
                    string s = CleanedWord(word);
                    if (s != "") cleanedInputList.Add(s);
                }
            }
            return cleanedInputList;
        }

So this code simply takes in a input string and spits out a List<string> with one word from the string in each element.

It also sets the string to lower case and trims off any leading or trailing spaces before testing the token to see if it is an empty string. This is to avoid turning (space)(space) into a token that is just a single (space), which would be trimmed into a empty string anyway. Long story short by the time we are inside the If statement we are working with a single word all set to lower case with no leading or trailing spaces and never a blank string.

Once we have that word we do some quick cleaning of unwanted characters.

public static string CleanedWord(string word)
        {
            char[] whiteList = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789'".ToArray();
            char[] rawChars = word.ToArray();
            List<char> cleanedChars = new List<char>();
            for (int i = 0; i < rawChars.Length; i++)
            {
                foreach (char c in whiteList)
                {
                    if (rawChars[i] == c)
                    { cleanedChars.Add(c); }
                }
            }
            if (cleanedChars.Count == 0)
            { return ""; }
            else
            {
                char[] c = cleanedChars.ToArray();
                string s = new string(c);
                return s;
            }
        }

My original code as you can see was a lot more verbose than what I ended up using  but worked fine. I had some issue latter on with reference vs copying of List<strings> and during the time I was trying to fix that problem I found a neat simple couple of a few lines on the internet that basically do the same thing.

        public static string CleanedWord(string word)
        {
            var banList = "~`!@#$%^&*()_+{}|[]\\:;\",<.>/?".ToCharArray();
            return string.Join("", word.Where(s => !banList.Contains(s)));
        }

What both these code blocs are doing with CleanedWord(string) is removing all alpha numeric chars from the word, except the ' (apostrophe). So I needed to get rid of brackets, dollar signs, exclamation points and all that, but keep the apostrophe any letters and numbers.

So after all that we end up with a List<string> with a cleaned word in each element.

Stemming the Words

What I am trying to do is distil a unknown list of words imputed by the user into usable keywords I can match to my words lists. One of the problems is that words have a great many forms. I believe it is called morphology. So...

  • Looks
  • Looking
  • Looked

Are all extensions of the word look. So in a "real" NLP system you need to understand these different versions of the word look, but in my little application I can get away with reducing these words down to its base version. It really makes no difference in a practical way if the user types "Give the deer her freedom" or "free deer". By stemming the word Freedom to Free, I reduce the keyword search and understanding requirements. But here is the crux. It is my belief that if they type "Give the deer her freedom" and the game only understands that as "free deer" due to my keyword system they will get the impression that the application is understanding more than it really dose. The important thing here is to have the program respond in a predictable way.

Now stemming words is a lot harder than it seems. I originally was just removing things like "ing" form any words I found.. but there was a lot of words that came out wrong. I did end up having a pretty decent code block for it though. It ignored words of a certain length , wouldn't remove a "s" from the back of a word if "s" was the previous letter stuff like that. I felt the stemming was working very well, but occasionally in my testing I would find a problem and have to add a exception for that problem.

What I ended up with was a very rough bit of code I called RoughStem(string)... still during my googling I found out about some super duper "algorithm" that is professionally made that dose exactly what I am looking for. Called "Porter2". I found it immensely satisfying that the Porter 2 algorithm was basically doing everything exactly like I was, but well, better and in a more robust way. It uses sub-string lists to find conmen suffixes, then uses logic to determine if that part of the word is normally removed or not. What it did really different to my experiments was to split the word before stemming and then stitch them back together. I was happy I was on the right track at least.

  • You can read all about the algorithm <here>

I got about 1/2 through building this algorithm in my noobish ways when I found some code on GitHub with the MIT licence, written in C# that I could just plug directly into my application which implemented the Porter2 Algorithm. .

  • You can see the GitHub Page <here>

After some back and forth I decided to abandon my "mostly" working RoughStem(string) method and my out of my depth attempts to build the algorithm myself and just use this GitHub code by "Nemec_"

So I added a function that simply called the Porter2 code and stems all the words in a List<string>

        public static List<string> StemWordList (List<string> wordlist)
        {
            for (int i = 0; i < wordlist.Count; i++)
            {
                string StemValue = StemWord.Stem(wordlist[i]).Value;
                wordlist[i] = StemValue;
            }
            return wordlist;
        }

Results

So after all that I think my Tokenisation Code is complete and finalised it takes any string input form the user and builds 2 variables form it that is stored until a new string input is entered.

  • string rawString - Single String that contains the entire line entered with no modifications.
  • List<string> cleanedInputTokens - This is a List<string> that has a cleaned word at every line.
  • List<string> stemmedInputTokens - This is a List<string> that has all the cleaned words stemmed and ready to be sent tp Processing.

I may produce a 4th list that contains all the words in a list but not cleaned or stemmed but I think for now I have teh default input string and the cleaned but not stemmed strings. I can use both of these or just the cleaned list to build response lines to feed back to the player and use the stemmed list to actually process the command.

The only other modification I may make before I continue is to dump the List<strings> completely. As in once I have built them return them as string[] arrays. This would make them a little easier to work with I think.

Well that's that!

Ok.. so that is the tokeniser!! It seems to be working very well, and I am glad I am rebuilding the project. It is a slow way to go but it really seems to be helping me get everything back into action.

uc?export=download&id=1SfHSRZjsdRIyP90Om

 

 

 



0 Comments


Recommended Comments

There are no comments to display.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!