A Python Program that can Interpret a sentence (help)

Started by
7 comments, last by Tutorial Doctor 9 years, 4 months ago
I have made a simple python program that has no dependencies (only string manipulation) that can interpret a sentence based on word associations.
It sorta feels good right now, but every word that has punctuation is interpreted as one word, and I can't interpret it separately.

Now this is a good thing in many ways, except for the fact that the last word in a sentence ending in a period is interpreted as that word plus the period.

But perhaps that is also a good thing, so that "What?" is interpreted differently from "What." But most sentences shouldn't need a separate interpretation for every time it has a period, or does it? Perhaps I should make a function for words ending with punctuation.

The bypass() function bypasses all unassociated words in the sentence, so I guess this works?

Can anyone find any examples where this system will not work?


Edit: My code is much shorter now:

sentence = 'Hello World'

assocation = {
'Hello':'Hola',
'World':'Mundo'
}

for word in assocation:
if word in sentence:
sentence = sentence.replace(word,assocation[word])
print sentence

They call me the Tutorial Doctor.

Advertisement
At the bare minimum, you need: http://en.wikipedia.org/wiki/Tokenization_(lexical_analysis)

Tokenization on English isn't too hard unless you need to handle input with lots of typos. Let's ignore typos for now. The fast-and-easy way is: Treat a continuous series of alphanumeric characters (also known as "word characters") as one token and treat all other characters as individual tokens. Discard whitespace after tokenization if it's not important to the rest of your code.

Tokenization for other languages (mainly those without spaces) is MUCH harder: http://en.wikipedia.org/wiki/Text_segmentation


So: "What's yours is mine, and what's mine is yours." would turn into:


What
'
s
yours
is
mine
,
and
what
'
s
mine
is
yours
.
NOTE: Contractions can be ambiguous. "I'd" could mean either "I would", "I had" and perhaps other possibilities. I wouldn't deal with them at this point unless they're completely unambiguous since you'll need a lot more information in order to decide which possibility is the most likely.

Shouldn't it output:

"What's mine is yours and what's yours is mine?"

I am just hoping it took the sentence as what it was somehow.

I actually had looked into tokenization, and couldn't come to grasps how it worked. It seemed to complicated, so I tried my own ideas.

I do need a way to determine multiple associations. But I wonder if basic english grammar some way to look at context, will help. Right now, I don't know how I would do that.

I do have a script that spell checks input from a sentence. Let me see if I can find it....

They call me the Tutorial Doctor.

Okay, the following code corrects a string based on some sample text.


#http://norvig.com/spell-correct.html
# By Peter Norvig
 
import re, collections
 
 
sample_text = 'you are not fair'
misspelled = 'yu ar nt far'
 
def words(text): return re.findall('[a-z]+', text.lower()) 
 
def train(features):
    model = collections.defaultdict(lambda: 1)
    for f in features:
        model[f] += 1
    return model
 
NWORDS = train(words(sample_text))
 
alphabet = 'abcdefghijklmnopqrstuvwxyz'
 
def edits1(word):
   splits     = [(word[:i], word[i:]) for i in range(len(word) + 1)]
   deletes    = [a + b[1:] for a, b in splits if b]
   transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
   replaces   = [a + c + b[1:] for a, b in splits for c in alphabet if b]
   inserts    = [a + c + b     for a, b in splits for c in alphabet]
   return set(deletes + transposes + replaces + inserts)
 
def known_edits2(word):
    return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)
 
def known(words): return set(w for w in words if w in NWORDS)
 
def correct(word):
    candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
    return max(candidates, key=NWORDS.get)
 
correction = correct(misspelled_word)
 
 
for word in misspelled.split(' '):
print correct(word),
 

Could I mix this in somehow?

>> You are not fair

They call me the Tutorial Doctor.

I need to be able to do Phrases as well, not just words. Any tips?

They call me the Tutorial Doctor.

I need to be able to do Phrases as well, not just words. Any tips?


It really depends on what you want to do with the phrases, and how tolerant you want it to be of malformed input.

There are tons of possible places for you to start. I personally have done some GLR parsing with some success, but I'm not really satisfied with it since it's tough to deal with intentionally malformed sentences.

http://en.wikipedia.org/wiki/Markov_algorithm (this most closely resembles what you're doing right now)

http://en.wikipedia.org/wiki/GLR_parser
read: http://en.wikipedia.org/wiki/LR_parser first


http://en.wikipedia.org/wiki/Head-driven_phrase_structure_grammar (this is interesting from the generative point of view but is harder to use from the parsing perspective)

Ahh, good ol Markov chains. Thanks Nypyren. I will look into all of this.

I do have some pretty interesting stuff going now. I can get phrases, but I can't get super phrases of a phrase in a sentence. I will post what I have when I am kn the laptop.

They call me the Tutorial Doctor.

Ahh, good ol Markov chains. Thanks Nypyren. I will look into all of this.


The Markov algorithm and Markov chains are unrelated; Markov chains have to do with sequential probabilities, Markov algorithm is a string rewriting system. You can use Markov *chains* to assist with parsing when trying to determine how to solve ambiguities by analyzing probabilities of each different option, but you'll need a string rewriter or a parser to actually figure out the structure of the sentence.

I do have some pretty interesting stuff going now. I can get phrases, but I can't get super phrases of a phrase in a sentence. I will post what I have when I am on the laptop.


Most parsers work by grouping words into phrases, combining phrases, then returning a single sentence, using a bottom-up approach. If you were to do this with a string rewriter, it's kind of like grouping words with parentheses and then treating those as a single unit from then on. A really simplified example of what a parser will try to do is:


What's mine is yours and what's yours is mine
A = (what's mine) -> A is yours and what's yours is mine
B = (A is yours) -> B and what's yours is mine
C = (what's yours) -> B and C is mine
D = (C is mine) -> B and D
E = (B and D) -> E
Done.
Of course, in reality the problem is that "and" is ambiguous and is going to cause the parser to have to try multiple options:


What's mine is yours and what's yours is mine
A = (what's mine) -> A is yours and what's yours is mine
B = (A is yours) -> B and what's yours is mine
C = (what's yours) -> B and C is mine
D = (B and C) -> D is mine
E = (D is mine) -> E
Done.
Things like GLR parsers are neat because they will parse *both* ways at the same time, and then you decide which possibilities to keep. Some sentences in fact are intended to have multiple meanings at the same time.
I've been working on several small systems dealing with interpreting (working on a fuzzy logic one now). I like your suggestion above. I do need one that can answer questions (mine is just trying to interpret statements.)

I like that setup too.

I could have a set of question words, and then somehow create a system on how to interpret a question.

I think I almost have a supposition system (it can suppose something based on a condition). But I need to add a bit of probability, so I am simulating set notation in python. Etc.

Cool stuff nevertheless.

Decided to check stack overflow for help:

# Sets#import dis

A = [1,2,3]
B = [4,5,6,3,2]
C = ['a','b',3]

### BORROWED CODE
def getIntersection(*S):
    sets = iter(map(set, S))
    result = sets.next()
    for s in sets:
        result = result.intersection(s)
    return list(set(result))
print getIntersection(A,B,C)
### END BORROWED CODE


def getUnion(*S):
	result = set().union(*S)
	return list(set(result))
print getUnionS(A,B,C)

They call me the Tutorial Doctor.

This topic is closed to new replies.

Advertisement