Generating language

Started by
2 comments, last by spraff 15 years, 3 months ago
Hi, everyone. I'd like to play around with the statistical generation of sentences. I've heard of a couple (megahal, for example) of fairly old real-world examples, but I'd like some advice on how to proceed with my own implementation. I want to incrementally feed the algorithm sentences and see it incrementally improve in sophistication. Not a new problem, right? For the next couple of paragraphs I'll be thinking out loud, so bear with me :) I've seen some cute stuff on N-grams, but they don't scale (the Brown corpus in a 4-gram would be in the thousands of petabytes). So something based on sparse data structures is clearly necessary, but if the data set is growing incrementally, how do we decide whether to include a new word or related group of words if we haven't been storing all of them? This looks like a dead end to me. I'm familiar with the basics of formal grammars, and I expect plenty of people have already tried loading the grammar with topical words on the fly so as to make the generated sentences more focused and less arbitrary. Some of the cutting-edge stuff looks a bit intimidating and might be overkill for my little experiments. Do any of you have hands-on experience with this stuff? Am I being too ambitious? Is it worth the time and energy or should I take someone else's open library and start twiddling? If so, what libraries do you think get good results? Thanks for reading this far
Advertisement
Of course writing your own library is always a plus, you get more control and can add any features you want.
You should consider how much you really are going to use this library, is it just for one project? If so it might be way too much work.

I've never done such a library myself, so I don't know exactly how much work it would be.
In the realm of natural language, I made a framework that you would give a structure of important objects and their place in the sentence, and it would try its damndest to give you a natural sentence that was also disambiguated. I never got it past "room" description generation (it was, of course, for interactive fiction), but it did pretty well at creating disambiguations. If you wanted two apples listed in generation, and they had different property sets, it would figure out what the important differences were between their adjectives, so you would get sentences like...

"There is a red apple on a table, a green apple on a table, and a green apple on a bookshelf."

Unfortunately, it didn't handle collating well, which gave it a bad habit of repeating itself, and it had a tendency to keep reiterating necessary disambiguating details. If you took the same scene that generated the sentence above, but added the bookshelf and the worm to the list of things to mention, with the worm marked as "important" to set it apart and with it given a mood-broadcasting component, it would give you the following...

"There is an oak bookshelf, a red apple on the table, a green apple on the table, a green apple on the oak bookshelf. There is also a worm sitting on the green apple on the oak bookshelf, looking at you over the top of his glasses."

Someday I'll get back to it, and have it remember isolating disambiguations, and try to handle scenes in nested order, so you would instead get this...

"There is a table with a red apple and a green apple on it, and an oak bookshelf with a green apple on it. Sitting on the apple on the bookshelf, there is a worm looking at you over the top of his glasses."
RIP GameDev.net: launched 2 unusably-broken forum engines in as many years, and now has ceased operating as a forum at all, happy to remain naught but an advertising platform with an attached social media presense, headed by a staff who by their own admission have no idea what their userbase wants or expects.Here's to the good times; shame they exist in the past.
Cute. Do you have code for that?

This topic is closed to new replies.

Advertisement