- Near unlimited number of names can be generated
- Names look and sound reasonable
- Names can be generated in a variety of languages/dialects
So I decided that a better avenue might be to explore a pronunciation model, where we generate names based on phonemes, and then translate the results back into text (i.e. graphemes).
I've built a system that does braindead-simple (but suprisingly effective) text->phoneme conversion in order to train the model, and then uses a pair of Markov chain models to actually generate the text. The first Markov model is used to generate procedural strings of phonemes, and the second is used to translate that string into legal grapheme sequences (i.e. text). Look at a few examples of the output...
# trained with the 1,000 most popular American baby names of 2011 kegan declan maver elisay dan jared mikay peytonykeelynn klare adwin
# trained with the 360 most popular Spanish names vicen fidela ysaas uma rosa valentin florena gabrah wendra dino
# trained with the ~2,000 most popular Russian names vilma soree pera ovarsonaya tonya stopolina belga alina sascha prosdoia
# and for completeness, ~200 names from Tolkien aldor anardil acainaro atanamir barahir aldaron baggins bregil arthedain bifhad
Apart from a tendency to spit out verbatim names from the training set, and the (fairly rare) long junk name, this seems to work pretty well. But I'd like to make it better - which is where you guys come in.
I don't have enough of a background in linguistics to know where I should be looking for techniques/approaches to improve my algorithm. I'd be very appreciative of any links to relevant literature in the field, or just general brainstorming on approaches I could apply...
fabula.zip 20.56KB 105 downloads
(python source and word lists are attached, if you want to have a poke around)