- Near unlimited number of names can be generated
- Names look and sound reasonable
- Names can be generated in a variety of languages/dialects
I started out with a simple Markov generator (because Markov chains seem to be the weapon of choice in the text generation arena), but I quickly found a lack of balance between order 2-3 Markov chains, which often return impossible (in the context of valid English spelling) character sequences, and degenerate order 4 chains, which just return selections from their input text.
So I decided that a better avenue might be to explore a pronunciation model, where we generate names based on phonemes, and then translate the results back into text (i.e. graphemes).
I've built a system that does braindead-simple (but suprisingly effective) text->phoneme conversion in order to train the model, and then uses a pair of Markov chain models to actually generate the text. The first Markov model is used to generate procedural strings of phonemes, and the second is used to translate that string into legal grapheme sequences (i.e. text). Look at a few examples of the output...
# trained with the 1,000 most popular American baby names of 2011
kegan
declan
maver
elisay
dan
jared
mikay
peytonykeelynn
klare
adwin
# trained with the 360 most popular Spanish names
vicen
fidela
ysaas
uma
rosa
valentin
florena
gabrah
wendra
dino
# trained with the ~2,000 most popular Russian names
vilma
soree
pera
ovarsonaya
tonya
stopolina
belga
alina
sascha
prosdoia
# and for completeness, ~200 names from Tolkien
aldor
anardil
acainaro
atanamir
barahir
aldaron
baggins
bregil
arthedain
bifhad
Apart from a tendency to spit out verbatim names from the training set, and the (fairly rare) long junk name, this seems to work pretty well. But I'd like to make it better - which is where you guys come in.
I don't have enough of a background in linguistics to know where I should be looking for techniques/approaches to improve my algorithm. I'd be very appreciative of any links to relevant literature in the field, or just general brainstorming on approaches I could apply...
[attachment=10206:fabula.zip]
(python source and word lists are attached, if you want to have a poke around)