Better procedural name generation

Started by
7 comments, last by alvaro 11 years, 8 months ago
I'm busy working on techniques for procedural name generation, because, quite frankly, existing approaches seem to leave a lot to be desired. My requirements are:

  • Near unlimited number of names can be generated
  • Names look and sound reasonable
  • Names can be generated in a variety of languages/dialects

I started out with a simple Markov generator (because Markov chains seem to be the weapon of choice in the text generation arena), but I quickly found a lack of balance between order 2-3 Markov chains, which often return impossible (in the context of valid English spelling) character sequences, and degenerate order 4 chains, which just return selections from their input text.

So I decided that a better avenue might be to explore a pronunciation model, where we generate names based on phonemes, and then translate the results back into text (i.e. graphemes).

I've built a system that does braindead-simple (but suprisingly effective) text->phoneme conversion in order to train the model, and then uses a pair of Markov chain models to actually generate the text. The first Markov model is used to generate procedural strings of phonemes, and the second is used to translate that string into legal grapheme sequences (i.e. text). Look at a few examples of the output...

# trained with the 1,000 most popular American baby names of 2011
kegan
declan
maver
elisay
dan
jared
mikay
peytonykeelynn
klare
adwin


# trained with the 360 most popular Spanish names
vicen
fidela
ysaas
uma
rosa
valentin
florena
gabrah
wendra
dino


# trained with the ~2,000 most popular Russian names
vilma
soree
pera
ovarsonaya
tonya
stopolina
belga
alina
sascha
prosdoia


# and for completeness, ~200 names from Tolkien
aldor
anardil
acainaro
atanamir
barahir
aldaron
baggins
bregil
arthedain
bifhad


Apart from a tendency to spit out verbatim names from the training set, and the (fairly rare) long junk name, this seems to work pretty well. But I'd like to make it better - which is where you guys come in.

I don't have enough of a background in linguistics to know where I should be looking for techniques/approaches to improve my algorithm. I'd be very appreciative of any links to relevant literature in the field, or just general brainstorming on approaches I could apply...

[attachment=10206:fabula.zip]
(python source and word lists are attached, if you want to have a poke around)

Tristam MacDonald. Ex-BigTech Software Engineer. Future farmer. [https://trist.am]

Advertisement
Your mapping to phonemes and back to text doesn't seem to work very well for Spanish:

  • ysaas : "aa" is extremely rare in Spanish (I can only think of "Aaron" and the last name "Saavedra") and no words would start with "ys" (are you using "y" as a vowel?), at least with modern spelling.
  • gabrah: The ending "h" is not common either.
  • wendra: "W" is almost not part of the language. There are a few Spanish names with "w" but they come from Gothic (Wamba, Wenceslao...).

I have had reasonable results using a weighted average of two-letter and three-letter Markov chains, plus a filter to remove the occasional really long name. You mat want to also filter out names that were in the input.

I've tried with Spanish, English and Japanese toponyms, and also with a list of Star Wars character names (all around 500 inputs). I was pretty satisfied with the results, but perhaps I am more easily satisfied than you are.

Your mapping to phonemes and back to text doesn't seem to work very well for Spanish

Ja, my phoneme map is currently hard-coded for English. In theory that should be as simple as loading up a new map for each language - but English is the only language that I was able to find a decent pre-built table for smile.png

I have had reasonable results using a weighted average of two-letter and three-letter Markov chains, plus a filter to remove the occasional really long name. You mat want to also filter out names that were in the input.

I've tried with Spanish, English and Japanese toponyms, and also with a list of Star Wars character names (all around 500 inputs). I was pretty satisfied with the results, but perhaps I am more easily satisfied than you are.[/quote]
That is roughly what I tried as a first attempt (and seems to be the most common approach). My main issue with this (and it could be somewhat an artefact of my implementation) is that even with a relatively large input text, a three-letter Markov chain basically ends up stringing together discrete 3-letter chunks that appear often.

Tristam MacDonald. Ex-BigTech Software Engineer. Future farmer. [https://trist.am]

What do you want to focus on improving the most? Weeding out junk? Reducing verbatim repetition? Ensuring few duplicates are generated? Better quality in the names which are generated?

I wonder whether some research into the way names evolved may help. For example roots in professions (Baker, Miller, etc), linguistic simplifications/shifts, family traditions (Donald -> MacDonald, Ander -> Anderson). It may give you a way to generate plausible names from actual roots.

Apart from that your system seems pretty good. Maybe adaptively change the probabilities. For example, temporarily reduce the probability of a phoneme that has been used in the same name/past 10 names.

What do you want to focus on improving the most? Weeding out junk? Reducing verbatim repetition? Ensuring few duplicates are generated? Better quality in the names which are generated?

Junk is the biggest overall problem - I want this to run without human intervention, so I won't be able to filter out junk. That said, my current system only generates junk when it generates extremely short/long names, which provides a fairly easy way to filter them out.

After that quality is the most important. Apart from incorrect phoneme maps, the Spanish/Russian/Tolkien generators do fairly well, but the English one really kind of sucks, I think because my training data contains many non-traditional English names (i.e. American is a melting pot, many cultures names are in the sample). I can probably solve this with a more traditionally English data set, but I haven't had time to find/construct one.

Verbatim repetition is not really that much of a problem. I don't want the entire world to be populated by Johns and Daves, but if there are a few of them it won't break immersion.

I wonder whether some research into the way names evolved may help. For example roots in professions (Baker, Miller, etc), linguistic simplifications/shifts, family traditions (Donald -> MacDonald, Ander -> Anderson). It may give you a way to generate plausible names from actual roots.[/quote]
A simple probability driven table of prefixes/suffixes could do wonders for the 'mac, mc, son' case. A table of professions and a mutator function could server for the other. I like these ideas.

This is good stuff, keep it coming!

Tristam MacDonald. Ex-BigTech Software Engineer. Future farmer. [https://trist.am]


Apart from incorrect phoneme maps, the Spanish/Russian/Tolkien generators do fairly well, but the English one really kind of sucks, I think because my training data contains many non-traditional English names (i.e. American is a melting pot, many cultures names are in the sample). I can probably solve this with a more traditionally English data set, but I haven't had time to find/construct one.


I think "the English one really kind of sucks" primarily because you are more familiar with English names and you are therefore harder to fool in English. My first language is Spanish and I think your generated Spanish names are horrible. :)

Just as it is in English, many last names in Spanish are professions, toponyms, adjective describing physical attributes ("Delgado", "Gordo"), or are derived from other names ("Fernández" comes from "Fernando", meaning something like "Fernandoson"). A majority of Spanish given names are names of saints or biblical figures, so if you try to name a population by just imitating the frequency of letter combinations, the names will not be believable as Spanish names.
Glad that helped... now, take two. I would suggest a larger data set and more segmentation. For example, break names up into first and last name. Break first names into male and female. It's common for male vs female names to have different suffixes, hardness/softness, etc. Provide a set of mappings from male first names to last names (mac, son, etc). It may be worth intersecting the english dataset with other language sets. Classify each phoneme by it's chance of being in each language. Scale the probabilities by how likely they are to be from the same language set, or better yet pick a language set before constructing a name and adjust probabilities accordingly so there is whole name consistency. I'm not saying never borrow from another language root, but keep it in check. Another possibility (although maybe more difficult) is that some names are newer than others and therefore may have different linguistic frequencies. If you could estimate the age of a name, then you could increase the frequency of phonemes from a similar age group. For example download name guides from a few different years/decades. Classify the name's age as the oldest set it appears in. Divide the probability by how many years apart they date from.
Heh, happened to come across this article http://thedailywtf.c...-Generator.aspx, a cautionary tale perhaps? smile.png
That was pretty funny. :)

This topic is closed to new replies.

Advertisement