Storing a massive amount of indexed text.

Started by
7 comments, last by Barius 16 years, 10 months ago
Hey, all. I'm looking for a good way to store a bunch of lines of text external to my game program (that is, not hardcoded) such that the game program can index particular lines of text (random access). I was reading some old tutorials (year 2000!), specifically those by Joseph Farrell, that mention the use of a String Table. I've never worked with resource files in C/C++ before, but they seem like a valuable tool. Reading the articles on this site, a string table was presented as the perfect solution for me: a bunch of strings indexed by integers! Well, I did some more research and actually tried to make and use a string table in a test project (using VC++ 6.0), and it turned out that it's not quite what I had imagined. It seems that the entries in the string table are actually indexed with a "string type" and the integer is more-or-less automatically chosen by the IDE. What I want to do is have all of the game dialog, item names, etc stored so that, using an example from the mentioned articles, a line in a script file such as "ShowText 10 3" would refer to three lines of text, starting at line ten. An obvious and easy alternative is to just have a gigantic text file with each line separated by carriage returns. Almost as obvious are the major problems: in order to have random access, the entire file would have to be read and shoved into an array; also, the data is not encoded at all (that is, anyone could go and edit and/or corrupt it). Please give me any suggestions! Thanks much!
Advertisement
If you're looking to store LARGE amounts of key-value pairs, you may be interested in reading up on Berkeley DB. It's basically a stored btree, with maybe a caching layer on top. It's designed to have super fast random access and range scans (both of which would be extremely useful for storing dialogs, depending on how you name your indexes).

I'm not saying this is the "be-all, end-all" solution, and you'll probably need to create an editor to insert your text and what not, but it's an idea.

EDIT: This might be a bit heavy-handed, even though BDB's typically have a pretty lightweight feel. These things are designed to operate flawlessly with 100's of gigs of data.
When I write "large", I mean several hundred to a few thousand character strings. A text file would be many kilobytes, I suppose. This is an inconsequential amount of space, but it's all arbitrary data. The idea of having tree-like utilities is interesting, though. One could refer to a bit of text through categories, such as ItemNames->WeaponNames->name5 (instead of meaningful names, these categories might just be more numbers, but at least it's more organized!).
A delimited text file seems the simplest way to go. Try it out and see if it's fast enough, or just read it into an array and see if it really takes an unacceptable amount of memory. As for data encoding... well, if it really bothers you that much you can encrypt it. But in my experience, it's not worth it; those who really want to mess with your game will find a way, and those who don't care enough aren't a problem. Besides, modding is a good thing.

If reading a whole text file into memory is unacceptable, then a couple optimizations spring to mind. First, you could just make each "line" in the file a fixed number of characters, say 80. Then, "ShowText 10 3" would just get all the text between 80*10 and 80*13. It'd bloat the file out by a fair amount, but all the empty space should collapse quite nicely with compression.

The other obvious option is to put a header in your text file, giving you the character offset and length of each chunk of text. Then you would just read that into the array, not keeping the entire file in memory. "ShowText 10 3" would look up number's 10, 11 and 12 in the array, find those specific offsets in the file, and read 'em in. The obvious problem with this is you'd have to modify the header every time you changed the text... a special editor would be required.
Hmmm, or you could just have your program read the text file in on start-up, build the array of offsets for each chunk, and just remember them. It'd be slower and more complicated, but it'd also be impervious to change.

Or, just use Berkeley DB. I've used it a few times before, and it is simple, small and useful.

Hope this helps.
-----http://alopex.liLet's Program: http://youtube.com/user/icefox192
You don't need to load the whole file in memory just to get to some data inside of it. You can just set the file pointer (or, position within the file stream, as it is also called), allocate memory just for the data you want, and then read just the data you want.

For a string collection file, two things are important: speed of loading a string and ease of editing.

These two points can be easily addressed by creating two files: the easily editable text file delimited by crlfs, and the binary index file that has information on at what offset the lines in the text file begin, and how long each one is. After this, you could go either of the two ways: make an editor that automatically updates the index file when the text file is modified, or make a command line compilation tool that will make a binary index file out of a given text file. It shouldn't be too much trouble to simply run it every time the text file is changed.

Then in your game, write the code that loads the contents of index file in memory and keeps it there for fast retrieval. Each index entry doesn't have to be more than 8 bytes long - one DWORD for starting offset of a line in the text file, and one DWORD for that line's length. All of the entries are in a simple 1-d array, an index into which is a 0-based text line number.

If you have a lot of entries in the index and you are worried about memory, you can even divide them into groups of, say, 256, and have a 'current group' that is the only one that resides in memory at any time. Since all entries are 8 bytes long, it shouldn't be difficult to calculate which group of 256 the requested string index is in, and if it's not the current one, then discard the current one and load the needed one from index file.

Keep a handle to the text file with strings open at all times (because opening a file is slow), and when you get your index for the line you want, just set the file pointer to starting offset and read from file the specified number of characters. Then return that into a buffer or whatever.

[EDIT] You can make a size of an index entry smaller by specifying the line lengths as a BYTE, effectively limiting the max size of a line to 255 characters. Which is a whole shitload of characters, actually.

Depending on how large your file is going to be, you could also make the line offset a WORD (limiting the size of the file to about 65535 ansi characters / (32767 unicode characters), which is 64K. That need not be a technical limitation, since you could just have several string table files. That way, they will also be easier to maintain too.
Thanks everyone for your input. ValMan, your suggestion sounds like it's just the solution I'm looking for. I will certainly try implementing it!
Another idea is to have a text file where each line is:
12371:the contents of the string\n

This is a relatively simple format to parse. It is human readable and human editable. You can delete strings from the middle of the list without causing problems.

I'd be tempted to make the integers at the front floats or doubles (this allows the user to insert a new string half-way between two existing strings).

Next, build your binary string table dynamically at run time.

You can do a binary dictionary search for the right string from the file. Seek to the middle of the file, advance/retreat to the nearest newline, then read the number. Repeat until you find the correct index.

Cache in memory the index->seek offset map. (also cache invalid indexes)

When you are asked the next index, search in your in-memory map. Find two indexes that bracket your target (or are your target). Do a binary search between them in the file for the index you want.

This will have slighly slower run-time performance, but it means that your data exists in one spot and one spot only. If this run-time building of the map turns out to be too slow, writing a tool that builds a compile-time map is easy (and the map can have the exact same interface: given a string index, return a seek offset).

Errors can creep in: if you have two strings with the same index. But validating the file's format is a simple matter of cut, uniq, sort and diff (to guarantee that all indexes are both unique and sorted).

In short, the problems I have with ValMan's solution are:
"It shouldn't be too much trouble to simply run it every time the text file is changed."
-- this will be the cause of a bug. The index and text file will become out of sync.

"Then in your game, write the code that loads the contents of index file in memory and keeps it there for fast retrieval. Each index entry doesn't have to be more than 8 bytes long - one DWORD for starting offset of a line in the text file, and one DWORD for that line's length. All of the entries are in a simple 1-d array, an index into which is a 0-based text line number."
-- good idea, if your system is a very small one. But today it is often the case that the benefits of making your integers smaller is tiny.

"the easily editable text file delimited by crlfs,"
-- This means that deleting a text line is dangerous, and you can only add text lines to the end. Fragility, thy name is!

His solution is a good one -- don't get me wrong -- I'm just addressing some possible flaws in it.
"this will be the cause of a bug. The index and text file will become out of sync."

Just check date modified on index file to be the same as text file when loading the former. Could go one step beyound to store dates in both the text file and the binary file, but that shouldn't be necessary, as date/times can only get reset if you zip the files and then unzip them. And zipping the files is something you will do as the last step, before publishing, so obviously you would do a preflight before that.

"This means that deleting a text line is dangerous, and you can only add text lines to the end. Fragility, thy name is!"

Nothing dangerous, the good thing is that you can modify them in any way you want. After modification, just run the compilation tool. If you forget, the game will remind you to do it as it starts (by displaying an error when index's modified date is less than text file's modified date). Hell, it could even display a warning and compile the index itself when it detects compilation is required. Then you don't have to lift a finger.

The only flaw I see is that text lines are indexed by line numbers, which is an idea of thread starter, not mine. I would index them by integer, then have an std::map of integers mapped to struct STRING_ENTRY { int nStart; int nLength }; Could also index them by a short string, in which case the lookup would be slower (but probably not slower enough to notice), and you would not need to worry about trying to keep ID numbers sequential. Short names like ERR_FILENOTFOUND and BUTTON_NEWGAME.
You could use MetaKit for this. It is a relatively lightweight embeddable database library.

This topic is closed to new replies.

Advertisement