Storing 8+ GB of data in 64,000 pieces.

Started by
6 comments, last by ddn3 12 years, 6 months ago
I'm developing a world generator targeted at a minecraft server. Unlike the normal server, however, mine does all of the world generation up front in order to do some more sophisticated simulation work.

The smallish test worlds I'm generating are bout 17 km sq and take up roughly 4 gb of space. I'm currently saving each chunk (of roughly 64,000) as a separate file on the hard-disk, with a cache set up to minimize file IO by keeping the most used chunks in memory.

The question I have is: would a database be better? I don't know much about databases, so I'm not sure if storing the chunk data there would be better. This server isn't going to be nearly as lightweight as the standard minecraft server, so having a large database engine as part of the system is acceptable.

Thoughts? Need more information?
Advertisement
Using some kind of spacial data structure may be useful to reduce the amout of memory needed to represents the world. You should also consider compressing each file to use less memory on disk. I don't think a database is useful in this case.

Using some kind of spacial data structure may be useful to reduce the amout of memory needed to represents the world. You should also consider compressing each file to use less memory on disk. I don't think a database is useful in this case.


I should have clarified file space isn't the issue I'm concerned with, but rather access time. This server will (hopefully) be simulating thousands of NPCs spread out over the entire world, and players moving around and exploring it, so the ability to quickly get the needed chunks into memory is the main concern.
The filesystem is a (heirarchical) database for byte blobs. The main advantage a database has for your needs are ACID properties. You might not need all the ACID properties, if you aren't running concurrent transactions. In this case, if you take care to handle the files safely to avoid corruption, the filesystem should suffice.

For example, when updating a file, instead of opening it for writing directly:

  • Create a new file in a temporary location.
  • Write the contents to this file.
  • Ensure the file handles are flushed and closed.
  • Use your platform's equivalent to an "atomic rename" to replace the old file.


Atomic rename isn't available on Windows, but certainly the above is more robust than simply opening the file and writing directly to it. If anything were to happen the file could become corrupted.

I'm developing a world generator targeted at a minecraft server. Unlike the normal server, however, mine does all of the world generation up front in order to do some more sophisticated simulation work.

The smallish test worlds I'm generating are bout 17 km sq and take up roughly 4 gb of space. I'm currently saving each chunk (of roughly 64,000) as a separate file on the hard-disk, with a cache set up to minimize file IO by keeping the most used chunks in memory.

The question I have is: would a database be better? I don't know much about databases, so I'm not sure if storing the chunk data there would be better. This server isn't going to be nearly as lightweight as the standard minecraft server, so having a large database engine as part of the system is acceptable.

Thoughts? Need more information?


The problems with DB storage generally are access speed. If you are going to be accessing the database frequently, you should probably not store it on a database. If you are only accessing it once every once in a while, then it shouldn't have a huge performance impact.

One thing worth noting is that if you have 4gb of data for each chunk and you are grabbing this all with db queries from your server, you are going to be using a lot of bandwidth. Unless you are doing this locally, you'll probably want to look into compressing your worlds or storing the base world on the user's disk and just storing diffs on the server then occasionally updating the user's disks to a new base state based off the diff.

I don't have enough minecraft knowledge, but I feel like you should be able to cut your 4GB by a large margin at least.
You don't store the data like this for one thing.

In the same way that the fastest rendering algorithm is not drawing something, the most space efficient way of storing something is not storing it.

For each "lump" of terrain, you have a generation algorithm. If this is repeatable (usually, derived from a psuedorandom sequence derived from the position). Then you can use the fact that if the terrain has not been modified since it was generated (there was no tunnelling through it) you do not need to store it. Why? Because you can just recreate it on demand. Areas of the world which have not been seen (the insides of mountains, for example) will never even get generated.

As for storing the data, don't put things like this in DB. They're not good at managing these sorts of data. I would do this by;

Storing the data in a large memory mapped file. Give each lump of terrain a flag. If any block is modified, set the flag. If the block drops out of visibility or you want to save the game, check the flag. If the flag is clear, just drop the data. If not, see if the lump has a serial number. If it does, then save it into the large file at that position. If not, you give it the serial number of "existing lumps + 1", and save it at that position effectively adding it to the end of the file. Write it into that position.

That's all fairly simple.

How do you get from an XYZ lump position to the serial number? (or lack thereof). Well, that's a hash. That's another data structure. For now, i would be tempted to put that into a second file. Later on, you can "steal" serial numbered blocks from the stream in the main file and use those slots to put your hash into, but that's getting a little complex to start with.

One thing worth noting is that if you have 4gb of data for each chunk and you are grabbing this all with db queries from your server, you are going to be using a lot of bandwidth. Unless you are doing this locally, you'll probably want to look into compressing your worlds or storing the base world on the user's disk and just storing diffs on the server then occasionally updating the user's disks to a new base state based off the diff.

I don't have enough minecraft knowledge, but I feel like you should be able to cut your 4GB by a large margin at least.


You misunderstand, it's not 4 GB per chunk, but rather 4 GB for the entire world, which consists of 65,536 chunks at 65 KB each.


You don't store the data like this for one thing.

In the same way that the fastest rendering algorithm is not drawing something, the most space efficient way of storing something is not storing it.

For each "lump" of terrain, you have a generation algorithm. If this is repeatable (usually, derived from a psuedorandom sequence derived from the position). Then you can use the fact that if the terrain has not been modified since it was generated (there was no tunnelling through it) you do not need to store it. Why? Because you can just recreate it on demand. Areas of the world which have not been seen (the insides of mountains, for example) will never even get generated.

As for storing the data, don't put things like this in DB. They're not good at managing these sorts of data. I would do this by;

Storing the data in a large memory mapped file. Give each lump of terrain a flag. If any block is modified, set the flag. If the block drops out of visibility or you want to save the game, check the flag. If the flag is clear, just drop the data. If not, see if the lump has a serial number. If it does, then save it into the large file at that position. If not, you give it the serial number of "existing lumps + 1", and save it at that position effectively adding it to the end of the file. Write it into that position.

That's all fairly simple.

How do you get from an XYZ lump position to the serial number? (or lack thereof). Well, that's a hash. That's another data structure. For now, i would be tempted to put that into a second file. Later on, you can "steal" serial numbered blocks from the stream in the main file and use those slots to put your hash into, but that's getting a little complex to start with.


The system your describing seems to be a generate-it-as-you-need it system, much like what Minecraft uses. The issue however is that I'm doing a lot of processing over the entire world before the server proper even starts. Rainfall simulation, plant growth, animal populations, and tribe/village placement and growth.


4GB isn't that large you can keep that it all in memory and every so often create a snapshot of the changes and persist it as a binary blob on the DB. If your looking for concurrent random access to that world from literally 1000s of client threads you'd be better off keeping it all in memory, rather than try some complicated caching scheme pulling/pushing blocks from the DB. Now 400GB then you'd need some smart caching scheme :)


A simple delta scheme, calculate a hash code for say 10x10 sector and if it's changed between the last snapshot persist that sector for that persistence cycle. The larger issue is when you have multiple servers hosting the same world, in that case how do you keep the changes in sync. esp when those servers are separated by a large distance? I'm sure it doesn't apply to your case but it is an interesting problem in MMO design.

Good Luck!


-ddn

This topic is closed to new replies.

Advertisement