Sign in to follow this  
GunBlade

Manipulating extremely large data models.

Recommended Posts

Hey all, I need to load and manipulate very large data sets. By large I mean that it's bigger then the RAM capacities. Saving onto the hard drive is a possibility, but that does not allow me quick interaction with the data. I wish to load my entire data into some data structure and use it online. Is there a way to make use of the virtual memory? What is the best way to handle such data? Thanks.

Share this post


Link to post
Share on other sites
Virtual memory is the operating system's concern. If you load the files and they can't fit in RAM, some data will be paged out to disk. Note, however, that you don't have any control over what is paged out to disk and when, so you can suffer from the hard disk access times. However, I believe Windows (32 bit) imposes a 2GB limit on processes (64 bit Windows gives you 4GB, I think).

However, there may be more efficient ways of handling the file than simply loading it all into memory (can you tell when parts of the file will be needed?).

Share this post


Link to post
Share on other sites
It will be very hard to know which parts of the file I would need at any given time. I need the data arranged in a C++ map (or other search structure). For that I need my data to exist in main memory. I did try to let the virtual memory handle the data, but it wasn't enough. Now I know why (you said there was a limit).
So do you know of a more convenient way to handle such data?

Thanks.

Share this post


Link to post
Share on other sites
Look into the Memory Management page of the Windows Platform SDK (Online, but incomplete, try older MSDN if you have it: Link).
There are a few ways to address the limits,
depending on what you want (VirtualAlloc, AllocateUserPhysicalPages, etc).

Share this post


Link to post
Share on other sites
Quote:
Original post by GunBlade
I need the data arranged in a C++ map (or other search structure). For that I need my data to exist in main memory.

I don't know the type of data your are using, but a solution often used is to create an index.
Since you are using map, there is some key-value relation. So, maybe you could prepare your data by walking through the file(s) sequentially, noting the offsets in the file for each value of the key. Then your map could be of the form std::map< key, offset >. Then your lookup would give you the offset in the file and you can load the data from there, without the need to have unused data in memory.

Maybe the above isn't applicable to the data you're using, but its hard to tell without more information.

Share this post


Link to post
Share on other sites
How big is the amount of data you want to manage ?
You said it is bigger than the amount of RAM you have, but is it also bigger than the virtual memory available to your process ? If not you can just map that data into memory (if it is a file).
The idea of using an index is also a good solution. Maybe you want to enhance it by using smart pointers, in order to hold only a certain portion of that data set in memory (using an LRU scheme and writing out/reading in data automatically).

Share this post


Link to post
Share on other sites
Quote:
Original post by GunBlade
I wish to load my entire data into some data structure and use it online. Is there a way to make use of the virtual memory? What is the best way to handle such data?


So, more than 4 gigs?

And a map/hash table?

I'm not sure what you want to use this data for, or what it is, but SQL was designed for this.

It's also guaranteed to contain all indexing, paging and other optimization algorithms there.

The only other solution is, obviously, to not load all the data in memory. Virtual memory is useless for this, since it uses generic rules for swapping out data, but only you know the access patterns for yours.

If you explain what this data is and how it's used, I'm sure someone will come out with perfect solution which won't require much memory.

Share this post


Link to post
Share on other sites
Thanks for all the replies.
I'm going to try a few solutions. The first of which is a cool project I was referred to:
http://i10www.ira.uka.de/dementiev/stxxl/doxy/html/index.html
It's a library that allows to use std libraries with out-of-core support.
If that doesn't work, I will check MSDN like ttdeath suggested, and go on from there.

Thanks again.

Share this post


Link to post
Share on other sites
Berkeley DB fits the "huge friggin map<key, data>" paradigm perfectly, and 4-5GB is considered "no big deal". If you need even more solid access (at the possible expense of some performance), go the relational route with some form of SQL db.

Share this post


Link to post
Share on other sites
The best way to handle it is to use less memory.

Have you considered breaking up the data into chunks and then using data compression? It is a rare dataset that cannot be compressed (typically only pre-compressed data contain no exploitable entropy) and decompression is often significantly faster than random access hard drive I/O times for the same buffer size.

The generic route for this is to use something like zlib, but a more specific route is often significantly better.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this