Manipulating extremely large data models.

Started by
9 comments, last by Rockoon1 16 years, 10 months ago
Hey all, I need to load and manipulate very large data sets. By large I mean that it's bigger then the RAM capacities. Saving onto the hard drive is a possibility, but that does not allow me quick interaction with the data. I wish to load my entire data into some data structure and use it online. Is there a way to make use of the virtual memory? What is the best way to handle such data? Thanks.
Advertisement
Virtual memory is the operating system's concern. If you load the files and they can't fit in RAM, some data will be paged out to disk. Note, however, that you don't have any control over what is paged out to disk and when, so you can suffer from the hard disk access times. However, I believe Windows (32 bit) imposes a 2GB limit on processes (64 bit Windows gives you 4GB, I think).

However, there may be more efficient ways of handling the file than simply loading it all into memory (can you tell when parts of the file will be needed?).
[TheUnbeliever]
It will be very hard to know which parts of the file I would need at any given time. I need the data arranged in a C++ map (or other search structure). For that I need my data to exist in main memory. I did try to let the virtual memory handle the data, but it wasn't enough. Now I know why (you said there was a limit).
So do you know of a more convenient way to handle such data?

Thanks.
Look into the Memory Management page of the Windows Platform SDK (Online, but incomplete, try older MSDN if you have it: Link).
There are a few ways to address the limits,
depending on what you want (VirtualAlloc, AllocateUserPhysicalPages, etc).
Quote:Original post by GunBlade
I need the data arranged in a C++ map (or other search structure). For that I need my data to exist in main memory.

I don't know the type of data your are using, but a solution often used is to create an index.
Since you are using map, there is some key-value relation. So, maybe you could prepare your data by walking through the file(s) sequentially, noting the offsets in the file for each value of the key. Then your map could be of the form std::map< key, offset >. Then your lookup would give you the offset in the file and you can load the data from there, without the need to have unused data in memory.

Maybe the above isn't applicable to the data you're using, but its hard to tell without more information.
How big is the amount of data you want to manage ?
You said it is bigger than the amount of RAM you have, but is it also bigger than the virtual memory available to your process ? If not you can just map that data into memory (if it is a file).
The idea of using an index is also a good solution. Maybe you want to enhance it by using smart pointers, in order to hold only a certain portion of that data set in memory (using an LRU scheme and writing out/reading in data automatically).
Quote:Original post by GunBlade
I wish to load my entire data into some data structure and use it online. Is there a way to make use of the virtual memory? What is the best way to handle such data?


So, more than 4 gigs?

And a map/hash table?

I'm not sure what you want to use this data for, or what it is, but SQL was designed for this.

It's also guaranteed to contain all indexing, paging and other optimization algorithms there.

The only other solution is, obviously, to not load all the data in memory. Virtual memory is useless for this, since it uses generic rules for swapping out data, but only you know the access patterns for yours.

If you explain what this data is and how it's used, I'm sure someone will come out with perfect solution which won't require much memory.
How about this one ?
http://dtemplatelib.sourceforge.net/
Thanks for all the replies.
I'm going to try a few solutions. The first of which is a cool project I was referred to:
http://i10www.ira.uka.de/dementiev/stxxl/doxy/html/index.html
It's a library that allows to use std libraries with out-of-core support.
If that doesn't work, I will check MSDN like ttdeath suggested, and go on from there.

Thanks again.
Berkeley DB fits the "huge friggin map<key, data>" paradigm perfectly, and 4-5GB is considered "no big deal". If you need even more solid access (at the possible expense of some performance), go the relational route with some form of SQL db.

This topic is closed to new replies.

Advertisement