Sign in to follow this  

How to work with a file system in memory?

This topic is 4679 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Does anyone know how to work with filesystems in memory? Say, if I open a zipped folder into memory and i wanted to access file "abc.txt" how would I access this file? I tried google a bit but it's kind of pointless when you don't know what to look for!

Share this post


Link to post
Share on other sites
The issues of compression and file structure are seperate. Heres something I have based my file system off of: http://www.gamers.org/dEngine/quake/spec/quake-spec34/qkspec_3.htm

Do note that the author has several discrepencies and at times cannot put together a sentence correctly. But its a step in the right direction for putting together a pack file that will let you have directories and sub-directories and files all together just like windows. For compression, I'm using ZLib (http://www.gzip.org/zlib/). The DLL interface isn't too bad, and if you google search you can find some examples on compression/decompression with it (http://www.w3.org/config/deflate.c)

Share this post


Link to post
Share on other sites
Well since this is pretty much the same topic, I have a question as well. I have my file system all spec'd out, but am stuck on one thing. When I open an existing pack file to add a new file, I'd like to be able to re-use the same file rather than create a new one and rename it to the original file name. WinRAR, WinZIP, and 7Zip all do this. 7Zip is open source but since the code isn't commented or documented very well its hard to follow. I just can't figure out how to do it other than to parse the file up to where I want to insert my new file, copy the rest of the file out to a temporary file, insert the new file, then copy all the rest of the data back over into the original file. This could be very disk intensive and slow. Ideas?

Share this post


Link to post
Share on other sites
nah, I think you got it right the first time. I want to be able to unzip a folder programmaticly (I'm using zlib) and I want to navigate the folder/read certain files while the unzipped folder is in memory.

Thinking about filesystem design is strange, thinking about file and folder structure as bytes and files is an exercise in recrsive thining ;)

Share this post


Link to post
Share on other sites
Ok, what do you guys think of the design I have so far? I decided to only allow one level of folders to save on complexity and random reading and writing to disk isn't important:


struct File
{
char[32] fileName;
long fileSize; //size of the file in bytes
long fileLocation; //the number of bytes into the data buffer the file is stored at
}

struct Folder
{
char[32] folderName;
int numFiles;
File* files[numFiles];
}

struct MountPoint
{
int numFolders;
int numFiles;
int totalFiles;
int headerSize;
Folder* folders[numFolders]
File* files[numFiles];
char* dataBuffer[];
}



Then I can calculate the size of the size of the header section of the file (where the file and folder objects are stored with:

long headerSize = (numFolders * sizeof(Folder)) + (totalFiles * sizeof(File)) + (4 * sizeof(int);

The File.FileLocation is relative to the start of the data buffer so finding the real location is done with:

fileLoc = File.fileLocation + headerSize;

I think I've thought that through pretty well, what do you guys think?

Quote:
Original post by Eps
Well since this is pretty much the same topic, I have a question as well. I have my file system all spec'd out, but am stuck on one thing. When I open an existing pack file to add a new file, I'd like to be able to re-use the same file rather than create a new one and rename it to the original file name. WinRAR, WinZIP, and 7Zip all do this. 7Zip is open source but since the code isn't commented or documented very well its hard to follow. I just can't figure out how to do it other than to parse the file up to where I want to insert my new file, copy the rest of the file out to a temporary file, insert the new file, then copy all the rest of the data back over into the original file. This could be very disk intensive and slow. Ideas?


Wouldn't that require an insert operation for whatever filesystem the OS uses? I vaguely remember hearing about something like that being available in reiser4!

Share this post


Link to post
Share on other sites
Well, it should certainly work, but you may have over complicated things a bit. I'd have to mull over the details of you design to discover any pros/cons, so instead I'll show you what I came up with and you can compare it for yourself.


// Pack Entry
struct PackEntry_T
{
PackEntry_T() { memset(this, 0, sizeof(PackEntry_T)); }

short type; // Type of entry: Folder(0) or File(1).
unsigned char string[54]; // Name of the entry, 52 chars, padded with '\0'.
long size; // Uncompressed size of the entry in PACK file if a file, or number of entries if a folder.
long csize; // Compressed size of the file.
};


Thats it. Using this structure I can have any number of levels of directories and any number of files. If I set an entry to type folder and write it out, it will tell me that the next 'x' number of entries read belong to that folder. That could include another folder inside of which could be more folders. If I want to insert a file into a folder, I read through the file until the correct folder is found, increment its file count, then immediately write in the new file.

The first thing in my file is always a root folder, under which i can keep track of everything that is going on. This is a permenant folder that cannot be removed.

You may notice I'm not keeping track of file offsets within the pack. This is because I'm writing things out in the "header-buffer-header-buffer" fashion instead of "header-header-buffer-buffer". This lowers the maintanence I need to do when writing/removing.

Your structs are 4 byte aligned which is good, if they weren't they would get padded by the default compiler options (in VS anyhow) and when trying to parse the file by the size of structs you'd run into problems.



If anyone read all that, let me know if you see any issues with my design, as I think I have thought it out fairly well too :)

Share this post


Link to post
Share on other sites
Quote:
Wouldn't that require an insert operation for whatever filesystem the OS uses? I vaguely remember hearing about something like that being available in reiser4!


Couldn't find this feature for reiser in just a couple min of searching, but I'm on a windows box anyhow. Any idea if there is a similar feature available (whatever it does?)

Share this post


Link to post
Share on other sites
Quote:
Original post by Eps
Well, it should certainly work, but you may have over complicated things a bit. I'd have to mull over the details of you design to discover any pros/cons, so instead I'll show you what I came up with and you can compare it for yourself.


// Pack Entry
struct PackEntry_T
{
PackEntry_T() { memset(this, 0, sizeof(PackEntry_T)); }

short type; // Type of entry: Folder(0) or File(1).
unsigned char string[54]; // Name of the entry, 52 chars, padded with '\0'.
long size; // Uncompressed size of the entry in PACK file if a file, or number of entries if a folder.
long csize; // Compressed size of the file.
};




Would yours be as efficient for navigating throu multiple directories? (not that mine can anyhow)? It seems like you would have to parse through a large amount of data to get to a file thats a few folders down the tree. I'm actually getting very interested in this project now, I never thought I'd find myself doing filesystem design, no matter how elementary!

Quote:

Your structs are 4 byte aligned which is good, if they weren't they would get padded by the default compiler options (in VS anyhow) and when trying to parse the file by the size of structs you'd run into problems.


I didn't even think of that so it's purely coincidental. Is there a precompiler directive that turns this off?

Quote:
Original post by Eps
Couldn't find this feature for reiser in just a couple min of searching, but I'm on a windows box anyhow. Any idea if there is a similar feature available (whatever it does?)


I don't think NTFS has this feature. From what I've read about resier4 it's meant to be the bees knees of FS', I have no idea how true that is though!

Share this post


Link to post
Share on other sites
Quote:
Original post by flukus
Would yours be as efficient for navigating throu multiple directories? (not that mine can anyhow)? It seems like you would have to parse through a large amount of data to get to a file thats a few folders down the tree. I'm actually getting very interested in this project now, I never thought I'd find myself doing filesystem design, no matter how elementary!


True, laying it out in "header-header-buffer-buffer" results in finding files faster. However, I'm doing a few things to speed mine up. For instance keeping the pack file open at all times, with the file pointer at the start of the current directory. Subsequent calls to my change directory, list directory, and extract file are then initiated at the appropriate file position. My goal is to save myself the headaches that come with maintainance of the "header-header-buffer-buffer" format (which I've dealt with on a few projects) sacrificing the speed. In reality though I cannot fathom it would have that large of an impact for anyone to complain.

Quote:
Original post by flukus
I didn't even think of that so it's purely coincidental. Is there a precompiler directive that turns this off?


I'm pretty sure there is. Just about all my work has been done with VS however, where the option is turned off in a menu, so I couldn't tell you what it is.


As for my problem of the icons moving around, I finally got my system up and running a few minutes ago and it turns out it isn't a problem at all. I created a new file, copied all the contents of the pack file over as well as a newly compressed file, then deleted the original and renamed the new file. The icon for the pack file never moved. Windows is doing something that fixes it apparantly. I did notice that when I was in debug mode and stepped through the function a line at a time it _did_ move, and thought maybe the time elapse would cause that and would experience the problem when compressing large files, but I've tested that and restested when compressing 500mb files and the icon doesn't move.

Share this post


Link to post
Share on other sites
I'm using ZLib for pak-files in my little filesystem. I also have similar directory-structure with quake3 and doom3, having a "base/" root folder and paks and discrete folders inside of it.

I'm actually building a file-table of everything that the root folder contains.
If there's file that has a same name in two paks, second overwrites the first. So when I want to load file out, the filesystem finds the file in the table, checks the location for it (pak or real folder). If it's in a pak, it opens that (the file points to a "directory" structure which has the name for the pak) and calls the ZLib functions to go to that file and load it up.

It´s not hard to create "tree" like system from this table (it's just linear linked-list with files sorted alphabetically). Just find the first '/' and your root-folder name is before that. Continue to next slash and you have name for your first sub-directory, etc.. Then you'd just create type for these folders/nodes that would contain the information and pointers on the files and folders inside of it.

Did I even get the point correctly? Or..?

ch.

Share this post


Link to post
Share on other sites
I thought I'd bring this thread back from the dead just to make one point.

We were debating the usefulness of creating a table of headers and then all the buffers at the end of the file versus having a file header then its buffer immediately following. Everyone agreed that having a table of the contents at the start of the file was faster for finding things within the file. Well, it turns out they may be identical in speed. I was benchmarking how long it would take to generate a directory listing using my method of file storage. It seems that when seeking through a file, it is essentially an instantaneous process. I could open any file, seek 500mb and profile this. Everytime it would return that 0 milliseconds had passed. I thought I was doing something wrong at first until I realized that Windows can do a very fast addition and jump right to that spot on the harddrive.

So the result is that a TOC and my method are the same in speed. In both you must read several headers to get to the one you are searching for. This will be the same number of read operations if the tree is built the same way. The difference comes in when in the TOC method you read-read-read-seek. Mine you read-seek-read-seek.

Share this post


Link to post
Share on other sites
however, multiple sequential reads are MUCH faster then reading and seeking since you have the buffer and cache to help ease harddrive access. Your app could have problems of hardrive thrashing if virtual memory starts to kick in. You want to read the toc into memory, and then never have to read it again. There is no reason to constantly read the toc to find a file since the pak will remain static. If you plain on allowing the game to modify the pak, you may wish to reconsider this. games that modify their own paks are not the best implementation, instead multiple paks and support of native directories (ie files and directories not in the pak but in the games directory) are a better option.

dont be niave and think that getting 0ms means that yoru method is just as fast. Sure harddrives can seek pretty fast, but burn your file to a cdrom and try again. That gives a better idea of time differences since the access time is a bit slower and easier to time. also use queryperformencecounter, so you have a more reliable timer.

btw having the entire toc at the end is best if you wish to add files to the archive. Only parts of the toc scattered throughout is not the best way if you want speed. Basically a header at the front that tells you to seek to the start of the toc (at the end of the pak if you wish) which is in one huge block that can be read and parsed.

just remember in a game enviroment, the pak file should not be modified at all by the user. thus the toc a the start of the pak makes more sense. optimizations that allow you to add files will only make debugging "easier". its better to allow your virtual file system to access the native directory structure within the game base directory instead.

Share this post


Link to post
Share on other sites
I second qazlop's points. And I'd like to add that even manipulating archives is no problem. In my VFS implementation I just had a Flush() method that kept disk and in-memory TOC in sync. Deleting files would be as simple as removing the corresponding entry from the TOC in memory (such that subsequent read requests would fail, even though the file is physically present) and calling Flush() on write/close.

Share this post


Link to post
Share on other sites
Get it to go and do a few trillion reads/writes and you'll have yourself a nice little number which says how fast it is. (time it, then output the number. You then go and shift the decimal point, to do the divide. Its usually an insanely small number per read/write).

What i would do, is you store files as chunks.
These chunks are 1K in size.

You seperate the chunks, and the headers, ie. put them in seperate arrays. (the chunks would take up a few 100mb, while the headers would take up 1mb max.)

There are two different types of headers, file headers and folder headers.
File headers, have a string of pointers (c string), which has every full chunk listed. You also have another pointer, which points to the last chunk, and an int which says how many bytes are in it.

Folder headers, basically have a list of all of its subfolders (and pointers to them) and all of its files (and pointers to them). you use binary searches to find them in o(log2 n) time.

You have an array of search states.
in the states are: What directory you are in
What file (if any) you are reading
And
How far you are into the file

You give the rest of the program a pointer to one, as they go and ask for a new one.

You use that, with operations like cd, you use the pointer, and you change the directory (including pointer), inside the record that the pointer points to.

When you are reading N bytes of data, you first go and divide N by 1K, thats just a bitshift. You pass an pointer, to an array you allocated on the heap, which has the data. You memcopy each file chunk (signified by N/1K), into your array.
You then go N & 1K, to find out the amount of bytes you have to read onto the last chunk. You then simply go to that chunk, and memcopyn the data back.

The entire thing can be done in times ranging in the nanosecond, to milisecond (if you have to do a lot of data copying. Normally there would be no data copying, because you would only take a kb or two at a time).

The easy way to go and tell if there needs to be any full chunk copying only requires a logical and, so theres no problems there.

This would probably be of use only once you hit the multi-gig mark, but it will probably be faster all round. (assuming you also use the arrays to copy arround data. The reson i do this, is that you can rearrange the parts of files, allowing very fast read/write acess.).

There are also fast ways to deleate data too. You just get the end record, and copy that over the one that you want to deleate. You then go and update the pointers. You do the same thing when you need to add a file. Or you can have links to files (in the linux sense) really easy.

hth.
From,
Nice coder

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
You didn't mention a programming language, but if it's C++, look into deriving a custom streambuf class so you can access your decompressed-to-memory files through istreams. Most useful.

Share this post


Link to post
Share on other sites
Quote:
Original post by Nice Coder
What i would do, is you store files as chunks.
These chunks are 1K in size.

...

There are also fast ways to deleate data too. You just get the end record, and copy that over the one that you want to deleate. You then go and update the pointers. You do the same thing when you need to add a file. Or you can have links to files (in the linux sense) really easy.


Seems a bit ... too much. Of course it's a balancing act between how much you're gonna update the file and how fast you want to read, but first of all, for your method to be fast, you'd have to read everything in to memory. Second, unless you don't care that after a thousand modifications the file will be in the order of gigabytes anyhow, it's not very efficient. Deleting will also mean deleting things from file, and re-arranging it and the pointers associated with it.

The way I do it is real simple. The header consists of files and folders. Every folder consists of its name and the number of sub-folder and files in it. Every file consists of its name, the packed and unpacked sizes, a checksum, and its offset relative to the end of the header. Then every folder header is followed firstly by the file headers in that folder, and then the folder headers for the sub-folders, which allows for easy recursive reading.

When I add a file, first I update the file-system in memory, then turn them into binary data, to see how much more data is required for the header. I then go into the file and from the end backwards shift forward all data in the file. Then finally I add the compressed data for the new file at the end of the file and voila.

Deleting is a little more difficult, because of the relative chaotic structure of the file itself. I solved it by mapping file offsets to the file headers. Then when I need to delete the file, I just traverse the list and modify all headers whose offset is higher than that of the file I'm deleting.

Reading the files can become quite memory-intensive, as the API returns either an array of bytes with the contents of the entire file, or a MemoryStream from those bytes. But generally, with file caching on etcetera, it's pretty fast, because the dictionary is read only once, and from there, it's one seek operation, and then one read operation.

Of course, this isn't the fastest way, but my archives aren't meant to be modified that much, as long as reading files is fast. Oh, I use gzip for compression, by the way.

Share this post


Link to post
Share on other sites

This topic is 4679 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this