How to work with a file system in memory?

Started by
17 comments, last by KorbenDallas 19 years, 2 months ago
Quote:Original post by Eps
Well, it should certainly work, but you may have over complicated things a bit. I'd have to mull over the details of you design to discover any pros/cons, so instead I'll show you what I came up with and you can compare it for yourself.

// Pack Entrystruct PackEntry_T{ 	PackEntry_T() { memset(this, 0, sizeof(PackEntry_T)); }	short type;                  // Type of entry: Folder(0) or File(1).	unsigned char string[54];    // Name of the entry, 52 chars, padded with '\0'.	long size;                   // Uncompressed size of the entry in PACK file if a file, or number of entries if a folder.	long csize;                  // Compressed size of the file.};




Would yours be as efficient for navigating throu multiple directories? (not that mine can anyhow)? It seems like you would have to parse through a large amount of data to get to a file thats a few folders down the tree. I'm actually getting very interested in this project now, I never thought I'd find myself doing filesystem design, no matter how elementary!

Quote:
Your structs are 4 byte aligned which is good, if they weren't they would get padded by the default compiler options (in VS anyhow) and when trying to parse the file by the size of structs you'd run into problems.


I didn't even think of that so it's purely coincidental. Is there a precompiler directive that turns this off?

Quote:Original post by Eps
Couldn't find this feature for reiser in just a couple min of searching, but I'm on a windows box anyhow. Any idea if there is a similar feature available (whatever it does?)


I don't think NTFS has this feature. From what I've read about resier4 it's meant to be the bees knees of FS', I have no idea how true that is though!
Advertisement
Quote:Original post by flukus
Would yours be as efficient for navigating throu multiple directories? (not that mine can anyhow)? It seems like you would have to parse through a large amount of data to get to a file thats a few folders down the tree. I'm actually getting very interested in this project now, I never thought I'd find myself doing filesystem design, no matter how elementary!


True, laying it out in "header-header-buffer-buffer" results in finding files faster. However, I'm doing a few things to speed mine up. For instance keeping the pack file open at all times, with the file pointer at the start of the current directory. Subsequent calls to my change directory, list directory, and extract file are then initiated at the appropriate file position. My goal is to save myself the headaches that come with maintainance of the "header-header-buffer-buffer" format (which I've dealt with on a few projects) sacrificing the speed. In reality though I cannot fathom it would have that large of an impact for anyone to complain.

Quote:Original post by flukus
I didn't even think of that so it's purely coincidental. Is there a precompiler directive that turns this off?


I'm pretty sure there is. Just about all my work has been done with VS however, where the option is turned off in a menu, so I couldn't tell you what it is.


As for my problem of the icons moving around, I finally got my system up and running a few minutes ago and it turns out it isn't a problem at all. I created a new file, copied all the contents of the pack file over as well as a newly compressed file, then deleted the original and renamed the new file. The icon for the pack file never moved. Windows is doing something that fixes it apparantly. I did notice that when I was in debug mode and stepped through the function a line at a time it _did_ move, and thought maybe the time elapse would cause that and would experience the problem when compressing large files, but I've tested that and restested when compressing 500mb files and the icon doesn't move.
I'm using ZLib for pak-files in my little filesystem. I also have similar directory-structure with quake3 and doom3, having a "base/" root folder and paks and discrete folders inside of it.

I'm actually building a file-table of everything that the root folder contains.
If there's file that has a same name in two paks, second overwrites the first. So when I want to load file out, the filesystem finds the file in the table, checks the location for it (pak or real folder). If it's in a pak, it opens that (the file points to a "directory" structure which has the name for the pak) and calls the ZLib functions to go to that file and load it up.

It´s not hard to create "tree" like system from this table (it's just linear linked-list with files sorted alphabetically). Just find the first '/' and your root-folder name is before that. Continue to next slash and you have name for your first sub-directory, etc.. Then you'd just create type for these folders/nodes that would contain the information and pointers on the files and folders inside of it.

Did I even get the point correctly? Or..?

ch.
I thought I'd bring this thread back from the dead just to make one point.

We were debating the usefulness of creating a table of headers and then all the buffers at the end of the file versus having a file header then its buffer immediately following. Everyone agreed that having a table of the contents at the start of the file was faster for finding things within the file. Well, it turns out they may be identical in speed. I was benchmarking how long it would take to generate a directory listing using my method of file storage. It seems that when seeking through a file, it is essentially an instantaneous process. I could open any file, seek 500mb and profile this. Everytime it would return that 0 milliseconds had passed. I thought I was doing something wrong at first until I realized that Windows can do a very fast addition and jump right to that spot on the harddrive.

So the result is that a TOC and my method are the same in speed. In both you must read several headers to get to the one you are searching for. This will be the same number of read operations if the tree is built the same way. The difference comes in when in the TOC method you read-read-read-seek. Mine you read-seek-read-seek.
however, multiple sequential reads are MUCH faster then reading and seeking since you have the buffer and cache to help ease harddrive access. Your app could have problems of hardrive thrashing if virtual memory starts to kick in. You want to read the toc into memory, and then never have to read it again. There is no reason to constantly read the toc to find a file since the pak will remain static. If you plain on allowing the game to modify the pak, you may wish to reconsider this. games that modify their own paks are not the best implementation, instead multiple paks and support of native directories (ie files and directories not in the pak but in the games directory) are a better option.

dont be niave and think that getting 0ms means that yoru method is just as fast. Sure harddrives can seek pretty fast, but burn your file to a cdrom and try again. That gives a better idea of time differences since the access time is a bit slower and easier to time. also use queryperformencecounter, so you have a more reliable timer.

btw having the entire toc at the end is best if you wish to add files to the archive. Only parts of the toc scattered throughout is not the best way if you want speed. Basically a header at the front that tells you to seek to the start of the toc (at the end of the pak if you wish) which is in one huge block that can be read and parsed.

just remember in a game enviroment, the pak file should not be modified at all by the user. thus the toc a the start of the pak makes more sense. optimizations that allow you to add files will only make debugging "easier". its better to allow your virtual file system to access the native directory structure within the game base directory instead.
I second qazlop's points. And I'd like to add that even manipulating archives is no problem. In my VFS implementation I just had a Flush() method that kept disk and in-memory TOC in sync. Deleting files would be as simple as removing the corresponding entry from the TOC in memory (such that subsequent read requests would fail, even though the file is physically present) and calling Flush() on write/close.
Get it to go and do a few trillion reads/writes and you'll have yourself a nice little number which says how fast it is. (time it, then output the number. You then go and shift the decimal point, to do the divide. Its usually an insanely small number per read/write).

What i would do, is you store files as chunks.
These chunks are 1K in size.

You seperate the chunks, and the headers, ie. put them in seperate arrays. (the chunks would take up a few 100mb, while the headers would take up 1mb max.)

There are two different types of headers, file headers and folder headers.
File headers, have a string of pointers (c string), which has every full chunk listed. You also have another pointer, which points to the last chunk, and an int which says how many bytes are in it.

Folder headers, basically have a list of all of its subfolders (and pointers to them) and all of its files (and pointers to them). you use binary searches to find them in o(log2 n) time.

You have an array of search states.
in the states are: What directory you are in
What file (if any) you are reading
And
How far you are into the file

You give the rest of the program a pointer to one, as they go and ask for a new one.

You use that, with operations like cd, you use the pointer, and you change the directory (including pointer), inside the record that the pointer points to.

When you are reading N bytes of data, you first go and divide N by 1K, thats just a bitshift. You pass an pointer, to an array you allocated on the heap, which has the data. You memcopy each file chunk (signified by N/1K), into your array.
You then go N & 1K, to find out the amount of bytes you have to read onto the last chunk. You then simply go to that chunk, and memcopyn the data back.

The entire thing can be done in times ranging in the nanosecond, to milisecond (if you have to do a lot of data copying. Normally there would be no data copying, because you would only take a kb or two at a time).

The easy way to go and tell if there needs to be any full chunk copying only requires a logical and, so theres no problems there.

This would probably be of use only once you hit the multi-gig mark, but it will probably be faster all round. (assuming you also use the arrays to copy arround data. The reson i do this, is that you can rearrange the parts of files, allowing very fast read/write acess.).

There are also fast ways to deleate data too. You just get the end record, and copy that over the one that you want to deleate. You then go and update the pointers. You do the same thing when you need to add a file. Or you can have links to files (in the linux sense) really easy.

hth.
From,
Nice coder
Click here to patch the mozilla IDN exploit, or click Here then type in Network.enableidn and set its value to false. Restart the browser for the patches to work.
You didn't mention a programming language, but if it's C++, look into deriving a custom streambuf class so you can access your decompressed-to-memory files through istreams. Most useful.
Quote:Original post by Nice Coder
What i would do, is you store files as chunks.
These chunks are 1K in size.

...

There are also fast ways to deleate data too. You just get the end record, and copy that over the one that you want to deleate. You then go and update the pointers. You do the same thing when you need to add a file. Or you can have links to files (in the linux sense) really easy.


Seems a bit ... too much. Of course it's a balancing act between how much you're gonna update the file and how fast you want to read, but first of all, for your method to be fast, you'd have to read everything in to memory. Second, unless you don't care that after a thousand modifications the file will be in the order of gigabytes anyhow, it's not very efficient. Deleting will also mean deleting things from file, and re-arranging it and the pointers associated with it.

The way I do it is real simple. The header consists of files and folders. Every folder consists of its name and the number of sub-folder and files in it. Every file consists of its name, the packed and unpacked sizes, a checksum, and its offset relative to the end of the header. Then every folder header is followed firstly by the file headers in that folder, and then the folder headers for the sub-folders, which allows for easy recursive reading.

When I add a file, first I update the file-system in memory, then turn them into binary data, to see how much more data is required for the header. I then go into the file and from the end backwards shift forward all data in the file. Then finally I add the compressed data for the new file at the end of the file and voila.

Deleting is a little more difficult, because of the relative chaotic structure of the file itself. I solved it by mapping file offsets to the file headers. Then when I need to delete the file, I just traverse the list and modify all headers whose offset is higher than that of the file I'm deleting.

Reading the files can become quite memory-intensive, as the API returns either an array of bytes with the contents of the entire file, or a MemoryStream from those bytes. But generally, with file caching on etcetera, it's pretty fast, because the dictionary is read only once, and from there, it's one seek operation, and then one read operation.

Of course, this isn't the fastest way, but my archives aren't meant to be modified that much, as long as reading files is fast. Oh, I use gzip for compression, by the way.

This topic is closed to new replies.

Advertisement