Fast Searching/Modifying data from Binary Files

Started by
6 comments, last by Mussi 11 years, 4 months ago
hi all,

Basically I want to find the quickest way in which to search for data quickly, from a binary file.

For example my bin file will contain data such as:
0,Hello World,0
0,Another World,1

Now I would like to search for each new line of a 10,000 (possible more) .bin file. Currently I am doing it this way:


void FetchData(const unsigned int& line,int column, std::string& data)
{
if (file)
{
// reset file pointer to beginning
file.clear();
file.seekg(0, std::ios::beg);
if (fs.good())
{
// search for the desired line and exit loop
for (int i = 0; i < (line+1) && file.eof() == false; i++)
{
std::getline(file,data);
}
}
}
}



This does work, but are there any more efficient approaches I could take?

I am also required to modify such data, but I am currently storing the .bin file in a vector temporarily and the modify particular elements of the vector
and then overwrite the existing .bin file with the latest data. If there is a more efficient approach to this please let me know?

Also on a side note, how would performance differ between using an .xml instead of a .bin file ?
Thank you for any help offered.
Advertisement
From what I can tell, you are working with a text file, not a binary file. Just because the extension is '.bin' it does not become a binary file.

I'm also not clear what exactly it is you are asking. If you really want to detect new lines in your text file, then yes, you have to read each line, interpret it and compare it with the data you already know you have. For something as tiny as 10000 (int, string, int) I see no problem with keeping the data in memory.
If the content of the data is critical I would not just overwrite the file like that. Rename the original file, write out the new file, delete the backup file after it has been ensured the new file is written completely to disk.

There are probably lots of ways to do things more efficiently but there is too little information about what you really need and which assumptions can be relaxed.
Reading an XML file makes the things even worse.

The constraints to the problem are not clear. There are of course possibilities to make things more efficient, but they may introduce some constraints that may or may not be okay here.

Possible things are
* making the data packets equal in size,
* using a kind of TOC ("table of contents"),
* appending data instead of overwriting,
* using prefetched hash tables (or something) if looking up names is required,
* ...
If this is something that must be done consistently from the same file you may want to look into memory mapping.
void hurrrrrrrr() {__asm sub [ebp+4],5;}

There are ten kinds of people in this world: those who understand binary and those who don't.
Apologies for the lack of detail.



For something as tiny as 10000 (int, string, int) I see no problem with keeping the data in memory.


I am creating a list for a GUI and considering the the amount of data in the list could be exponential, even larger than 10,000 - though it's unlikely it is certainly possible. On top of this I am
my list can handle a varied amount of columns.

All of the data will actually be strings, even in the example in my first post.

I have not used the file libraries much, but the code I wrote is this to create the file:


std::ofstream file;
file.open("data.bin", std::ios::in | std::ios::out | std::ios::binary | std::ios::app);

std::string str = "0,Hello World,0";
std::string::size_type sz = str.size();
if (file)
{
for(int i = 0; i < 10000; i++)
{
file.write(str.c_str(), strlen(str.c_str()));
file << std::endl;
}
}
file.close();


If this is incorrect please let me know.

Note: I actually won't be creating the binary file myself, only receiving the file and reading the data in FetchData method.

The main problem presented is my list works perfectly fine up until I require the 4000th row then It becomes noticeably slower, this is because I am iterating through the file until
I reach the index required.
If the structure of the file is out of your responsibility, you want fast "random" read access w/o loading the entire file into memory, then you can do a pre-pass where the file is scanned through, and for each line the file offset of the current line as well as necessary identifiers (e.g. index, name, hash, whatever) are remembered in working memory. Then you can identify a line of interest in memory, and use the corresponding file offset to skip to into the file and read the line. This is a kind of table-of-contents as mentioned earlier, but outside the file itself.
Thanks haegarr,

that sound's like a great solution to my problem. Do you know of any tutorials/examples of how to do this?

I am creating a list for a GUI and considering the the amount of data in the list could be exponential, even larger than 10,000 - though it's unlikely it is certainly possible. On top of this I am
my list can handle a varied amount of columns.

So you have to display a portion of the list? If so, the number of rows you'll display is most likely limited to a small amount. You can use this to your advantage by mapping out the starting and ending position of blocks of lines, then reference this when you want to display the relevant portions.

Keeping 10,000 or even 100,000 strings into memory shouldn't be problem though, unless you've got some insanely long strings.

This topic is closed to new replies.

Advertisement