Sign in to follow this  
JakeM

Binary pattern searching

Recommended Posts

How to search for a binary byte pattern quickly in a large 100MB+ file? Hex editors can do it in a matter of seconds, while my first attempt did in a matter of extremely long minutes? This is what I tried. Yeah I know it looks bad because it is. :)
#define NUM_SEARCH_BYTES	8
//
 BYTE search_bytes[8];
 search_bytes[0] = 0x11;
 search_bytes[1] = 0x22;
 search_bytes[2] = 0x33;
 search_bytes[3] = 0x44;
 search_bytes[4] = 0x55;
 search_bytes[5] = 0x66;
 search_bytes[6] = 0x77;
 search_bytes[7] = 0x88;
//
 BYTE b;
//
 // search from start to end of file
 while(EOF) {

    Get_Byte(b);
    if(b == search_bytes[0]) {				// found 1st byte

       Get_Byte(b);
       if(b == search_bytes[1]) {			// found 2nd byte

          Get_Byte(b);
          if(b == search_bytes[2]) {			// found 3nd byte

             Get_Byte(b);
             if(b == search_bytes[3]) {			// found 4th byte

                Get_Byte(b);
                if(b == search_bytes[4]) {		// found 5th byte

                   Get_Byte(b);
                   if(b == search_bytes[5]) {		// found 6th byte

                      Get_Byte(b);
                      if(b == search_bytes[6]) {	// found 7th byte

                         Get_Byte(b);
                         if(b == search_bytes[7]) {	// found 8th byte

                            printf("FOUND IT!\n");
                         }
                      }
                   }
                }
             }
          }
       }
    }
 }
}

Share this post


Link to post
Share on other sites
Have a look at the different types of string searching algorithms that have been developed. They'll work for binary patterns too (really, they are the same thing). If you Google around you'll probably find all the information and source code snippets you need.

Good luck!

[Edited by - Trapper Zoid on September 6, 2005 4:40:06 AM]

Share this post


Link to post
Share on other sites
I'd venture to say that perhaps the file is parsed into a tree of some sort when loaded. If this was done, searching the tree for the requested pattern becomes trivial and very fast! But then again, doing so would require a lot of memory if stored, so either they save the tree to a file or just use something different altogether.

Share this post


Link to post
Share on other sites
Assuming the pattern I want to search is binary, pattern is of any length and of any character (I don't care about text),
searching will be done sequentially in a large 100MB+ file (beginning to end), and I want to access the file
directly using fopen() (don't want to create any temp. copies in memory, file is too large), what would be
the best algorithm for me?

Share this post


Link to post
Share on other sites
Quote:
Original post by JakeM
...I want to access the file directly using fopen() (don't want to create any temp. copies in memory, file is too large)...

The entire file might be too large, but parts of it aren't. I don't believe fopen type functions buffer for you [not sure...shouldn't take too long to test, though], and file access is extremely slow. If you're going for speed, you want to be reading from memory as much as possible.

CM

Share this post


Link to post
Share on other sites
It would be nice if I could access the file like a buffer as it is. Does CreateFileMapping() have fast file access?
I'm only seeking sequentially forward, not jumping the file pointer around, file access is read only, so that should help?

Share this post


Link to post
Share on other sites
Quote:
Original post by JakeM
while(EOF) {

Get_Byte(b);
if(b == search_bytes[0]) { // found 1st byte

Get_Byte(b);
if(b == search_bytes[1]) { // found 2nd byte

Get_Byte(b);
if(b == search_bytes[2]) { // found 3nd byte

Get_Byte(b);
if(b == search_bytes[3]) { // found 4th byte

Get_Byte(b);
if(b == search_bytes[4]) { // found 5th byte

Get_Byte(b);
if(b == search_bytes[5]) { // found 6th byte

Get_Byte(b);
if(b == search_bytes[6]) { // found 7th byte

Get_Byte(b);
if(b == search_bytes[7]) { // found 8th byte

printf("FOUND IT!\n");
}
}
}
}
}
}
}
}
}
}
There is never any excuse for writing code with that much duplication. Do that in a job interview and your chances of being hired are slim to none. I'm very pleased that you yourself have recognised that it is bad.

As a matter of fact, not only is it bad, it's actually very wrong!
It may not find a string that is preceeded by a substring of the string you are searching for.
i.e. It will NOT find "abcabcde" in the string "hegafabcabcabcdecfegbadh".

Definately check out the link Trapper Zoid posted.

Share this post


Link to post
Share on other sites
Quote:
Original post by Conner McCloud
Quote:
Original post by JakeM
...I want to access the file directly using fopen() (don't want to create any temp. copies in memory, file is too large)...

The entire file might be too large, but parts of it aren't. I don't believe fopen type functions buffer for you [not sure...shouldn't take too long to test, though], and file access is extremely slow. If you're going for speed, you want to be reading from memory as much as possible.

CM


fopen() et al do buffer - that's why getch() is generally implemented as a macro that just grabs one byte from the buffer if it's available. I wouldn't rely on its buffering for that size of file though. The fastest way would probably be to use Windows asynchronous file reading calls or using two threads on Unix, so you can be loading the file and scanning through it at the same time.

Share this post


Link to post
Share on other sites
I'm guessing here, but reading byte by byte from file may be the slowest part.

You really should read at least 64k (or other magic number) sized chunks and perform the check on the memory buffer. And since you are dealing with binary data, it may make sense to use the largest binary data type usable, __int32 or __int64.

something like that:


int k, i, j;
__int64 buf[65536], seq;

seq = ...;
while(!EOF)
{
k = fread(buf, 65536, 1, yourfile) / 8; //k read bytes into k/8 _int64
i = 0;
while(i<k)
{
if(buf[i] == seq)break;
i++;
}
if(i<k)_found(i);
}



The above code doesn't take into consideration what happens when the first part of the binary string is found at the end of one read chunk and the second in the next read chunk, nor if the read chunk cannot be divided be 8.

Share this post


Link to post
Share on other sites
I think it's quite obvious that you should use the boyer moore algorithm. It's the most widely used algorithm in pattern searching in editors and viewers.

Share this post


Link to post
Share on other sites
Well, it's not too obvious for someone who has never heard of it since yesterday. :)

What I don't understand is why these searching algorithm docs never mention 'binary' searching.
It's always text searching. I don't want any algorithm to assume to want to search for text
only.

Share this post


Link to post
Share on other sites
It's because text searching and binary searching can use the same algorithms. It just is more intuitive for humans to think about text searching.

Share this post


Link to post
Share on other sites
Is this all of it? Short and sweet.


// Boyer-Moore-Horspool text searching

char
*search( char *pat, char *text, int n) {

int i, j, k, m, skip[MAXCHAR];

m = strlen(pat);
if( m==0 ) return( text );

for( k=0; k<MAXCHAR; k++ ) skip[k] = m;

for( k=0; k<m-1; k++ ) skip[pat[k]] = m-k-1;

for( k=m-1; k < n; k += skip[text[k] & (MAXCHAR-1)] ) {
for( j=m-1, i=k; j>=0 && text[i] == pat[j]; j-- ) i--;
if( j == (-1) ) return ( text+i+1 );
}

return( NULL );
}



As for reading it in, I can probably put that aside until I get this thing working for a binary file, fast or slow. Thanks.

Share this post


Link to post
Share on other sites
Quote:
Original post by JakeM
It would be nice if I could access the file like a buffer as it is. Does CreateFileMapping() have fast file access?
I'm only seeking sequentially forward, not jumping the file pointer around, file access is read only, so that should help?
I found during profiling that memory mapping files (using CreateFileMapping etc) in wundows using Visual Studio 2002 .Net Academic was around 9 times faster than using the C++ file stream classes, and around 4 times faster than the C fopen/fread/etc set of functions. Then I turned profiling off, and noticed it didn't really make a noticable difference when reading in files around 10MB in size. I also tried loading files around 700MB, but that was a different program that wasn't at all disk-bound in the first place (it ran fine even constantly reading the HD because of buffering) and it had the downside that much RAM was consumed (windows didn't page it out nearly aggressively enough so it actually slowed other parts of the program down slightly)

You could try it though. For great justice, you can use my file map class if you'd like:
/*Memory Mapped File Class
Copyright (c) 2005 Extrarius

Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files (the "Software"), to deal in
the Software without restriction, including without limitation the rights to
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
of the Software, and to permit persons to whom the Software is furnished to do
so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.*/

#include <windows.h>

class FileMap
{
HANDLE m_File;
HANDLE m_Mapping;
unsigned int m_FileLength;
void *m_Data;
public:
FileMap(void): m_File(NULL), m_Mapping(NULL), m_Data(NULL), m_FileLength(0) {}
~FileMap(void)
{
Close();
}
bool Open(const wchar_t *p_Filename)
{
Close();
m_File = CreateFile(p_Filename, GENERIC_READ, FILE_SHARE_READ, NULL, OPEN_EXISTING, 0, NULL);
if(m_File != INVALID_HANDLE_VALUE)
{
m_Mapping = CreateFileMapping(m_File, NULL, PAGE_READONLY, 0, 0, NULL);
if(m_Mapping != NULL)
{
m_Data = MapViewOfFile(m_Mapping, FILE_MAP_READ, 0, 0, 0);
if(m_Data != NULL)
{
m_FileLength = GetFileSize(m_File, NULL);
return true;
}
}
}
Close();
return false;
}
void Close(void)
{
if(m_Data != NULL)
{
UnmapViewOfFile(m_Data);
m_Data = NULL;
}
if(m_Mapping != NULL)
{
CloseHandle(m_Mapping);
m_Mapping = NULL;
}
if(m_File != NULL)
{
CloseHandle(m_File);
m_File = NULL;
}
m_FileLength = 0;
}
unsigned char *GetPointer(void)
{
return reinterpret_cast<unsigned char *>(m_Data);
}
unsigned int GetLength(void)
{
return m_FileLength;
}
};


Share this post


Link to post
Share on other sites
Thanks, I'll try to do something with it. :)

But, this is what I don't fully understand. Should I pass the entire file buffer to this searching function?
Or should I pass small chunks to it? Either way, the searching function only returns the first match found.
And if I pass small chunks, there's the possibility that my pattern could span between two
chunks and be missed entirely.

Share this post


Link to post
Share on other sites
Wow, I just searched a 100MB+ file in less than a second, and picked off a couple hundred
matches. Blazing hot damn fast. Certainly better than my first attempt almost an agonizing 5 minutes.
It matched the output perfectly.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this