Back to General and Gameplay Programming

[.net] Effective(I.e Fast) method of searching a pool of files?

General and Gameplay Programming Programming

Started by AntonyCoder October 08, 2006 09:47 AM

7 comments, last by orbano 17 years, 6 months ago

AntonyCoder

145

Author

October 08, 2006 09:47 AM

Hi, In my p2p app, when a client connects, the server grabs a list of the files the peer is sharing and stores it within the router's connection object for that peer. Now I've come to adding searching these files, on the server(The peer sends a request, the server searches, and then returns the results via udp) but I'm not sure what technique to use. Should I tokenize the search string and the file names and use hash checks for speed? Should I just search for the search string within each file name? Are there anything like binary search trees for strings to speed this up? If the app ispopular(I know big if) it could very well have a million + users, all searching, so speed is my biggest concern, not memory consumption. If I have to use 200mb to speed up the searches 500% I'll gladly make the trade-of.

turnpast

1,011

October 09, 2006 05:37 PM

One of the simplest ways to speed up a big text search is to create an index. In .NET this can be really easy depending on your needs. Try using a Dictionary or SortedDictionary (hashtable or binary tree). Go through each block of searchable text (file name or contents in your case) and identify terms, add each term as a key and set each value to some reference to to the blocks that contain that term (like a list of file names). Keep this Dictionary around. When you want to search: tokenize your input search string and lookup each token in your dictionary. Intersect or union the results and return them. There are a number of more complicated things you can do, but this should get you started.

hope this helps.

RipTorn

722

October 09, 2006 10:41 PM

Go for a 3rd party indexing package, don't even think about trying to roll your own if you are aiming for large numbers of users.

A good bet would be dotlucene. It's free, and really *really* fast. 100% .net too. It is a port of the java 'lucene' project. You can find another port called nlucene. Which is better? dunno. [wink]

You might also want to consider simply keeping the index to each local machine, and doing common query caching on the server end. Or at least cache if a client actually has any results at all. Indexes can still get really big, really quick.

Anonymous

October 10, 2006 01:08 PM

thanks for the tips, think i'll check out that indexing package. Atm I just brute force through every unique file and do a simple string.contains(Searchterm) so anything has to be an improvement.

My one worry is, if I switch to indexing, will i still be able to search for incomplete strings? I.e if I have a file called ThisIsATest.exe and I just enter "test", it'll still return the file right?

orbano

130

October 12, 2006 02:02 AM

Quote:Original post by Anonymous Poster
thanks for the tips, think i'll check out that indexing package. Atm I just brute force through every unique file and do a simple string.contains(Searchterm) so anything has to be an improvement.

My one worry is, if I switch to indexing, will i still be able to search for incomplete strings? I.e if I have a file called ThisIsATest.exe and I just enter "test", it'll still return the file right?

these pre-sorting methods are good if you know the filename and want to get its location, ar any other information fast. this means that you have to know the key (=filename).
When searching in keys, you simply iterate through the keys and try to fing the matching substring. So your whole structure would look something like this:
-Have a complex directory structure with a lot of files
-Have a sorting structure for them, to find a file's location fast (2-3 trees, b-trees are quite good for file-system-like jobs)
-Have an array of keys (filenames) to access them fast, linearly. Runing a LIKE '*string*' search in a few hundred thousand long array is not a big deal (and it can not be accelerated either with simple techniques, only with advanced heuristics)

You have to do the search in client machines or on a server? If on a server machine, i suggest using a database (mySQL for example), as they know all the functionality you need in this case...

"Knowledge is no more expensive than ignorance, and at least as satisfying." -Barrin

Anonymous

October 12, 2006 11:24 AM

Well after reading what's needed, I'm leaning towards mysql.
I have followed some mysql tutorials in C# on the ms website, but I haven't tried coding them by hand.

Can you suggest any good tutorials to help me get up to speed? I did toy with the idea of a database, but since it's so dynamic i thought it would be a slow method.

BradSnobar

232

October 12, 2006 06:34 PM

To create a fast index of names of any kind, here is what I would want to do.

Create a hashtable or dictionary with a binary tree off of each node.

To make it simple use 36 entries in the table.

Each entry corresponds to the first letter of the given node.
Hopefully your filenames don't all start with c:\ ... if so you can probably write a workaround for that.

0, 1, 2, ... 9, a, b, c, d, e, ... x, y, z

off of each of these table indexes create a tree.

This will give you the best kind of index that you can code up quickly. Other examples are faster, but likely less elegant.

This also allows you to not need to keep everything in memory. You can save off a letter index, or load one letter. There are obvious benefits to scalability across machines as needed here too.

As a side note, do not use recursion in your binary tree, or your memory requirements will explode.

You could even get clever when building the tree and not store the fullname in each node by removing the part of the name that might exist in the parent or root node. I don't usually like clever code, because it is hard to maintain, but it can be worthwhile sometimes.

orbano

130

October 12, 2006 11:36 PM

Quote:Original post by Anonymous Poster
Well after reading what's needed, I'm leaning towards mysql.
I have followed some mysql tutorials in C# on the ms website, but I haven't tried coding them by hand.

Can you suggest any good tutorials to help me get up to speed? I did toy with the idea of a database, but since it's so dynamic i thought it would be a slow method.

If you are familiar with SQL and relational databases, msnd tutorials should be easy. I can provide you sample codes, if you want to start coding fast. For this job you need really simple queries from one single table with one index, filename. All you need to do is create a connection object and a command, then call the command's executereader method, and iterate through the results.

"Knowledge is no more expensive than ignorance, and at least as satisfying." -Barrin

orbano

130

October 12, 2006 11:48 PM

Quote:Original post by BradSnobar
To create a fast index of names of any kind, here is what I would want to do.

Create a hashtable or dictionary with a binary tree off of each node.

To make it simple use 36 entries in the table.

Each entry corresponds to the first letter of the given node.
Hopefully your filenames don't all start with c:\ ... if so you can probably write a workaround for that.

0, 1, 2, ... 9, a, b, c, d, e, ... x, y, z

off of each of these table indexes create a tree.

...

What kind of tree would you suggest? I dont really get the idea of "hashing" 100 000 - 1 000 000 filenames into 37 entries, and then maintaining a linked tree. I mean i knowwhat it is good for, but in this case its not the best solution. AntonyCoder wants this to run on a server machine and be really fast. Implementing a good storage system with fast caching in this case is needless, as a database storage engine can be times faster with less effort, so he can focus on other problems (anyway i doubt this data structure would fit his needs best)

"Knowledge is no more expensive than ignorance, and at least as satisfying." -Barrin

[.net] Effective(I.e Fast) method of searching a pool of files?

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

[.net] Effective(I.e Fast) method of searching a pool of files?

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines