[.net] Effective(I.e Fast) method of searching a pool of files?

Started by
7 comments, last by orbano 17 years, 6 months ago
Hi, In my p2p app, when a client connects, the server grabs a list of the files the peer is sharing and stores it within the router's connection object for that peer. Now I've come to adding searching these files, on the server(The peer sends a request, the server searches, and then returns the results via udp) but I'm not sure what technique to use. Should I tokenize the search string and the file names and use hash checks for speed? Should I just search for the search string within each file name? Are there anything like binary search trees for strings to speed this up? If the app ispopular(I know big if) it could very well have a million + users, all searching, so speed is my biggest concern, not memory consumption. If I have to use 200mb to speed up the searches 500% I'll gladly make the trade-of.
Advertisement
One of the simplest ways to speed up a big text search is to create an index. In .NET this can be really easy depending on your needs. Try using a Dictionary or SortedDictionary (hashtable or binary tree). Go through each block of searchable text (file name or contents in your case) and identify terms, add each term as a key and set each value to some reference to to the blocks that contain that term (like a list of file names). Keep this Dictionary around. When you want to search: tokenize your input search string and lookup each token in your dictionary. Intersect or union the results and return them. There are a number of more complicated things you can do, but this should get you started.

hope this helps.
Go for a 3rd party indexing package, don't even think about trying to roll your own if you are aiming for large numbers of users.

A good bet would be dotlucene. It's free, and really *really* fast. 100% .net too. It is a port of the java 'lucene' project. You can find another port called nlucene. Which is better? dunno. [wink]

You might also want to consider simply keeping the index to each local machine, and doing common query caching on the server end. Or at least cache if a client actually has any results at all. Indexes can still get really big, really quick.
thanks for the tips, think i'll check out that indexing package. Atm I just brute force through every unique file and do a simple string.contains(Searchterm) so anything has to be an improvement.


My one worry is, if I switch to indexing, will i still be able to search for incomplete strings? I.e if I have a file called ThisIsATest.exe and I just enter "test", it'll still return the file right?
Quote:Original post by Anonymous Poster
thanks for the tips, think i'll check out that indexing package. Atm I just brute force through every unique file and do a simple string.contains(Searchterm) so anything has to be an improvement.


My one worry is, if I switch to indexing, will i still be able to search for incomplete strings? I.e if I have a file called ThisIsATest.exe and I just enter "test", it'll still return the file right?


these pre-sorting methods are good if you know the filename and want to get its location, ar any other information fast. this means that you have to know the key (=filename).
When searching in keys, you simply iterate through the keys and try to fing the matching substring. So your whole structure would look something like this:
-Have a complex directory structure with a lot of files
-Have a sorting structure for them, to find a file's location fast (2-3 trees, b-trees are quite good for file-system-like jobs)
-Have an array of keys (filenames) to access them fast, linearly. Runing a LIKE '*string*' search in a few hundred thousand long array is not a big deal (and it can not be accelerated either with simple techniques, only with advanced heuristics)

You have to do the search in client machines or on a server? If on a server machine, i suggest using a database (mySQL for example), as they know all the functionality you need in this case...
"Knowledge is no more expensive than ignorance, and at least as satisfying." -Barrin
Well after reading what's needed, I'm leaning towards mysql.
I have followed some mysql tutorials in C# on the ms website, but I haven't tried coding them by hand.

Can you suggest any good tutorials to help me get up to speed? I did toy with the idea of a database, but since it's so dynamic i thought it would be a slow method.
To create a fast index of names of any kind, here is what I would want to do.

Create a hashtable or dictionary with a binary tree off of each node.

To make it simple use 36 entries in the table.

Each entry corresponds to the first letter of the given node.
Hopefully your filenames don't all start with c:\ ... if so you can probably write a workaround for that.

0, 1, 2, ... 9, a, b, c, d, e, ... x, y, z

off of each of these table indexes create a tree.

This will give you the best kind of index that you can code up quickly. Other examples are faster, but likely less elegant.

This also allows you to not need to keep everything in memory. You can save off a letter index, or load one letter. There are obvious benefits to scalability across machines as needed here too.

As a side note, do not use recursion in your binary tree, or your memory requirements will explode.

You could even get clever when building the tree and not store the fullname in each node by removing the part of the name that might exist in the parent or root node. I don't usually like clever code, because it is hard to maintain, but it can be worthwhile sometimes.


Quote:Original post by Anonymous Poster
Well after reading what's needed, I'm leaning towards mysql.
I have followed some mysql tutorials in C# on the ms website, but I haven't tried coding them by hand.

Can you suggest any good tutorials to help me get up to speed? I did toy with the idea of a database, but since it's so dynamic i thought it would be a slow method.


If you are familiar with SQL and relational databases, msnd tutorials should be easy. I can provide you sample codes, if you want to start coding fast. For this job you need really simple queries from one single table with one index, filename. All you need to do is create a connection object and a command, then call the command's executereader method, and iterate through the results.
"Knowledge is no more expensive than ignorance, and at least as satisfying." -Barrin
Quote:Original post by BradSnobar
To create a fast index of names of any kind, here is what I would want to do.

Create a hashtable or dictionary with a binary tree off of each node.

To make it simple use 36 entries in the table.

Each entry corresponds to the first letter of the given node.
Hopefully your filenames don't all start with c:\ ... if so you can probably write a workaround for that.

0, 1, 2, ... 9, a, b, c, d, e, ... x, y, z

off of each of these table indexes create a tree.

...


What kind of tree would you suggest? I dont really get the idea of "hashing" 100 000 - 1 000 000 filenames into 37 entries, and then maintaining a linked tree. I mean i knowwhat it is good for, but in this case its not the best solution. AntonyCoder wants this to run on a server machine and be really fast. Implementing a good storage system with fast caching in this case is needless, as a database storage engine can be times faster with less effort, so he can focus on other problems (anyway i doubt this data structure would fit his needs best)
"Knowledge is no more expensive than ignorance, and at least as satisfying." -Barrin

This topic is closed to new replies.

Advertisement