Archived

This topic is now archived and is closed to further replies.

intrest86

Searching List of Strings w/ Wildcards

Recommended Posts

If this is something that is already generally recognized, and I can just Google for a phrase, just tell me what it is. It seems unique enough, though. Basically, I have a large list (Constantly growing, but I have had it up to 200,000) of 9 character strings. Some characters are a wildcard, "_". I then have one string at a time that I need to search the list for, and I am trying to find a fast way to do it, because I will be searching the list a few billion times. Here is what you know about the data: Each character has 20 possible values other than "_" Before having characters replaced with wildcards, the list is sorted lexicographically by ASCII value. Each new string to search for contains no wildcards Each new string is greater than the previous The data in the list is used only here, so it can be deleted or modified I considered a tree implementation, but the data is going to change very fast, so I would have to keep the entire structure in memory all the time. (At least, that is what I think). Right now I am just brute-force searching the entire list for each string, and even though each search doesn''t take too long, it adds up fast. Is there some way of sorting the list to speed up searches? Or is there something else completely I can do to improve this? Thank you for your help ^_^

Share this post


Link to post
Share on other sites
Reeks of homework, but I''ll throw you a bone:

Insertion sort a vector in reverse (you are told each new string is greater than the last, start at the end; if they are lying the insertion sort will make certain everything still works). Then use a binary search on it (e.g. std::lower_bound) with a lexicographical compator (e.g. std::lexicographical_compare).

Share this post


Link to post
Share on other sites
Ugg, I was worried it would sound like homework >_<

Actually, if you check my post record, about half a year ago I mentioned I was working on a scientific problem called the Busy Beaver. Part of my solution involves taking each part of the solution that I solve and compute a "mask", so that in the future I can test another part of the problem to see if I have already solved it. The program is already written, and is working rather well, but searching for a string match is taking a lot of time.

Lexicographically, each new string is not greater than the previous in the list, that is only true BEFORE the wildcard characters are swapped in. That is ultimately where I am confused. How can a binary sort work if the entire list has wildcard characters? It would find an exact match, but not a string with wildcards that match, right?

Another thing we know about the list that I forgot to mention is that every string contains at least 1 wildcard, and many contain multiple. That is why I wouldn''t want to consider simply expanding each statement so that every exact string was in the list. I''m really befuddled by this one, it seems that there is no nice balance between speed and memory efficiency here.

Share this post


Link to post
Share on other sites
Just an update ^_^

Problem solved. What I did was create a node data type for a tree that has a fixed length array of pointers to nodes to point to each of the possible values of the next character in the string, plus another pointer for any strins that have a wildcard in the next position.

To keep memory allocation as simple as possible (saving time), I allocated an array of odes from the beggining and just flaed which nodes were already being used. Whenever I need a new node I just search for one in thelist not being used and add it in.

Search times are reat, and from viewing my data in a tree I am surprised to find that my data is actually very dense, so the tree is not taking much more memory than the string list I was using before. That, and it is much easier to delete strings from the list when I no longer need them.

Overall improvement: 6 to 10 time speed improvement

Share this post


Link to post
Share on other sites