Searching List of Strings w/ Wildcards

Started by
2 comments, last by intrest86 20 years, 5 months ago
If this is something that is already generally recognized, and I can just Google for a phrase, just tell me what it is. It seems unique enough, though. Basically, I have a large list (Constantly growing, but I have had it up to 200,000) of 9 character strings. Some characters are a wildcard, "_". I then have one string at a time that I need to search the list for, and I am trying to find a fast way to do it, because I will be searching the list a few billion times. Here is what you know about the data: Each character has 20 possible values other than "_" Before having characters replaced with wildcards, the list is sorted lexicographically by ASCII value. Each new string to search for contains no wildcards Each new string is greater than the previous The data in the list is used only here, so it can be deleted or modified I considered a tree implementation, but the data is going to change very fast, so I would have to keep the entire structure in memory all the time. (At least, that is what I think). Right now I am just brute-force searching the entire list for each string, and even though each search doesn''t take too long, it adds up fast. Is there some way of sorting the list to speed up searches? Or is there something else completely I can do to improve this? Thank you for your help ^_^
Turring Machines are better than C++ any day ^_~
Advertisement
Reeks of homework, but I''ll throw you a bone:

Insertion sort a vector in reverse (you are told each new string is greater than the last, start at the end; if they are lying the insertion sort will make certain everything still works). Then use a binary search on it (e.g. std::lower_bound) with a lexicographical compator (e.g. std::lexicographical_compare).
- The trade-off between price and quality does not exist in Japan. Rather, the idea that high quality brings on cost reduction is widely accepted.-- Tajima & Matsubara
Ugg, I was worried it would sound like homework >_<

Actually, if you check my post record, about half a year ago I mentioned I was working on a scientific problem called the Busy Beaver. Part of my solution involves taking each part of the solution that I solve and compute a "mask", so that in the future I can test another part of the problem to see if I have already solved it. The program is already written, and is working rather well, but searching for a string match is taking a lot of time.

Lexicographically, each new string is not greater than the previous in the list, that is only true BEFORE the wildcard characters are swapped in. That is ultimately where I am confused. How can a binary sort work if the entire list has wildcard characters? It would find an exact match, but not a string with wildcards that match, right?

Another thing we know about the list that I forgot to mention is that every string contains at least 1 wildcard, and many contain multiple. That is why I wouldn''t want to consider simply expanding each statement so that every exact string was in the list. I''m really befuddled by this one, it seems that there is no nice balance between speed and memory efficiency here.
Turring Machines are better than C++ any day ^_~
Just an update ^_^

Problem solved. What I did was create a node data type for a tree that has a fixed length array of pointers to nodes to point to each of the possible values of the next character in the string, plus another pointer for any strins that have a wildcard in the next position.

To keep memory allocation as simple as possible (saving time), I allocated an array of odes from the beggining and just flaed which nodes were already being used. Whenever I need a new node I just search for one in thelist not being used and add it in.

Search times are reat, and from viewing my data in a tree I am surprised to find that my data is actually very dense, so the tree is not taking much more memory than the string list I was using before. That, and it is much easier to delete strings from the list when I no longer need them.

Overall improvement: 6 to 10 time speed improvement
Turring Machines are better than C++ any day ^_~

This topic is closed to new replies.

Advertisement