Archived

This topic is now archived and is closed to further replies.

Horia Beschea

How to sort a VERY large array

Recommended Posts

Hi, I''m developing an expert system and I''m in trouble accesing data in the knowledge base fast enough. I have an array that grows to hold millions of items. Each item holds responses like yes/no/maybe/na to specific questions. When it comes to find a specific diagnostics to a chain of symptoms, a question is asked and all ''no'' clauses must not be accesed again. The next step is to take the possible clause with the most ''yes'' matches and ask the next question. My problem is finding the item with ''the most yes answers'', especially in early stages of searching. I don''t think I''ve been clear enough, but my English is not helping me right now. Thank you for your responses Regards, Horia Of all the things I''''ve lost I miss my mind the most

Share this post


Link to post
Share on other sites
well assuming you represent the various responses with numerical values (eg no = 0, yes = 1, maybe = 2, na = 3) the best sort routines would be either quick sort or a histogram sort (you create a bin per possible value, run through the list sticking items in the appropriate bin, then recombine as appropriate - much quicker than quicksort if the range of values is known)

I am not too sure I understand your problem, but I am guessing a nice solution involving trees might be worth looking at...

Share this post


Link to post
Share on other sites
It also sounds like a Priority Queue would be a good choice for your application... Possibly based on a Binary Heap, possibly based on something more complicated like a Fibonacci Heap.

Share this post


Link to post
Share on other sites
I was told that quicksort was the fastest sorting algorithm, and that there is a proof that proves that you cannot sort an array/list/vector/whatever faster than O(log 2n). But you''re saying that a histogram sort is faster?

Share this post


Link to post
Share on other sites
Quicksort is O(n log n), not O(log n). And big-O doesn''t tell you exactly how long it would take. It means that the time it takes is -proportional- to some function of n. There could be another sort that still runs O(n log n), and is faster than a quicksort. If one sort has a lower complexity than another, it means that after a sufficient number of terms, the former will beat the latter. For example, a bidirectional selection sort will completely own quicksort up to quite large values of n, when sorting an integer array.

The histogram/bucket/radix sort runs O(n). For several million terms, it''s the best option.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
If I''m not mistaken, the average case time complexity on quicksort is @(n) with @ being theta, with O(nlogn) being worst case. Just something you might want to take into account.

Share this post


Link to post
Share on other sites
doesn''t quicksort act funny on huge data sets?

"You won''t get wise with the sleep still in your eyes." - Neil Peart

Share this post


Link to post
Share on other sites
Maybe i''m missing the question, but why not use a red black tree ? That way your data items are always sorted (no inserting and then resorting). And searches are very fast, eliminating half the tree at each node.

ECKILLER

Share this post


Link to post
Share on other sites
Thank you for your answers. Although I''m not aware of some algorithms you mentioned earlier (like fibbonaci heap, priority queue,..) I will start looking for details.
I don''t think anything like a tree would work because I have 4 possible ways to go down in a tree and I have to cut just a brench (the one with "no"s) and I would have to create the tree at each question.
qsort may work, but I think my array is too long, and I don''t have to have the array sorted (sorry about the title of the post), I just want the take out the item with the most ''yes''s and continue to eliminate the ''no''s.
BTW, this program is going to run on a 4 processor machine and I would like to use the most of each processor.

Regards,
Horia

Of all the things I''''ve lost I miss my mind the most

Share this post


Link to post
Share on other sites
Would this work :

A trick is to know which questions should you ask first. If you ask first those questions that''ll cut the most ''no'' answers, you save lots of time. For start, you can simply know how many no answers are there for each question. Simply ask first the question in your query with most nos. Then the next, and so on.

This can not work very good sometimes. This is an example :

1)YYYYYYYYYYYYYYYYYNNNNNNNNNNNNNNNN
2)NNNNNNNNNNNNNNNNYYYYYYYYYYYYYYYYY
3)NNNNNNNNNNNNNNNNYNNNYYYYYYYYYYYYY
4)NNNNNNNNNNNNNNNNYNNNYYYYYYYYYYYYY
5)NNNNNNNNNNNNNNNNYNNNYYYYYYYYYYYYY
6)NNNNNNNNNNNNNNNNYNNNYYYYYYYYYYYYY
7)NNNNNNNNNNNNNNNNYNNNYYYYYYYYYYYYY


The algorithm should ask the 1th and 2th questions first. If it did it only one item would remain. Instead, it asks the 3th, 4th, 5th, 6th, 7th questions first, cause they have more No answers.

So, you could be funky and know how many items would remain when asking 2 questions for each pair of two elements.

In this example :
1) and 2) : 1 item remains
1) and 3) : 1 item remains
1) and 4) : 1
....
2,3 : lots of them
2,4 : lots
....
3,4 : lots
...
4,5 :lots
.........

So, when you get your query, assuming you have more than 1 question, you ask first those two questions that are in your query and TOGETHER have the most NO answers.

You can experiment and know 3 or 4 questions, dont know how much will it improve the speed, guess it depends on what your database looks like.



For a fast intersection, for each question, I''d store the set of item indexes that answer Yes/Maybe/Na for that question. You can store these indexes in a sorted tree thingie (b-tree works best for reading stuff from hard disk). When you have to intersect two such sets, you take all elements from the set with the least number of elements, and for each element, check if you can find it in the other set. If the two sets have m and n number of elements, with m > n, the complexity of the intersection will be
O(n * log m) (n cause you take all the n elements from the small set, log m is the time for a check in a tree). This means that if you can reduce the set of items that can match your query early on, when you have to intersect this remaining set with a big set of a "mostly yes answers" question, the size of that set isnt all that important.


Dont know how much my English helped me either

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
quote:
Original post by Merge2

I was told that quicksort was the fastest sorting algorithm, and that there is a proof that proves that you cannot sort an array/list/vector/whatever faster than O(log 2n). But you''re saying that a histogram sort is faster?


hmm... And I''ve always been taught that a Tree Sort was faster than a Quick Sort... how weird...


Share this post


Link to post
Share on other sites
Qsort is the fastest comparison sort. No comparison sort can be faster than O(nlog(n)). However, if you know information about your data, you can do faster sorts. For a crappy example, if you have 1000 distinct data items ranging from 1 to 1000, you can sort them in log(n) time using an array of 1000. Qsort does not act wacky on large data sets, unless there''s not enough memory to hold the entire data set. Then something like mergesort would be better.

Share this post


Link to post
Share on other sites
Well look at come.to/20q this is an interesting approach to something similar. You think of something and it asks you 20 questions about it and then guesses what you''ve thought of. Maybe it helps.

Share this post


Link to post
Share on other sites
Quick sort is (on average) O(N log N), worst case is O(N^2)
Radix sort/histogram sort are both O(N), but may have a large memory overhead (if implemented naively, anyway)

Share this post


Link to post
Share on other sites
Hi,

Diodor, that is a great algorithm and this is definitely the way I''m going to start. The only problem is that in my case the most answers are ''not defined = na'' because I enter something like "being cold = yes to fever, yes to headache, no to bleeding, maybe dizzines...etc'' and I don''t enter for each symptom yes/no/maybe. I guess it will be a pain to get this system accurate by having as few na answers as possible.
Thanks.

Regards,
Horia

Of all the things I''''ve lost I miss my mind the most

Share this post


Link to post
Share on other sites
I guess I should apologise for the mistakes in my last post.
"bidirectional" was the wrong word where I was referring to the selection sort - I was thinking of a selection sort where you find both the lowest and the highest values and swap them into place. And there was an error in my quicksort that was slowing it down. I''ve now compared the enhanced selection sort to a fast quicksort, and it only beats the fast quicksort up to n = 50.

However, for n = 5000, the radix sort beats the fast quicksort 45 times over. For n = 100000, the fast quicksort took 25 seconds, and the radix sort took just under half a second.

The radix sort does not make any assumptions about the data being sorted - but it does require extra memory. Under the implementation I have (sorting an integer array), it''s at its least efficient. It would certainly be more efficient if it was used to sort linked lists.

Under my current implementation of the radix sort (which is unrecursive), four arrays are needed. Two are integer arrays with 256 values (for a byte; if you''re sorting by nibbles, it''d be 16). One is an integer array with the dimension of the array to be sorted, and the final is an array which must be the same size and shape as the array being sorted. I''ll be looking for ways to reduce the memory usage. But under a real situation, if I was planning to use the radix sort, I''d use linked lists.

Share this post


Link to post
Share on other sites