How to sort a VERY large array

Started by
16 comments, last by Horia Beschea 22 years, 11 months ago
Hi, I''m developing an expert system and I''m in trouble accesing data in the knowledge base fast enough. I have an array that grows to hold millions of items. Each item holds responses like yes/no/maybe/na to specific questions. When it comes to find a specific diagnostics to a chain of symptoms, a question is asked and all ''no'' clauses must not be accesed again. The next step is to take the possible clause with the most ''yes'' matches and ask the next question. My problem is finding the item with ''the most yes answers'', especially in early stages of searching. I don''t think I''ve been clear enough, but my English is not helping me right now. Thank you for your responses Regards, Horia Of all the things I''''ve lost I miss my mind the most
Of all the things I''ve lost I miss my mind the most
Advertisement
well assuming you represent the various responses with numerical values (eg no = 0, yes = 1, maybe = 2, na = 3) the best sort routines would be either quick sort or a histogram sort (you create a bin per possible value, run through the list sticking items in the appropriate bin, then recombine as appropriate - much quicker than quicksort if the range of values is known)

I am not too sure I understand your problem, but I am guessing a nice solution involving trees might be worth looking at...
It also sounds like a Priority Queue would be a good choice for your application... Possibly based on a Binary Heap, possibly based on something more complicated like a Fibonacci Heap.
I was told that quicksort was the fastest sorting algorithm, and that there is a proof that proves that you cannot sort an array/list/vector/whatever faster than O(log 2n). But you''re saying that a histogram sort is faster?
Quicksort is O(n log n), not O(log n). And big-O doesn''t tell you exactly how long it would take. It means that the time it takes is -proportional- to some function of n. There could be another sort that still runs O(n log n), and is faster than a quicksort. If one sort has a lower complexity than another, it means that after a sufficient number of terms, the former will beat the latter. For example, a bidirectional selection sort will completely own quicksort up to quite large values of n, when sorting an integer array.

The histogram/bucket/radix sort runs O(n). For several million terms, it''s the best option.
The easiest (and pretty darn fast) way to sort any data, I''ve found, is just the standard library qsort function.

-------------------------------
NeXe: NeHe DirectX-style.

Follow the orange rabbit.
If I''m not mistaken, the average case time complexity on quicksort is @(n) with @ being theta, with O(nlogn) being worst case. Just something you might want to take into account.
doesn''t quicksort act funny on huge data sets?

"You won''t get wise with the sleep still in your eyes." - Neil Peart
"You won't get wise with the sleep still in your eyes." - Neil Peart
Maybe i''m missing the question, but why not use a red black tree ? That way your data items are always sorted (no inserting and then resorting). And searches are very fast, eliminating half the tree at each node.

ECKILLER
ECKILLER
Thank you for your answers. Although I''m not aware of some algorithms you mentioned earlier (like fibbonaci heap, priority queue,..) I will start looking for details.
I don''t think anything like a tree would work because I have 4 possible ways to go down in a tree and I have to cut just a brench (the one with "no"s) and I would have to create the tree at each question.
qsort may work, but I think my array is too long, and I don''t have to have the array sorted (sorry about the title of the post), I just want the take out the item with the most ''yes''s and continue to eliminate the ''no''s.
BTW, this program is going to run on a 4 processor machine and I would like to use the most of each processor.

Regards,
Horia

Of all the things I''''ve lost I miss my mind the most
Of all the things I''ve lost I miss my mind the most

This topic is closed to new replies.

Advertisement