How to sort a VERY large array

Started by
16 comments, last by Horia Beschea 23 years ago
Would this work :

A trick is to know which questions should you ask first. If you ask first those questions that''ll cut the most ''no'' answers, you save lots of time. For start, you can simply know how many no answers are there for each question. Simply ask first the question in your query with most nos. Then the next, and so on.

This can not work very good sometimes. This is an example :

1)YYYYYYYYYYYYYYYYYNNNNNNNNNNNNNNNN
2)NNNNNNNNNNNNNNNNYYYYYYYYYYYYYYYYY
3)NNNNNNNNNNNNNNNNYNNNYYYYYYYYYYYYY
4)NNNNNNNNNNNNNNNNYNNNYYYYYYYYYYYYY
5)NNNNNNNNNNNNNNNNYNNNYYYYYYYYYYYYY
6)NNNNNNNNNNNNNNNNYNNNYYYYYYYYYYYYY
7)NNNNNNNNNNNNNNNNYNNNYYYYYYYYYYYYY


The algorithm should ask the 1th and 2th questions first. If it did it only one item would remain. Instead, it asks the 3th, 4th, 5th, 6th, 7th questions first, cause they have more No answers.

So, you could be funky and know how many items would remain when asking 2 questions for each pair of two elements.

In this example :
1) and 2) : 1 item remains
1) and 3) : 1 item remains
1) and 4) : 1
....
2,3 : lots of them
2,4 : lots
....
3,4 : lots
...
4,5 :lots
.........

So, when you get your query, assuming you have more than 1 question, you ask first those two questions that are in your query and TOGETHER have the most NO answers.

You can experiment and know 3 or 4 questions, dont know how much will it improve the speed, guess it depends on what your database looks like.



For a fast intersection, for each question, I''d store the set of item indexes that answer Yes/Maybe/Na for that question. You can store these indexes in a sorted tree thingie (b-tree works best for reading stuff from hard disk). When you have to intersect two such sets, you take all elements from the set with the least number of elements, and for each element, check if you can find it in the other set. If the two sets have m and n number of elements, with m > n, the complexity of the intersection will be
O(n * log m) (n cause you take all the n elements from the small set, log m is the time for a check in a tree). This means that if you can reduce the set of items that can match your query early on, when you have to intersect this remaining set with a big set of a "mostly yes answers" question, the size of that set isnt all that important.


Dont know how much my English helped me either
Advertisement
quote:Original post by Merge2

I was told that quicksort was the fastest sorting algorithm, and that there is a proof that proves that you cannot sort an array/list/vector/whatever faster than O(log 2n). But you''re saying that a histogram sort is faster?


hmm... And I''ve always been taught that a Tree Sort was faster than a Quick Sort... how weird...


Sorry about the last message being anonymous poster. It was in fact me. I forgot that I had erase my browsers cache. Sorry again...




"And that''s the bottom line cause I said so!"

Cyberdrek
Headhunter Soft
A division of DLC Multimedia

Resist Windows XP''s Invasive Production Activation Technology!
[Cyberdrek | ]
Qsort is the fastest comparison sort. No comparison sort can be faster than O(nlog(n)). However, if you know information about your data, you can do faster sorts. For a crappy example, if you have 1000 distinct data items ranging from 1 to 1000, you can sort them in log(n) time using an array of 1000. Qsort does not act wacky on large data sets, unless there''s not enough memory to hold the entire data set. Then something like mergesort would be better.
Well look at come.to/20q this is an interesting approach to something similar. You think of something and it asks you 20 questions about it and then guesses what you''ve thought of. Maybe it helps.
Yesterday we still stood at the verge of the abyss,today we're a step onward! Don't klick me!!!
Quick sort is (on average) O(N log N), worst case is O(N^2)
Radix sort/histogram sort are both O(N), but may have a large memory overhead (if implemented naively, anyway)
Hi,

Diodor, that is a great algorithm and this is definitely the way I''m going to start. The only problem is that in my case the most answers are ''not defined = na'' because I enter something like "being cold = yes to fever, yes to headache, no to bleeding, maybe dizzines...etc'' and I don''t enter for each symptom yes/no/maybe. I guess it will be a pain to get this system accurate by having as few na answers as possible.
Thanks.

Regards,
Horia

Of all the things I''''ve lost I miss my mind the most
Of all the things I''ve lost I miss my mind the most
I guess I should apologise for the mistakes in my last post.
"bidirectional" was the wrong word where I was referring to the selection sort - I was thinking of a selection sort where you find both the lowest and the highest values and swap them into place. And there was an error in my quicksort that was slowing it down. I''ve now compared the enhanced selection sort to a fast quicksort, and it only beats the fast quicksort up to n = 50.

However, for n = 5000, the radix sort beats the fast quicksort 45 times over. For n = 100000, the fast quicksort took 25 seconds, and the radix sort took just under half a second.

The radix sort does not make any assumptions about the data being sorted - but it does require extra memory. Under the implementation I have (sorting an integer array), it''s at its least efficient. It would certainly be more efficient if it was used to sort linked lists.

Under my current implementation of the radix sort (which is unrecursive), four arrays are needed. Two are integer arrays with 256 values (for a byte; if you''re sorting by nibbles, it''d be 16). One is an integer array with the dimension of the array to be sorted, and the final is an array which must be the same size and shape as the array being sorted. I''ll be looking for ways to reduce the memory usage. But under a real situation, if I was planning to use the radix sort, I''d use linked lists.

This topic is closed to new replies.

Advertisement