Back to General and Gameplay Programming

How to sort a VERY large array

General and Gameplay Programming Programming

Started by Horia Beschea April 26, 2001 02:55 AM

16 comments, last by Horia Beschea 23 years ago

517

April 27, 2001 03:53 AM

Would this work :

A trick is to know which questions should you ask first. If you ask first those questions that''ll cut the most ''no'' answers, you save lots of time. For start, you can simply know how many no answers are there for each question. Simply ask first the question in your query with most nos. Then the next, and so on.

This can not work very good sometimes. This is an example :

1)YYYYYYYYYYYYYYYYYNNNNNNNNNNNNNNNN
2)NNNNNNNNNNNNNNNNYYYYYYYYYYYYYYYYY
3)NNNNNNNNNNNNNNNNYNNNYYYYYYYYYYYYY
4)NNNNNNNNNNNNNNNNYNNNYYYYYYYYYYYYY
5)NNNNNNNNNNNNNNNNYNNNYYYYYYYYYYYYY
6)NNNNNNNNNNNNNNNNYNNNYYYYYYYYYYYYY
7)NNNNNNNNNNNNNNNNYNNNYYYYYYYYYYYYY

The algorithm should ask the 1th and 2th questions first. If it did it only one item would remain. Instead, it asks the 3th, 4th, 5th, 6th, 7th questions first, cause they have more No answers.

So, you could be funky and know how many items would remain when asking 2 questions for each pair of two elements.

In this example :
1) and 2) : 1 item remains
1) and 3) : 1 item remains
1) and 4) : 1
....
2,3 : lots of them
2,4 : lots
....
3,4 : lots
...
4,5 :lots
.........

So, when you get your query, assuming you have more than 1 question, you ask first those two questions that are in your query and TOGETHER have the most NO answers.

You can experiment and know 3 or 4 questions, dont know how much will it improve the speed, guess it depends on what your database looks like.

For a fast intersection, for each question, I''d store the set of item indexes that answer Yes/Maybe/Na for that question. You can store these indexes in a sorted tree thingie (b-tree works best for reading stuff from hard disk). When you have to intersect two such sets, you take all elements from the set with the least number of elements, and for each element, check if you can find it in the other set. If the two sets have m and n number of elements, with m > n, the complexity of the intersection will be
O(n * log m) (n cause you take all the n elements from the small set, log m is the time for a check in a tree). This means that if you can reduce the set of items that can match your query early on, when you have to intersect this remaining set with a big set of a "mostly yes answers" question, the size of that set isnt all that important.

Dont know how much my English helped me either

diogames.com

Anonymous

April 27, 2001 04:19 AM

quote:Original post by Merge2

I was told that quicksort was the fastest sorting algorithm, and that there is a proof that proves that you cannot sort an array/list/vector/whatever faster than O(log 2n). But you''re saying that a histogram sort is faster?

hmm... And I''ve always been taught that a Tree Sort was faster than a Quick Sort... how weird...

Cyberdrek

100

April 27, 2001 04:22 AM

Sorry about the last message being anonymous poster. It was in fact me. I forgot that I had erase my browsers cache. Sorry again...

"And that''s the bottom line cause I said so!"

Cyberdrek
Headhunter Soft
A division of DLC Multimedia

Resist Windows XP''s Invasive Production Activation Technology!

Resist Windows XP''s Invasive Production Activation Technology!

[Cyberdrek | ]

Thrump

169

April 27, 2001 06:34 AM

Qsort is the fastest comparison sort. No comparison sort can be faster than O(nlog(n)). However, if you know information about your data, you can do faster sorts. For a crappy example, if you have 1000 distinct data items ranging from 1 to 1000, you can sort them in log(n) time using an array of 1000. Qsort does not act wacky on large data sets, unless there''s not enough memory to hold the entire data set. Then something like mergesort would be better.

Vaporisator

122

April 27, 2001 07:27 AM

Well look at come.to/20q this is an interesting approach to something similar. You think of something and it asks you 20 questions about it and then guesses what you''ve thought of. Maybe it helps.

Yesterday we still stood at the verge of the abyss,today we're a step onward! Don't klick me!!!

Sandman

2,211

April 27, 2001 07:49 AM

Quick sort is (on average) O(N log N), worst case is O(N^2)
Radix sort/histogram sort are both O(N), but may have a large memory overhead (if implemented naively, anyway)

Horia Beschea

122

Author

April 27, 2001 12:08 PM

Hi,

Diodor, that is a great algorithm and this is definitely the way I''m going to start. The only problem is that in my case the most answers are ''not defined = na'' because I enter something like "being cold = yes to fever, yes to headache, no to bleeding, maybe dizzines...etc'' and I don''t enter for each symptom yes/no/maybe. I guess it will be a pain to get this system accurate by having as few na answers as possible.
Thanks.

Regards,
Horia

Of all the things I''''ve lost I miss my mind the most

Of all the things I''ve lost I miss my mind the most

Beer Hunter

712

April 28, 2001 06:17 PM

I guess I should apologise for the mistakes in my last post.
"bidirectional" was the wrong word where I was referring to the selection sort - I was thinking of a selection sort where you find both the lowest and the highest values and swap them into place. And there was an error in my quicksort that was slowing it down. I''ve now compared the enhanced selection sort to a fast quicksort, and it only beats the fast quicksort up to n = 50.

However, for n = 5000, the radix sort beats the fast quicksort 45 times over. For n = 100000, the fast quicksort took 25 seconds, and the radix sort took just under half a second.

The radix sort does not make any assumptions about the data being sorted - but it does require extra memory. Under the implementation I have (sorting an integer array), it''s at its least efficient. It would certainly be more efficient if it was used to sort linked lists.

Under my current implementation of the radix sort (which is unrecursive), four arrays are needed. Two are integer arrays with 256 values (for a byte; if you''re sorting by nibbles, it''d be 16). One is an integer array with the dimension of the array to be sorted, and the final is an array which must be the same size and shape as the array being sorted. I''ll be looking for ways to reduce the memory usage. But under a real situation, if I was planning to use the radix sort, I''d use linked lists.

How to sort a VERY large array

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

How to sort a VERY large array

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines