Sign in to follow this  
Tomaz Kristan

ArtificialSort

Recommended Posts

This is interesting. I am wondering if you could write up some information about the algorithm itself, for those who are not familiar with C/C++?

Share this post


Link to post
Share on other sites
Lots of pictures...

But where's the analysis of the algorithm, proving it performs better than QS.

Best case, worst case, average, typical case...

Better than which QS. How is the boundary element for QS chosen? Single, Random? Median? Average? How many samples? etc...

Also, how was QS for comparison implemented? From a very fast and cursory look, it seems that Artificial sort is simply some form of quick sort with inner loop unrolled using insertion sort.

The reason I ask is because asymptotic performance is AS is exactly the same as QS, with exception of a constant.

Also, AS doesn't seem to unroll for array sizes of 3 elements or less, when the insertion sort is completely redundant.

These details are what give the huge running time differences for small number ranges, where QS is known to be suboptimal.

Share this post


Link to post
Share on other sites
The C++ code is on the site I gave. You can cut off the insertion portion if you like. But it gives some additional performance to Artificial, just as it gives some additional performance to Quick sort. That for sorting smaller chunks only.

But both can do almost as well without the Insertion part.

The main difference between those two sorts - Artificial and Quick is the fact, that the Artificial checks the segment's order before performs the sort. That gives it a huge edge quite often. See the table and try it yourself. Use the best known QuickSort you can get.

Share this post


Link to post
Share on other sites
I don't get it. Your "random data" animation shows QuickSort finishing first, and the QuickSort algorithm in your "stair data" animation seems to be using a silly partitioning algorithm (as opposed to Dutch Flag). Could you provide more information about the version of QuickSort you used?

Also, small nitpick: the provided source code is C, not C++.

Bigger nitpick: try to make your web site sound less like a snake oil salesman, and more like a serious computer science discussion. You are, after all, addressing to developers, which have enough knowledge about the problem field to understand what you have to say.

Share this post


Link to post
Share on other sites
For the small random files, Quick may seldom win. Also at the example shown. Facts are facts.

But you have the table for larger files, how they behave. That's important.

You are free to choose any known QuickSort algorithm in C/C++ and compare it to ArtificialSort given.

If you think it's a snake oil, just test it, it's easy to do.

Share this post


Link to post
Share on other sites
I think it should be interesting to see the complexity of this algorithm...

Probably the order is O(nlogn), but the complet complexity can prouve effectively it's faster than the quicksort (in average case).

And I have a question, with your variable "meja", you have this:
meja = ( temp1 + temp2 + 1 ) >> 1;
if ( array[ (left + right) >> 1 ] > meja )
{
meja = array[ (left + right) >> 1 ];
}

If I want to sort... clients by Lastname/Firstname... how, can I do the "( temp1 + temp2 + 1 ) >> 1" ??

Base on this fact, this algorithm, is only good for numbers... the Quicksort, work for every kind of data... and the section above, looks pretty important because it act like the pivot in the quicksort...

It's like the Counting Sort, it's work in linear time, but it's work only for number, and you have to know the range of your data entries...

The ArtificialSort has this problem... unless, someone has a suggestion to adapt the algorithm for every kind of data...

[Edited by - widggman on April 7, 2007 9:54:51 AM]

Share this post


Link to post
Share on other sites
First, a disclaimer: I'm not criticizing your algorithm at all. As a matter of fact, I don't have an opinion about it yet. What I am criticizing, however, is your web site, and the methodology you use to present information to us. It is essential, if you are serious about advertising your algorithm, or the program that generated it, that you make an effort to improve the credibility of your web site in general, and of this article in particular.

Quote:
But you have the table for larger files, how they behave. That's important.


It would have been important, had it made any sense. As it is presented here, the table provides us with:
  • The speed at which your algorithm runs on unspecified hardware.
  • The speed at which an unspecified algorithm (some flavour of QuickSort) runs on unspecified hardware.


So, we do not have access to any absolute measure of performance (such as complexity, average number of swaps, reads, writes or comparisons), which makes it hard to judge your algorithm. But, even worse, we do not have access to any relative measure of performance either, since we do not know which particular algorithm you compared it to. Was it a suboptimal quicksort? A standard library generalistic one not optimized for integers? The seed for your evolution?

Your table provides no relevant information, it's just hand-waving. If you're serious about your algorithm, the first step would be to make the information relevant by adding information about:
  • Hardware (processor, memory size) and OS
  • Compiler, optimization options
  • The source code of the quicksort algorithm used.


Without these, the table is useless. It would be even more useful if it included statistical significance information (p-value for the hypothesis, at least), and compared several more algorithms instead of just two.

Quote:
You are free to choose any known QuickSort algorithm in C/C++ and compare it to ArtificialSort given.


This is an extremely bad thing to say. Remember, you're trying to convince people that your algorithm is faster than QuickSort. So far, you've provided people with your source code, and told them to do the rest themselves. Which 90% of people will not do, because:
  • They are too lazy.
  • They can't do it right now, and will forget about it by the time they can.
  • They wait for someone else to do it.
  • They don't have access to resources for a relevant benchmark.
  • They don't care.


In the current state of your presentation and web site, most people will read over the page, think "Ok." and move on to something else. When convincing people, it is essential that you come up with clear, detailed and cleanly presented facts, and an explanation of why these facts justify your initial assertion. Counting on the addressee to prove your pont himself is argumentative suicide.

Would you be sending incomplete resumes to companies, with the footnote "you are free to examine my background yourself" ? This would certainly not get you hired. Apply the same principle here: provide complete information yourself, don't wait for people to find it on their own.

Quote:
If you think it's a snake oil, just test it, it's easy to do.


Your web site sounds like a snake oil salesman. This is most unfortunate, since it makes you lose your credibility, and even though the statements may be entirely correct, people will tend to disbelieve you.

For instance, if you were to ask a girl out, a good idea would be to explain where you will be going, what movie you will see, what kind of dinner you will eat, and other practical details about the date itself. Would you explain to her that you're not a rapist, a transsexual or already married? Keep your discourse to the bare facts, without meta-reasoning about the state of mind of the reader, philosophical subjects about how the universe works, tangent remarks about trivia, gloating about how clever and great and excellent your approach is, and anything that is not a justification of your initial assertion.

Now, does any serious site mention that the reader might not believe his own eyes? No. Most of them act as if what they were saying was obvious and self-evident truth, and then present justification in a clear and transparent way. Who mentions the reader's disbelief?
  • People who try to build up hype. "So good you won't believe your eyes." This will get the average Joe's attention, but is easily recognizable by people in the same field of competence.
  • People who don't have arguments to justify themselves, and resort to meta-reasoning to distract the reader from the point (the point being the stated facts, and not the user's belief in them).

So, when people read such remarks, they will automatically associate you with these shady types, even if you are in fact honest and innocent.

Reading your main page for Critticall, half of the page is dedicated to information not directly related to the problem at hand. This means that, upon reading a new sentence, there is a 50% chance for me not to care about what that sentence says. This is unusually high. I'm not saying that you're a snake oil salesman, but your site looks like one, and it would be best for you to move away from that image. Again, this is friendly advice about how to improve your communication, and not dismissal of what you have to say.

A great structure for your algorithm presentation would be:
  • Abstract: present in a few broad strokes what your algorithm is about. For instance, "A divide-and-conquer high-performance sort algorithm: presentation, comparison with existing solutions, source code".

  • The algorithm: provide a pseudocode explanation of your algorithm. Something that does not have the readability issues of C code. For instance, a divide-and-conquer algorithm is usually best presented as a recursive solution, not as an imperative-with-stack optimization. Also, the insertion sort optimization is not relevant to the algorithm: its rightful place is in a later section of the document, along with production code. This section is about explaining how the algorithm works, and why. It's also a great place to explain why your algorithm deserves a brand new name, even though the code provided on your page is, in fact, a typical QuickSort algorithm with an unusual partitioning routine.

    This section should not contain even a single allusion to the performance of other algorithms. You're explaining how your algorithm works, don't drown the reader in unstructured information. Performance comes in the next section, not here.

  • The results: this is where the results are. This generally involves one or two quick overviews (graphical animations, your demonstration section), followed by the actual benchmark. The benchmark should state what the hardware, OS, compiler and algorithms compared are. It should also describe the statistical significance of your results (the average time isn't enough, you will need at least the standard deviation, with p-values for the null hypothesis being expected in many cases). Provide links to the appendix.

    If at all possible, provide several benchmark cases: there is more to the world than just integer arrays, performance on string arrays (for example) would be great to have as well. Also, provide more than two algorithms. Try out several quicksort variants, along with bin sort, radix sort and other unrelated sort algorithms. And how does the algorithm compare, on nearly-sorted sets, to insertion sort or bubble sort? To quicksort with median-of-three selection?

  • The conclusion: wrap up your argument. Here, it would be that you have a new algorithm which performs faster than several typical implementations of other known fast algorithms, although it does have some documented weak points.

  • The appendix: the source code for all algorithms included in the benchmark is a must. A makefile and benchmark scaffold would be a definite plus. If you're developer-friendly, providing a generic version of your algorithm in an easily usable form (header file with inlined template implementation) would be much welcome. Make sure you give credit where credit is due, for instance for the code of the other sort algorithms in the benchmarks (including crediting yourself, if you're the author).


If you follow a similar structure, and keep to the point, it would improve credibility, provide the necessary information for judging your algorithm, and be a useful resource to everyone interested.

[Edited by - ToohrVyk on April 7, 2007 9:33:19 AM]

Share this post


Link to post
Share on other sites
Quote:
Original post by Tomaz Kristan
Between 'Jones' and 'Lindberg' - "meja" is 'K'. Between 'ABBA' and 'ABEL', "meja" is 'ABC'. And so on. This way, the Artificial is a general sort.


Works for strings is not the definition of general. A general algorithm is able to work given a type T and a comparison function of the form:

bool operator<(const T&, const T&)

Share this post


Link to post
Share on other sites
When I clicked on the link in the OP, I imediately skimmed the page for pseudo code section or a description of how the sort works. All I found was C++ source.

Then I tried reading trough your source. The source is very hard to read without knowing what the code should be doing. Your variable names are completely non self-explanatory (you have left and right and left1 and right1, that clearly showes you were not taking care of presentation when choosing the variable names) and some will not be understandable to most of the the readers, levi and desni means left and right respectively (in most variant od slavic languages including Croatian) but english speakers will not know that.

I am interested in your sort but please provide a clean pseudo-code and a description of how it works.

Also, ToohrVyk's post is a good description of how a final version of a proper science paper should look like.

However, a few more details about testing and an a pseudocode section (with maybe an analysis of it's complexity: O(?)) would be sufficient, for now.

Share this post


Link to post
Share on other sites
That's a good point... but it's not every kind of data it's possible, or easy, to find a mean between two of them...

And just for fun, have you calculate the complexity of your algorithm? Because for many person, it's a better proof than bench mark. It's completely independant of the langage and the machine (and an human can be considered has a machin...).

And for the complexity...

In the worst case, it looks like O(n²) for every while loop...

And why the variable "temp" is use to choose between your algorithm and the insertion sort, and it is used to swap value ?? I don't see any place where you're "temp" value is restaured to a good value... if I swap the value 15, the next loop, it has some chance to go in the insertion sort !

Share this post


Link to post
Share on other sites
For the people who are too lazy to try and understand the source code on the page, the algorithm appears to be a QuickSort with an unusual partitioning routine (it chooses the pivot as the average of the first two unsorted elements, and early-outs if the segment is sorted).

The pseudocode for the partitioning routine is something like:


// Computes intervals [left,b[ [c,right[ to be sorted
PARTITION(A,left,right)

a := biggest element such that indices
[left,a[ in array A are sorted

if (a = right) return(b=left,c=right) // already sorted

pivot := max ((A[a+1] + A[a])/2, A[ (left+right)/2 ])

// Use Hoare's partition algorithm to position
// the pivot. Technically, it's a Hoare variation, since it
// does not perform the "position pivot" final operation.
pos := PARTITION-HOARE(A,left,right,pivot)

return (b=pos,c=pos+1) // Around the pivot


I do have, however, trouble with the correctness proof for this algorithm. I'm considering an input such as:

1 (n times) 11 (once) 1 (n times)

Why does it terminate? [wink]

Share this post


Link to post
Share on other sites
Quote:
For the people who are too lazy to try and understand the source code on the page, the algorithm appears to be a QuickSort with an unusual partitioning routine (it chooses the pivot as the average of the first two unsorted elements, and early-outs if the segment is sorted).


In fact it's a little more. The main difference between Quick and Artificial is the checking section, where the Artificial first checks the order. When and where it's broken, the new "meja" (Slovenian for border) is calculated.

When the QuickSort always do the partitioning, Artificial doesn't. Instead of (not one but many sessions) of partitioning, it just completes one check in that case.

This explains why it's so much faster for not all random arrays.

Share this post


Link to post
Share on other sites
Quote:
Original post by Tomaz Kristan
In fact it's a little more.


It's not that much more, to be honest. The only difference with usual QuickSort flavors is the partitioning function, the rest of the algorithm being perfectly identical. The partitioning difference is twofold:

  • It chooses the pivot using the first unsorted pair. Quicksort pivot choice heuristics are plenty.
  • It may return zero-size to-sort intervals. The Dutch Flag already does this, albeit in a different situation (all elements being equal).

Share this post


Link to post
Share on other sites
This was originally presented somewhat wrong. The algorithm is quicksort.

But emphasis should have been on the application used to generate it.

I've ran some tests with various quicksort flavors vs. asort, and asort consistently runs in shorter time.

But O() complexity of both is the same, or better yet, asort performs same as hybrid sort which uses optimal algorithm based on interval range. Also, for arrays over 1000 elements (using Algorithms in C++, Addison Wesley, 1998 reference implementation) it's only 50-75% faster on random data.

The tool with which it was obtained is a nice little optimizer which, one way or another (the details are lacking) manages to produce code specifically optimized for given algorithm.

This is also the main reason why there's such a noticable difference between both - asort wins by constant factor, simply due to being, in a way, fully unrolled and tuned for specific input/output range.

The real question, before judging the true value of asort, is how it performs algorithmically. How many swaps and comparisons, best/worst case performance, etc. I would dare claim that in that respect it's not a notable improvement.

Nevertheless, Critticall could be a useful tool, depending on its limitations. Instruction schedulers, pipeline optimizers and similar niche products have their uses. While test examples obviously work, they are somewhat non-representative. How does it handle real-life algorithms? Sorting, while important, usually faces quite different problems, such as storage.

Unfortunately Critticall seems to offer only support for C, which means its use remains quite limited in any real C++/STL codebase.

Rather than advertising the asort, you'd be better off outlining how the optimization process works in Critticall.

Share this post


Link to post
Share on other sites
Quote:
Original post by Tomaz Kristan
Quote:
It's not that much more, to be honest.


Not need to be. The important thing is, it's much faster on average, than ANY QuickSort you can find on the Internet.


Just a note - it runs in shorter time. It hasn't been proven to be better algorithmically.

That's huge difference when talking about algorithms, and something to be careful about, since it can easily cause a lot of misunderstanding.

What is missing before most here are convinced, is the algorithm analysis. Code can always be optimized later.

Share this post


Link to post
Share on other sites
Also, I'm curious about how Critticall ensures that the new algorithm is correct. Do you use some kind of semantical analysis tool? If so, can it generate human-readable correctness proofs? Or do you have to write the proof by hand once the algorithm is found? And, anyway, where can I find the correctness proof for asort? [smile]

Share this post


Link to post
Share on other sites
Quote:
Original post by Tomaz Kristan
Quote:
It's not that much more, to be honest.


Not need to be. The important thing is, it's much faster on average, than ANY QuickSort you can find on the Internet.


Can you prove that? And I mean mathematically. Your personal benchmarks of some quicksort and your algorithm doesn't prove anything. The timing isn't done that well and I doubt the conditions were exactly the same for the two different tests. This doesn't mean your benchmarks are worthless, but they aren't conclusive and you need way more information about them and more benchmarks.

I haven't taken the time to deciphering your badly documented C code (at least give variables descriptive English names). If you were to come up with a decent explanation of the algorithm, some pseudo-code, a proof that it's correct and proofs which proves the asymptotically tight complexity (Θ) preferably along with all the other stuff ToohrVyk mentions then I guess people may take the time to look into your algorithm. I know I would.

Share this post


Link to post
Share on other sites
Every algorithm which comes out of Critticall is conjectural. Highly probable, since it is extensively tested during it's evolution, but still conjectural.

Only the core idea for Artificial was evolved, but it was later re-implemented by humans.

The mathematical proof, with the necessary rigor, will be on the site soon.

I must also say, that Artificial IS algorithmically faster than Quick. At least at those not "totally random" arrays.

Share this post


Link to post
Share on other sites
As soon as you do this:
meja = ( temp1 + temp2 + 1 ) >> 1;

Then the sort is no longer Comparison Based.

Since quicksort is comparison based, and ArtificialSort is not, you can't meaningfully compare them. Quicksort can sort any type of data, but ArtificialSort can only sort integers. It is this loss of generality that has allowed you to find a more efficient algorithm.

If you want to compare ArtificialSort to other algorithms, you have to compare it to other non-comparison-based algorithms such as counting sort, bucket sort and radix sort.

I appreciate your efforts, and I think Crittical is really cool, but please do a little more research before making claims like this.

Share this post


Link to post
Share on other sites
Melekor has a point. It looks like a radix sort.

By the way what do meja, levi, desni and nor translate to?

meja appears to mean middle. nor appears to mean number. levi and desni appear to mean left and right respectively.

And is there a way to ditch the goto? [smile]

Share this post


Link to post
Share on other sites
Quote:
Original post by Melekor
As soon as you do this:
meja = ( temp1 + temp2 + 1 ) >> 1;

Then the sort is no longer Comparison Based.

Can you explain why? It looks to me like this is just a heuristic to pick a pivot value to compare against when partioning. The algorithm certainly does a lot of comparisons, and its time complexity fits O(n log n) better than it fits O(n). I'm pretty certain "ArtificialSort" is just an optimized QuickSort.

Edit: Quick tests on random data show the heuristic is a little better than just picking the middle value. OTOH, the shuffling done at the start of the algorithm only appears to serve to slow it down.

Edit 2: Tests on "mostly sorted" data show that it's a little worse than just picking the middle value.

Additionally, the Dutch national flag code

while ( array[right1] >= meja ) right1--;
while ( array[left1] < meja ) left1++;

can be made, with suitably crafted input, to walk off the end of the array. Needless to say, hilarity ensues.

[Edited by - Nathan Baum on April 7, 2007 5:44:59 PM]

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this