Using 3rd party libraries or code your own?

Started by
21 comments, last by Kian 9 years, 9 months ago

but I also have MemCmpF which saves cycles by returning only true or false (as apposed to -1, 0, or 1, which requires a byte-per-byte scan once a discrepancy between the inputs is found).

You'd be at most saving one comparison actually. ... As for the actual topic, there's always a desire to roll you own when you realize you need something. ... The better you can define your requirements, the easier it is to make a decision.
Looks like you might also benefit from taking time to understand the requirements.

While it is true that the return value is a tiny difference, the actual comparisons can be radically different. Parallel search for mismatch is an algorithm family that can have superlinear speedup. So if you split it up into 4 chunks of work you can do the job much faster than 1/4 the time on average; sometimes dividing into 4 parts and not caring abot exact mismatch position or direction could give 8x speedup rather than a linear 4x speedup, for example. If you are not that particular about the exact nature of the mismatch and are willing to do some algorithmic transformations you can get much faster results than a standard library's memory compare in many situations.
Advertisement

Well, if we're getting right down to the stated requirements, L. Spiro said the goal of his memcmp was to save cycles, not time.

The parallel search will have strictly worse cost in cycles spent, just to get the basic boolean result. Trying to get the +1 or -1 result would actually require a significant increase in complexity (compared to checking the first byte that differed).

This is because the single threaded compare only needs to search until it finds a discrepancy, and it can quit early. Parallel compare, on the other hand, has to split the memory into chunks and start every chunk at the same time. Even if you allow for threads to quit as soon as they find a discrepancy, the threads that didn't find one will continue until they are done.

Worse case for the library memcmp is if both strings are equal, since it has to compare it all to return 0. This is the same for the parallel compare, since all the threads will have to do the same number of comparisons. In every other case, every chunk that compares equal has to be run to completion in either parallel or single threaded compare, but the single threaded model won't compare any chunks after the first difference, while the parallel compare will have to compare at the very least the first byte of each chunk. A consequence of this, in fact, is that as you increase the number of chunks, your algorithm becomes faster but also more wasteful. Take the best case scenario, for example, where every byte is different. memcmp will check the first byte and return. Parallel compare will check n bytes and return, where n is the number of chunks you have.

And I'm not even comparing the cost of launching all the threads. Even for perfect threads that cost nothing to create, the algorithm is strictly worse for every case in terms of cycles spent.

The parallel search will have strictly worse cost in cycles spent, just to get the basic boolean result. Trying to get the +1 or -1 result would actually require a significant increase in complexity (compared to checking the first byte that differed).

Putting an on-topic spin to this, this is one of the reasons reinventing wheels can be a good thing.

Kian thinks std::memcmp() is so basic, “It’s just comparing bytes until it finds a discrepancy,” but if he had taken the time to write his own routine he would have found it to be at least 4 times slower than std::memcmp().


Even if you don’t think you can write a faster routine, nor think reinventing wheels is a good idea, you should still rewrite some of the STD routines as a learning exercise. You really lose nothing. If yours is faster, then you’ve just made a faster version (just for bugs). If yours is substantially slower, you will be inclined to ask why that is, investigate, and learn.



So, first-off, Kian, it is not implemented as a byte-to-byte compare.
Look at these 2 arrays of 8 bytes:


const u8 pu8One[] = { 0, 1, 2, 3, 4, 5, 6, 7 };
const u8 pu8Two[] = { 0, 1, 2, 3, 4, 5, 7, 6 };

You could do this:


for ( u32 I = 0; I < 8; ++I ) {
    if ( (*pu8One++) != (*pu8Two++) ) { /* Found a mismatch. */ }
}

…which takes 7 iterations to discover a mismatch.

Or you could compare either 4 or 8 bytes at a time (depending on the architecture):


const UINT_PTR * puptrOne = reinterpret_cast<const UINT_PTR *>(pu8One);
const UINT_PTR * puptrTwo = reinterpret_cast<const UINT_PTR *>(pu8Two);
for ( UINT_PTR I = 0; I < 8 / sizeof( UINT_PTR ); ++I ) {
    if ( (*puptrOne++) != (*puptrTwo++) ) { /* Found a mismatch. */ }
}

…which finds a mismatch in 2 iterations on x86 and 1 iteration on x64.


This is just basic but is already a speed-up of 4 or 8 times. When you don’t need to return -1 or 1 on a mismatch, you are done. Since you aren’t done if you do, by definition returning -1 or 1 is slower than just returning true or false.

Can you think of a way to make it even faster?


L. Spiro

I restore Nintendo 64 video-game OST’s into HD! https://www.youtube.com/channel/UCCtX_wedtZ5BoyQBXEhnVZw/playlists?view=1&sort=lad&flow=grid

^ Nice, but what if pu8One/pu8Two have odd and non-POT number of elements?


The main advantage of a 3rd party library is not that the code is written for you. The main advantage is that the code has been tested and debugged for you.

Although this is certainly true, it's also dangerous to assume it is extremely well tested for your exact cases.

Taking in 3rd party code is also a risk.

It could speed up development a lot, but it can also trip you on the finish line when some obscure bug in it appears, and you have to scramble to either have it fixed or work around it.

If you have the source code to it, you are a bit safer, but understanding a 3rd party library enough to safely make changes to it also takes some time.

Just wanted to underline that the choice isn't straight forward...

On big-endian architectures you can actually just do the final comparison for 4/8 byte chunks just as you would for a byte-by-byte approach (of course, most consumer hardware is little-endian). I would also imagine that unless your buffers are in cache to begin with, your implementation is likely going to be memory-bound. In short, memcmp is not the first function I would look at when micro-optimizing, compared to, say, memory allocation patterns, excessive use of virtual functions, poor cache usage, etc... but, yes, it will probably make you a better programmer to really understand how it is typically implemented on a variety of architectures.

“If I understand the standard right it is legal and safe to do this but the resulting value could be anything.”




Just wanted to underline that the choice isn't straight forward...

Agree one hundred percent.

When people write a library, they write it with a set of assumptions in mind. Those assumptions may or may not be true for your use case.

I've just had to write a work around for a very well known and respected library. They made an assumption that an animation rig would only have a very small number of float tracks attached to it. This assumption is very wrong in our case. Simply getting an integer from the system using a string as a key was taking several milliseconds because they just did a linear walk looking for the string in the array.

So every time you used this function you were doing a hell of a lot of string compares.

I simply used a hashmap to replace the linear walk and saved 100's of milliseconds per frame. A massive speed up for very little code.

Did take a hell of a long time to find the problem though :>

Third party libraries are great as long as you know there limitations

Kian thinks std::memcmp() is so basic, “It’s just comparing bytes until it finds a discrepancy,” but if he had taken the time to write his own routine he would have found it to be at least 4 times slower than std::memcmp().

Actually, on my first reply to you I clarified:

(not sure if it compares byte by byte or word by word or whatever, but the case is the same regardless of how it does it)

I'm using byte to stand in for whatever your largest integral type is, since regardless of the size of your comparisons, the algorithm is the same. I didn't want to get into architectural details like how many bytes you can check per instruction. The effect is that you are checking for the first bit that differs. A byte just lets you check 8 bits at a time, a 32 bit pointer 32 bits at a time and a 64 bit pointer 64 bits at a time.

I did agree that returning true or false was faster, I just didn't perhaps entirely understand what you meant by needing to check byte by byte to get -1 or +1. With your example:

const UINT_PTR * puptrOne = reinterpret_cast<const UINT_PTR *>(pu8One);
const UINT_PTR * puptrTwo = reinterpret_cast<const UINT_PTR *>(pu8Two);
for ( UINT_PTR I = 0; I < 8 / sizeof( UINT_PTR ); ++I ) {
    if ( (*puptrOne++) != (*puptrTwo++) ) { 
        /* Found a mismatch. */ 
        if ( *--puptrOne > *--puptrTwo ){
            return +1;
        }
        else {
            return -1;
    }
    return 0;
}
There's no byte-by-byte checking required. As soon as you find a mismatch it takes a single comparison, regardless of how many bits you're checking at a time.

Kian thinks std::memcmp() is so basic, “It’s just comparing bytes until it finds a discrepancy,” but if he had taken the time to write his own routine he would have found it to be at least 4 times slower than std::memcmp().

Actually, on my first reply to you I clarified:

(not sure if it compares byte by byte or word by word or whatever, but the case is the same regardless of how it does it)

I'm using byte to stand in for whatever your largest integral type is, since regardless of the size of your comparisons, the algorithm is the same. I didn't want to get into architectural details like how many bytes you can check per instruction. The effect is that you are checking for the first bit that differs. A byte just lets you check 8 bits at a time, a 32 bit pointer 32 bits at a time and a 64 bit pointer 64 bits at a time.

I did agree that returning true or false was faster, I just didn't perhaps entirely understand what you meant by needing to check byte by byte to get -1 or +1. With your example:


const UINT_PTR * puptrOne = reinterpret_cast<const UINT_PTR *>(pu8One);
const UINT_PTR * puptrTwo = reinterpret_cast<const UINT_PTR *>(pu8Two);
for ( UINT_PTR I = 0; I < 8 / sizeof( UINT_PTR ); ++I ) {
    if ( (*puptrOne++) != (*puptrTwo++) ) { 
        /* Found a mismatch. */ 
        if ( *--puptrOne > *--puptrTwo ){
            return +1;
        }
        else {
            return -1;
    }
    return 0;
}
There's no byte-by-byte checking required. As soon as you find a mismatch it takes a single comparison, regardless of how many bits you're checking at a time.


That's actually more checking then you need. You'll note that memcmp (and several other C comparison functions) does NOT return -1, 0, and 1. They return <0, 0, or >0. This is because they are most likely doing the following:


auto result = value2 - value1;
return result;
No conditionals. If you're having to iterate, then you only need to do a single compare of the result against 0 and return if the result is non-0. That's going to be generally faster then two conditionals and exact -1,0,1 return values.

I did agree that returning true or false was faster, I just didn't perhaps entirely understand what you meant by needing to check byte by byte to get -1 or +1. With your example:

And herein lies your problem.
Think about what std::memcpy() returns.
It returns -1 or 1 on the first byte that mismatches.

A value greater than zero indicates that the first byte that does not match in both memory blocks has a greater value in ptr1 than in ptr2 as if evaluated as unsigned char values; And a value less than zero indicates the opposite.


As Bacterius correctly mentions, you can compare all 4 or 8 bytes at a time on big-endian machines and return the correct value, but I chose the 2 arrays I did intentionally to illustrate the problem you have with little-endian machines (Windows, iOS, Android, Mac OS X…).

If you cast pu8One and pu8Two to DWORD’s for the compare you would find a difference between them on reinterpret_cast<DWORD *>(&pu8One[4]) and reinterpret_cast<DWORD *>(&pu8Two[4]) (let’s call those addresses pdwOne and pdwTwo).
(*pdwOne) == 117835012 (0x07060504).
(*pdwTwo) == 101123332 (0x06070504).
This means you would return 1, when in fact you should return -1 (on a per-byte compare, the first non-matching byte is greater in pu8Two).

I am not going to just give you the solution. Coming up with the correct solutions on your own is good for training and is exactly the purpose in challenging yourself to rewrite seemingly mundane things such a memory compares, string compares, string lengths, etc.


This doesn’t directly answer the original poster’s question as to why big names rewrite things, but it illustrates that reinventing wheels, even when not necessarily providing a real-world application, is a great way to practice critical thinking and learn computer-specific tricks that will help you forever into the future (or at least until you stop programming).
You quickly brushed off memcmp as trivial, but if you had ever taken the time to try it for yourself you would find its hidden complexities lie in its subtleties.


L. Spiro

I restore Nintendo 64 video-game OST’s into HD! https://www.youtube.com/channel/UCCtX_wedtZ5BoyQBXEhnVZw/playlists?view=1&sort=lad&flow=grid

This topic is closed to new replies.

Advertisement