Sign in to follow this  

Using 3rd party libraries or code your own?

This topic is 1269 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

So I was looking through the Unreal 4 source and I see that they have everything as you would imagine, a whole string library pretty much the same as Boost. Now I guess they are so used to always writing their own as libraries as libs like boost haven't been around for long but I am guessing they could knock maybe up to a third of their code off from using libs like Boost, and various other highly optimised libs.

 

So what are your thoughts here, write your own or use third party?

Share this post


Link to post
Share on other sites

The decision to use a third party library is never an easy one.

 

I personally don't believe in re-inventing the wheel, if someone has already done a really good job of doing something, why should I invest the time doing the same ?

 

However there are times when "the best out there" isn't good enough. Then you need to get involved and do some serious coding.

 

It's a cost versus advantage equation. 

 

If using an external library saves you 3 man months of coding at the cost of 1ms per loop, it's probably an easy decision ... until you really need that millisecond smile.png

 

We use third party code, and pay a lot for the privilege. It's not always a good call. I've just saved 100's of milliseconds in a game because I noticed a problem in a third party library.

 

Now this library is very good, don't get me wrong, at times it's brilliant. However it was designed for a small number of animation tracks. We use thousands of animation tracks. Huge numbers of the damn things. So the code was doing something it wasn't designed to do, and hence it was slow.

 

I added three lines of code and increased the speed by several thousand percent. The code was simple, finding the problem was a bitch. I'm talking a 20 year old bitch with massive PMT. blink.png 

 

Would we have had that problem if we wrote it ourselves? Hell no!

 

Would we get the game out on time if we had wrote it ourselves...... probably not. sad.png  

Share this post


Link to post
Share on other sites

The main advantage of a 3rd party library is not that the code is written for you. The main advantage is that the code has been tested and debugged for you. 

 

Do not underestimate that. You might look at a library and be horrified at the bloat and hacks and god knows what else, but most of the time, that boat and those hacks represent hours/day/weeks of work tracking down obscure bugs in strange environments.

 

Always be aware of that if you decide to write your own code instead of using a library.

Share this post


Link to post
Share on other sites

The main issue with pre-existing libraries is that they rarely do exactly what you want, and it's frequently the case that said library needs a lot of wrapping before you can even use its unwieldly API.  You end up doing things like writing C interface layers to poorly documented image processing libraries that think they're cute because they use longjmp.

 

That said, as terrible as most libraries tend to be, there tends to be at least one that's a very close fit to your requirements.  The hard part is finding it, and the harder part is testing it, because those libraries aren't typically the massively battle-tested ones.

Edited by SeraphLance

Share this post


Link to post
Share on other sites

The other thing to consider as well as the very valid points mentioned above is that the Unreal 4 code base is based on code that has been around for a long time.  Sure Epic may claim that it has been rewritten form scratch but I highly doubt that they wrote every single line.  If they have a string or a collection library that they have been using and maintaining since they release Unreal in 1998 then I really doubt they are going to replace it.

Share this post


Link to post
Share on other sites

but I also have MemCmpF which saves cycles by returning only true or false (as apposed to -1, 0, or 1, which requires a byte-per-byte scan once a discrepancy between the inputs is found).

You'd be at most saving one comparison actually. As soon as you find a byte that's different (not sure if it compares byte by byte or word by word or whatever, but the case is the same regardless of how it does it), you check if the byte is bigger and return based on that.

If you have four bytes 0x01 0x02 0x03 0x04 and compare them to 0x01 0x03 0x03 0x04, as soon as you check the second byte you know the second string is bigger, so you return 1. If the second string was 0x01 0x00 0x03 0x04 you'd return -1 after checking the second byte. The first byte that differs is the most significant byte that differs, so that's all you need.

As for the actual topic, there's always a desire to roll you own when you realize you need something. You always need to be aware of the 80/20 rule, though: 80% of the work will be done in 20% of the time. The last 20% will take 80% of the time. That is to say, you can get most of the way in a short time, which may make you think that you saved a lot of time, but actually taking care of all the edge cases, testing, debugging and the like that would get you all the functionality you need will take you the most time. Deciding whether to roll your own or use 3rd party libraries needs to be carefully considered, there's no simple answer.

The first thing you should do is understand what your requirements are, exactly. The better you can define your requirements, the easier it is to make a decision. Then look at the documentation for the library. Does it match your requirements? How much work would you need to do to get it all the way there? Look around, too, don't just settle for the first library you find. After this you should have a list of libraries that match your requirements. Compare those to determine which one is the best. Does it have an active community of users (good for getting help if you're stuck, and probably means it is in active development), does it have good documentation, does it have a bug tracker, are there any disqualifying bugs still around, is the interface reasonable and easy to use, is the build system simple, etc. Once you have the best of these, compare what it does to how much effort it would take you to match the parts that you actually need.

If you are convinced you can do it better in less time that it would take you to learn the interface, go for it. Otherwise, use the library.

Of course, writing libraries is great for learning too. If time isn't an issue (meaning, generally, a hobby project), do it yourself if you want to. What I generally do is write from scratch things that I think are interesting, and use libraries for things that are boring.

Share this post


Link to post
Share on other sites

For me it depends, when it comes to the standard libraries like vector, map, strings etc., I believe I couldn't do it better/ more efficient. When it comes to for example graphics handling I tend to just use the API and create libraries/ code around it myself. Not that I can do that better maybe then other available libraries, but it helps me in learning a lot.

Share this post


Link to post
Share on other sites

but I also have MemCmpF which saves cycles by returning only true or false (as apposed to -1, 0, or 1, which requires a byte-per-byte scan once a discrepancy between the inputs is found).

You'd be at most saving one comparison actually. ... As for the actual topic, there's always a desire to roll you own when you realize you need something. ... The better you can define your requirements, the easier it is to make a decision.
Looks like you might also benefit from taking time to understand the requirements.

While it is true that the return value is a tiny difference, the actual comparisons can be radically different. Parallel search for mismatch is an algorithm family that can have superlinear speedup. So if you split it up into 4 chunks of work you can do the job much faster than 1/4 the time on average; sometimes dividing into 4 parts and not caring abot exact mismatch position or direction could give 8x speedup rather than a linear 4x speedup, for example. If you are not that particular about the exact nature of the mismatch and are willing to do some algorithmic transformations you can get much faster results than a standard library's memory compare in many situations.

Share this post


Link to post
Share on other sites

Well, if we're getting right down to the stated requirements, L. Spiro said the goal of his memcmp was to save cycles, not time.

 

The parallel search will have strictly worse cost in cycles spent, just to get the basic boolean result. Trying to get the +1 or -1 result would actually require a significant increase in complexity (compared to checking the first byte that differed).

 

This is because the single threaded compare only needs to search until it finds a discrepancy, and it can quit early. Parallel compare, on the other hand, has to split the memory into chunks and start every chunk at the same time. Even if you allow for threads to quit as soon as they find a discrepancy, the threads that didn't find one will continue until they are done.

 

Worse case for the library memcmp is if both strings are equal, since it has to compare it all to return 0. This is the same for the parallel compare, since all the threads will have to do the same number of comparisons. In every other case, every chunk that compares equal has to be run to completion in either parallel or single threaded compare, but the single threaded model won't compare any chunks after the first difference, while the parallel compare will have to compare at the very least the first byte of each chunk. A consequence of this, in fact, is that as you increase the number of chunks, your algorithm becomes faster but also more wasteful. Take the best case scenario, for example, where every byte is different. memcmp will check the first byte and return. Parallel compare will check n bytes and return, where n is the number of chunks you have.

 

And I'm not even comparing the cost of launching all the threads. Even for perfect threads that cost nothing to create, the algorithm is strictly worse for every case in terms of cycles spent.

Share this post


Link to post
Share on other sites

The parallel search will have strictly worse cost in cycles spent, just to get the basic boolean result. Trying to get the +1 or -1 result would actually require a significant increase in complexity (compared to checking the first byte that differed).

Putting an on-topic spin to this, this is one of the reasons reinventing wheels can be a good thing.

Kian thinks std::memcmp() is so basic, “It’s just comparing bytes until it finds a discrepancy,” but if he had taken the time to write his own routine he would have found it to be at least 4 times slower than std::memcmp().


Even if you don’t think you can write a faster routine, nor think reinventing wheels is a good idea, you should still rewrite some of the STD routines as a learning exercise. You really lose nothing. If yours is faster, then you’ve just made a faster version (just for bugs). If yours is substantially slower, you will be inclined to ask why that is, investigate, and learn.



So, first-off, Kian, it is not implemented as a byte-to-byte compare.
Look at these 2 arrays of 8 bytes:
 

const u8 pu8One[] = { 0, 1, 2, 3, 4, 5, 6, 7 };
const u8 pu8Two[] = { 0, 1, 2, 3, 4, 5, 7, 6 };

You could do this:

for ( u32 I = 0; I < 8; ++I ) {
    if ( (*pu8One++) != (*pu8Two++) ) { /* Found a mismatch. */ }
}

…which takes 7 iterations to discover a mismatch.

Or you could compare either 4 or 8 bytes at a time (depending on the architecture):

const UINT_PTR * puptrOne = reinterpret_cast<const UINT_PTR *>(pu8One);
const UINT_PTR * puptrTwo = reinterpret_cast<const UINT_PTR *>(pu8Two);
for ( UINT_PTR I = 0; I < 8 / sizeof( UINT_PTR ); ++I ) {
    if ( (*puptrOne++) != (*puptrTwo++) ) { /* Found a mismatch. */ }
}

…which finds a mismatch in 2 iterations on x86 and 1 iteration on x64.


This is just basic but is already a speed-up of 4 or 8 times. When you don’t need to return -1 or 1 on a mismatch, you are done. Since you aren’t done if you do, by definition returning -1 or 1 is slower than just returning true or false.

Can you think of a way to make it even faster?


L. Spiro

Edited by L. Spiro

Share this post


Link to post
Share on other sites

The main advantage of a 3rd party library is not that the code is written for you. The main advantage is that the code has been tested and debugged for you. 

 

Although this is certainly true, it's also dangerous to assume it is extremely well tested for your exact cases.

Taking in 3rd party code is also a risk.

It could speed up development a lot, but it can also trip you on the finish line when some obscure bug in it appears, and you have to scramble to either have it fixed or work around it.

If you have the source code to it, you are a bit safer, but understanding a 3rd party library enough to safely make changes to it also takes some time.

 

Just wanted to underline that the choice isn't straight forward...

Edited by Olof Hedman

Share this post


Link to post
Share on other sites

On big-endian architectures you can actually just do the final comparison for 4/8 byte chunks just as you would for a byte-by-byte approach (of course, most consumer hardware is little-endian). I would also imagine that unless your buffers are in cache to begin with, your implementation is likely going to be memory-bound. In short, memcmp is not the first function I would look at when micro-optimizing, compared to, say, memory allocation patterns, excessive use of virtual functions, poor cache usage, etc... but, yes, it will probably make you a better programmer to really understand how it is typically implemented on a variety of architectures.

Share this post


Link to post
Share on other sites




Just wanted to underline that the choice isn't straight forward...

 

Agree one hundred percent.

 

When people write a library, they write it with a set of assumptions in mind. Those assumptions may or may not be true for your use case.

 

I've just had to write a work around for a very well known and respected library. They made an assumption that an animation rig would only have a very small number of float tracks attached to it. This assumption is very wrong in our case. Simply getting an integer from the system using a string as a key was taking several milliseconds because they just did a linear walk looking for the string in the array.

 

So every time you used this function you were doing a hell of a lot of string compares.

 

I simply used a hashmap to replace the linear walk and saved 100's of milliseconds per frame. A massive speed up for very little code.

 

Did take a hell of a long time to find the problem though :>

 

Third party libraries are great as long as you know there limitations

Share this post


Link to post
Share on other sites

Kian thinks std::memcmp() is so basic, “It’s just comparing bytes until it finds a discrepancy,” but if he had taken the time to write his own routine he would have found it to be at least 4 times slower than std::memcmp().

Actually, on my first reply to you I clarified:

(not sure if it compares byte by byte or word by word or whatever, but the case is the same regardless of how it does it)

I'm using byte to stand in for whatever your largest integral type is, since regardless of the size of your comparisons, the algorithm is the same. I didn't want to get into architectural details like how many bytes you can check per instruction. The effect is that you are checking for the first bit that differs. A byte just lets you check 8 bits at a time, a 32 bit pointer 32 bits at a time and a 64 bit pointer 64 bits at a time.

I did agree that returning true or false was faster, I just didn't perhaps entirely understand what you meant by needing to check byte by byte to get -1 or +1. With your example:

const UINT_PTR * puptrOne = reinterpret_cast<const UINT_PTR *>(pu8One);
const UINT_PTR * puptrTwo = reinterpret_cast<const UINT_PTR *>(pu8Two);
for ( UINT_PTR I = 0; I < 8 / sizeof( UINT_PTR ); ++I ) {
    if ( (*puptrOne++) != (*puptrTwo++) ) { 
        /* Found a mismatch. */ 
        if ( *--puptrOne > *--puptrTwo ){
            return +1;
        }
        else {
            return -1;
    }
    return 0;
}
There's no byte-by-byte checking required. As soon as you find a mismatch it takes a single comparison, regardless of how many bits you're checking at a time.

Share this post


Link to post
Share on other sites

Kian thinks std::memcmp() is so basic, “It’s just comparing bytes until it finds a discrepancy,” but if he had taken the time to write his own routine he would have found it to be at least 4 times slower than std::memcmp().

Actually, on my first reply to you I clarified:

(not sure if it compares byte by byte or word by word or whatever, but the case is the same regardless of how it does it)

I'm using byte to stand in for whatever your largest integral type is, since regardless of the size of your comparisons, the algorithm is the same. I didn't want to get into architectural details like how many bytes you can check per instruction. The effect is that you are checking for the first bit that differs. A byte just lets you check 8 bits at a time, a 32 bit pointer 32 bits at a time and a 64 bit pointer 64 bits at a time.

I did agree that returning true or false was faster, I just didn't perhaps entirely understand what you meant by needing to check byte by byte to get -1 or +1. With your example:

const UINT_PTR * puptrOne = reinterpret_cast<const UINT_PTR *>(pu8One);
const UINT_PTR * puptrTwo = reinterpret_cast<const UINT_PTR *>(pu8Two);
for ( UINT_PTR I = 0; I < 8 / sizeof( UINT_PTR ); ++I ) {
    if ( (*puptrOne++) != (*puptrTwo++) ) { 
        /* Found a mismatch. */ 
        if ( *--puptrOne > *--puptrTwo ){
            return +1;
        }
        else {
            return -1;
    }
    return 0;
}
There's no byte-by-byte checking required. As soon as you find a mismatch it takes a single comparison, regardless of how many bits you're checking at a time.


That's actually more checking then you need. You'll note that memcmp (and several other C comparison functions) does NOT return -1, 0, and 1. They return <0, 0, or >0. This is because they are most likely doing the following:

auto result = value2 - value1;
return result;
No conditionals. If you're having to iterate, then you only need to do a single compare of the result against 0 and return if the result is non-0. That's going to be generally faster then two conditionals and exact -1,0,1 return values.

Share this post


Link to post
Share on other sites

I did agree that returning true or false was faster, I just didn't perhaps entirely understand what you meant by needing to check byte by byte to get -1 or +1. With your example:

And herein lies your problem.
Think about what std::memcpy() returns.
It returns -1 or 1 on the first byte that mismatches.

A value greater than zero indicates that the first byte that does not match in both memory blocks has a greater value in ptr1 than in ptr2 as if evaluated as unsigned char values; And a value less than zero indicates the opposite.


As Bacterius correctly mentions, you can compare all 4 or 8 bytes at a time on big-endian machines and return the correct value, but I chose the 2 arrays I did intentionally to illustrate the problem you have with little-endian machines (Windows, iOS, Android, Mac OS X…).

If you cast pu8One and pu8Two to DWORD’s for the compare you would find a difference between them on reinterpret_cast<DWORD *>(&pu8One[4]) and reinterpret_cast<DWORD *>(&pu8Two[4]) (let’s call those addresses pdwOne and pdwTwo).
(*pdwOne) == 117835012 (0x07060504).
(*pdwTwo) == 101123332 (0x06070504).
This means you would return 1, when in fact you should return -1 (on a per-byte compare, the first non-matching byte is greater in pu8Two).
 
I am not going to just give you the solution.  Coming up with the correct solutions on your own is good for training and is exactly the purpose in challenging yourself to rewrite seemingly mundane things such a memory compares, string compares, string lengths, etc.
 
 
This doesn’t directly answer the original poster’s question as to why big names rewrite things, but it illustrates that reinventing wheels, even when not necessarily providing a real-world application, is a great way to practice critical thinking and learn computer-specific tricks that will help you forever into the future (or at least until you stop programming).
You quickly brushed off memcmp as trivial, but if you had ever taken the time to try it for yourself you would find its hidden complexities lie in its subtleties.
 
 
L. Spiro Edited by L. Spiro

Share this post


Link to post
Share on other sites

No conditionals. If you're having to iterate, then you only need to do a single compare of the result against 0 and return if the result is non-0. That's going to be generally faster then two conditionals and exact -1,0,1 return values.

Ah, you're right. I didn't go back to check the spec for memcpy.

You quickly brushed off memcmp as trivial, but if you had ever taken the time to try it for yourself you would find its complexities lie in its hidden subtleties.

I didn't brush it off as trivial. I was confused by you saying that you would need to check byte-by-byte after finding a mismatch. Yes, little endian vs big endian is something I don't often think about, since I'm not generally working at that level.

If we have the pattern 0x01 0x02 0x03 0x04, in a little endian architecture, and compare against 0x02 0x02 0x03 0x03, we would want this to find the second pattern to be larger. But when we read it, they get interpreted as 0x04 0x03 0x02 0x01 and 0x03 0x03 0x02 0x02. So the processor will think the second pattern is smaller.

I suppose that I would then do
const UINT_PTR * puptrOne = reinterpret_cast<const UINT_PTR *>(pu8One);
const UINT_PTR * puptrTwo = reinterpret_cast<const UINT_PTR *>(pu8Two);
for ( UINT_PTR I = 0; I < 8 / sizeof( UINT_PTR ); ++I ) {
    if ( (*puptrOne++) != (*puptrTwo++) ) { 
        /* Found a mismatch. */ 
        auto rPtrOne = reinterpret_cast<const unsigned char*>( puptrOne );
        auto rPtrTwo = reinterpret_cast<const unsigned char*>( puptrTwo );
        // For a UINT_PTR of size 4. I could do a bit of template or macro magic to have it choose at compile time 
        // something appropriate for size 8 UINT_PTR
        UINT_PTR reverseValueOne = rPtrOne[-4]>>24 | rPtrOne[-3]>>16 | rPtrOne[-2]>>8 | rPtrOne[-1]>>0;
        UINT_PTR reverseValueTwo = rPtrTwo[-4]>>24 | rPtrTwo[-3]>>16 | rPtrTwo[-2]>>8 | rPtrTwo[-1]>>0;
        return reverseValueOne - reverseValueTwo;
    }
}
I suppose checking byte by byte also works. I'd need to check how the ifs compare, though I generally believe branching logic is much more expensive than following a single path.

Share this post


Link to post
Share on other sites

Since we've derailed to the point of discussing memory comparison...

 

What about alignment? Imagine I give you starting pointers that are not properly aligned for the architecture; lets say the first is aligned at 3 bytes above a 16-byte boundary, the other is 9 bytes above it. Now by blindly treating them as larger values you are getting a nasty penalty.  Wrong alignment could destroy your performance far worse than a few additional comparisons. 

 

What about the locations they are reading from? What are the distances from each other? If by some horrible luck your memory is located at LLC_size/cache_page_size distances from each other at offsets that don't work with the chip's cache associations, suddenly the chip's cache is completely useless. Every comparison causes a cache invalidation. Cache thrashing will far outweigh just about any other performance problem.

 

How long are the chunks of memory to be compared? Once the length is sufficiently large the overhead of bigger parallelization, either across a single chip or across multiple chips, will eventually cross the threshold and become faster solutions.  ... but the overhead of detecting the situation will require several cycles which are not necessary for very small memory sizes.

 

It is really hard to write code that works well in all cases. If you have some specialized knowledge you can usually write some limited algorithm that works much faster than a general purpose algorithm.  

 

You might be able to craft an algorithm that works best because you know the memory does not overlap, is located exactly in 64-byte alignment, and is an exact 64-byte incremental length, and has a total length of less than or exactly 4096 bytes. With all those guarantees you can blow away the performance of a general purpose memory comparison function.

 

 

 

Generally it is faster and easier for programmers to rely on other people's code.

 

You do it all the time.

 

You use the Windows libraries to handle your disk loads rather than writing huge amounts of code to directly handle every possible type of disk driver, attached through any type of connection from SCSI to IDE or SATA to PATA to a bunch of chained USB devices connecting to an arbitrary storage device. You just use FileOpen and let the third party code do the work.

 

You use Direct3D or OpenGL libraries to handle all the graphics work, rather than writing huge amounts of code for every kind of card and chipset that will map memory windows and transfer card-specific data for everything.

 

I remember back in the bad old days having to do much of that myself as a hobby developer in the late 1980s.  Detecting an official SoundBlaster 16 Pro card usually works, unless they had a long list of inferior cards where detecting it incorrectly could freeze the computer. But maybe they had one of the Turtle Beach cards, so you needed to account for those nicer computers either directly or by using their emulation layers.  Then for graphics you needed to code directly to various EGA banks using a standard set of commands ... except for a short list of incompatible cards that didn't precisely implement the standard.

 

 

 

Use third party libraries, unless you have a good reason to use your own.  Educational reasons are a good reason to use your own.  Documented performance reasons are a good reason to use your own.  Knowledge of the underlying systems are a good reason to use your own. Your ability to make guarantees is a good reason to use your own.

 

The original question was about Unreal using a collection of their own libraries that were similar to (but different from) several existing implementations.  The reason is that Epic, for whatever reason, decided to implement their own. Usually game studios don't rewrite standard libraries unless they have some specific reason to do so. We can speculate for their reasons, or you can look up the exact implementation, see if they left any comments describing the motivation, and maybe even ask on their support forums if you still don't see it.

Edited by frob

Share this post


Link to post
Share on other sites
I agree. This is specially true for your language's libraries. Whoever implemented it for your system will have more information about the environment than you do. They don't need to worry about portability. If you try to roll your own, however, you need to make sure it can run on every platform you might want to support. It can be a good learning experience, but I wouldn't use it in production code.

Share this post


Link to post
Share on other sites

This topic is 1269 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this