Error handling: To check every little error or confidently skip them?

Started by
13 comments, last by taby 5 months, 1 week ago

Say I have these functions as examples:

...
int result = pthread_mutex_init(&mutex, NULL);
int count = printf("I return negative value on error!");
...

As you can see, the functions like above or others in similar fashion have its own return values that indicate errors, success, or actual result.

My question is in what sense I would need to check every little error for functions like those, or check only if they are successful, or just skip them at all? For example, in many lines in android_native_app_glue.c, they leave the mutex functions completely unchecked. Is that right or not? or is that okay or not? Either way, why?

I honestly don't know the “moral compass” over this at least by industry standard (of any industry, if there are different use cases that lead to different answers), hence the question. 🙂

Advertisement

Definitely check anything that has any reasonable likelihood it could fail, such as file access, or interfacing with the graphics API, or audio devices. I usually try to “bubble up” any errors to the wrapping function, so that they can be handled by code with more context about how to recover from the error. In my mutex wrapper I return a bool from lock()/unlock() functions, though I don't believe I ever check that return value. On windows, the critical section functions don't even have return values, so I assume they always succeed.

There are a few ways to handle errors:

  • assert() - this is good for high-performance code or for things that are only likely to fail due to programmer error, such as passing in invalid arguments to a function. These asserts can be omitted in release/optimized builds so that there is no performance hit in released code. A failing assert crashes the program and tells you where it failed, so that you can correct the programming error. Don't use asserts for things that can fail for other reasons. I have asserts turned on during development (even in optimized builds) to catch errors.
  • Return codes/bools - I use these for things that are more likely to fail due to circumstances outside the control of the programmer, such as if a file doesn't exist. With this method, you need to be careful to propagate errors up the call stack manually. Generally I find bool to be better than integer return codes for this. If you need to indicate a specific type of error occurred, print to a centralized logging system.
  • Exceptions - not recommended, due to potential for unexpected behavior. Exceptions pollute your code with try/catch nonsense, and bloat your executable. In code that uses exceptions, there is the possibility that any function you call could throw, which means to be safe you would need try/catch almost everywhere to clean up stuff to avoid resource leaks. To some extent you can avoid this with the RAII paradigm, but it's no panacea. Most game developers and other writers of high-performance code mandate that exceptions be disabled at the compiler level, to avoid their problems and overhead. I saved like 15% on binary size by turning off exceptions, even though I don't use them at all.

In my codebase I heavily check errors for most 3rd party libraries and APIs, and turn those into a simple bool+log if necessary, for ease of error propagation. Usually errors only propagate up no more than a few function calls before it can be recovered from, if your code is designed well.

Another pattern that can be useful in the “return code” paradigm is to create a “result” class, which is a simple wrapper around an integer return code or enum. This class has an operator for conversion to bool, which makes it easy to see if the function succeeded, yet allows the specific error to also be known:

class Result
{
public:
    Result( int newCode ) : code( newCode ) {}
    
    operator bool () const
    {
        return code != ERROR;
    }
    
    int code;
};

I use this in my audio engine, where the process() function on an audio effect can return a few different values depending on what happened: SUCCESS, ERROR, SILENCE. SILENCE is not an error, and indicates that the audio output of the effect should be assumed to be all 0.

Looking at that android code, I'd say it doesn't have enough error checking, in fact there is almost none.

@Aressera So if I can summarize:

  • Assert on errors that can be made by programmer so it can be captured on point. It can be used on development/optimized build.
    I think I've done this but my boundary of using this method was kinda off to codes beyond my control, so I guess it leads to…
  • Return code/bool for functions that are outside the control of the programmer and propagate up carefully.
    This is probably the kind of boundary that I have to remind myself, which to assert and to return, which should answer the question now. Also, I kinda think to use enum like VkResult for Vulkan, for example.
  • Exception is not recommended.
    I got my -fno-exceptions on test/release build so I guess we're one the same page, but good to know more on the actual outcome by doing so regarding to the binary size.
  • Definitely check anything that has any reasonable likelihood it could fail.
  • I may assume it's always succeed for function that doesn't have any way to handle it.

Aressera said:

Looking at that android code, I'd say it doesn't have enough error checking, in fact there is almost none.

I knew it! Thanks for looking it through. I keep wondering what kind of confidence do I need to code almost without checks like that and rely mostly on trace logs. It seems I got to be careful to learn from code like that.

Aressera said:
Exceptions - not recommended

Floating point exceptions can be very useful. In the physics engine i'm using they are enabled by default (maybe in debug mode). So whenever some nan or inf number pops up, you see the culprit immediately.
I had to disable this, because my code has too many numerical issues. And for some things infinite numbers just work. Also, things like graphics drivers spill some nans as well internally, i have learned.
But if i had used fp exceptions from the start myself too, my code would be better and more robust. So i hope - after fixing all my issues - i can adopt this practice. During development it seems a handy debugging tool.

arly said:

Say I have these functions as examples:

...
int result = pthread_mutex_init(&mutex, NULL);
int count = printf("I return negative value on error!");
...

I can't make much sense out of the printf line, but personally i use a thread safe version of it (printf often is thread safe already, but not sure, so i rather use a mutex).
I also make sure the file on disk is updated after each print, so if the application crashes i do not miss the latest messages.
And i have some limit of lines to print, so i do not fill my entire HDD with some error which appears too often.
Looks like that:

void SystemTools::Log (const char *text, ...)
{
	static std::mutex mutex;
	std::lock_guard<std::mutex> lock(mutex);

	static int maxLines = 50000;
	if (maxLines < 0) return;
	maxLines--;

	va_list vl;
	va_start (vl, text);
	vprintf (text, vl);
	va_end (vl);
	fflush (stdout);
}

I feel the amount of error checking i do increases with age, but i still don't do enough of it. : )

arly said:
I honestly don't know the “moral compass” over this at least by industry standard (of any industry, if there are different use cases that lead to different answers), hence the question. 🙂

I work in the business where correctness is much more relevant than speed, and I add assertion-like checks everywhere. They basically cover all values that are not expected and/or not handled in the code. Eg if a value should be either 1 or 2 and I dispatch in 2 cases with an IF .. ELSE IF, I add a 3rd ELSE CRASH(); as a check. Alternatively, you can code that as IF value == 1 … ELSE assert(value == 2); …

The same with a switch on all enumeration literals. I add a DEFAULT: CRASH(); so I will catch the case that some one might extend the enumeration and not changed that switch.

I have been doing that for a long time, and I found it speeds up development and results in more robust code. All the checks basically make that you cannot move outside any boundary, since something nearby will crash the program then. Obviously it is somewhat inconvenient to have a crash, but “nearby” is key here. Errors in values etc don't travel to completely different and unrelated parts of the code before eventually crashing the application. It can be a long way back from that crashing point to the root cause of the crash. Not giving errors room for propagating to other code helps a lot.

The second thing is that making an error in the code (thus computing a wrong result) is quite likely to crash the program on the first few tests, often even with the first test. At that time you still remember what you changed, and thus already have a head start in finding the culprit. This also happens when you forget to change some code. For example, you're expanding functionality and you need a new value for something. If that value then arrives on code that should handle the new value as well but you forgot about that, in my case it will crash there. It happens quite often that I see a stack-dump and think “Oh my, yes that code needs to be changed too!”.

We keep all checks in-place also at deployment. We don't get many reports with crashes on those checks. In all fairness, our user-base is likely much smaller than of the average game 🙂

EDIT: These checks also extend to variables/data structures, eg adding an assumed new key/value pair in a dictionary, I check it is a new value indeed.

Thanks for further replies, guys! Lots of good points I can further consider to design my code. 🙂

Also

Alberth said:
Not giving errors room for propagating to other code helps a lot.

Noted!

arly said:
I honestly don't know the “moral compass” over this at least by industry standard (of any industry, if there are different use cases that lead to different answers), hence the question. 🙂

We usually flag it in code reviews.

If a function was written in a way that indicates a potential failure, when you multiply it by thousands or millions of players, and they're playing for hundreds of hours, sooner or later the error will happen and the code will need to handle it.

Exactly how you handle it will depend on the system and the results and how critical the failure is, but that's besides the point. If you capture the failure and then comment it's not critical, you can move along.

In your example of pthread mutex unable to init that is a big problem that could result in a crash, or much worse: random data corruption. It's something that would be extremely bad to not capture. I really like the addition of the [[nodiscard]] attribute added in C++17 and expanded in C++20 for conditions that are going to be a major defect. They added it to a few situations like memory leaks on operator new, allocate(), and launder(), or obvious bugs like an uncaptured async. The addition to empty() to reveal very common bugs of misuse was also great. In your pthread_mutex_init() example that's something that if rewritten today, would be a good fit for [[nodiscard]], since the error is potentially hard-to-detect data corruption.

Two questions you need to ask yourself for every error condition:

  1. Is this error actually something that can happen?
  2. If so, is there anything I can do about it?

If both are true, put in a check. If not, ignore it.

Take printf for example. Can it fail? Maybe in theory, such as when stdout is redirected to a file on an USB stick and you remove the stick. Is there anything you can do about it if it fails? Not really. You could try to notify the user, but in most cases there's no point. Unless you're writing some really important information that absolutely needs to be written, you're better off just ignoring it.

Counterexample: let's say you're writing a saved game file and run out of disk space. Is this something that can happen? Very much so. Is there something you can do about it? Yes, actually. Before overwriting an existing saved game, save the game to a temporary file, then move the temporary file over the existing saved game - but only if writing the temporary file succeeds. You can't guarantee that saving the game succeeds, but you can guarantee that you won't run trash your existing saved game when you run out of disk space.

Did you know that C++ has multi-threaded functionality built into the standard?

This topic is closed to new replies.

Advertisement