Safety vs Efficiency

Started by
30 comments, last by Ravyne 8 years, 9 months ago
I am not sure. If you call abort() you essentially block everyone until you checked in a fix. Blocking all your co-workers is not a good idea. Loosing work or blocking people is both expensive and doesn't make sense to me. Chances are that your bug might not even affect most of the team so why penaltilize them as well.

In our system when an assert is triggered the user can decide to skip this assert once or always. He then can also decide to send out an email with a description of the error. The expectation is to be able to continue in that case. E.g. If a collision routine fails it is ok for the entity to fall out the world. I don't see a reason to crash in such a case.
Advertisement


I am not sure. If you call abort() you essentially block everyone until you checked in a fix. Blocking all your co-workers is not a good idea. Loosing work or blocking people is both expensive and doesn't make sense to me. Chances are that your bug might not even affect most of the team so why penaltilize them as well.

Fail fast and fail loud, there is no other way :) And in my opinion everything else that crash is not loud enough. User will never let you know or even read the error message on the screen.

If you ship with bug that affects EVERONE, then you should be punished :) If you ship with a bug the occurs on the edge case, then it hardly blocks everyone. And even then you can implement auto-save or similar functionality. Anyway at the time you are in a state of crashing bug - like pointer that points to different structure that it should, then you already are in unstable and undefined state. Trying to save at this point may actually cause lost work as you don't know what you are saving.


Redesign the way this is used so that the check is never needed

generally speaking, this is probably the best way to code anything - if possible.

theoretically, validation of user input is the only necessary check, assuming the coders do their job correctly.

Norm Barrows

Rockland Software Productions

"Building PC games since 1989"

rocklandsoftware.net

PLAY CAVEMAN NOW!

http://rocklandsoftware.net/beta.php

ail fast and fail loud, there is no other way smile.png

I recommend to rethink this. I do quite some interviewing these days and saying something like this in an phone screen or interview would be a dark red flag. I don't know what experience you have in professional development, but maybe don't think so much shipping, but daily development in large teams of maybe 100 people. If you crash loud and hard on every little bug because of not programming defensively I don't see such a team working effectively.

Personally, I can't live without the SDL_assert. The feature of having a button to ignore once and another to ignore all future ones is so helpful. You can just skip it. But it's there if you need to know.

As for failing hard, I tend to fail silently in production, and fail with a debug log error that makes it easy to find and fix the error. If I ever get an error and it takes me longer than 5 minutes to figure out what happened, I tweak the message so it's easier to find next time.

Imagine that you've got some alpha versions being tested. You can't just have an abort() every time something goes wrong. Sometimes, knowing that the error happened once or every frame is just the data point you need.

I think, therefore I am. I think? - "George Carlin"
My Website: Indie Game Programming

My Twitter: https://twitter.com/indieprogram

My Book: http://amzn.com/1305076532

If you crash loud and hard on every little bug because of not programming defensively I don't see such a team working effectively.

Crashing loud and hard on invalid inputs *is* defensive programming.

Wasting 8 hours debugging incorrect outputs because your co-worker decided to silently ignore invalid inputs - that's the kind of thing that rips a team apart. If he'd thrown an exception, or abort()'d with a suitable error message, I'd instantly know where the bug was.

Edit: I'm not suggesting you ship a build to your customers which crashes every time they move their mouse. But silent errors are never ok - even if you decide to continue running in your production build, you still need to log a stacktrace before doing so.

Tristam MacDonald. Ex-BigTech Software Engineer. Future farmer. [https://trist.am]

Salty Boyscouts, have you read 'The Pragmatic Programmer'? It talks about this very issue.

Beginner in Game Development?  Read here. And read here.

 

I think we are on the same page here. Loud to me means to trigger the assert. This should be a serious enough problem to fix and I feel personally better fixing something without the pressure blocking other people.

The SDL assert is a great example and we use this here at work as well. Then the error is communicated to the team. For me this is loud enough. Just don't terminate the program without reason if you don't need to.

I don't want to sound compulsive here. My point is just to think twice before you potentially unnecessarily pull the plug for someone else in your team. I am also thinking more on the daily production cycle than shipping. smile.png

I just skimmed the thread, but I'm surprised nobody seems to have categorized this as a trust boundary problem yet.


If your API exists at a trust boundary, you damn well better not crash on bad inputs. In fact, you have a sacred duty to the customer to ensure that you are robust in the face of garbage - or even malicious - data forced into your API. At a trust boundary, unit testing is only one part of the reliability picture; you also need fuzz testing and penetration testing. This applies whether you are working in truly "secure" software requirements land, or just writing an open source library. If there is a difference in how much you trust your caller versus how much you trust your callee, you have a trust boundary, and you better be robust in the face of bad data.

On the flip side, if you're contained in a consistent layer of trust, e.g. internal calls that outsiders do not have access to, fail hard and fail fast.


Always know what the failure mode of your code should be. Anyone telling you it's always the same is lying, naive, or both. When people don't understand trust boundaries and how they relate to failure modes, we get software exploits and compromised systems.

Wielder of the Sacred Wands
[Work - ArenaNet] [Epoch Language] [Scribblings]

If you crash loud and hard on every little bug because of not programming defensively I don't see such a team working effectively.

Crashing loud and hard on invalid inputs *is* defensive programming.

Wasting 8 hours debugging incorrect outputs because your co-worker decided to silently ignore invalid inputs - that's the kind of thing that rips a team apart. If he'd thrown an exception, or abort()'d with a suitable error message, I'd instantly know where the bug was.

Edit: I'm not suggesting you ship a build to your customers which crashes every time they move their mouse. But silent errors are never ok - even if you decide to continue running in your production build, you still need to log a stacktrace before doing so.

Fail loud and hard is a wise advice, but I think it really depends on the circumstances. I have worked on systems that failing loud and hard is the preferable choice, but I have also worked on parts that it needs to fail silenty and log the errors in the background, or at least throw a loud warning.

Imagine a scenario where you are making an MMORPG and you work on continuous levels. Depending on player's position in the game world, you load the adjacent segments, along with all their entities and enemies, etc. However, somebody fucked up in placing the entities that one of them just happen to be inside an enemy, and due to the way the code has been written, this causes the entity to get stuck in the enemy, and move along with it.

This error is not catasthropic. Having a tiny butterfly entity stuck inside an ogre's belly isn't that big of a deal and doesn't break the entire game experience, and player wouldn't probably notice. Should you fail hard on this type of error? Failing hard meaning the game crashes really bad and disconnect the players out of the game world. This means no players will able to explore this area at all just because a tiny butterfly ornament just happen to be spawned inside an ogre.

So I don't think there is a universal answer to this. An error should be reported loud indeed, but it shouldn't fail hard, as it depends on the nature of the error. Does failing it hard cause disruption in the user's experience and flow? Does it matter that much that an error in an entity should cause the entire level loading system to crash along with it?

This topic is closed to new replies.

Advertisement