Fail early, fail wisely

Riding my bike home I often reflect on a days work done. Part of today had me working on fixing code that imported data from one system into another. Better said, it should have imported data, but it didn’t anymore.

Drilling into the code handling the import and reading the system’s logs, it was soon clear what caused the problem. The xml data from the source system was mallformed, halting the import and making the code responsible for handling the data say “fawk it, i’m out”. This is good. The system shouldn’t try to juggle garbage around until it magically validates and parses. It’s the pragmatic programmers mantra ‘fail early, fail often’, right?

Most of my work revolves around getting data from system A to system B. Every day I get tempted to manipulate bad input into an acceptable format, trying to keep everything afloat. Resisting that urge, and failing early, is always better. It pays out in the long rung. Both system A and B should work as expected. Wrangling duct-tape around incorrect input both creates and allows unexpected behaviour on two sides. Basically, you’ve started digging that hole you can’t get out of in due time. Failing is important, but handling failure even more.

My point in this case: it’s not enough to fail early and often. You should fail wisely. This problem took a developer, me, to dig through code to find the cause. There were application logs, but they’re pretty crowded. Once you find the error, it simply says it couldn’t finish the job. There are specific logs regarding the import, but it’s hard to notice what’s going on to the non-tech end users.

I’m still not sure how to handle this particular case, but one thing is clear. Failing early is pretty easy: be strict. Failing often will happen by default when you separate your code and keep on failing early. Failing wisely is the difficult part.

What does your system have in place for handling the failure?
Who will be notified / who keeps track of failures?
What other parts depend on your failure?
How will end users understand your failure?
What will be needed to debug the failure?

During a previous bike ride an idea dawned on me: Logging SaaS. Basically you’d have a central place which records all your errors, exceptions and the like through it’s API. Options for notifying people, severity based actions, grouping of errors, integration with other systems, you name it. Turns out it’s already here and called Sentry. Too bad, at least I know it was a good idea. As an added bonus, their homepage also makes this entire TL;DR; article obsolete by saying “Shit happens, be on top of it”.