QCon SF 2011: And It All Went Horribly Wrong: Debugging Production Systems

Intro: Maurice Wilkes quote:

"As soon as we started programming, we found out to our surprise that it wasn't as easy to get programs right as we had thought. Debugging had to be discovered. I can remember the exact instant when I realized that a large part of my life from then on was going to be spent in finding mistakes in my own programs."

Presentation slides here.

Debugging through the ages :

Production systems are more complicated (abstraction, componeurization, etc.) and less debuggable! When something goes wrong, it is all opaque and more and more difficult to fix...

How have we made it this far?

we architect ourselves to survive component failure
we forced ourselves to stateless tiers
when there are states, we considered semantics (ACID, BASE) to increase availability
redundant systems
clouds (especially unreliable ones, like Amazon) have expended the architectural imperative to survive data-center failure

Do we still need to care about failure?

single component failures still has significant costs (both economic and run-time)... but most dangerously, a single component failure puts the global system in a more vulnerable mode where further failures is more likely to happen... This is a cascading failure - and this is what induces failures in mature and reliable systems.

Cascaded failure example 1:

An example of a bridge that collapsed in Tampa Bay because of a boat with ballasts full (which was not supposed to happen) hit the bridge (which was not supposed to happen either) that had been built by a crooked contractor playing with sand/cement ratio in concrete to save money (which was not supposed to happen either - after all, this is Florida, not Quebec!). In the end, it took all those "unlikely" events to all occur to cause the bridge to fall.

Wait, it gets worse

this assumes that the failure is fail stop
if failure is transient a single component failure can alone induce system failure
monitoring attempts to get at this by establishing liveness criteria for the system - and allowing operator to turn transient failure into fatal failure...
... but if monitoring becomes too sophisticated or invasive, it risks becoming so complicated as to compound failure.

Cascaded failure example 2:

An image of 737 rudder PCU schematic (details here). Another example of a cascaded failure that led to B737 landing issues.

Debugging in the modern era

- Failure - even or a single component - erodes oeverall reliability system

- When a single component fails, we need to understand why and fix it

Debugging fatal component failure

when a component fails fatally, its state is static and invalid
by saving the state, to stable storage, in DRAM for example, the component can be debugged postmortem
one starts with the invalid states and proceeds backward to find the transition from a valid state to an invalid one
this technique is old: core dumps

Postmortem advantages

no run-time systems overhead
debugging can occur anytime, in parallel of the production system
tooling can be very rich since the overhead caused to a run-time system is not a problem

Cascaded failure example 3:

Flight Data Recorder from Air France crash (747) found 1.5 years after the crash

This recovery definitely permitted postmortem analysis

Postmortem challenges

need the mechanism for saving state on failure (ex: core dumps)
must record sufficient state (program text + program data)
need sufficient state present in DRAM to allow for debugging
must manage state such that storage is not overrun by a repeatedly pathological system

Conslusion:

these challenges are real but surmountable - as in some open source systems presented below (MDB, node.js, DTrace, etc.)

Postmortem debugging: MDB

a debugger in illumos OS (solaris derivative)
extensible with custom debugger module
well advanced for native code but much less for dynamic envirnoments such as Java, Python, Ruby, JS, Erlang...
if components going into infrastructures are developed using those languages it is critical that they support postmortem debugging.

Postmortem debugging: node.js

not really interesting for non JAVA...
debugging a dynamic environment requires a high degree of VM specificity in the debugger...
see all details on dtrace.org/.../nodejs-v8-postmortem-debugging

Debugging transient component failure

fatal failures, despite its violence, can be root-caused from a single failure
non fatal failure, it is more difficult to compensate for and debug

QCon SF 2011

Thursday, November 17, 2011

And It All Went Horribly Wrong: Debugging Production Systems

No comments:

Post a Comment