Thursday, November 17, 2011

And It All Went Horribly Wrong: Debugging Production Systems

Intro: Maurice Wilkes quote:
"As soon as we started programming, we found out to our surprise that it wasn't as easy to get programs right as we had thought. Debugging had to be discovered. I can remember the exact instant when I realized that a large part of my life from then on was going to be spent in finding mistakes in my own programs."

Presentation slides here.

Debugging through the ages :
Production systems are more complicated (abstraction, componeurization, etc.) and less debuggable! When something goes wrong, it is all opaque and more and more difficult to fix...

How have we made it this far?
  • we architect ourselves to survive component failure
  • we forced ourselves to stateless tiers
  • when there are states, we considered semantics (ACID, BASE) to increase availability
  • redundant systems
  • clouds (especially unreliable ones, like Amazon) have expended the architectural imperative to survive data-center failure
Do we still need to care about failure?

single component failures still has significant costs (both economic and run-time)... but most dangerously, a single component failure puts the global system in a more vulnerable mode where further failures is more likely to happen... This is a cascading failure - and this is what induces failures in mature and reliable systems.

Cascaded failure example 1:

An example of a bridge that collapsed in Tampa Bay because of a boat with ballasts full (which was not supposed to happen) hit the bridge (which was not supposed to happen either) that had been built by a crooked contractor playing with sand/cement ratio in concrete to save money (which was not supposed to happen either - after all, this is Florida, not Quebec!). In the end, it took all those "unlikely" events to all occur to cause the bridge to fall.

Wait, it gets worse
  • this assumes that the failure is fail stop
  • if failure is transient a single component failure can alone induce system failure
  • monitoring attempts to get at this by establishing liveness criteria for the system - and allowing operator to turn transient failure into fatal failure...
  • ... but if monitoring becomes too sophisticated or invasive, it risks becoming so complicated as to compound failure.
Cascaded failure example 2:
An image of 737 rudder PCU schematic (details here). Another example of a cascaded failure that led to B737 landing issues.

Debugging in the modern era
- Failure - even or a single component - erodes oeverall reliability system
- When a single component fails, we need to understand why and fix it

Debugging fatal component failure
  • when a component fails fatally, its state is static and invalid
  • by saving the state, to stable storage, in DRAM for example, the component can be debugged postmortem
  • one starts with the invalid states and proceeds backward to find the transition from a valid state to an invalid one
  • this technique is old: core dumps
Postmortem advantages
  • no run-time systems overhead
  • debugging can occur anytime, in parallel of the production system
  • tooling can be very rich since the overhead caused to a run-time system is not a problem
Cascaded failure example 3:
Flight Data Recorder from Air France crash (747) found 1.5 years after the crash
This recovery definitely permitted postmortem analysis

Postmortem challenges
  • need the mechanism for saving state on failure (ex: core dumps)
  • must record sufficient state (program text + program data)
  • need sufficient state present in DRAM to allow for debugging
  • must manage state such that storage is not overrun by a repeatedly pathological system
these challenges are real but surmountable - as in some open source systems presented below (MDB, node.js, DTrace, etc.)

Postmortem debugging: MDB
  • a debugger in illumos OS (solaris derivative)
  • extensible with custom debugger module
  • well advanced for native code but much less for dynamic envirnoments such as Java, Python, Ruby, JS, Erlang...
  • if components going into infrastructures are developed using those languages it is critical that they support postmortem debugging.
Postmortem debugging: node.js
  • not really interesting for non JAVA...
  • debugging a dynamic environment requires a high degree of VM specificity in the debugger...
  • see all details on
Debugging transient component failure
  • fatal failures, despite its violence, can be root-caused from a single failure
  • non fatal failure, it is more difficult to compensate for and debug

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.