Thursday, November 17, 2011

Uptime in High Volume Systems - Lessons Learned

Excellent presentation from Urban Airship (UA) - mainly, what struck me is the way they seem to be Agile and Lean in the real sense of the terms, and at all levels of their company, from Engineering to Operations, etc. The guy had way too many slides for an hour however... here's the overview (more details at first, then, less and less as the speaker accelerated... I will put the slides below for those interested in more details!):

Presentation slides here.

About Urban Airship

  • Hosting for mobile services
  • Unified API for services across platforms
  • Content delivery at all scale
  • SLAs for throughput, latency
  • Apple, Android, RIM

UA is a Lean Company

  • Specifically Lean startup
  • From wikipedia: Use of FOSS and employment of Agile Techniques; is a "ferocious customer-centric rapid iteration company"
  • Attention to Continuous Improvement
  • Value the elimination of waste
  • Transparent, open processes
  • Does not apply to just Engineering but also to Operations

UA by numbers

  • more than 20K developers
  • 300 millions active applications installs use our APIs across
  • more than 170 millions unique devices
  • 10s of billions of API requests per month
  • 10 millions direct socket connections to our servers
  • more than 50TB worth of analytic data
  • 30 software engineers, 5 operations engineers

Obligatory Architecture slide
See slides that will be posted soon
3-tier architecture (using Apache Cassandra, PostgreSQL, Java, Python, HDFS, Hbase big user (like Facebook))

Architecture - General principles

  • Keep everyone moving in the same direction
  • Help discrete teams understand how they interact
  • Think in terms of small discrete services
  • Continuous capacity planning based on real data
  • Avoid local optimization decision making

Architecture - Services
Trending towards a service based architecture
Critical traits of a services

  • Minimal exposed functionality (Smallest reasonable surface area to the API and Operate on one type of data and do it well)
  • Simple to operate
  • Over exposure of metrics and stats
  • Discoverable via ZooKeeper (future)
  • Zero visibility into inner workings of other services
  • No shared storage mechanisms across services (services are completely fronting their datasets) - Motivation for this: security, performance, scalability
  • Minimize shared state - use ZooKeeper if absolutely necessary
  • Consistent logging and configuration properties
  • Consistent implementation idioms
  • Consistent message passing
  • Convention for on-disk layout and structure (directory structure is standard on all their nodes)

Architecture waste reduction

  • All back-end services are in Java and Python
  • All Java services are made to use a single set of operational scripts
  • Always looking for new ways to eliminate waste
  • Architectural waste comes in many forms (lots of data storage engines - postgreSQL, MongoDB, Cassandra, Hbase, using a complex, unfamiliar queuing system was wasteful, large diversity in approaches for managing services, worker processes, process management, etc.
  • Developer silos - avoid the bus factor - they always have at least 2-3 persons per service (and they currently have 35-40 services)

Architecture - fault domains
essentially, they worked hard to ensure that when 2 resources are completely unrelated, they should are isolated fault domains if at all possible.
Engineering at UA

  • 46% of the time they develop new features
  • 28% spent sustaining internal support
  • 21% production support
  • 2% - social stuff (beer , ping-pong)

Engineering for iteration

  • Team of about 30 engineers
  • Small teams organized around functional area (DB guys, etc.)
  • Shortest iterations possible - Lean MVP concept (Minimum Viable Product)
  • No formal QA team (!) - and they seem to be happy with giving this responsibility to all their people.
  • Frequently pairing, but not mandatory - this is the choice of the developers themselves
  • They always leave code better than how they found it
  • All bug fixed requires a code review in a review board
  • Large new developments require a sit-down code/design review (a little more formal, but not that much)

Engineering for automation
3 level of testing (Unit, Functional (with mocks)and Integration). Commits are done to a single main git branch

Engineering for simplicity
They simplified metrics and stats capture and so they can do it everywhere!
They capture latency for Service critical operations and External service invocation
They capture counters for Service critical operations and Services faults

Engineering for Operations

  • Tests and Deployment scripts are done within the development teams (their definition of done essentially includes the deployment automation).
  • Services deployments done via automation tools and
  • Automation scripts always pull from prod git branch after passing auto and manual tests
  • They apply the "Put the mechanics on the helicopter" principles
Engineering for Responsiveness
  • Low latency, high throughput message paths using an in-house developed RPC system based on Netty and Google PBs.
  • They support sync and async clients, journaling of messages
  • Latency tolerant message paths using Kafka for pub-sub messaging (they generally favor pub-sub model)
Engineering for Availability
  • They use dark launch (like Facebook) which essentially is a roll out of a new functionality to a subset of customers
  • Take new service in or out of prod with no customers impact (Double writes, single reads, migration , cutover, Load balanced http with blended traffic to new and old service
  • Their service abstraction helps immensely
  • Requires extra discipline for co-existing versions or services
Engineering for Continuous Improvement
They use the 5 whys approach
Operations at UA
A team of 5 operations engineers handling more than 100 servers (Mostly bare metal, using EC2 for surge capacity)

Operations for Transparency
They measure absolutely everything, and monitor only the important things

See the slides here: TBD

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.