QCon SF 2011: Uptime in High Volume Systems

Excellent presentation from Urban Airship (UA) - mainly, what struck me is the way they seem to be Agile and Lean in the real sense of the terms, and at all levels of their company, from Engineering to Operations, etc. The guy had way too many slides for an hour however... here's the overview (more details at first, then, less and less as the speaker accelerated... I will put the slides below for those interested in more details!):

Presentation slides here.

About Urban Airship

Hosting for mobile services
Unified API for services across platforms
Content delivery at all scale
SLAs for throughput, latency
Apple, Android, RIM

UA is a Lean Company

Specifically Lean startup
From wikipedia: Use of FOSS and employment of Agile Techniques; is a "ferocious customer-centric rapid iteration company"
Attention to Continuous Improvement
Value the elimination of waste
Transparent, open processes
Does not apply to just Engineering but also to Operations

UA by numbers

more than 20K developers
300 millions active applications installs use our APIs across
more than 170 millions unique devices
10s of billions of API requests per month
10 millions direct socket connections to our servers
more than 50TB worth of analytic data
30 software engineers, 5 operations engineers

Obligatory Architecture slide

See slides that will be posted soon

3-tier architecture (using Apache Cassandra, PostgreSQL, Java, Python, HDFS, Hbase big user (like Facebook))

Architecture - General principles

Keep everyone moving in the same direction
Help discrete teams understand how they interact
Think in terms of small discrete services
Continuous capacity planning based on real data
Avoid local optimization decision making

Architecture - Services

Trending towards a service based architecture

Critical traits of a services

Minimal exposed functionality (Smallest reasonable surface area to the API and Operate on one type of data and do it well)
Simple to operate
Over exposure of metrics and stats
Discoverable via ZooKeeper (future)
Zero visibility into inner workings of other services
No shared storage mechanisms across services (services are completely fronting their datasets) - Motivation for this: security, performance, scalability
Minimize shared state - use ZooKeeper if absolutely necessary
Consistent logging and configuration properties
Consistent implementation idioms
Consistent message passing
Convention for on-disk layout and structure (directory structure is standard on all their nodes)

Architecture waste reduction

All back-end services are in Java and Python
All Java services are made to use a single set of operational scripts
Always looking for new ways to eliminate waste
Architectural waste comes in many forms (lots of data storage engines - postgreSQL, MongoDB, Cassandra, Hbase, using a complex, unfamiliar queuing system was wasteful, large diversity in approaches for managing services, worker processes, process management, etc.
Developer silos - avoid the bus factor - they always have at least 2-3 persons per service (and they currently have 35-40 services)

Architecture - fault domains

essentially, they worked hard to ensure that when 2 resources are completely unrelated, they should are isolated fault domains if at all possible.

Engineering at UA

46% of the time they develop new features
28% spent sustaining internal support
21% production support
2% - social stuff (beer , ping-pong)

Engineering for iteration

Team of about 30 engineers
Small teams organized around functional area (DB guys, etc.)
Shortest iterations possible - Lean MVP concept (Minimum Viable Product)
No formal QA team (!) - and they seem to be happy with giving this responsibility to all their people.
Frequently pairing, but not mandatory - this is the choice of the developers themselves
They always leave code better than how they found it
All bug fixed requires a code review in a review board
Large new developments require a sit-down code/design review (a little more formal, but not that much)

Engineering for automation

3 level of testing (Unit, Functional (with mocks)and Integration). Commits are done to a single main git branch

Engineering for simplicity

They simplified metrics and stats capture and so they can do it everywhere!

They capture latency for Service critical operations and External service invocation

They capture counters for Service critical operations and Services faults

Engineering for Operations

Tests and Deployment scripts are done within the development teams (their definition of done essentially includes the deployment automation).
Services deployments done via automation tools and
Automation scripts always pull from prod git branch after passing auto and manual tests
They apply the "Put the mechanics on the helicopter" principles

Engineering for Responsiveness

Low latency, high throughput message paths using an in-house developed RPC system based on Netty and Google PBs.
They support sync and async clients, journaling of messages
Latency tolerant message paths using Kafka for pub-sub messaging (they generally favor pub-sub model)

Engineering for Availability

They use dark launch (like Facebook) which essentially is a roll out of a new functionality to a subset of customers
Take new service in or out of prod with no customers impact (Double writes, single reads, migration , cutover, Load balanced http with blended traffic to new and old service
Their service abstraction helps immensely
Requires extra discipline for co-existing versions or services

Engineering for Continuous Improvement

They use the 5 whys approach

Operations at UA

A team of 5 operations engineers handling more than 100 servers (Mostly bare metal, using EC2 for surge capacity)

Operations for Transparency

They measure absolutely everything, and monitor only the important things

See the slides here: TBD

QCon SF 2011

Thursday, November 17, 2011

Uptime in High Volume Systems - Lessons Learned

No comments:

Post a Comment