- Hosting for mobile services
- Unified API for services across platforms
- Content delivery at all scale
- SLAs for throughput, latency
- Apple, Android, RIM
- Specifically Lean startup
- From wikipedia: Use of FOSS and employment of Agile Techniques; is a "ferocious customer-centric rapid iteration company"
- Attention to Continuous Improvement
- Value the elimination of waste
- Transparent, open processes
- Does not apply to just Engineering but also to Operations
- more than 20K developers
- 300 millions active applications installs use our APIs across
- more than 170 millions unique devices
- 10s of billions of API requests per month
- 10 millions direct socket connections to our servers
- more than 50TB worth of analytic data
- 30 software engineers, 5 operations engineers
3-tier architecture (using Apache Cassandra, PostgreSQL, Java, Python, HDFS, Hbase big user (like Facebook))
Architecture - General principles
Architecture - Services
Trending towards a service based architecture
Critical traits of a services
Architecture - fault domains
essentially, they worked hard to ensure that when 2 resources are completely unrelated, they should are isolated fault domains if at all possible.
Engineering for iteration
Engineering for automation
3 level of testing (Unit, Functional (with mocks)and Integration). Commits are done to a single main git branch
Engineering for simplicity
They simplified metrics and stats capture and so they can do it everywhere!
Engineering for Responsiveness
Engineering for Availability
Engineering for Continuous Improvement
They use the 5 whys approach
Operations for Transparency
They measure absolutely everything, and monitor only the important things
- Keep everyone moving in the same direction
- Help discrete teams understand how they interact
- Think in terms of small discrete services
- Continuous capacity planning based on real data
- Avoid local optimization decision making
- Minimal exposed functionality (Smallest reasonable surface area to the API and Operate on one type of data and do it well)
- Simple to operate
- Over exposure of metrics and stats
- Discoverable via ZooKeeper (future)
- Zero visibility into inner workings of other services
- No shared storage mechanisms across services (services are completely fronting their datasets) - Motivation for this: security, performance, scalability
- Minimize shared state - use ZooKeeper if absolutely necessary
- Consistent logging and configuration properties
- Consistent implementation idioms
- Consistent message passing
- Convention for on-disk layout and structure (directory structure is standard on all their nodes)
Architecture waste reduction
- All back-end services are in Java and Python
- All Java services are made to use a single set of operational scripts
- Always looking for new ways to eliminate waste
- Architectural waste comes in many forms (lots of data storage engines - postgreSQL, MongoDB, Cassandra, Hbase, using a complex, unfamiliar queuing system was wasteful, large diversity in approaches for managing services, worker processes, process management, etc.
- Developer silos - avoid the bus factor - they always have at least 2-3 persons per service (and they currently have 35-40 services)
Engineering at UA
- 46% of the time they develop new features
- 28% spent sustaining internal support
- 21% production support
- 2% - social stuff (beer , ping-pong)
- Team of about 30 engineers
- Small teams organized around functional area (DB guys, etc.)
- Shortest iterations possible - Lean MVP concept (Minimum Viable Product)
- No formal QA team (!) - and they seem to be happy with giving this responsibility to all their people.
- Frequently pairing, but not mandatory - this is the choice of the developers themselves
- They always leave code better than how they found it
- All bug fixed requires a code review in a review board
- Large new developments require a sit-down code/design review (a little more formal, but not that much)
They capture latency for Service critical operations and External service invocation
They capture counters for Service critical operations and Services faults
Engineering for Operations
- Tests and Deployment scripts are done within the development teams (their definition of done essentially includes the deployment automation).
- Services deployments done via automation tools and
- Automation scripts always pull from prod git branch after passing auto and manual tests
- They apply the "Put the mechanics on the helicopter" principles
- Low latency, high throughput message paths using an in-house developed RPC system based on Netty and Google PBs.
- They support sync and async clients, journaling of messages
- Latency tolerant message paths using Kafka for pub-sub messaging (they generally favor pub-sub model)
- They use dark launch (like Facebook) which essentially is a roll out of a new functionality to a subset of customers
- Take new service in or out of prod with no customers impact (Double writes, single reads, migration , cutover, Load balanced http with blended traffic to new and old service
- Their service abstraction helps immensely
- Requires extra discipline for co-existing versions or services
Operations at UA
A team of 5 operations engineers handling more than 100 servers (Mostly bare metal, using EC2 for surge capacity)
See the slides here: TBD
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.