Wednesday, November 16, 2011

Scaling Social Computing - Lessons Learned at Facebook

A not so interesting presentation going over different aspect of Facebook growth. Anyhowm here's an abstract of the presented content:

Lead in: Scaling is about dealing with failures
Facebook culture: Move fast
Problem domain: social data database (lots of distributed data!)

Context: Facebook usage has been continuously growing for the past 7 years, every single week... that gives really no rest to the development teams

Presentation slides here.

Move fast
  • This is seen as an enabler to try lots and lots of things, to adjust more often, as a risk-taking enabler.
  • At the beginning of a software development project, we have a lot of questions and as we proceed, questions get answered and the product evolves from the cumulative knowledge.
  • When the outcome is uncertain, they use what they call experiments. They can have experiments on a daily basis, which means they can proceed from questions to questions quite fast.
Practice #1: Frequent small changes (never a delay waiting for a release, easier to isolate bugs)

Data model: Node graph data model
they consider the data objects in isolation are not as important as the relationships between them (ex: their photo app, that kind of sucks - even according to them - from a photo perspective, still works well because you can tag the photo with who is in the photo). The model contains many small objects updated in real time (typical page contains dozens of those small objects).

Social data:
  • Many small objects,
  • hard to partition and cluster AND
  • frequently changing (privacy rules need to be evaluated in real-time which seems to be their biggest concern).
  • Data consistency is really important (they can't really infringe the privacy rules) for the user experience.
Other Technical aspects
  • Their DB seems to be a mixed of MySQL (which they consider excellent for random reads - on flash and flash cache), HBase (ref to be found) and no SQL data.
  • The key/tricky things relate to the data distribution and synchronization across many servers

Scaling is how machine interact
  • Bottlenecks (Network mostly for them)
  • Handling failures
Principle #2: When you reach a certain load threshold, it makes the network load explodes. The idea is to measure the load continuously and act when you are approaching this threshold (by forcing a load reduction, thus a latency increase).

Principle #3: The rate of the exchange between a certain machine and lots of other machines should be dictated by the first machine (Facebook uses the "throttle" algorithm on the client side to maximize network switch throughput thus the site responsiveness).

Handling Failures
Principle #4: You want to intercept small problems before they become bigger (ex: network breakdown)

Principle #5: At Facebook, there is no finger-pointing allowed; it is encouraged that you change things even if occasionally, it breaks something. Their idea is to do proper root cause analysis of why it happened to try to prevent it from happening again. It is their strong belief that finger-pointing leads people to simply try things less and less until they actually do nothing anymore...

Single Points of Failure
they consider software as a potential single point of failure (SPOF)

Principle #4: you should roll out machines gradually and not start them all at once (ex: when they updated some memory and restarted the systems all at the same time, they all crashed at the same time a week after and it took a complete day to fix!)

Principle #5: If you lose half your machines, you are still doing well if you serve half your traffic.

Cultural - the strategy is not avoiding errors, but making them cheap
  1. Test failure conditions (very important, at the production system level, not unit, etc.) If something keeps you up at night, break it! (at least you chose the time frame and not inherit it whenever it happens)
  2. Monitor everything (be wary of using averages which hides everything! Use peaks, std dev., etc.)
  3. Post Mortems (finding root causes is important - this is not about finger pointing - otherwise people will stop to contribute! it is about knowing exactly what caused what and to avoid this in the future?). If someone does no mistake, it probably means it does nothing!
Notes from Q&A session:
  1. HBase very good with big data sets, MySQL better at random reads
  2. no polling in browser, they use a server push system
  3. Weekly push of trunk - no branches! Daily pushes for experiments. - They are tending to move toward a complete daily push.
  4. Their application is abstracted from the DB/hardware optimization so they can adjust the data model as needed without affecting the application code
  5. Code vs data schema migrations; they use no SQL schemas but still, the code must match the data. They try to only add fields so they maintain compatibility (code can work with or without the expected piece of new data). They mostly only add.
  6. Security/User privacy settings. They want the privacy as low as possible in the stacks. They are essentially modeled with graph locks.
  7. API changes: they would like to do more changes... the API cannot be frozen... they try to manage this interface very carefully...
  8. They use Cassandra - really really good for inbox search. Really nice for distribution and load sharing. Not their primary storage however (because of RAID? unsure...)
  9. Having a good dashboard (which they claim they have) is key to fixing problems faster and more easily. - Is the system OK now? If not, you need maximum data on the situation!

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.