Wednesday, November 16, 2011

Stratos - an Open Source Cloud Platform

Presentation given by WSO2, the producer of Stratos cloud-reated products. Here's some pointers from their presentation. I think this should interest all our people working on SOA implementation.

Presentation slides here.

Slide 1
  • Presentation of Platform as a Service (PaaS)
  • online data doubles online every 15 months and the number of apps kind of follow the same trend
  • what the Cloud is really depends on who you are (ex: online music for teenagers, emails for your mom, prospects for a sales guy, etc.)
Slide 2:
  • PaaS (Platform as a Service) is what is positioned between IaaS (Infrastructure as a Service, ex: Amazon) and SaaS (Software as a Service, ex: Google Apps)
  • Stratos tries to insert itself in between the hardware and the software platforms.
Slide 3:
What does Cloud native means:
  • Distributed/Dynamically wired (i.e. it works properly in the cloud)
  • Elastic (i.e. it uses the cloud efficiently)
  • Multi-tenant (i.e. it only costs when you use it)
  • Self-service (i.e. you put it in the hands of users)
  • Granularity billed (i.e. you pay for just what you use)
  • Incrementally deployed and tested (i.e. it support seamless live upgrades - Continuous update, SxS, in-place testing and incremental deployment)
Slide 4:
Apply those concepts to an Enterprise architecture… Cloud is no only about Web apps, it is also about Portal, Queues and topics, DB, Registry/Repositories, Rules/CEP queries, Integrations, etc.

Slide 5:
What are the various dimensions to evaluate a PaaS:
  • Which languages and APIs does it support? (Are you locked in?)
  • Can it run on a private cloud? (Are you locked in?)
  • Which services does it offer? (Are you locked in?)
  • Is it open source? (Are you locked in?)
Slide 6:
What are the cloud players:
1- Those without private PaaS
  • Force.com / Heroku
  • Google App Engine
  • Amazon Elastic Beanstalk
2- Those with provate PaaS
  • Tibco
  • Microsoft
  • Cloudbees
3- Those with both
  • Stratos (obviously...)
  • others I missed.

Slide 6:
What is Stratos
  • A product: Open source, based on OSGi
  • Services (based on the product) called Stratos Live
Then we got a live demo of Stratus Live


For more details, see http://wso2.com/cloud/stratos/

This maybe interesting for App server, ESB, Repositories, etc. seems to be a fairly complete, consistent, integrated portfolio of middleware services.

Here's a list of what they claim their offer in Stratos Live:
  1. Application Server as a service
  2. Data as a service
  3. identity as a service
  4. Governance as a service (this one is for Pascal!)
  5. Business Activity Monitoring as a service
  6. Business Processes as a service
  7. Business Rules as a service
  8. Enterprise Service Bus (ESB) as a service
  9. Message Broker as a service
  10. etc.

JVM performance optimizations at Twitter's scale

Complete presentation about application optimization based on twitter experiments; it was Java-centric, but I tried to generalize the presented principles.


and for those interested... I still do not like Java!




Takeaway: Twitter went from 60 second of Garbage Collection (GC) per hour to 5 seconds per 3.5 days, mainly because of the internal slab allocator.


Presentation slides here.


Twitter's biggest enemy: LatencyWhat are the main latency contributors (from worst to less):
  1. Garbage collector (GC)
  2. in process locking and thread scheduling
  3. I/O
  4. application inefficient algorithms
but only the first one can be addressed by JVM tuning (which happens to be the expertise of the speaker)

Areas of performance tuning:
  • Memory tuning (footprint, allocation rate, garbage collector rate)
  • Lock contention
Memory footprint tuning
  • Maybe you just use too much data!
  • Maybe your data representation is fat! (bad data model)
  • Maybe you have a memory leak (not covered in remaining presentation)
Global advice for performance tuning:
Profile your applications, especially when 3rd parties are involved abd not well known (ex: primitive SCALA array types eating lots of memory; another ex: Guava's concurrent map eating a lot of resource - especially if you do not need concurrency).

Even more important advice:
Writing a new memory manager is not a good idea unless you already optimized your data and your JVM... if you start doing this, it is a strong smell that you are doing something wrong!

Fighting latency
Performance tradeoff: Memory vs Time (convenient and valid, but oversimplified view); the performance triangle (press memory usage down, push throughput up and keep latency low) is a better model, but still too simple for Twitter optimization team. In the end they use a (optimization increase) = C x T x R where C=compactness (good data model), T= throughput and R=responsiveness.

The rest of the presentation was about tuning and optimizing Java JVM Garbage Collector (GC) - if you are interested, let me know (I will not document all of this in this post - lucky you!).


Side note:
At Twitter, they seem to also have a big gap between developer's "done" and "production ready" done.

"DevOps" - Change is NOT the Enemy

From Wikipedia (link):

"DevOps" is an emerging set of principles, methods and practices for communication, collaboration and integration between software development (application/software engineering) and IT operations (systems administration/infrastructure) professionals"

Sounds familiar? It should.

If you've ever been part of one of these groups at some point in your career, you probably said to yourself or heard things such as: "Those developers just want to rapidly push on us all this new technology but we've yet to handle issues with the currently deployed tech." or maybe "Operations are dinosaurs; they reject any opportunity to use our new software and make us stick to old outdated tech. without realizing the benefits of the new!". In essence, agile development creates change and operations want less change.

As organizations make the move to new technologies (e.g. cloud, SaaS, PaaS, WCF, .NET 4.0, etc...), the role of development versus operations teams is evolving and this change must be properly managed but, change isn't the enemy, the lack of alignment is. Businesses fail to see this and think pushing change down the pipe will magically break their reliance on monolithic applications and help them move towards more modern distributed service based applications. The fact to the matter is, distributed and loosely coupled applications are more complex to manage and also fail in a distributed way!

This type of change has to be monitored, measure and managed so both parties can establish a better communication and collaboration relationship. Can you think of any examples where this has happened to you? (post them in the comments)

Obviously, this post is merely skims the topic of "DevOps" but, I'm hoping it got you thinking a little and wanting to read-up on this more. I will be attending a presentation on Thursday more specifically on "DevOps" applied to addressing performance issues so hopefully I will have some detailed examples. I'll post a follow-up on this subject then.

Scaling Social Computing - Lessons Learned at Facebook

A not so interesting presentation going over different aspect of Facebook growth. Anyhowm here's an abstract of the presented content:


Lead in: Scaling is about dealing with failures
Facebook culture: Move fast
Problem domain: social data database (lots of distributed data!)

Context: Facebook usage has been continuously growing for the past 7 years, every single week... that gives really no rest to the development teams

Presentation slides here.

Move fast
  • This is seen as an enabler to try lots and lots of things, to adjust more often, as a risk-taking enabler.
  • At the beginning of a software development project, we have a lot of questions and as we proceed, questions get answered and the product evolves from the cumulative knowledge.
  • When the outcome is uncertain, they use what they call experiments. They can have experiments on a daily basis, which means they can proceed from questions to questions quite fast.
Practice #1: Frequent small changes (never a delay waiting for a release, easier to isolate bugs)

Data model: Node graph data model
they consider the data objects in isolation are not as important as the relationships between them (ex: their photo app, that kind of sucks - even according to them - from a photo perspective, still works well because you can tag the photo with who is in the photo). The model contains many small objects updated in real time (typical page contains dozens of those small objects).

Social data:
  • Many small objects,
  • hard to partition and cluster AND
  • frequently changing (privacy rules need to be evaluated in real-time which seems to be their biggest concern).
  • Data consistency is really important (they can't really infringe the privacy rules) for the user experience.
Other Technical aspects
  • Their DB seems to be a mixed of MySQL (which they consider excellent for random reads - on flash and flash cache), HBase (ref to be found) and no SQL data.
  • The key/tricky things relate to the data distribution and synchronization across many servers

Scaling is how machine interact
  • Bottlenecks (Network mostly for them)
  • Handling failures
Principle #2: When you reach a certain load threshold, it makes the network load explodes. The idea is to measure the load continuously and act when you are approaching this threshold (by forcing a load reduction, thus a latency increase).

Principle #3: The rate of the exchange between a certain machine and lots of other machines should be dictated by the first machine (Facebook uses the "throttle" algorithm on the client side to maximize network switch throughput thus the site responsiveness).

Handling Failures
Principle #4: You want to intercept small problems before they become bigger (ex: network breakdown)

Principle #5: At Facebook, there is no finger-pointing allowed; it is encouraged that you change things even if occasionally, it breaks something. Their idea is to do proper root cause analysis of why it happened to try to prevent it from happening again. It is their strong belief that finger-pointing leads people to simply try things less and less until they actually do nothing anymore...

Single Points of Failure
they consider software as a potential single point of failure (SPOF)

Principle #4: you should roll out machines gradually and not start them all at once (ex: when they updated some memory and restarted the systems all at the same time, they all crashed at the same time a week after and it took a complete day to fix!)

Principle #5: If you lose half your machines, you are still doing well if you serve half your traffic.

Cultural - the strategy is not avoiding errors, but making them cheap
  1. Test failure conditions (very important, at the production system level, not unit, etc.) If something keeps you up at night, break it! (at least you chose the time frame and not inherit it whenever it happens)
  2. Monitor everything (be wary of using averages which hides everything! Use peaks, std dev., etc.)
  3. Post Mortems (finding root causes is important - this is not about finger pointing - otherwise people will stop to contribute! it is about knowing exactly what caused what and to avoid this in the future?). If someone does no mistake, it probably means it does nothing!
Notes from Q&A session:
  1. HBase very good with big data sets, MySQL better at random reads
  2. no polling in browser, they use a server push system
  3. Weekly push of trunk - no branches! Daily pushes for experiments. - They are tending to move toward a complete daily push.
  4. Their application is abstracted from the DB/hardware optimization so they can adjust the data model as needed without affecting the application code
  5. Code vs data schema migrations; they use no SQL schemas but still, the code must match the data. They try to only add fields so they maintain compatibility (code can work with or without the expected piece of new data). They mostly only add.
  6. Security/User privacy settings. They want the privacy as low as possible in the stacks. They are essentially modeled with graph locks.
  7. API changes: they would like to do more changes... the API cannot be frozen... they try to manage this interface very carefully...
  8. They use Cassandra - really really good for inbox search. Really nice for distribution and load sharing. Not their primary storage however (because of RAID? unsure...)
  9. Having a good dashboard (which they claim they have) is key to fixing problems faster and more easily. - Is the system OK now? If not, you need maximum data on the situation!

Being an Architect

Here are a few quips on being an architect:


  • Architects shouldn't work in an ivory tower, they should get involved in daily operations
  • Architects must see trees without loosing view of the forest
  • Architects should be responsible for creating an environment that produces high quality software

What do you think? Post your answers in the comments.


Soup to Nuts - Harnessing Lean to break through the local optimisation

Initial Note on presentation title
What does Soup to Nuts means?
"Soup to nuts" is an American English idiom conveying the meaning of "from beginning to end". It is derived from the description of a full course dinner, in which courses progress from soup to a dessert of nuts. (see http://en.wikipedia.org/wiki/Soup_to_nuts for more details)

Presentation Lead-in
- The audience was asked "why we (typically IT dept.) develop software?"
- A guy answered "To change the world"...
- "but how asked the presenter again?"... "By developing Value delivered to delighted users".

Presentation slides here.

Introduction on value-driven principles
Our software development actions should be rooted in values (Values lead to Beliefs which lead to Behaviors and eventually to Actions)

The presenter went through many stories he lived as a consultant leading change in big companies and concluded in the end IT departments are not and should not be solely turned toward themselves and wait for job to do but be involved in the why it is actually being requested. We should adopt practices only when they eventually bring real values to our companies (why does the company exists? To do what? etc.). The whole difference is between "do that" vs. "I do that for this reason" - knowing the reason will help the proceeding with the requested change.

Value Statement and Value Assertions as tools
He then suggested that we develop a Value Statement that should answer the why portion of any prescribed change and to then develop the Value Assertions. The idea is to validate the measurable (automatically or not) portion of the value statement (He introduced the concept of Value Tests as tests against the market, the stakeholders - this is done now, but generally outside of IT completely). The bad news is that the toolset for value assertion/feedback is not good but if we find a way to deliver smaller increment of value and measure it, we get the benefits (similar to what Facebook seems to be doing with what they call experiments);
The good news: seeing delivered value (aka progress) around us motivates us (even if it is in another team) - this can be a key to motivate teams.


Suggested value-related modifications to the Agile manifesto stating popular values and principles
It seems that the current agile manifesto (http://agilemanifesto.org/) is confusing MEANS with the END - we would need to add value into it

Modified element
- delivered value software over... working software over comprehensive documentation

New elements to be added to the manifesto to take value into account
- Technology as a Value Engine, not a Cost Center
- Appreciate and capture diverse stakeholder values
- Do the simplest thing that deliver value (Minimum Valuable Product)

Conclusion
We should be more Value-Driven in everything we do (track it, measure it, etc.). Why is Apple successful? Could it be because it is essentially value-driven and not technology driven?IT (software development departments) should stop seing itself as a slave of the business but as an important part of the business instead that can use technonlogy to deliver business value... a big change!

Always be Failing!

Here is a good change enabler: FAILURE!

Failure is a powerful tool. People learn from failure but unfortunately, the first reflex most of use have is trying to prevent failure. When you think about it, we should be spending more time becoming resilient to failure instead of preventing it; This would be a much more valuable investment of our time.

When you fail, fail quick and fail LOUD so people can see the causal relationship between a change and failure. For example, think about having a big, publicly visible build system monitor or package deployment summary screen that goes RED when something bad happens (broken build, unresolved dependency, failing unit tests, etc..). The instant feedback is key!

In today's fast moving technological world and particularly with software development, change is something that is risky, but necessary. Enabling change by embracing failure can be extremely beneficial  and if failing hurts, DO IT MORE!

  • Push code all the time - It will get less painful
  • Upgrade all the time - It will get less painful
  • Write tests all the time - It will get less painful
  • Do code reviews all the time - It will get less painful
  • Merge branches all the time - It will get less painful
  • Fail all the time, it will get less painful

So go ahead, learn from failure and enable change! Just don't forget to make sure people aren't stigmatized or afraid to fail.


Understanding the Magic of Lean Product Development

Lead in Quote: Arthur C. Clarke said, “Any sufficiently advanced technology is indistinguishable from magic.

Presentation slides here

Main Lean Product Development (LPD) problem: People consider LPD to be magical rather than technological.
  • Why not move a domain X best practice (ex: Toyota Manufacturing) into another domain (ex: software product development)? Improved performance might transfer from a domain to another one... or not. For example, if we apply Toyota's approach to a hospital emergency room, arriving patients would be processed with a FIFO queue, and admissions would be accepted until a preset FIFO limit is reached... not necessarily a good idea, right? So such a good practice in a domain might be bad into a different domain.
  • This presentation is mainly about analyzing what could/should be transferred from Lean Manufacturing to Software Product Development (SPD)

Speaker's suggested approach is based on this:
  1. Toyota developed their way by focusing way more on actual practices and their measured implementations than on theoretical analysis (TBD, add Taiichi Ohno related links)
  2. How to turn Magic into Technology: By Using some ideas and Lean Manufacturing & Add concepts and Science from other Domains (ex: the queuing theory, traffic flow theory, computer OS design concepts, need to add non-repetitive tasks aspects, high variability aspects and non-homogeneous flows, The "hooda loop" (TBD) from the maneuver warfare (accelerate decision making process)
So here are some ideas the speaker considered good enough to be imported from Lean Manufacturing to Software Product Development domain:
  1. queuing theory: queues sizes vs resources capacities is not linear, but exponential... the more you utilize the capacity, the more the queue size explodes (ex: traffic at rush hour: reducing capacity by one lane out of 4 at rush hour will cause more than 25% delay in the queue time)... To produce an economic output, we should take great care at selecting the right amount of capacity and use the economics of queues; we should measure queues sizes in our product development cycle and search for $ vs excess product development resources with 3 curves, a-total cost, b-cost of excess capacity and c-cost of delay. Why it works so naturally for manufacturing? Because queues are physically visible. In Software Product Development, they are invisible... however, it is really useful to know how much delaying a certain software release actually costs globally.
  2. Batch size is an very important aspect as well (ex: coffee break with 100 person at the same time vs 5 cofee breaks with 20 persons for each one) - see the economic lot size equation reference (created in 1930s). To select the right batch size, you need to measure. There are huge benefits in small batch testing for example (less debug complexity therefore cheaper debug, less open bugs, therefor cheaper testing, faster cycle time, early feedback therefore lower cost changes, ... in the end, better economics). TBD: a link here with CI small increment theory
  3. WIP (Work In Process) constraints: very powerful to deal with accumulation in states (ex: maximum 10 items in Coding, ready to test and testing states); See Little's formula that says that waiting time in the queue is function of number of items in the queue and departure rate of the items. A useful tool for this one is the visual WIP boards - those boards makes non physical/invisible items as physical tokens on the boards - the horizontal axis changes from the traditional time axis to an axis on the software state. Another useful tool is a Cumulative Flow Diagram (showing Cumulative quantity vs time). It permits to see arrivals vs departure times in the queues, therefore, both at the same time the queue sizes (by taking on the x slices) and quantities in queues (by taking y axis slices).
  4. Synchronized cadence: it makes wait time predictable (ex: dedicated support time on each day vs a % of the time of support at an unknown time); Cadence sets an upper bound for wait time and favor less context switching costs.
And now, the ideas considered toxics by the speakerthat should probably not be imported from Lean Manufacturing to Software Product Development domain:
  1. Variability: as opposed to Manufacturing, we cannot eliminate variability in Software Development (concept equivalent to the finance volatility); in fact variability leads to the very options concept presenting higher value possibilities; removing this variability from Software Product Development is equivalent to killing ideas and potential values.
  2. Queuing disciplines: many queuing strategies exists and not only the Manufacturing FIFO. See high cost of delay first (HCDF) minimum slack time first (MSTF), weighted shortest job first (WSJF), etc. we can see Computer OS techniques. The idea is to take the right one depending on context.
  3. Fast feedback: extremely important in software development (in fact, one of the main reason for small batch sizes and shorter cycle time) - the main idea is that we can quit unproductive paths quicker and save the associated resources (ex: SCRUM daily meetings vs weekly meetings, or a 2 digits lottery (ref needed) ).
Conclusion
  1. The importance of Math: The underlying mechanisms behind lean methods can be used in Software Product Development; these methods affect more than one measure of performance so trade-offs are necessary; An economic model gives comparable measures with common units. (See Life Cycle profit impact (ex: one month delay can cost 500K$)).
  2. We need to calculate more economically and less with intuition! The speaker measured scientifically that any analysis (even poor ones), beats intuition in software development!

Going further references:
Book 1: Developing Product in half the time
Book2: Managing the design factory (a little outdated possibly)
Book 3: The principles of Product Development Flow (second generation LPD) - excellent!

Objects on Trial Keynote

This morning, objects are being put on trial. Before us is the head Judge and a panel of jurors accusing objects of the following serious charges:
  • Did no deliver the promise of code reuse, reusable component marketplaces and frameworks that would take the drudgery out of programming
  • Inability to communicate intent clearly amongst themselves
  • Insisting on intimately encapsulating behaviour and mutable state, and by having paid inadequate attention to concurrent, have left the industry vulnerable to what some have dubbed an ill-conceived "Neo-Functional Renaissance"
  • Eschewing static type information forcing implementers to sacrifice performance in the name of linguistic simplicity
  • High skill and education required to adequately craft state-of-the-art code to a level that is beyond what the industry is prepared to pay for
In the accused box are the following:
  • A Penguin (Tux)
  • A coffee mug (Java)
  • A UML object (a.k.a. person, corporation, domain, identifier, serializable, cloneable, iterable)

Objects did not plead anything upon being faced with the aforementioned accusations. The head of the jury now rises and pronounces the objects GUILTY as charged!

Keynote Live

It begins:

Wednesday's Plan - take 2

Hi
Jean already posted the presentations he will attend today; here are those I will attend:
I will try to summarize them by the end of today; do not hesitate to publish your comments.

Warm San Francisco Welcome

Well, this morning was an interesting one. On the ride from the hotel to the conference, we passed through what the cab driver referred to as "the worst neighborhood in town" seeing a healthy load of bums, a weirdo with a "Jesus Loves You!" placard and two suspicious gentlemen enjoying a "smoke" from a glass pipe.

Traveller Tip: if you go to San Francisco, ALWAYS EXIT A TAXI ON THE CURB SIDE. Failing to do so will get you promptly warned by the trolley drivers that doing so is dangerous. I must've met the nicest one in town which quickly pulled the break, yelled and gratuitously called my an a-hole. Awesome! :)

All this aside, if you ever are in town, do visit the Blue Bottle Coffee on Mint Street (map here) as they make some good breakfast foods and a killer cappuccino. Highly recommended.

Thats it for now. Getting ready to attend the first keynote.

Wednesday's plan

Here is today's plan following the keynotes. I will try my best to provide a summary of each presentation and provide you relevant information throughout the day. Don't hesitate to give your feedback or start a discussion in the comments, I will definitely be looking at them when I get a change:

  • SimpleGeo: Staying Agile at Scale - link
  • Max Protect: Scalability and Caching at ESPN.com - link
  • The New Generation of Enterprise Java and .NET: Designing for the Next Big Things - link
  • Exploring Composition and Functional Systems in the Cloud - link
  • Architecting Visa for Massive Scale and Continuous Innovation - link
Enjoy!