Tuesday, May 4, 2010

CAP Theorem: An informal proof

[Attention: Pending peer review]

Prof. Eric Brewer made a conjecture in the keynote speech at the Principles of Distributed Computing (PODC) conference in 2000. It has since been called the CAP theorem and goes as follows:

A distributed system cannot guarantee all three properties, namely, consistency (C), availability (A), and partition tolerance (P) at the same time in the presence of network failures. It can only guarantee any two of the three.

CAP theorem has important implications for web service designers and users.

This was formally proved by Gilbert and Lynch of MIT in 2002. The formal theorem is nice but hard to understand for mere mortals like us. So I have tried to give a more informal proof that is more comprehensible.

For simplicity, let's assume that the distributed system has just one object O - say, the number of tickets available for a movie show at 1 pm today. So it is just one number. We have built an awesome web service that offers just 2 operations: read ticket count, and update ticket count. The web service runs on a distributed system consisting of 100 nodes scattered across the internet.

Some definitions:

Consistency that web service users expect is linearizability, meaning that if there are simultaneous concurrent reads and writes to the object O, the distributed/parallel system should linearize or sequence these operations in such a way that it appears to the user as executing on a single node. In effect, read operation return the value posted by the last update, and updates happen sequentially one after another. For example, if the initial value of O is 1000 and I update it to 999 via node 53, and the next instant someone on node 72 asks for the value of O, the system should return 999 and not 1000. In other words, the users of the web service don't care if there is single server behind the web service or a massive system distributed across the four corners of the planet. They want to see the updates they made most recently, and that's it.

Consistency comes in many forms. One is strong consistency which says that any subsequent read that follows an update will always see the latest update. The other is weak consistency which says that only reads that come after certain elapsed time (called the inconsistency interval or inconsistency window) will see the update. There is no one form of consistency that is suitable for all applications. Banks need strong consistency because they handle money, but airplane reservations are ok with weak consistency because they allow overbooking. (See Eventually Consistent for better definitions).

Availability means that when a read or update operation is invoked on the system, the system will always complete the operation and respond back with a success or failure. To an observer, the web site is available.

Partition tolerance means that if there is a network partition, then all the updates made to the system during the outage will be captured and propagated to all the nodes when the outage completes. That is, no updates are lost. By network partition, we mean that one or more nodes are unable to communicate with one or more other nodes in the distributed system. As a result, and during a network partition, the inconsistency interval increases dramatically - many times the normal inconsistency interval.

Informal proof:

Case 1: A and P are satisfied but C is not.

Let's assume that there is a network partition, i.e., some set of nodes in the distributed systems cannot communicate with another set of nodes, but each node is alive and well.

We will now show that either one of C, A or P is not possible or we arrive at a contradiction.

Let's say the user invokes the web service to make an update during the network partition.

The system now has two choices: to allow the update or not allow the update. If the system decides to not allow the update, then the system is not Available (A).

If the system decides to allow the update, then the system is Available. However, the question is - will the update reach all the nodes?

If the update does not reach all the nodes, then the system is not partition tolerant (P).

However, if the update reaches all the nodes, the system is partition tolerant (P). But when do the updates reach the nodes?

If the updates reach instantly or within the inconsistency interval, then the network is ok and there is no network partition - a contradiction to our assumption we made at the start of the proof.

If the updates do not reach all the nodes instantly or within the inconsistency interval, then some nodes in the system have not been informed of the update. Thus the value of O is different on different nodes after the inconsistency interval has elapsed. Therefore the system is not Consistent (C).

Case 2: If A and C are satisfied by the system, then P is not satisfied.

Let's say there is a network partition in progress. Let's say the user invokes the web service to make an update during the network partition.

The system now has two choices: to allow the update or not allow the update. If the system decides to not allow the update, then the system is not Available (A).

If the system decides to allow the update, then the system is Available. However, the question is - will the update be reflected on all the nodes?

If the update is not reflected on all the nodes, then the system cannot be consistent (C).

If the update is reflected on all the nodes, then the system is consistent (C). However, this implies that communication was possible between the nodes. This contradicts our assumption that there was a network partition.

This implies that the update made during the network partition was not reflected on all the nodes. Since the system is consistent (C), and the only option for it to remain consistent is to abandon the update made during the network partition. This means the system is not Partition Tolerant (P).

Case 3: If C and P are satisfied by the system, A is not satisfied.

Let's say there is a network partition, and let's assume the system is consistent (C).

Since the system is consistent, it must ensure that all the updates are made available to all the nodes.

When there is a network partition, the system has two choices: to ignore the updates made during the network outage, or to save them for later updates.

If the system ignores the updates made during the network outage, then it is not partition tolerant (P).

Therefore the system must save the updates for propagating them to all the nodes when the network outage is done.

Given that the system is consistent (C), it cannot allow reads to proceed anywhere in the system as long as there are pending updates to be propagated.

If the systems allows the reads to proceed before the updates are propagated, then the reads will see old values and not the latest values waiting to be propagated, thus contradicting consistency (C) assumption.

Therefore it must not respond to the read requests coming into the system. This means the system is not Available (A).

End of Informal proof.

Amazon's web services make various tradeoffs between A and C given P. Interested readers can read Werner Vogels' article on being Eventually Consistent. It is vital for web developers to understand the limitations of web services so that they can build more reliable applications that depend on web services.

Thursday, October 29, 2009

Cost of YouTube Service on Amazon S3

YouTube hit a milestone recently - it served 1 billion videos per day. I was wondering how much it would cost Google to serve these videos. It seems to be a lot but how much?

Since we don't know Google's hosting costs, let's calculate it in a different way - from the published rates of Amazon S3 service. Note that this calculation is primarily focused on storage and bandwidth costs.

Here are some stats to work with:
  • average video size on YouTube is 10 MB (maybe slightly dated)
  • videos served per month 30 billion
  • 20 hours of video uploaded each minute, or 864,000 hours of video per month
  • average size of 1 hour of video = 0.5GB (calculated at bit-rate of 1150 kbps for VCD quality vide0)
Data transferred out per month to view the videos is:

30billion * 10MB = 300 billion MB = 292,968750 GB

From the Amazon S3 calculator, this comes to $29 million/month.

Cost of videos transferred in per month is:
0.5GB * 864000 = 432,000 GB

From the Amazon S3 calculator, it is $43,200/month.

Data transfer out dominates the data transfer-in costs.

Storing the new video each month costs $66,000/month. (Cost of holding previous videos is not calculated since I don't have that data).

The grand total is around $29,366,380/month or about $352 million per year. The real cost to Google would be lower than this (perhaps by 10-20%), but $352 million seems to be the upper limit for this year. This is quite similar to the estimates made by Credit Suisse report.

Cost of Amazon s3 vs Apple's MobileMe

I have been using Apple's MobileMe as a backup solution for a while and was wondering if Amazon S3 would work out cheaper than MobileMe. With Firefox extension such as S3Fox it is equally easy to use Amazon s3 for backup.

MobileMe costs $99/year for 20GB storage and 200GB of data transfer per month. Using the Amazon Calculator, I entered the following numbers:

  • Storage: 20 GB-months, i.e., using the full storage of 20GB per month
  • Data transfer in: 100GB/month, and Data transfer out: 100 GB/month. This is assuming my data transfer is equally used between input and output. This is usually not the case but let's assume it for simplicity in calculation.

The Amazon S3 cost comes to $3 for storage/month and $27 for data-transfer/month, for a grand total of $30/month, or $360/year, a whopping 3.6 times more than MobileMe.

More than the storage, it is the data-transfer costs that add up at Amazon. So if you were to go easy on the data-transfer (a typical use case when used as a backup), then the costs will come down.

What is the break even point? After tweaking the numbers, I found that if you store approximately 20GB/month and your data transfer is approximately 40GB/month (in and out included), then the price of Amazon s3 and Apple's MobileMe are the same. Beyond this, MobileMe is a better value.

In other words, Amazon S3 is a better value if you are mainly interested in using it as a storage device and data-transfer is small. On the other hand, if your data-transfer needs are high (e.g., as in hosting a popular video), MobileMe is better value.

Sunday, October 25, 2009

Kindle App on the iPhone

Recently, I was very excited about buying the new Kindle with international wireless. After doing a little bit of research, I stumbled upon the Kindle App for iPhone. Guess what, it is an excellent app. You can easily buy books from the Kindle store and it will appear on your iPhone. The iPhone app was so easy to use that I wonder if it is really necessary to buy the Kindle Reader.

Granted that the Kindle reader has a bigger screen and the battery lasts longer than the iPhone, but the iPhone screen is good enough for me. If you intend to read a book continuously for a long time, the Kindle may be better on your eyes, but for a quick read to kill time, the iPhone is good enough.

One thing I found lacking in the Kindle app for iPhone is the settings button. There is no way to prevent the app from using the wifi/3G/GPRS network for internet access when you don't want it to.

Tuesday, September 15, 2009

Google's Fast Flip or Fast Flop?

Fast Flip is an evolutionary step ahead in browsing for information. To determine if this is going to be a winner, we need to determine if Fast Flip is going to be a win-win-win for readers, publishers, and Google itself.

Who is the target audience?

Fast Flip is similar to a RSS feed reader except that you can also see the page as the website intended you to see it. If you don't care about the visual factor and any formatting is good enough for you, then Fast Flip provides no additional advantage. If you already have a feed reader, would you use Fast Flip? Not likely. So the target audience is mostly consumers who do not use a RSS feed readers - and that is the majority of the internet users.

Is it a win-win-win?

Consumers are certainly pressed for time. They would prefer any service that would eliminate the need for them to visit multiple websites. That's why News aggregators such as Yahoo news, Google news are popular. It gives them a quick overview of the day's news. If users want to know more about an article, they click. So what's the problem that Fast Flip is trying to solve? Is it that people are frustrated about slow websites? This was a problem a decade ago. Most people have broadband now. So the claim that websites are slow doesn't seem to make sense, or does it? We will revisit this a little later.

There is one interesting use case where the slowness becomes apparent. If there is an extremely popular article that everybody wants to read now but the source website can't handle the traffic, then a service such as Fast Flip backed by Google can enable users to read the story without experiencing any delay. In this case, it's a win-win-win for readers, publishers, and Google. But this is such a rare event. Clearly, Google is not banking on this use case for its service.

However, generalizing the above idea raises an interesting thought. What if the publishers "outsource" hosting of their content to Google Fast Flip - not as in web hosting, but in caching the image of their web page? Since most readers flip through information, why waste bandwidth and hosting resources on consumers that simply spend 20 seconds or less on a website. They don't even look at the ads, let alone click on them. However, the users that click through Fast Flip are the truly interested ones. To the publishers, this is the subset of readers that are of high value - they come to the site and stay longer. Displaying ads to this audience can command a much higher premium. Even though the participating websites will see a decline in traffic, the quality of the traffic will increase and thus drive their ad revenues higher. Given that Google is sharing revenue with participating web sites, the revenues are still coming in but slightly less. This is the cost of "outsourcing" the content to Google. But this will be offset by a reduction in infrastructure costs. Overall, it will benefit the publishers, but only if they transfer a significant chunk of their traffic to Google Fast Flip.

In short, Google wants to keep the low value but large number of visitors on Fast Flip, but send the fewer but higher value visitors to the participating website. Is this a good deal for the publishers?

Now let's revisit the "problem" of a website being slow. Given that most consumers spend no more that 20 seconds on a web page, a 10 second staggered wait time to fully load the web page is not uncommon these days. And that's 50% of a visitor's attention to a website. And that's is indeed too long! Fast Flip solves this problem and removes 50% of a reader's time and that is indeed huge. So assuming the content is deep, users will indeed opt to use Fast Flip. I can even see the standard Google search results being replaced by Fast Flip one day.

Thus Fast Flip seems to be a win-win-win for all, consumers, publishers, and Google. It does seem to make business sense. Will it work out? Only time will tell.

Thursday, August 27, 2009

2 Great Tools to Check Global Accessibility to Your Website

Would you like to know how long it takes for your customers/readers across the world to reach your website? Any techie would know that you can use the ping tool from your machine. But how do you reach the four corners of the world to do the ping test?

Try just-ping.com.

It is a web tool that pings your website from all around the world, in fact, from 38 locations around the world at the time of this writing.

Next is the just-traceroute.com , again from the same people.

This has remote terminals across the world and allows you to see the route from the machine to your website. This is particularly useful to debug latency problems, and also DNS problems.

Both tools will give you a glimpse into the performance of your website as seen by your customers. Of course, network latency is only part of the total performance, but it is an important one and cannot be ignored.

Friday, August 21, 2009

Superb free book for Software Architects

If you are a software architect or want to be a software architect, I highly recommend this superb free book from Microsoft. It is really a treasure trove of information for architecting software systems. It is not an academic book, rather, it incorporates wisdom from the field.

Application Architecture Guide 2.0

Although it includes information on how to build applications on the .NET platform, it's utility is not restricted to just .NET. The principles and guidelines can be applied to non-microsoft platforms too.