Caring about Causality - now in Cassandra

Over the past few years, we've spent a bunch of time thinking about and designing scalable systems that provide causally-consistent wide-area replication.  (Here, "we" means the team of Wyatt Lloyd, Michael Freedman, Michael Kaminsky, and myself;  but if you know academia, you wouldn't be surprised that about 90% of the project was accomplished by Wyatt, who's a graduating Ph.D. student at the time of this writing.)  I'm posting this because we've finally entered the realm of the practical, with the release of both the paper (to appear at NSDI'13) and code for our new implementation of causally-consistent replication (we call it Eiger) within the popular Cassandra key-value store.

Why do we care about consistency in wide-area replication?

Because there's a fundamental, unavoidable trade-off between having guaranteed low-latency access (meaning not having to send packets back-and-forth across the country) and making sure that every client sees the same system state at the same time.  In our first work on this subject, COPS, we coined a fancy term for such low-latency-favoring systems, ALPS ("Availbility, Low-latency, Partition tolerance, and Scalability"), but this is the fundamental tradeoff.  People familiar with the famous CAP theorem will probably already understand why, but here's an example:

Consider concurrent writes and reads at two different datacenters.  If you want both the write to have low latency and the read to have low latency, then you have to satisfy them faster than the information can propagate to the other datacenter.  In other words, in some circumstances, you have to let the read return stale information after the write completed on the west coast.  If you don't like this behavior, you could make the write take longer (to propagate to the east coast) or the read take longer (to go to the west coast), but you can't have both.

This is pretty well understood, and was one of the (several) reasons behind the popularity of "eventual consistency", recently popularized by Amazon's Dynamo.  The other, of course, is availability:  In this example, if the two datacenters can't talk to each other, at least one of them must stop processing requests.  Eventual consistency allows the datacenters to each return local results, rapidly, even if the other one is down.  What it sacrifices, of course, is ... consistency:  Queries at different datacenters may see different results, in different orders.

Here's where causality comes in:  You can provide something better than "eventual" consistency without sacrificing availability or low-latency.  That something is causal consistency, and it's provably the strongest form of consistency you can have without giving up low latency.

What is causal consistency?

Causal consistency means that two events that are "causally" related (or potentially so) must appear in the same order.  In other words, if action B occurred in response to seeing action A, then B must appear after A.  As a concrete example, consider replying to a snarky comment on someone's Facebook post:  Your reply should be causally ordered after the snark.  And, indeed, this is exactly what causally consistent replication can provide:  Your reply will never appear to have happened before the snark that triggered it.

Causal consistency is good for users

We've seen out-of-order comments appear on Facebook posts before.  If this was the actual stream of posts:

  1. Woohoo!  I see a cute puppy outside my window!
  2.   [a few minutes later]  Aw, man, it just got eaten by a bear.
  3.   [reply from a friend] OMG, bears attack!
It looks a little weird if what shows up on someone else's screen is:

  1. Woohoo!  I see a cute puppy outside my window!
  2.   [reply from a friend] OMG, bears attack!
There are even better examples, widely used, when talking about access control:

  1. Remove +Wyatt Lloyd from friends list
  2. Post:  "Hey, nobody should hire Wyatt, this causal consistency stuff is bogus!"
If these two actions occur in the wrong order, then my post will not have been hidden from Wyatt as intended.  Bad news bears.

Causal Consistency is good for programmers

A stronger consistency model restricts the number of orderings of events that can show up at a remote datacenter.  This can ease the task of a programmer.  Imagine two causally related events:  Creating a new photo album and then uploading an image to it.  If those events are replicated out-of-order, your code might have to try to cope with the idea of an image being uploaded to a nonexistent photo album.  Or crash, because you never expected it to happen.  In contrast, in a causally consistent system, you might never see the photo upload (or it could be delayed), but it will always occur after the creation of the album.  Easier code, happier programmers.

What's neat about our new work on Eiger

Eiger improves on our earlier COPS system in a few ways:
  1. It doesn't suffer unsightly metadata buildup when a datacenter fails.
  2. It supports write-only transactions as well as read-only transactions.
  3. Its read-only transactions are lower overhead, as its its metadata tracking overall.
  4. It supports Cassandra's rich data model instead of just exact-match key-value.
COPS was based on our FAWN-KV system, which only provides a hash-table like interface: GET(key) and PUT(key, value).  For Eiger, Wyatt re-implemented everything from scratch inside Cassandra, which provides a rich data model based upon Google's BigTable, allowing range queries, counter columns, multi-key gets and puts, multiple sparse columns per key, and so on.   Supporting Cassandra's data model required an overhaul of the way we tracked causal dependencies, because the way we did it in COPS scaled poorly when used in a context where you could touch thousands of items with a single command.

One thing we left as future work in COPS was dealing with datacenter failure.  With the way we tracked dependency metadata, a full DC failure would prevent the other datacenters from being able to garbage collect metadata until a human administrator stepped in and declared the other datacenter offline.  This was fine for outages of a few minutes, but poor for big-but-temporary failures.  This was bothersome, and Eiger's operation-based metadata tracking doesn't suffer this problem.

Write-only transactions are important for being able to consistently update a group of objects.  For example, write-only and read-only transactions can be used together to ensure invariants such as "if A likes B, then B must be liked byA".  This again makes the programmer's life easier, because they can simply write something like:

atomic_mutate( Dave->insert(Assocs:Likes:Wyatt),

and be sure that these two updates will appear together, or not at all.

And it can do all of this without sacrificing low latency or availability, unlike the stronger forms of consistency provided by systems such as Google's Spanner.

Conflicts:  What Causally Consistent Replication Can't do

Causally consistent replication's weakness is that it doesn't prevent conflicting updates.  Imagine running a contest in which the first person to reply to your post would be awarded a pony.  (Many commenters on sites such as Slashdot seem to believe that this occurs regularly.)  In a causally consistent system, a poster on the west coast and a poster on the east coast could both legitimately believe---for a while---that they had posted the first reply.  These two updates would be in conflict, but it would take at least one round-trip-time between the datacenters to discover the conflict and resolve it.

Causal consistency does not work for systems that must (at all costs) maintain a global invariant.  One can imagine examples of this from banking:  "User A wishes to withdraw all funds from her account";  it should only be possible to do this once, globally.  In these  scenarios, it's probably worth accepting a cross-country RTT slowdown in favor of never seeing an inconsistent view of the world.  For many other things, such as FB/G+ response threads and likes, etc., causality may be all you really need.

Looking at Eiger in the bigger context

The last few years have seen substantial progress in improving wide-area replicated datastores.  I think of our work, as well as Mike Dahlin's group's concurrent work on Real-Time Causal Consistency, as pushing hard on trying to provide usable, stronger consistency for the "we want low latency!" crowd; Peter Bailis and the folks at Berkeley have been working on providing better transactional semantics for causally consistent systems that will hopefully go beyond Eiger's read-only and write-only transactions;  at the same time, systems such as Walter, Gemini, and, of course, Spanner, have been pushing on improving the common-case behavior and scale of strongly-consistent replication for the "must have consistency!" crowd.  In a sense, there's a line drawn in the sand by the CAP theorem and the incompatibility of low-latency and strongly-consistent replication, and we're all racing towards it from different sides of the line.

The good news is that all of this (will be) good for developers and users:  Faster, more scalable strongly-consistent systems mean that there are fewer times when you need to resort to the more difficult task of programming in a less-consistent environment;  and the bump from eventual to causal ("causal+" or "real-time causal") makes programming in that environment safer and easier.

Popular posts from this blog

  • It's been a year and a half since I migrated my undergrad distributed systems course (15-440) to the Go programming language .  I'm still glad I did it and will keep it there, but we've also learned some important lessons about how to best use it in our context that I hope I'll be able to improve upon next semester.   Hopefully, some of this may be useful for those of you considering moving to or creating a Go-based class. The big points I hope to get across are:  (1)  Go is really  good for distributed systems;  (2)  Students need more intro to the language than I thought;  (3)  The language has matured a lot, but there are some technical glitches remaining that you'll have to work around. Go is good for teaching (and building) distributed systems I believe this for two reasons: Go provides a good mix of high-level and low-level functionality. Channels and goroutines provide a program structure that works well for distributed programming. The importance
    Keep reading
  • Over the past three weeks, I've been running crypto-coin mining software on between 20 and 60 Amazon EC2 GPU instances at a (very small) profit, with the goal of understanding the currencies, exchanges, and algorithms in more detail.  This is post #1 of 2 discussing scrypt coin mining on Amazon.  In the follow-up post, I'll go into the algorithmic and engineering details of building a better CUDA-based miner. A few weeks ago, Gün and colleagues wrote a paper describing a new collusion attack against Bitcoin .  I hadn't thought much about BTC before that, but he got me curious.  As I explored, I discovered Litecoin ,  an alternative to Bitcoin that is designed to reduce the comparative advantage of using custom ASICs (or GPUs) for mining it relative to using a conventional CPU.  Its future is even less certain that BTC, of course, as a later comer, but the technically interesting bits lie in its proof-of-work hash function:   Scrypt .  Scrypt is designed to be "memory
    Keep reading
  • I woke up on May 28th, 2014, on vacation with my family in the middle of the desert, to find a copy of my private source code plastered across the bitcointalk message board.  Announced as a "new optimized version" of the Monero currency miner, it was enthusiastically adopted by cryptocurrency miners across the world.  And in the process of doing so, my daily profit from the Monero Mining Project dropped by over five thousand dollars per day. But let's start at the beginning, when I started getting in to a loose collaboration with three people I've never met---one whose name I'm not even confident I really know---with hundreds of thousands of dollars worth of a nebulous new cryptocurrency at stake. It started with a cryptic note from someone I'd met online, with a link to a message board discussion for a new currency called "bitmonero".  His note said only:  "this looks interesting." From prior collaborations with him
    Keep reading