A brief response to "COPS and PNUTS"

I stumbled across a review by Peter Bailis of one of our SOSP papers today, and I'm worried that we failed to be clear about a few things in the work, so I figured I'd grab the blog soapbox and try to clarify. First is the relationship of COPS [pdf] and prior work. Peter comments:
I'm not sure this advances much of the state of the art beyond, say, Dynamo, which provided a scalable web store with user-specified conflict handling.
The delta from Dynamo is very simple: Dynamo offers no “guaranteed” consistency semantics at all. Operations that touch different keys/tablets can arrive at replicas in arbitrary order. In COPS, that’s not the case - operations can only arrive in an order allowed by causal consistency. The example of updating two keys, one that controls the permissions of an object, and one that modifies the object, is a good way to illustrate this. In Dynamo, the object modification could be applied at a remote datacenter before the corresponding permissions update. In COPS, it could not. PNUTS is stronger than Dynamo, but weaker than COPS - it provides per-record timeline consistency, but provides no consistency guarantees across different keys. Consider three updates by a single thread:
     write A=1, write B=dog, write B=cow, write A=2
In Dynamo, the order at which these updates are applied is unconstrained. In a remote datacenter, read(A), read(A) might return A=2, A=1
In PNUTS, the value of “A” could never go backwards: A=1, A=1, … A=2, A=2, …
But in both Dynamo and PNUTS, the relative values of A and B are unconstrained, and each of these systems could read(A), read(B): {2, dog}. Even though the originating thread set B=cow before it set A=2.
In COPS, this latter read sequence could not happen. The allowable reads are:
   {1, }, {1,dog}, {1,cow}, {2, cow}
In all three systems (in PNUTS, using Read-any), concurrent updates at different datacenters could cause conflicts that invoke conflict handlers, and in this way the three systems are not different - nor do we claim they are.
Causal consistency in this way is not new, and it’s not the contribution of COPS. It’s the same consistency model provided by Bayou and the systems that followed it. The contribution is that COPS does so in a scalable way when your data store is partitioned among many machines in a datacenter. As you shard data onto more and more tablets, those tablets can replicate to their peers in a remote datacenter without all updates having to go through a single point of serialization. The tricky thing in doing so is ensuring that you don't build up an ever-growing pile of metadata in the process, and I think this is COPS's biggest contribution - Wyatt figured out how to do this replication, efficiently identify and garbage collect the dependencies, and so on.
I hope that we didn’t convey the impression that we thought “causal+” is a novel model - in fact, as we explicitly point out in section 3.4, it’s exactly the consistency model provided by Bayou and PRACTI. However, in order to prove that COPS’s design preserved the causal ordering, we had to create a formal definition of causal+. As Peter says, we could have called it “employing commutative merge functions to totally order causally divergent versions”, but that’s a mouthful, so we coined a shorthand. It’s not clear to me that giving names to enhance clarity justifies claiming that the “card-carrying club members” of the SOSP community were “misleading” and “posturing”. :-)
Finally, COPS isn't the solution to every problem under the sun. There are times when you want to sacrifice consistency for availability, and for those times, COPS provides strictly stronger consistency semantics than, e.g., Dynamo and PNUTS, with the same availability. But there are also times when you need something stronger, and for that, you have to use something like Megastore or Walter. There's a reason Amazon's S3 and PNUTS offer the choice of strong or eventual consistency, and nothing can change that - it's a fundamental tradeoff. But COPS makes the "available" choice somewhat easier to use.

Comments from the original post

  • Anonymous, December 1 2011, 11:59:19
    Not quite true that Dynamo offers no consistency semantics at all; with W + R > N, serializability is guaranteed, since every read will necessarily pick up the most recently written value.
  • dga
    Hi, Anon@11:59am -
    Kind of, except under failures. Dynamo will dynamically extend the set of W nodes in a way that isn't visible to other readers. So it can happen that a write is still written to 3 nodes, but those 3 nodes aren't the same nodes that the readers will try to read from. That's the place where Dynamo fundamentally chooses which side of the CAP line to be on -- it chooses AP. Even when W + R > N.
    It's worth noting, though, that > 99% of the time, Dynamo with W + R > N behaves like a linearizable system. It's just under failures that it stops doing so.
  • Mike Freedman
    Dynamo's guarantees
    Agree with David. In order to guarantee linearizability (i.e., strong consistency), you need to have a strongly consistency notion of group membership (i.e., the quorum view). Because Dynamo has "sloppy quorums", it can't guarantee linearizability. Admittedly, this might not matter under normal operating conditions too much, but linearizability is hard exactly under failure conditions, not under the no-fault case. Hence the complexity of Paxos and like consensus algorithms.
  • strlen
    Re: 3 nodes
    I think what you're referring to are sloppy quorums. However, sloppy quorums aren't intrinsic to the Dynamo data model. Out of the three popular Dynamo implementations, only one even makes sloppy quorums for reads an option (which isn't enabled by default).
    Of course, quorums have their drawbacks. Main issue with Dynamo is the difficult of repairing a node to a consistent state, in the absence of an ordered log. Read-repair works well for transient failures within a datacenter: it is rightly the main approach for restoring consistency, as most failures are indeed transient. On the other hand, read repair is prohibitively expensive across a WAN link (due to high latency), and takes much longer for longer-term failures (nodes being taken out for hours or days). The other approaches (Merkle trees, hinted handoff) work but have their drawbacks.
  • strlen
    Re: 3 nodes
    Typed too quickly:
    s/Dynamo data model/Dynamo distribution model/
    s/quorums have their drawbacks/even strict quorums have their drawbacks/
  • pbailis
    Dave--you make some great clarifying points in this blog post. In my accidentally-public class paper review, I fixated mostly on whether causal+ was a novel contribution by itself. I'm glad to hear that we agree that a formal definition of causal+ is important but the main contribution of the paper is COPS, or, more generally, figuring out how to build a system that delivers scalable causal+ consistency. I also appreciate your magnanimity in receiving my quite unfiltered review!
    I've posted a longer reply regarding the context of my review and some additional thoughts here:


Popular posts from this blog

Reflecting on CS Graduate Admissions

Minting Money with Monero ... and CPU vector intrinsics

Finding Bugs in TensorFlow with LibFuzzer