A Public Review of Cuckoo Cycle

In the ongoing experiment that is crypto-currencies, several "Proof-of-Work" functions are commonly used as mechanisms to decentralize control of the currency.  These PoW functions, as they are typically referred to, require a participant to prove that they expended (in expectation) a certain amount of computational effort, such as finding a partial collision in a cryptographically strong hash function.  Many have been introduced to counter the gradual ASIC-ificiation of the dominant PoW, partial collisions in SHA256, used in Bitcoin.  I've discussed several in prior posts that try to require random memory accesses in order to favor general-purpose CPUs over specialized devices such as GPUs or ASICs.

In this post, I'll take a look at one called Cuckoo Cycle that combines two of my interests:  Cuckoo Hashing and memory-hard PoW functions.  Its author and a few other people asked me to take a look, and it finally seemed like enough people were looking at it to do so.

In the rest of this post, let's assume that the properties of this function are desirable in some context, and put aside the question of whether they are the right design for a cryptocurrency.  As the saying goes, that's above my pay grade.  (Though I will quibble with the article's assertion that GPUs and ASICs lead to high power costs -- on a hash-per-second-per-Watt basis, they're substantially more efficient, and I suspect that on a per-participant basis, the Bitcoin network is using less power post-ASIC than it was pre-ASIC).

Rough idea of CC:  Define N as a number of nodes (the explanation document says < 2^32, but there's no real reason for this) and E as a number of edges  N/4 < E <= N.  Split the nodes in half with N/2+1 on the left (V0) and N/2 -1 on the right (V1).  For nonce 0 <= n < E, add an edge from siphash(k, n)%(N/2+1) in V0 to siphash(k, n)%(N/2-1) in V1.  k is a constant for the block header.

The proof-of-work is to find a cycle of length L in the bipartite graph thus defined.

The author claims this work is tmto-hard (time-memory tradeoff, which I obnoxiously insist on pronouncing "tomato"), in the sense that trying to use less space causes a slowdown of "several orders of magnitude".

Here's a quick start on how I'd attack this beast in parallel using substantially less memory than the author's original design - but not less than O(N) yet.  Note:  This may be wrong!  I haven't implemented it, I'm designing it on paper, and it needs a bit more care with its O(N).  I'm throwing it up in its raw form as a way of continuing the dialog about such proof-of-work functions, and as an illustration of how I start thinking about problems like this.

Let a vertex v be live if it has two or more incident edges.  Otherwise, this vertex cannot be part of a cycle.

Let a nonce n be live iff both of the vertices it connects are live.  Otherwise, it is an edge to nowhere and can be removed from the graph.

Use as a building block an optimized "two-or-more" counting structure such as that which I used for the Momentum PoW.  With proper design, such a structure is relatively bandwidth and CPU-limited in its counting ability, not random access:  On all processors generate a sequence of values v.  Place those values into one of 2^p output buckets based upon the p low-order bits of the value:  bucket[p&0x0000ff], for example, to go into 255 buckets.  When each bucket is full, flush it to a growing list of items in that range.  By sizing p and the buckets appropriately, perhaps with multiple passes, this uses fast local memory to ensure that the writes to slower global memory are fully coalesced.  It's the design I use in my CPU version of Momentum, for which you can see the code here.

To reduce memory use, do this in k passes, where in a pass, you discard hash outputs that don't have the low-order bits you're trying to filter.  This is a standard time/memory tradeoff, requiring k times as many hashes, but reducing the number of output buckets (and thus the amount of memory required for counting of incident edges) by a factor of k.  Appropriate choice of k can reduce the per-pass storage to, e.g., N/log(N), thus o(N) (that's little-oh -- it disappears as N scales).

Maintain a bitvector BV of size N bits that determines the set of live nonces.

For enough passes to shrink the problem size to something you care about, do:
  Test liveness of V0 - mark off nonces.  A nonce is live iff its associated vertex in V0 has two or more incident edges, determined using the duplicate counter.
  Using the live nonces, test liveness of V1.

Using the author's implementation default of E=N/2, this reduces the number of live nonces to a bit less than 1/2 the previous count (the author notes 1/ln(e)) on each test, except for edges that participate in a cycle.

This is a little napkin-sketchy, but it's how I'd start attacking the problem on a platform like a GPU.  My estimate of the cost is that it requires something like N bits of storage, N*log^2(N) siphash computations, and N*log^2(N) bits of global memory bandwidth, if the time/memory tradeoff value is chosen as log^2(N) -- which is maybe feasible with siphash.  Practically speaking, I'd go for more bandwidth and maybe 4-8 executions of siphash, if needed to reduce the partitioning fan-out for duplicate counting.  (That may need to be log^3(N) bits of bandwidth in the most general sense - but not for 512MB)

This attack does leave it with a basic requirement of O(N) bits  (if we're being picky, the bitvector could be stored compressed to save a little more-- but still O(N) and yuch).  As a result, this doesn't eliminate the claim of being an ASIC-unfriendly design, except for the bounded N in the definition of the paper, which isn't fundamental.  Any proof-of-work whose memory requirements, even if time-memory tradeable, scale with increasing difficulty, will present a pain for a baked-in-silicon design.  But 2^32 bits (512MB) is very manageable on almost any good GPU platform today.

(There is a way to slightly reduce the memory demands:  Split the bit vector in half, and record the live bits only for the first half across multiple iterations of the test, using the full second half.  Then store this bit vector in a compact way, and perform the computation on the second half.  This doesn't extend very far past a factor of two reduction, but it does require only a roughly linear increase in the amount of computation and could possibly get the storage needed down to about 300MB for 2^32.)

This proposal has some things I like - it remains a generalization of Momentum's collision finding;  by having a more dense graph structure, it requires a few more iterations through memory vs the amount of hashing.  By using siphash as its memory-filler, it reduces the compute-to-memory (bandwidth or latency) ratio, which is something the SHA512 designs could improve -- their bottleneck is over 90% SHA512 computation time even with vector code.

There may be further progress to be made on this.  In particular, cycles in graphs seem very likely to yield to sampling-based approaches.  Consider, e.g., a 42-cycle:  A sampling of 10% of the edges in the graph is extremely likely to contain at least one of the edges in the 42 cycle.  One might be able to use this as a starting point to solve Cuckoo Cycle in sublinear memory.  I haven't thought enough about it, but it's where I'd go next.  If sampling is effective, it may be that the cycle finding could weaken this proposal beyond Momentum.  If it's not, it may be "stronger" (in terms of memory-to-time).  That'd be interesting to determine.

Popular posts from this blog

Reflecting on CS Graduate Admissions

Minting Money with Monero ... and CPU vector intrinsics

Two examples from the computer science review and publication process