Gaining Momentum: Duplicate Detection in CUDA for better mining
I've been having fun finding "memory-hard" proof-of-work functions and engineering better solutions to solving them, with a focus on doing so for Nvidia's cards. (Why? Because I have some of them, because CUDA's more fun to program in, and because Nvidia's cards are widely regarded as the underdog in the crypto-currency game because of their slower shift and rotate instruction throughput. It's fun to have a challenge and it's nice to use good tools.)
I recently stumbled across a new one, called "Momentum", which requires as its proof of work finding a collision in a 50-bit hash function, given a 26-bit nonce space to play in for each attempt. In other words, to mine a Momentum coin:
for i = 0; i < 2^26; i++ {
h = hash(i)
noncelist = nonces[h]
noncelist = append(noncelist, i)
nonces[h] = noncelist
}
for h, noncelist := range nonces {
if len(noncelist) > 1 {
// Yay I found a collision check it further.
}
}
Pretty straightforward. The "check it further" involves another SHA256 of the nonces and a Bitcoin-style comparison against a difficulty target to let the proof-of-work grow harder or easier over time, but the major amount of time is spent generating the nonces and checking the hash table.
The hash itself is actually SHA512, taking each of the eight 64-bit output blocks as a separate hash. So that loop really looks like:
for i = 0; i < 2^26; i += 8 {
h = hash(i)
for j = 0; j < 8; j++ {
... h[j]
}
}
The developer of the scheme described it as "GPU-resistant", and that, combined with the relatively small size of the currency, kept people from (publicly) building a good GPU version. Which is odd - the memory to store all of the hashes is only 512MB, they can all be calculated independently, and then you just need a nice parallel design for finding the duplicates.
I looked at this as a possible candidate for putting +Bin Fan 's Cuckoo Filter in, and I'm still planning on that for the CPU version (next post), but for the GPU, the cost of Cuckooing added too many random memory references, so I went with a design that had worse memory use but only accessed one location per hash: A simple two-bit saturating counter (kinda) updated using Nvidia's AtomicOr instruction. (The Cuckoo Filter is most interesting when it lets you fit in cache, which wasn't an option with 2^26 values on a GPU).
address, bitstring := convert hash value to bucket number
old = AtomicOr(address, bitstring);
if (old & bitstring) {
// our bit was already set - duplicate detected!
// so set the second bit to indicate collision detected
AtomicOr(address, bitstring<<1);
}
This counter jumps from 1 to 3, but that's ok. Only the high bit matters. Use this like:
calculate SHA512 values in parallel
foreach hash value, increment assoc. counter in parallel
---barrier: wait for all values to be counted---
foreach hash value, if counter not 3, zero out,
otherwise retain as possible collision.
The cost of running a full round of 2^26 nonces thus becomes:
2^23 parallel executions of SHA512
2^26*(1+epsilon) AtomicOr instructions to random locations
2^26 reads of random locations
... some smaller costs to clean up after that
To keep the memory cost down, I execute two rounds of this simple filter. The first round zeros out the non-duplicated entries so that the lookup table is much more sparse (and with fewer accesses) in the second round. As a result, it works decently well with 256MB of lookup (2^31 total bits, for 2^30 counters, meaning one hash for every 16 counters. Thus, the second phase issues only about 1/16th as many ops - pretty negligible.
There are a lot of possibly clever optimizations to explore. Many of them require more memory, so I rejected them because my poor little Mac Pro only has 1GB of video RAM. But the core question of approximate duplicate detection with one-sided error in a GPU is great fun!
As a bit of a comparison, I tried the other obvious route to detecting duplicates: Create all 2^26 values, sort them using thrust::sort, detect duplicates, and flush those that aren't duplicated. It runs at about 50-60% the speed of my current implementation.
I've open-sourced my implementation of this, and it seems to have caused a bit of a stir in the protoshares community. There was a persistent rumor that there existed a private GPU implementation - and given the relative ease of doing it, I would be surprised if it were otherwise - but no known version available in public in either binary or source. Since releasing my code, clones have crept up with windows ports and AMD ports, etc. It's been very entertaining to watch.
Acknowledgements:
- I did a lot of testing on the Susitna GPU cluster, which is part of the NSF-funded PrOBE testbed. It's an incredible resource to the community, and being able to stress test variants of your code on multiple systems at a time is a huge boon. NVidia donated the Tesla K20c GPUs for this cluster.
- As before, my family and several of my friends (you know who you are) put up once again with me obsessing with a crypto-currency related problem for weeks on end.
I recently stumbled across a new one, called "Momentum", which requires as its proof of work finding a collision in a 50-bit hash function, given a 26-bit nonce space to play in for each attempt. In other words, to mine a Momentum coin:
for i = 0; i < 2^26; i++ {
h = hash(i)
noncelist = nonces[h]
noncelist = append(noncelist, i)
nonces[h] = noncelist
}
for h, noncelist := range nonces {
if len(noncelist) > 1 {
// Yay I found a collision check it further.
}
}
Pretty straightforward. The "check it further" involves another SHA256 of the nonces and a Bitcoin-style comparison against a difficulty target to let the proof-of-work grow harder or easier over time, but the major amount of time is spent generating the nonces and checking the hash table.
The hash itself is actually SHA512, taking each of the eight 64-bit output blocks as a separate hash. So that loop really looks like:
for i = 0; i < 2^26; i += 8 {
h = hash(i)
for j = 0; j < 8; j++ {
... h[j]
}
}
The developer of the scheme described it as "GPU-resistant", and that, combined with the relatively small size of the currency, kept people from (publicly) building a good GPU version. Which is odd - the memory to store all of the hashes is only 512MB, they can all be calculated independently, and then you just need a nice parallel design for finding the duplicates.
I looked at this as a possible candidate for putting +Bin Fan 's Cuckoo Filter in, and I'm still planning on that for the CPU version (next post), but for the GPU, the cost of Cuckooing added too many random memory references, so I went with a design that had worse memory use but only accessed one location per hash: A simple two-bit saturating counter (kinda) updated using Nvidia's AtomicOr instruction. (The Cuckoo Filter is most interesting when it lets you fit in cache, which wasn't an option with 2^26 values on a GPU).
address, bitstring := convert hash value to bucket number
old = AtomicOr(address, bitstring);
if (old & bitstring) {
// our bit was already set - duplicate detected!
// so set the second bit to indicate collision detected
AtomicOr(address, bitstring<<1);
}
This counter jumps from 1 to 3, but that's ok. Only the high bit matters. Use this like:
calculate SHA512 values in parallel
foreach hash value, increment assoc. counter in parallel
---barrier: wait for all values to be counted---
foreach hash value, if counter not 3, zero out,
otherwise retain as possible collision.
The cost of running a full round of 2^26 nonces thus becomes:
2^23 parallel executions of SHA512
2^26*(1+epsilon) AtomicOr instructions to random locations
2^26 reads of random locations
... some smaller costs to clean up after that
To keep the memory cost down, I execute two rounds of this simple filter. The first round zeros out the non-duplicated entries so that the lookup table is much more sparse (and with fewer accesses) in the second round. As a result, it works decently well with 256MB of lookup (2^31 total bits, for 2^30 counters, meaning one hash for every 16 counters. Thus, the second phase issues only about 1/16th as many ops - pretty negligible.
There are a lot of possibly clever optimizations to explore. Many of them require more memory, so I rejected them because my poor little Mac Pro only has 1GB of video RAM. But the core question of approximate duplicate detection with one-sided error in a GPU is great fun!
As a bit of a comparison, I tried the other obvious route to detecting duplicates: Create all 2^26 values, sort them using thrust::sort, detect duplicates, and flush those that aren't duplicated. It runs at about 50-60% the speed of my current implementation.
I've open-sourced my implementation of this, and it seems to have caused a bit of a stir in the protoshares community. There was a persistent rumor that there existed a private GPU implementation - and given the relative ease of doing it, I would be surprised if it were otherwise - but no known version available in public in either binary or source. Since releasing my code, clones have crept up with windows ports and AMD ports, etc. It's been very entertaining to watch.
Acknowledgements:
- I did a lot of testing on the Susitna GPU cluster, which is part of the NSF-funded PrOBE testbed. It's an incredible resource to the community, and being able to stress test variants of your code on multiple systems at a time is a huge boon. NVidia donated the Tesla K20c GPUs for this cluster.
- As before, my family and several of my friends (you know who you are) put up once again with me obsessing with a crypto-currency related problem for weeks on end.
Comments
Post a Comment