Inside a better CUDA-based scrypt miner
In my previous post, I discussed how I'd written a more-efficient NVidia-based scrypt coin miner and took advantage of the competitive advantage it conferred to (briefly) mine profitably on Amazon EC2 instances. In today's post, I'll break down the algorithmic and engineering details of that improved mining. You can read along in the code that I've released on github.
Some terms: GPUs are big vector processors, but they can, at high expense, let the individual items in the vector "diverge" and take different paths through the code. As a result, NVidia refers to this as a "CMT" machine: Concurrent Multi-Threading. The "Kepler" architecture GPUs my code targets execute 32 threads at a time (in groups of 192 in total) on a single vector unit.
The ideal candidate GPU code looks something like:
process_vector(vec) {
for i := 0; i < vecsize; i++ {
do_something_expensive(vec[i])
}
}
where the something_expensive is independent of all of the other vec[i]'s. Such as, say, computing an expensive hash function on thousands or millions of input keys at the same time.
To start off, we first want to ask: what does it take to run quickly on a GPU? You have to keep all of the cores busy all of the time. This seems obvious, but it has a few implications:
Let's see how we can tackle all of these. I'll start with the third first: Keep the GPU busy. CUDA provides a lot of mechanisms for, e.g., overlapping copies in & out of the GPU and running the parallel code (called a "kernel"). This is a good idea, and it's what Cudaminer did. I decided not to, and instead, I moved all of the mining functionality into the GPU so that I can invoke it, have it run for a long time, and then very quickly invoke it again with the next (small) job description. This required implementing more stuff on the GPU, but simplified the architecture.
To see this, we need to break down what scrypt mining actually does. Scrypt operates in two passes. Before doing anything else, it uses the PBKDF2 (sha256) key-derivation function to take the input block and turn it into 512 bits of state. PBKDF2 uses SHA-256 and HMAC internally. The first generates a big state vector, and the second bounces through it pseudo-randomly. The pseudocode looks like this, with some simplification:
hashed_key, saved_keystate := PBKDF2_SHA256_start(key)
// hashed_key is 1024 bits of state data.
for i := 0; i < 1024; i++ {
State[i] = hashed_key // State is 128KB.
hashed_key = salsa8_xor(hashed_key)
}
for i := 0; i < 1024; i++ {
which_state = low order 10 bits of hashed_key
hashed_key = XOR(State[which_state], hashed_key)
hashed_key = salsa8_xor(hashed_key)
}
final_key := PBKDF2_SHA256_finish(out_key, saved_keystate)
The majority of time is spent in salsa8_xor and in filling up and reading from State[i]. For mining, this is all wrapped in a loop:
target := some_big_number;
found := false
// This loop is usually done for many nonces in parallel
for nonce := 0; nonce < max_nonce && !found; nonce++ {
hash := do_scrypt(coin_data, nonce);
if hash < target {
found = true
}
}
In these terms, the prior work, Cudaminer, ran the PBKDF2 code on the host CPU, and ran only the scrypt core loops on the GPU. As a result, it had to copy in 1024 bits (128 KB) per key in and out of the GPU. My code moves the entire search process in to the GPU, returning only a single integer of whether or not a scan for several thousand nonces succeeded or not.
The use of keplerminer is straightforward, as you can see in hasher_bench.cu:
printf("Found a hash collision for nonce %d\n", rc);
}
You can see the ScanNCoins function in hasher.cu.
By doing this, each invocation of the kernel requires copying in and out only a few hundred bytes of memory, and the kernel runs for decent fractions of a second, ensuring very high utilization. I implemented PBKDF2 by simply copy/pasting the simple CPU versions from Colin Percival's code. The implementation of PBKDF2 on the GPU is trivial and could no doubt be optimized, but it's only 3-4% of the total runtime, so it didn't seem worth it.
Let's get back to making the core scrypt fast, now that the GPU is running busy all the time without waiting for the host. When parallelizing a trivially-parallel algorithm such as coin mining, there are two design options:
The drawback to option #1 is that if the work involves a lot of state or memory, you can't run a lot of parallel threads: You use up all of the registers and SRAM on the GPU with relatively few threads. From looking at scrypt above, it should be clear that there's a lot of state for each key being evaluated: About 1024 bits of current key state, plus it reads another 1024 bits. Stored in 32 bit registers, this is over 64 registers per thread - more than the GPU can handle, and thus, it will start spilling registers to L1 shared memory and slowing down.
Of course, taking option #2 requires going quite a bit deeper into the algorithm and extracting more parallelism from it. (This is fun.) Fortunately, there is about a 4x increase in parallelism to be found in scrypt. Interestingly, on most GPUs, scrypt is not memory latency or bandwidth limited: It's compute bound on both CPUs (using L2 cache) and GPUs (using high-bandwidth memory). So if we can parallelize the instruction stream inside it, it may be worthwhile.
So let's look more at salsa8_xor, which is derived from the gorgeously straightforward mixing function from the Salsa20 cipher by D. J. Bernstein. This one mixes 512 bits of input and is done twice to cover the full 1024 bits of state:
x00 = B[0] ^= Bx[0]
x01 = B[1] ^= Bx[1]
...
x16 = B[31] ^= Bx[16]
Then put those 32 integers in a 4x4 matrix (conceptually):
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
and mix them four times:
First, mix within each column. Example from column 1:
x04 ^= (x00+x12)<<<7;
x08 ^= (x04+x00)<<<9;
x12 ^= (x08+x04)<<<13;
x00 ^= (x12 + x08)<<<18;
Then mix within each row. Example from row 1:
x01 ^= (x00+x03)<<<7;
x02 ^= (x01+x00)<<<9;
x03 ^= (x02+x01)<<<13;
x00 ^= (x03+x02)<<<18
And then sum them with the inputs:
B[0] += x00
...
B[16] += x16
So, this looks cool. The input to the hash function can be done in parallel up to 16 ways (xors and addition). The mixing in the middle can be done by four threads in parallel: each row (and column) can be mixed independently. However, there's a transpose in between: The thread handling column 1 must go from operating on x0, x4, x8, x12 to operating on x0, x1,x2, x3.
Enter the NVidia Kepler architecture shuffle instruction. Shuffle lets threads steal registers directly from other threads in a "warp" (or group of 32 threads that are executed concurrently). Using this, we can execute the inner-loop transpose in only three instructions. You can see this in hasher.cu in the salsa_xor_core function.
The next part is to not waste memory bandwidth. A well-known optimization for GPUs is that you need to issue coalesced memory reads, where the threads read adjacent values from the same larger block of memory. Recall that, because of the 4-way parallelization, each thread in keplerminer has eight 32-bit registers of the scrypt state that needs to be stored and written. To speed this up, the code:
This is in the read and write_keys functions in hasher.cu.
There are some other small optimizations of loop unrolling, etc., that you can see in the code, but those two high-level speedups give the majority of benefits. The results are nice:
Macbook Pro: 34 kh/s --> 62 kh/s
1/2 NVidia Grid K2: ~150 kh/s -> 220 kh/s
And, by moving all of the PBKDF2 functions into the GPU and reducing the number of host-side kernel invocations, the CPU use drops from 20% down to 0%, freeing up a bit of host CPU for mining if you're so inclined.
The status of the code is so-so: I haven't integrated it with a miner in a nice, usable way yet, so building it is a hack. There's also a known issue (bug?) that my code works on my Macbook's GT550m, a Tesla K20c, and on Amaon's Grid G2, but it slows down after 15 minutes of processing on a GTX 650Ti for no reason and only a machine reboot fixes it. I'm not sure if that's a code problem or a GPU problem.
Happy parallel algorithms!
Some terms: GPUs are big vector processors, but they can, at high expense, let the individual items in the vector "diverge" and take different paths through the code. As a result, NVidia refers to this as a "CMT" machine: Concurrent Multi-Threading. The "Kepler" architecture GPUs my code targets execute 32 threads at a time (in groups of 192 in total) on a single vector unit.
The ideal candidate GPU code looks something like:
process_vector(vec) {
for i := 0; i < vecsize; i++ {
do_something_expensive(vec[i])
}
}
where the something_expensive is independent of all of the other vec[i]'s. Such as, say, computing an expensive hash function on thousands or millions of input keys at the same time.
To start off, we first want to ask: what does it take to run quickly on a GPU? You have to keep all of the cores busy all of the time. This seems obvious, but it has a few implications:
- Don't use too many variables (registers) per thread (vector entry). The GPU has a large but still limited number. Using more than 32 or 64 variables per thread can start to slow things down by restricting the number of threads that can execute concurrently.
- Don't use too much memory bandwidth. The GPU has a lot -- from 80 to 300 GB/sec -- but it's not infinite.
- Use your memory bandwidth well: If each thread reads a totally random location at a time, your code will be slow. If, instead, most threads read adjacent locations so that the overall read is a big sequential one to memory, you will get a lot of bandwidth.
- Keep the GPU busy while waiting on the host to do things.
Let's see how we can tackle all of these. I'll start with the third first: Keep the GPU busy. CUDA provides a lot of mechanisms for, e.g., overlapping copies in & out of the GPU and running the parallel code (called a "kernel"). This is a good idea, and it's what Cudaminer did. I decided not to, and instead, I moved all of the mining functionality into the GPU so that I can invoke it, have it run for a long time, and then very quickly invoke it again with the next (small) job description. This required implementing more stuff on the GPU, but simplified the architecture.
To see this, we need to break down what scrypt mining actually does. Scrypt operates in two passes. Before doing anything else, it uses the PBKDF2 (sha256) key-derivation function to take the input block and turn it into 512 bits of state. PBKDF2 uses SHA-256 and HMAC internally. The first generates a big state vector, and the second bounces through it pseudo-randomly. The pseudocode looks like this, with some simplification:
hashed_key, saved_keystate := PBKDF2_SHA256_start(key)
// hashed_key is 1024 bits of state data.
for i := 0; i < 1024; i++ {
State[i] = hashed_key // State is 128KB.
hashed_key = salsa8_xor(hashed_key)
}
for i := 0; i < 1024; i++ {
which_state = low order 10 bits of hashed_key
hashed_key = XOR(State[which_state], hashed_key)
hashed_key = salsa8_xor(hashed_key)
}
final_key := PBKDF2_SHA256_finish(out_key, saved_keystate)
The majority of time is spent in salsa8_xor and in filling up and reading from State[i]. For mining, this is all wrapped in a loop:
target := some_big_number;
found := false
// This loop is usually done for many nonces in parallel
for nonce := 0; nonce < max_nonce && !found; nonce++ {
hash := do_scrypt(coin_data, nonce);
if hash < target {
found = true
}
}
In these terms, the prior work, Cudaminer, ran the PBKDF2 code on the host CPU, and ran only the scrypt core loops on the GPU. As a result, it had to copy in 1024 bits (128 KB) per key in and out of the GPU. My code moves the entire search process in to the GPU, returning only a single integer of whether or not a scan for several thousand nonces succeeded or not.
The use of keplerminer is straightforward, as you can see in hasher_bench.cu:
CudaHasher *h = new CudaHasher();
if (h->Initialize() != 0) {
fprintf(stderr, "Could not initialize hasher. Exiting\n");
delete h;
return(-1);
}
int rc = h->ScanNCoins(job, target, n_attempts, &stop, NULL);
if (rc != -1) {printf("Found a hash collision for nonce %d\n", rc);
}
You can see the ScanNCoins function in hasher.cu.
By doing this, each invocation of the kernel requires copying in and out only a few hundred bytes of memory, and the kernel runs for decent fractions of a second, ensuring very high utilization. I implemented PBKDF2 by simply copy/pasting the simple CPU versions from Colin Percival's code. The implementation of PBKDF2 on the GPU is trivial and could no doubt be optimized, but it's only 3-4% of the total runtime, so it didn't seem worth it.
Let's get back to making the core scrypt fast, now that the GPU is running busy all the time without waiting for the host. When parallelizing a trivially-parallel algorithm such as coin mining, there are two design options:
- One key per core.
- Spread one key's work across multiple cores.
The drawback to option #1 is that if the work involves a lot of state or memory, you can't run a lot of parallel threads: You use up all of the registers and SRAM on the GPU with relatively few threads. From looking at scrypt above, it should be clear that there's a lot of state for each key being evaluated: About 1024 bits of current key state, plus it reads another 1024 bits. Stored in 32 bit registers, this is over 64 registers per thread - more than the GPU can handle, and thus, it will start spilling registers to L1 shared memory and slowing down.
Of course, taking option #2 requires going quite a bit deeper into the algorithm and extracting more parallelism from it. (This is fun.) Fortunately, there is about a 4x increase in parallelism to be found in scrypt. Interestingly, on most GPUs, scrypt is not memory latency or bandwidth limited: It's compute bound on both CPUs (using L2 cache) and GPUs (using high-bandwidth memory). So if we can parallelize the instruction stream inside it, it may be worthwhile.
So let's look more at salsa8_xor, which is derived from the gorgeously straightforward mixing function from the Salsa20 cipher by D. J. Bernstein. This one mixes 512 bits of input and is done twice to cover the full 1024 bits of state:
x00 = B[0] ^= Bx[0]
x01 = B[1] ^= Bx[1]
...
x16 = B[31] ^= Bx[16]
Then put those 32 integers in a 4x4 matrix (conceptually):
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
and mix them four times:
First, mix within each column. Example from column 1:
x04 ^= (x00+x12)<<<7;
x08 ^= (x04+x00)<<<9;
x12 ^= (x08+x04)<<<13;
x00 ^= (x12 + x08)<<<18;
Then mix within each row. Example from row 1:
x01 ^= (x00+x03)<<<7;
x02 ^= (x01+x00)<<<9;
x03 ^= (x02+x01)<<<13;
x00 ^= (x03+x02)<<<18
And then sum them with the inputs:
B[0] += x00
...
B[16] += x16
So, this looks cool. The input to the hash function can be done in parallel up to 16 ways (xors and addition). The mixing in the middle can be done by four threads in parallel: each row (and column) can be mixed independently. However, there's a transpose in between: The thread handling column 1 must go from operating on x0, x4, x8, x12 to operating on x0, x1,x2, x3.
Enter the NVidia Kepler architecture shuffle instruction. Shuffle lets threads steal registers directly from other threads in a "warp" (or group of 32 threads that are executed concurrently). Using this, we can execute the inner-loop transpose in only three instructions. You can see this in hasher.cu in the salsa_xor_core function.
The next part is to not waste memory bandwidth. A well-known optimization for GPUs is that you need to issue coalesced memory reads, where the threads read adjacent values from the same larger block of memory. Recall that, because of the 4-way parallelization, each thread in keplerminer has eight 32-bit registers of the scrypt state that needs to be stored and written. To speed this up, the code:
- Performs reads and writes using the 128-bit int4 type;
- Thread t recruits thread t+4 to write four of its 32 bit registers for it.
This is in the read and write_keys functions in hasher.cu.
There are some other small optimizations of loop unrolling, etc., that you can see in the code, but those two high-level speedups give the majority of benefits. The results are nice:
Macbook Pro: 34 kh/s --> 62 kh/s
1/2 NVidia Grid K2: ~150 kh/s -> 220 kh/s
And, by moving all of the PBKDF2 functions into the GPU and reducing the number of host-side kernel invocations, the CPU use drops from 20% down to 0%, freeing up a bit of host CPU for mining if you're so inclined.
The status of the code is so-so: I haven't integrated it with a miner in a nice, usable way yet, so building it is a hack. There's also a known issue (bug?) that my code works on my Macbook's GT550m, a Tesla K20c, and on Amaon's Grid G2, but it slows down after 15 minutes of processing on a GTX 650Ti for no reason and only a machine reboot fixes it. I'm not sure if that's a code problem or a GPU problem.
Happy parallel algorithms!
Comments
Post a Comment