Accelerating cryptocurrency Mining with Intel ISPC

Daniel Lemire and Maxime Chevalier just had a great exchange on Twitter about the state of compilers and being able to automatically vectorize code, such as a scalar product.  Of course, hoping the compiler can correctly vectorize your code can be a bit fragile, and as Maxime points out, writing raw intrinsics results in an unreadable pain in the editor, and a GPU-style implementation in CUDA or OpenCL might be more scalable and maintainable.

A few years ago, some folks at Intel wrote a compiler called ISPC, the Intel SPMD Program Compiler.  A possibly unfairly simple way to describe ISPC is that it's OpenCL-style programming for x86 vector units.  You can write code that looks like this:

export uniform float scalar_product(uniform float a[],
                                    uniform float b[],
                                    uniform int count) {
    float a_times_b = 0.0;
    foreach (i = 0 ... count) {
        a_times_b += a[i] * b[i];
    }
    return reduce_add(a_times_b);
}

After compiling this appropriately with ISPC and linking it in, you'd have an appropriately-vectorized implementation of scalar product for the architecture(s) you specified.  Pretty neat!

(If you can't tell, I'm a fan of ISPC, and I don't know for the life of me why it hasn't received more attention.)

At the end of 2013, sick in bed with bronchitis, I got interested in playing around with proof-of-work functions for cryptocurrencies.  One was Momentum, for which I wrote and released the first publicly-available GPU miner.  As a part of playing with Momentum, I decided to also write an optimized CPU miner.  At the time, the fastest available one was from yvg1900, a (I believe) Ukraine-based developer who writes some of the most astoundingly optimized code I've seen.  The momentum miner was the fastest available, but didn't seem as fast as it could be on Intel Haswell series processors, which support AVX-2 (256 bit integer vector support), so I decided to try to improve upon it.  

The Momentum proof of work relies upon finding partial collisions in a large (2^26) set of hashes.  To improve upon the miner, I used the same fast duplicate detection mechanism I'd devised for the GPU version.  After implementing that, it turned out that the vast majority of time was then spent running the SHA-512 hash function, even after linking against Intel's highly tuned version of it.  The Intel optimized SHA-512 didn't scale too well to AVX2:  Going to SSE-optimization (128 bit vectors) gave it a big speedup, but doubling the vector width to 256 bits didn't come close to doubling the speed.

Fortunately, there was a solution at hand.  In most cryptocurrencies, the hash function can be trivially parallelized, as the input is of the form:  Hash(block stuff, nonce), where the nonce is usually an integer that's incremented for each turn.  So you can compute many hashes in parallel -- this is one of the reasons GPUs and ASICs work so well for cryptocurrency mining.

I turned to ISPC to write an 8-way parallel SHA-512 implementation.  It turns out to have been approximately trivial.  I started with some GPU-style unrolled, very-inlined code that implemented SHA-512 on a restricted sized input, and then wrapped that for ISPC:


export void sha512_ispc(uniform unsigned int64 data[5], 
                        uniform unsigned int64 * uniform buf,
                        uniform int base) {
  for (uniform int i = 0; i < 8; i+= programCount) {
    sha512_block(&buf[i*8], i+programIndex, data, base);
  }
}

In this code, the sha512_block code is not parallelized -- it just implements a single hash at a time.  I've lost some of my original source code for this, but the results were fast:


dga-sha512_ispc:   1.239s
intel-avx2:        2.853

The output of using ispc like this is probably not as fast as expert-optimized code.  I've looked at some of the Intel hand-optimized cryptographic functions, and they're the work of wizards -- they expertly juggle between both the vector units and the scalar units to get maximum throughput out of both.  But for the game I was playing a few years ago, I didn't have to -- I just had to write a fast miner sooner than anyone else did.  ISPC was a great tool for doing so.

(For the record and for fairness, of course, I should note that yvg1900 later released a version of their AVX2 Momentum miner that was faster than mine, also using ISPC for some of the code generation, and also that there are many things that make software good -- the yvg1900 miners are robust and well written, and mine was a quick hack.)

I've put the source code for the AVX2-optimized ISPC SHA512 code on my Github -- but be warned, I didn't extract from the original framework the makerules to build the Intel parts of the code.

(Like what you're reading?  See more in the Archive...)

Comments

Popular posts from this blog

Reflecting on CS Graduate Admissions

Chili Crisp Showdown: Laoganma and Flybyjing

Moving the Pi Searcher from Go to Rust