Integer log10 in Rust and C
Following the stabilization of the integer log10 interface in Rust 1.67, I got interested in seeing whether it could be optimized. There's well known lore in this area starting from the omnipresent Hackers Delight implementation, to more recent thoughts (discussed here). I had a bit of a new thought on how to express it that turned out inferior for reasons I find interesting. Let's dive in.
Most of the approaches involve a two-step process: (1) Use integer log2 to approximate the log10 result; and (2) check to see if the value exceeds the next log10 threshold and increment if so.
To make that clear, consider the number 14. Its rounded-down log2 is 3. In that range one can find the numbers 8-15. Therefore, one would map 3 -> 0 (i.e., "numbers whose log2 is 3 should be assumed to be 10^0"), but then test to see if the number was >= 10, in which case, it gets an extra boost.
This algorithm was succinctly described by Steve Cannon in a stackoverflow post, using a lookup table for each step. (1) Map the log2 to a log10 estimate: values 0-3 all map to log10 of 0; values 4-7 map to a log10 of 1; etc. (2) given a log10 estimate, check in a table for a threshold value.
This might look like this, in Rust:
Advantage: It's pretty easy to read and interpret. But it's not as fast as it could be.
Hank Warren's version in Hackers Delight improves on calculating the guess by replacing the table with a simple multiply-and-shift: (log2 * 9) >> 5. Or, in other words, log2 * 9 / 32, which is a sufficiently close approximation of dividing by log2(10) that it works for the small integer values we're dealing with -- remember that log2 of a u32 can only be from 0 to 32.
where TEN_THRESHOLDS is the same table from above. Nicely, on x86-64, this compiles to just a few instructions. Multiplying by 9 is a single instruction: lea (x, x, 8), followed by a shift. So while it looks kind of complex, it's actually very fast. The ilog2 is implemented as two instructions, an lzcnt (leading zeros count) followed by xor with 0x31 to calculate 31 - leading zeros. Implemented in hand-written assembly to avoid any compiler weirdness, it looks basically like this:
Six instructions, all cheap on modern CPUs. Note some of the compiler cleverness: The standard use of lea to implement both a shift and an add, here synthesizing *9 as (x + x << 3), and a very nice trick of using the carry flag after the comparison to have a single-instruction "increment if greater than".
I identified a novel (as far as I know) approach to estimating the guess, which takes Steve Cannon's table and turns it into a single uint32:
The LZ_GUESSMASK is an encoding of the places where the log10 value increases by 1, so you can find your log10 value by shifting out the bits you don't want and popcounting the remainder. Because it can directly use the lzcnt output, it saves an instruction compared to the Hackers Delight version in a tight loop, but when you're not benchmarking, it requires a mov to load the constant, so it ends up being about the same, but only on platforms with a fast popcount. AMD's Zen architecture has such a popcount, but on Intel and ARM, popcount is a few cycles slower.
The Willets approach discussed in Daniel Lemire's blog post above is fast - really fast - in a benchmark loop. It does, however, use a substantially larger lookup table: 256 bytes, vs. only 40 bytes in the other. In a tight loop, you're guaranteed that the entire lookup table will always be in L1 cache, and you don't have to worry about the L1 pressure of other code. But in real applications, you're less likely to hit in that table and the extra size (it's 4 cache lines instead of 1) will hurt. This leads me to favor the venerable Hacker's Delight version as a better balance of performance and size. (And, compared to my popcount version, the Hackers Delight version has good performance on any architecture with a leading-zeros-count instruction, instead of only those that also have a fast popcount). Another win in the books for Hacker's Delight!
Let's look at performance, with execution times in microseconds to calculate ilog10 for all u32's. The x86-64 numbers are compiled with RUSTFLAGS="-C target-feature=+popcnt,+lzcnt", as the default build does not assume the target architecture supports these features. (They've been around for a decade, so using them is reasonable when you care about performance.)
Platform | popcount | mul | stdlib |
---|---|---|---|
Macbook Pro M1 | 2253311 | 2204559 | 4371157 |
AMD EPYC | 2311064 | 2312847 | 5660831 |
Skylake Xeon | 3653631 | 2968868 | 6259066 |
The Hacker's Delight version is the clear winner when you look across architectures, but I'm going to keep my popcount trick in my pocket in case it brings more ideas.
Notably, while I'm comparing with popcnt and lzcnt enabled (or on ARM M1, where clz and the NEON popc are enabled), the existing rust stdlib version is faster on platforms that don't have a clz/lzcnt instruction, so it's worth keeping around and using architecture-specific flags to pick one. Now, what we might argue about is whether the default Rust codegen for x86-64 should include lzcnt... but that's another day. :) And if nothing else, the modern ARM users will get a faster integer log10.
As an aside, it's remarkably nice that in 2023, you can exhaustively check all possible values of a u32 in just a few seconds, and if you bring in the Rayon library in rust, you can trivially parallelize it. User steffhan on the rust-internal forum suggested this one-liner that runs on my macbook pro in 0.78 seconds - that's about 11 billion integer log10 calculations per second, using about 7.5 seconds of CPU time.
I do keep thinking it would be nice to find a multiplicative approach that could both be implemented in a single lea and go from the lzcnt results instead of needing to first calculate 31 - lzcnt, but I haven't put my thinking cap on enough there yet. It'd save one xor. Maybe you could calculate it backwards, use that to lookup in a reversed comparison table, and then calculate the corrected version in parallel with the table lookup, so you shrink the critical path by 1 instruction. hmmm...
Some code on github. While I've focused on the Rust numbers here, I also provided C (and some assembly) implementations. I'm not as confident about the accuracy of the C benchmarks as I didn't make absolutely sure they couldn't get optimized too much.
This post owes a debt of gratitude to all of the participants in the discussion in the rust internals forum, where several people contributed optimization ideas that made the above code substantially better in practice, and gave very useful feedback.
Comments
Post a Comment