Moving the Pi Searcher from Go to Rust
Seven years ago (almost to the date), I wrote about my experience moving the Pi Searcher from C++ to Go. In that article, I mentioned that while the core searching code got slower, the system as a whole sped up by avoiding the per-request fork overhead of the earlier CGI-based code.
This year I moved the Pi Searcher to Rust, hoping to get the best of both worlds. The tl;dr is that it worked and gave me and end-to-end throughput increase of at least 22%. But not entirely in the place I'd expected! The code is basically a transliteration of the Go to Rust, except for moving from Go's internal net/http server to actix-web for Rust. The algorithm is the same: sequential search for short search strings, and using a suffix array for longer strings.
Go Version: 8.1 microseconds/search
Rust version: 6.1 microseconds/search
No big deal. And not surprising - this time is bounded by the binary search / indirect memory reference time. With 200M positions in the file, searching takes about log2(200M) = 27 steps. Each of those steps involves two dependent memory lookups, and at about 100ns per memory lookup, that's 27*2*100 = 5.4 microseconds of memory time. That's obviously a handwavy calculation that ignores real memory timings and the fact that the first few lookups will hit in cache, but it suggests that both bits of code are mostly memory-bound.
Rust
Requests per second: 6657.61 [#/sec] (mean)
Time per request: 0.300 [ms] (mean)
Time per request: 0.150 [ms] (mean, across all concurrent requests)
Go
Requests per second: 6074.71 [#/sec] (mean)
Time per request: 0.329 [ms] (mean)
Time per request: 0.165 [ms] (mean, across all concurrent requests)
So far, that doesn't look all that better, and certainly not like it was worth rewriting. But the pi searcher doesn't exist in a vacuum - the backend is accessed through an Apache reverse proxy.
Requests per second: 2811.57 [#/sec] (mean)
Time per request: 0.711 [ms] (mean)
Time per request: 0.356 [ms] (mean, across all concurrent requests)
(The "legacy" bit is because the pi searcher exports both a modern JSON-based API and a legacy interface that looks like the old CGI interface.) I'm sure there are better ways to accomplish what I was trying to do: I'm not a Rust expert by any means; I used this rewrite as a way to dive in a little to see what creating web services was like in Rust so that I could better evaluate the costs/benefits for real work projects.
Rust's binary_search_by_key isn't designed as well for suffix array use. Actually, it's designed and implemented just fine, but the documentation notes:
I've internalized the -1, 0, 1 mapping of C-style comparison functions such as Go's, but I really liked here that it was an explicit type, not an integer.
The 22% speedup is probably under-measuring, but it's also the number I care about: the actual throughput my server can sustain.
Things that were painful: Getting the routing and whatnot set up with actix-web took a lot of work. The documentation isn't comprehensive for weird usecases like mine (which are, admittedly, stupid -- I'm supporting a legacy interface on both GET/POST and a JSON-ish interface that also stupidly takes both GET/POST). Sorting through the twists of using #[get(...)] vs explicitly routing things .route(web::get().to(serve_legacy_get) took a while to get right. Mixing my own "in-band" error handling with Actix's error handling was also a pain. It's clear that the framework is designed for supporting modern API development usecases, and that's not unreasonable.
I miss having explicitly named multiple return values. The Search function returns a tuple of (position, count), both of which are of type usize, and it's nice to have the semantics documented explicitly through return parameter naming in Go.
Things that were awesome: Automatic request parsing into a struct (and automatic marshaling of the struct back to json) was very nice. I'd implemented the JSON marshaling as a wrapper in the Go version, but it was automatic in Rust.
Rust has better support for some functional forms, so, for example, finding the "next search result" (i.e., the element that appears first after position p) was elegant:
I do love that compared to the Go version:
One thing I had been worried about was mmap support, but this was mostly a no-op. The only issue was that I had to recalculate the "mapped as uint32" version in functions where I used it (which is basically nothing), because one doesn't store two pointers to the same thing in a struct in Rust:
No biggie and if I did it in more than two places it'd be an easy helper function.
This year I moved the Pi Searcher to Rust, hoping to get the best of both worlds. The tl;dr is that it worked and gave me and end-to-end throughput increase of at least 22%. But not entirely in the place I'd expected! The code is basically a transliteration of the Go to Rust, except for moving from Go's internal net/http server to actix-web for Rust. The algorithm is the same: sequential search for short search strings, and using a suffix array for longer strings.
Results
Code length isn't too different. I was a little sloppy in writing the Rust version, so I don't think a comparison is entirely fair, but they're within 50% of each other and would probably be closer if I structured the Rust better.Core search times using the suffix array
The core search didn't change by much, as timed on my 2017 i7 Macbook Pro once the cache was fully warmed:Go Version: 8.1 microseconds/search
Rust version: 6.1 microseconds/search
No big deal. And not surprising - this time is bounded by the binary search / indirect memory reference time. With 200M positions in the file, searching takes about log2(200M) = 27 steps. Each of those steps involves two dependent memory lookups, and at about 100ns per memory lookup, that's 27*2*100 = 5.4 microseconds of memory time. That's obviously a handwavy calculation that ignores real memory timings and the fact that the first few lookups will hit in cache, but it suggests that both bits of code are mostly memory-bound.
Direct access over HTTP to the JSON results
What about direct reqs/second, measured using ab -c 2 -n 1000? Measurements on the Pi searcher server (an n1 small GCP instance):Rust
Requests per second: 6657.61 [#/sec] (mean)
Time per request: 0.300 [ms] (mean)
Time per request: 0.150 [ms] (mean, across all concurrent requests)
Go
Requests per second: 6074.71 [#/sec] (mean)
Time per request: 0.329 [ms] (mean)
Time per request: 0.165 [ms] (mean, across all concurrent requests)
So far, that doesn't look all that better, and certainly not like it was worth rewriting. But the pi searcher doesn't exist in a vacuum - the backend is accessed through an Apache reverse proxy.
Access via the Apache reverse proxy
GoRequests per second: 2811.57 [#/sec] (mean)
Time per request: 0.711 [ms] (mean)
Time per request: 0.356 [ms] (mean, across all concurrent requests)
Rust
Requests per second: 3445.73 [#/sec] (mean)
Time per request: 0.580 [ms] (mean)
Time per request: 0.290 [ms] (mean, across all concurrent requests)
Ah-ha! We finally have a speedup. And what's probably happening is that the speedup is a little bigger than it looks here, but we're bottlenecked in our measurements by doing the measurements on the same machine. With the server running both the ab benchmarker, the Apache proxy, and the pi searcher, it's hard to find huge speedups. I'd probably be better off moving to nginx but that's for another year.
Observations
The process of rewriting wouldn't have been all that painful for someone familiar with actix-web. As a newbie, I spent a lot of time fighting with it to be able to use its derived parsers for requests while still being able to handle errors myself in some cases. I still don't like the solution I came up with, where there are two analogous-but-different functions to serve GETs and POSTs:(The "legacy" bit is because the pi searcher exports both a modern JSON-based API and a legacy interface that looks like the old CGI interface.) I'm sure there are better ways to accomplish what I was trying to do: I'm not a Rust expert by any means; I used this rewrite as a way to dive in a little to see what creating web services was like in Rust so that I could better evaluate the costs/benefits for real work projects.
Rust's binary_search_by_key isn't designed as well for suffix array use. Actually, it's designed and implemented just fine, but the documentation notes:
If there are multiple matches, then any one of the matches could be returned.This is an entirely sane decision for an API designer to make, because it gives them freedom to change the implementation in the future. Unfortunately, as an API user, it means I can't use it to find the full range of results in a suffix array, which one does by finding the first and last matching elements. I had to re-implement binary search within my code, which, fortunately, was about 15 lines of very easy-to-read code. This was a place that Rust really shined:
I've internalized the -1, 0, 1 mapping of C-style comparison functions such as Go's, but I really liked here that it was an explicit type, not an integer.
The 22% speedup is probably under-measuring, but it's also the number I care about: the actual throughput my server can sustain.
Things that were painful: Getting the routing and whatnot set up with actix-web took a lot of work. The documentation isn't comprehensive for weird usecases like mine (which are, admittedly, stupid -- I'm supporting a legacy interface on both GET/POST and a JSON-ish interface that also stupidly takes both GET/POST). Sorting through the twists of using #[get(...)] vs explicitly routing things .route(web::get().to(serve_legacy_get) took a while to get right. Mixing my own "in-band" error handling with Actix's error handling was also a pain. It's clear that the framework is designed for supporting modern API development usecases, and that's not unreasonable.
I miss having explicitly named multiple return values. The Search function returns a tuple of (position, count), both of which are of type usize, and it's nice to have the semantics documented explicitly through return parameter naming in Go.
Things that were awesome: Automatic request parsing into a struct (and automatic marshaling of the struct back to json) was very nice. I'd implemented the JSON marshaling as a wrapper in the Go version, but it was automatic in Rust.
Rust has better support for some functional forms, so, for example, finding the "next search result" (i.e., the element that appears first after position p) was elegant:
I do love that compared to the Go version:
One thing I had been worried about was mmap support, but this was mostly a no-op. The only issue was that I had to recalculate the "mapped as uint32" version in functions where I used it (which is basically nothing), because one doesn't store two pointers to the same thing in a struct in Rust:
let idxu32: &[u32] = bytemuck::cast_slice(&self.pi_index_map);
No biggie and if I did it in more than two places it'd be an easy helper function.
Wrapup
A 22% speedup isn't bad, and is probably under-measured because of the other overhead in the system that should also be removed (hi, Apache, I'm looking at you). But it was a good experience, and I'm pleased with the results: this is code that I've poked at on and off for quite a while, and getting a nontrivial speed boost without adding any real complexity or extra lines of code was quite welcome. There are aspects of the Rust ecosystem that remain frustrating -- error handling and documentation primary among them, and Actix is fast and helpful but has a lot of sharp edges -- but it's matured drastically since the last time(s) I've taken a peek at it.
Comments
Post a Comment