A recent blog post asked "Does C++ still deserve its bad rap?"
While that title presumes the answer (after all, a "bad rap" implies that the judgement is undeserved), I think C++'s reputation for making it very difficult to achieve security or memory safety or thread-safety remains well-deserved, even though it's gotten a lot better over the years. I mean that: The c++-17 version of that program is years better than it would have been with c++0x, and it let a careful programmer write a nice program.
The post used a simple multi-file word count as an example: Count all words in all ".txt" files in the current directory and all of its subdirectories, where words were defined as things matching the regexp "([a-z]{2,})" in a case-insensitive manner.
First, I'll walk through the example from that blog post and identify some remaining bugs. And then we'll build up a (IMO) better version in Rust that's faster, safer, and more concise. Let's start with the body that processes the files and adds the counts to a hash table:
A few things immediately come to mind when examining this code from a safety perspective. First, it uses a
filesystem::recursive_directory_iterator to identify the files in the current directory, testing their types and names before processing them. It then opens the file using
std::ifstream.
From the way I phrased this, you're probably already guessing that this is a TTCTTOU eror -- time-to-check-to-time-of-use. The program validates that the entry is a regular file, but then later opens it, and assumes that the result of the check holds. We should ask ourselves:
- What happens if the file has been deleted between the directory listing and the open?
- What happens if the file has been replaced by, e.g., a pipe or other not regular_file?
For the second case, it's pretty clear that the program will try to open it and operate on it; this is therefore a bug with respect to the intent of the programmer. Is it a huge bug in this context? Not at all, but analogous bugs have led to serious security problems.
For the first case, I wasn't sure - and I bet many other C++ programmers aren't either. The answer is that the std::ifstream is safe, and will simply behave in the same way as if EOF was reached, but I wasn't certain about that until I googled it a lot and wrote a test program to verify it. Accidental correctness is better than being wrong, but we should strive for more in our programs.
Finally, the way it's written isn't very conducive to future multithreading. That's wasn't the point of the exercise, but I think it's worth thinking about as many modern data processing programs take an evolutionary path from single-threaded to multi-threaded. Constructs like this invite threading bugs:
(1) It separates the computation of a safe size range from the use of that range; (2) it destructively modifies the word_array for no reason other than to make it easy to print the first 10 items. And adding multithreading to things in C++ still remains somewhat painful.
Here are some reasons I'm coming to prefer Rust as a systems language. It's not remarkably shorter - the Python or shell version of all this can be expressed in a few lines of code! But it's a bit shorter, and it's more clear where corners have been cut or errors properly handled:
use anyhow::Result;
use lazysort::SortedBy;
use std::collections::HashMap;
use std::io::prelude::*;
fn scanfiles() -> Result<()> {
let mut wordcounts = HashMap::new();
let words = regex::Regex::new(r"([a-zA-Z]{2,})")?;
for file in globwalk::glob("*.txt")?
.filter_map(Result::ok)
.filter_map(|dirent| std::fs::File::open(dirent.path()).ok())
.filter(|file| {
file.metadata()
.and_then(|md| Ok(md.is_file()))
.unwrap_or(false)
})
{
let reader = std::io::BufReader::new(file);
for line in reader.lines().filter_map(Result::ok) {
for word in words.captures_iter(&line) {
let w = word[0].to_lowercase();
*wordcounts.entry(w).or_insert(0) += 1;
}
}
}
let words_sorted = wordcounts.iter().sorted_by(|a, b| b.1.cmp(a.1));
for kv in words_sorted.take(10) {
println!("{} {}", kv.0, kv.1);
}
Ok(())
}
fn main() {
if let Err(e) = scanfiles() {
println!("Error: Something unexpected happened: {:#?}", e);
}
}
Ok, that's 39 lines of code, but not counting the Cargo.toml file, which is another 4 of non-boilerplate (the four dependencies, which are fairly analogous to the #include lines in the C++ version, so should be counted).
A little shorter. But what I like about the Rust version is that there's a lot less guessing about the error handling. Did the file fail to open? We know it's skipped, because the
filter_map will discard that file if the result of the open is not ok. We know errors in
globwalk or constructing the regex are handled, because the
? will cause the function to return an error if they return an error. The more functional
.take(N) idiom is a little more idiot-proof than the equivalent code in the C++ version, as we know it returns early if there are fewer than N items.
While modern C++ lets a careful programmer write a good program, it still lets a less-careful programmer do things in ways that create errors. Rust has less of that: it will nag you to cover corner cases more carefully.
Rust doesn't make it magically easier to avoid that TTCTTOU problem - it would have been just as easy to write the code using
std::fs::metadata(path) in the same way it was in the C++ version. But it does improve my confidence that were that bug present, the code would have handled the "file doesn't exist" problem. But it still would have opened a special file or directory with aplomb.
It's also very easy to convert to a multithreaded program: Just turn the for x in ... loop into a for_each, and use the
Rayon crate
.par_bridge() to cause the function in for_each to be executed in parallel.
globwalk::glob("*.txt")?
.filter_map(Result::ok)
.filter_map(|dirent| std::fs::File::open(dirent.path()).ok())
.filter(|file| {
file.metadata()
.and_then(|md| Ok(md.is_file()))
.unwrap_or(false)
})
.par_bridge()
.for_each(|file| {
Of course, that won't compile, because we've forgotten to use any locking on the hash table into which we're inserting counts:
error[E0596]: cannot borrow `wordcounts` as mutable, as it is a captured variable in a `Fn` closure
--> src/main.rs:25:22
|
25 | *wordcounts.entry(w).or_insert(0) += 1;
| ^^^^^^^^^^ cannot borrow as mutable
So then we either switch it to a parallel hash table, or we guard it with a mutex. I'll take the simple approach of locking the table before using it, and do just enough optimization to make it faster than the single-threaded version. In this case, I parse each line into lowercased-words, which I store in a vector, before locking the hash table and then bulk-inserting the line's worth of words:
use anyhow::Result;
use lazysort::SortedBy;
use rayon::prelude::*; // NEW
use std::collections::HashMap;
use std::io::prelude::*;
fn scanfiles() -> Result<()> {
let mut wordcounts = HashMap::new();
let wordcounts_locked = std::sync::Mutex::new(&mut wordcounts); // NEW
let words = regex::Regex::new(r"([a-zA-Z]{2,})")?;
globwalk::glob("*.txt")?
.filter_map(Result::ok)
.filter_map(|dirent| std::fs::File::open(dirent.path()).ok())
.filter(|file| {
file.metadata()
.and_then(|md| Ok(md.is_file()))
.unwrap_or(false)
})
.par_bridge() // NEW
.for_each(|file| { // CHANGED
let reader = std::io::BufReader::new(file);
for line in reader.lines().filter_map(Result::ok) {
let wordlist: Vec<String> = words // NEW DESIGN
.captures_iter(&line)
.map(|w| w[0].to_lowercase())
.collect();
let mut wordcounts = wordcounts_locked.lock().unwrap();
for w in wordlist {
*wordcounts.entry(w).or_insert(0) += 1;
}
}
});
let words_sorted = wordcounts.iter().sorted_by(|a, b| b.1.cmp(a.1));
for kv in words_sorted.take(10) {
println!("{} {}", kv.1, kv.0);
}
Ok(())
}
fn main() {
if let Err(e) = scanfiles() {
println!("Error: Something unexpected happened: {:#?}", e);
}
}
Not bad - about ten lines of code changed to make a basic parallel version of the code.
There are some advantages to the C++ version that reflect the immaturity of the Rust standard library. It was able to easily use a partial_sort to reduce the amount of work done sorting by counts; I had to find the lazysort crate, which isn't part of the standard library, which does the same thing.
Both are fast; depending on the C++ compiler, the rust one may or may not be faster. (
Updated: More benchmarks, prompted by @cb321 on hacker news, and switched the benchmark to something easily downloaded: Charles Dickens "
A Tale of Two Cities")
- I7-7920HQ, MacOS
Rust version: 90ms (rustc 1.47.0 (18bf6b4f0 2020-10-07))
C++ version: 340ms (Apple clang version 12.0.0 (clang-1200.0.32.2))
- Xeon(R) Gold 6130 (skylake, Ubuntu 20.04)
Rust: 76ms (rustc-1.47.0)
C++: 60ms (gcc version 9.3.0 (Ubuntu 9.3.0-10ubuntu2)), -O3, no PGO
With a lot of larger files, the parallel version can wordcount four 2mb sample files in 0.307 seconds using a quad-core CPU - not quite linear scaling, but not bad for a few minutes of tweaking. The single-threaded C++ version, meanwhile, is stuck taking about 3.55 seconds on MacOS.
Despite its shorter length, the Rust version probably took me longer to write than the C++ version would have. I had to spend more time figuring out what functions returned so I could handle their possible return types, and had to spend a bit more time googling for things like the filesystem metadata, probably because I'm newer to the language. But making it correct and making it parallel took less time, even though I'm less experienced in the language. I'll take that tradeoff in a lot of what I do.
-------
Update: Some good feedback from burntsushi about the unnecessary use of globwalk introducing a lot of extra dependencies. This is a good point about the risks of making it too easy to use random crates off of github -- or, perhaps, about the dangers inherent in having a less-fully-fledged stdlib. An updated version of the code that's faster to compile is:
use anyhow::Result;
use lazysort::SortedBy;
use rayon::prelude::*;
use std::collections::HashMap;
use std::io::prelude::*;
fn scanfiles() -> Result<()> {
let mut wordcounts = HashMap::new();
let wordcounts_locked = std::sync::Mutex::new(&mut wordcounts);
let words = regex::Regex::new(r"([a-zA-Z]{2,})")?;
walkdir::WalkDir::new(".")
.into_iter()
.filter_map(Result::ok)
.filter(|dirent| dirent.path().extension().and_then(std::ffi::OsStr::to_str) == Some("txt"))
.filter_map(|dirent| std::fs::File::open(dirent.path()).ok())
.filter(|file| {
file.metadata()
.and_then(|md| Ok(md.is_file()))
.unwrap_or(false)
})
.par_bridge()
.for_each(|file| {
let reader = std::io::BufReader::new(file);
for line in reader.lines().filter_map(Result::ok) {
let wordlist: Vec<String> = words
.captures_iter(&line)
.map(|w| w[0].to_lowercase())
.collect();
let mut wordcounts = wordcounts_locked.lock().unwrap();
for w in wordlist {
*wordcounts.entry(w).or_insert(0) += 1;
}
}
});
let words_sorted = wordcounts.iter().sorted_by(|a, b| b.1.cmp(a.1));
for kv in words_sorted.take(10) {
println!("{} {}", kv.1, kv.0);
}
Ok(())
}
fn main() {
if let Err(e) = scanfiles() {
println!("Error: Something unexpected happened: {:#?}", e);
}
}
What's the explanation why Rust version performs faster? The fastest code is one that never runs: what does Rust version doesn't do then?
ReplyDelete> The fastest code is one that never runs: what does Rust version doesn't do then?
DeleteNot exactly. If we wanted to have as little code as possible we could use only unordered linked list. Such implementation would for generate much less assembly code since no hashing and resizing logic would be necessary however it would most likely run much slower.
I can make it faster with dlang..
ReplyDeleteas long as they happy while coding '__')
ReplyDeleteusing go or lua or v or anything else..
The c++ version seems far more readable to me.
ReplyDeleteYou still have TTCTTOU. It would also be easy to refactor the C++ version to use std::future and std::async for easy asynchronous multi-threading and no need to use a mutex, if each async uses it's own local map and return it in the future which gets merged by the main thread.
ReplyDeleteOP author here. If you want to do the nondestructive sorting thing, the C++ standard library also has partial_sort_copy that does exactly that. On the other hand if you can use external libraries, there's this:
ReplyDeletehttps://godbolt.org/z/PanWf7
I did not write that, though. This was something that was sent to me. I make no guarantees about its behaviour, or whether it even works at all. :)
Feels like this isn't the complete picture...
ReplyDelete