Don’t be mean
I got curious about parallelism—in Rust, but the same ideas probably apply
elsewhere—comparing using multiple CPUs via
rayon
and SIMD via
portable_simd
.
I suspected that which one would be the fastest would be highly dependent on
multiple factors, like the amount and distribution of the data involved and the
complexity of desired computation.
To investigate, I wrote some code to calculate the mean of a collection of 32-bit floating point numbers in four different ways.1 The first is a sequential version of Welford’s online algorithm:
pub fn seq(xs: &[f32]) -> f32 {
let (mut n, mut mu) = (0.0, 0.0);
for x in xs {
n += 1.0;
mu += (x - mu) / n;
}
mu
}
Next, we have a rayon
-ified version of Chan et. al.’s parallel
generalization.
It first folds the same sequential version over each chunk in parallel, and
then merges the chunks together.2
pub fn par(xs: &[f32]) -> f32 {
xs.into_par_iter()
.fold(
|| (0.0, 0.0),
|(n_old, mu_old), x| {
let n = n_old + 1.0;
let mu = mu_old + (x - mu_old) / n;
(n, mu)
},
)
.reduce(
|| (0.0, 0.0),
|(n_a, mu_a), (n_b, mu_b)| {
let n = n_a + n_b;
let mu = mu_a + (mu_b - mu_a) * n_b / n;
(n, mu)
},
)
.1
}
I also developed two SIMD versions of the above, seq_simd<N>
and
par_simd<N>
for supported lane count N
.
These were relatively straightforward to write and test3 based on their
non-SIMD counterparts.
The only parts that required a little more thought were collecting the N
intermediate values and then “cleaning up” at the end; if the number of values
isn’t a multiple of the lane count N
, we need to pick up the last
(potentially up to N
- 1) elements.
I won’t include the SIMD code here, as I’ve already copy-pasted a fair amount
of code as it is, but it’s all available on my
GitHub.
Do you even bench?
Once these four different implementations were ready, I reached for
criterion
to try
and benchmark them.
I measured seq
, par
, seq_simd<N>
and par_simd<N>
for N
in \( \lbrace 8, 16,
32, 64\rbrace \).
For each power of ten between ten (\(10^1\)) and a billion (\(10^9\)), I
generated a vector of that length of f32
s uniformly from the unit interval,
and had criterion
measure the time it took each version to calculate the mean
of that vector.
All the usual caveats about this type of benchmarking still apply! I did the minimal due diligence—no other applications were running on my laptop, e.g.—but your mileage may vary due to different hardware (in this case, Apple M3 Max), data, computational load, etc.
Results
After running the benchmarks, I collated and plotted the results. Here’s the plot of #floats vs. nanoseconds (click on it for the full SVG); note the log-log scale.
To my eye, there’s a couple interesting things going on.
When dealing with a small number of floats, the sequential versions look
faster; indeed, the relative simplicity of seq
is the fastest for taking the
mean of ten values.
SIMD starts being useful pretty quickly, however; all four of the seq_simd<N>
versions are faster than seq
by the time we’re working with a thousand
floats.
Multi-CPU parallelism starts to be useful much later, but eventually wins out.
par
beats seq
by the time we reach \(10^5\), and each par_simd<N>
beats
all four of the seq_simd<N>
s by \(10^7\) and beyond.
Fin.
All in all, the usual wisdom applies: benchmark for your particular use-case,
realistic workloads don’t look like benchmarks run in a vacuum, etc.
That said, this was a fun exercise!
The std::simd
APIs are rather pleasant to use; I look forward to them being
available on stable instead of only on nightly.
Please check out the code
for more!