Parallel Means

2025-07-25

Don’t be mean

I got curious about parallelism—in Rust, but the same ideas probably apply elsewhere—comparing using multiple CPUs via rayon and SIMD via portable_simd. I suspected that which one would be the fastest would be highly dependent on multiple factors, like the amount and distribution of the data involved and the complexity of desired computation.

To investigate, I wrote some code to calculate the mean of a collection of 32-bit floating point numbers in four different ways.1 The first is a sequential version of Welford’s online algorithm:

pub fn seq(xs: &[f32]) -> f32 {
    let (mut n, mut mu) = (0.0, 0.0);
    for x in xs {
        n += 1.0;
        mu += (x - mu) / n;
    }
    mu
}

Next, we have a rayon-ified version of Chan et. al.’s parallel generalization. It first folds the same sequential version over each chunk in parallel, and then merges the chunks together.2

pub fn par(xs: &[f32]) -> f32 {
    xs.into_par_iter()
        .fold(
            || (0.0, 0.0),
            |(n_old, mu_old), x| {
                let n = n_old + 1.0;
                let mu = mu_old + (x - mu_old) / n;
                (n, mu)
            },
        )
        .reduce(
            || (0.0, 0.0),
            |(n_a, mu_a), (n_b, mu_b)| {
                let n = n_a + n_b;
                let mu = mu_a + (mu_b - mu_a) * n_b / n;
                (n, mu)
            },
        )
        .1
}

I also developed two SIMD versions of the above, seq_simd<N> and par_simd<N> for supported lane count N. These were relatively straightforward to write and test3 based on their non-SIMD counterparts. The only parts that required a little more thought were collecting the N intermediate values and then “cleaning up” at the end; if the number of values isn’t a multiple of the lane count N, we need to pick up the last (potentially up to N - 1) elements. I won’t include the SIMD code here, as I’ve already copy-pasted a fair amount of code as it is, but it’s all available on my GitHub.

Do you even bench?

Once these four different implementations were ready, I reached for criterion to try and benchmark them. I measured seq, par, seq_simd<N> and par_simd<N> for N in \( \lbrace 8, 16, 32, 64\rbrace \). For each power of ten between ten (\(10^1\)) and a billion (\(10^9\)), I generated a vector of that length of f32s uniformly from the unit interval, and had criterion measure the time it took each version to calculate the mean of that vector.

All the usual caveats about this type of benchmarking still apply! I did the minimal due diligence—no other applications were running on my laptop, e.g.—but your mileage may vary due to different hardware (in this case, Apple M3 Max), data, computational load, etc.

Results

After running the benchmarks, I collated and plotted the results. Here’s the plot of #floats vs. nanoseconds (click on it for the full SVG); note the log-log scale.

To my eye, there’s a couple interesting things going on. When dealing with a small number of floats, the sequential versions look faster; indeed, the relative simplicity of seq is the fastest for taking the mean of ten values. SIMD starts being useful pretty quickly, however; all four of the seq_simd<N> versions are faster than seq by the time we’re working with a thousand floats. Multi-CPU parallelism starts to be useful much later, but eventually wins out. par beats seq by the time we reach \(10^5\), and each par_simd<N> beats all four of the seq_simd<N>s by \(10^7\) and beyond.

Fin.

All in all, the usual wisdom applies: benchmark for your particular use-case, realistic workloads don’t look like benchmarks run in a vacuum, etc. That said, this was a fun exercise! The std::simd APIs are rather pleasant to use; I look forward to them being available on stable instead of only on nightly. Please check out the code for more!

  1. I used f32 instead of f64 so as to fit more values in a single SIMD vector. ↩︎

  2. I smell a semigroup! ↩︎

  3. With both manual unit testing and proptest of course. ↩︎