
Benchmark results for baseline reducer
| Parent ↗ |
Summary
The base line implementation of the by-band reductions nicely shows how the time per value increases with the number of bands to process.
Addendum (Dec 4, 2025)
The first attempt at speeding things up worked. This attempt
- fully unrolls the inner loop for the common depths of 1 to 4, and
- for bands > 4 uses a generic loop effectively inlining the function calls of the baseline.
For the fully unrolled forms the new code is pretty much always faster.
For the generic loop the overhead for the more complex operations (sums (*)) the boost is not as strong, for some ops there is no boost at all.
This change is accepted.
(*) Additions are actually non-trivial because of my use of kahan summation.
Ideas to look at for more boosts:
- Inline the code of the kahan summation.
- Unroll the pixel loop as well to handle more than one pixel per iteration.
Addendum 2 (Dec 4, 2025)
After fixing an issue in the kahan summation benchmarking inlining
of this functionality (via macros) provides a measurable boost to
all the reductions using it, i.e. sum, sumsquared, mean,
stddev, and variance, for both baseline and special variants.
The special variants look to get a slightly larger boost however, as the inlining allows optimization over the entire loop body, instead of just inside of the pixel-function used by the baseline. Remember, these loop bodies have the band loop unrolled.
This change is accepted.
Now look at unrolling the pixel loop itself too.
Addendum 3 (Feb 7 2026)
The pixel loop unrolled 4 times looks to be best for most of the operations and depths. The simpler operations look to have it better.
That said, for the high-complexity ops (stddev and variance) the
unrolled form is mostly worse, except for depth 1. For higher depths the
special is better.
The sumsquared is in a bit of a transition area, where special is better
for depths 2 and 3, else unrolled, i.e. depths 1 and 4+. With 4+ having more
variation based on vector length.
Addendum 4 (Feb 7 2026)
The curves had quite a bit of differences, even if the general shape looked somewhat the same.
Reworked the benchmarks for a hopeful better estimation of general performance. The previous code initialized the work arrays once, with random data, and then ran all combinations of methods, depths and vector length over that fixed set. As such each full run of the benchmarks sampled only a single point in the input space.
The new code has each iteration for a combination re-initialize the work array. This causes the set of iterations for a combination to take many more samples of the input space, averaging them at the end.
The implementation of the selector, reusing the existing special and unrol function, and switching between them is not really working out very well. In quite a bit of where special is called the selector is slower. That said, in the new set of result these are all also closer than before.
And there is that for the operations in question the 4-unrolled
loops look to be only better for depth == 1. Which is something
which will not occur in production, only in benchmarking. Because
for that depth all the reducers reduce to constant results, and
are handled by the simplification rules at construction time.
Scrapping the selector as-is. Simply use special for the complex
operations.
| Op | 1 | 2 | 3 | 4 | 5+ | ||
|---|---|---|---|---|---|---|---|
| argmax | u4 | u4 | u4 | u4 | u4 | ||
| argmin | u4 | u4 | u4 | u4 | u4 | ||
| max | u4 | u4 | u4 | u4 | u4 | ||
| min | u4 | u4 | u4 | u4 | u4 | ||
| mean | u4 | u4 | u4 | u4 | u4 | ||
| sum | u4 | u4 | u4 | u4 | u4 | ||
| sumsquared | u4 | spec | spec | spec | spec | ||
| stddev | u4 | spec | spec | spec | spec | ||
| variance | u4 | spec | spec | spec | spec |