AKTIVE

Documentation
Login

Documentation

Benchmark results for baseline reducer

Parent ↗

Summary

The base line implementation of the by-band reductions nicely shows how the time per value increases with the number of bands to process.

Addendum (Dec 4, 2025)

The first attempt at speeding things up worked. This attempt

  1. fully unrolls the inner loop for the common depths of 1 to 4, and
  2. for bands > 4 uses a generic loop effectively inlining the function calls of the baseline.

For the fully unrolled forms the new code is pretty much always faster.

For the generic loop the overhead for the more complex operations (sums (*)) the boost is not as strong, for some ops there is no boost at all.

This change is accepted.

(*) Additions are actually non-trivial because of my use of kahan summation.

Ideas to look at for more boosts:

Addendum 2 (Dec 4, 2025)

After fixing an issue in the kahan summation benchmarking inlining of this functionality (via macros) provides a measurable boost to all the reductions using it, i.e. sum, sumsquared, mean, stddev, and variance, for both baseline and special variants.

The special variants look to get a slightly larger boost however, as the inlining allows optimization over the entire loop body, instead of just inside of the pixel-function used by the baseline. Remember, these loop bodies have the band loop unrolled.

This change is accepted.

Now look at unrolling the pixel loop itself too.

Addendum 3 (Feb 7 2026)

The pixel loop unrolled 4 times looks to be best for most of the operations and depths. The simpler operations look to have it better.

That said, for the high-complexity ops (stddev and variance) the unrolled form is mostly worse, except for depth 1. For higher depths the special is better.

The sumsquared is in a bit of a transition area, where special is better for depths 2 and 3, else unrolled, i.e. depths 1 and 4+. With 4+ having more variation based on vector length.

Addendum 4 (Feb 7 2026)

The curves had quite a bit of differences, even if the general shape looked somewhat the same.

Reworked the benchmarks for a hopeful better estimation of general performance. The previous code initialized the work arrays once, with random data, and then ran all combinations of methods, depths and vector length over that fixed set. As such each full run of the benchmarks sampled only a single point in the input space.

The new code has each iteration for a combination re-initialize the work array. This causes the set of iterations for a combination to take many more samples of the input space, averaging them at the end.

The implementation of the selector, reusing the existing special and unrol function, and switching between them is not really working out very well. In quite a bit of where special is called the selector is slower. That said, in the new set of result these are all also closer than before.

And there is that for the operations in question the 4-unrolled loops look to be only better for depth == 1. Which is something which will not occur in production, only in benchmarking. Because for that depth all the reducers reduce to constant results, and are handled by the simplification rules at construction time.

Scrapping the selector as-is. Simply use special for the complex operations.

Op 1 2 3 4 5+
argmax u4 u4 u4 u4 u4
argmin u4 u4 u4 u4 u4
max u4 u4 u4 u4 u4
min u4 u4 u4 u4 u4
mean u4 u4 u4 u4 u4
sum u4 u4 u4 u4 u4
sumsquared u4 spec spec spec spec
stddev u4 spec spec spec spec
variance u4 spec spec spec spec

Plots

argmax

argmin

max

mean

min

stddev

sum

sumsquared

variance