AKTIVE: Documentation

Benchmark results for baseline reducer


Parent ↗

Summary

The base line implementation of the by-band reductions nicely shows how the time per value increases with the number of bands to process.

Addendum (Dec 4, 2025)

The first attempt at speeding things up worked. This attempt

fully unrolls the inner loop for the common depths of 1 to 4, and
for bands > 4 uses a generic loop effectively inlining the function calls of the baseline.

For the fully unrolled forms the new code is pretty much always faster.

For the generic loop the overhead for the more complex operations (sums (*)) the boost is not as strong, for some ops there is no boost at all.

This change is accepted.

(*) Additions are actually non-trivial because of my use of kahan summation.

Ideas to look at for more boosts:

Inline the code of the kahan summation.
Unroll the pixel loop as well to handle more than one pixel per iteration.

Addendum 2 (Dec 4, 2025)

After fixing an issue in the kahan summation benchmarking inlining of this functionality (via macros) provides a measurable boost to all the reductions using it, i.e. sum, sumsquared, mean, stddev, and variance, for both baseline and special variants.

The special variants look to get a slightly larger boost however, as the inlining allows optimization over the entire loop body, instead of just inside of the pixel-function used by the baseline. Remember, these loop bodies have the band loop unrolled.

This change is accepted.

Now look at unrolling the pixel loop itself too.

Addendum 3 (Feb 7 2026)

The pixel loop unrolled 4 times looks to be best for most of the operations and depths. The simpler operations look to have it better.

That said, for the high-complexity ops (stddev and variance) the unrolled form is mostly worse, except for depth 1. For higher depths the special is better.

The sumsquared is in a bit of a transition area, where special is better for depths 2 and 3, else unrolled, i.e. depths 1 and 4+. With 4+ having more variation based on vector length.

Addendum 4 (Feb 7 2026)

The curves had quite a bit of differences, even if the general shape looked somewhat the same.

Reworked the benchmarks for a hopeful better estimation of general performance. The previous code initialized the work arrays once, with random data, and then ran all combinations of methods, depths and vector length over that fixed set. As such each full run of the benchmarks sampled only a single point in the input space.

The new code has each iteration for a combination re-initialize the work array. This causes the set of iterations for a combination to take many more samples of the input space, averaging them at the end.

The implementation of the selector, reusing the existing special and unrol function, and switching between them is not really working out very well. In quite a bit of where special is called the selector is slower. That said, in the new set of result these are all also closer than before.

And there is that for the operations in question the 4-unrolled loops look to be only better for depth == 1. Which is something which will not occur in production, only in benchmarking. Because for that depth all the reducers reduce to constant results, and are handled by the simplification rules at construction time.

Scrapping the selector as-is. Simply use special for the complex operations.

Op	1	2	3	4	5+
argmax	u4	u4	u4	u4	u4
argmin	u4	u4	u4	u4	u4
max	u4	u4	u4	u4	u4
min	u4	u4	u4	u4	u4
mean	u4	u4	u4	u4	u4
sum	u4	u4	u4	u4	u4

sumsquared	u4	spec	spec	spec	spec
stddev	u4	spec	spec	spec	spec
variance	u4	spec	spec	spec	spec

Documentation

Summary

Addendum (Dec 4, 2025)

Addendum 2 (Dec 4, 2025)

Addendum 3 (Feb 7 2026)

Addendum 4 (Feb 7 2026)

Plots

argmax

argmin

max

mean

min

stddev

sum

sumsquared

variance