AKTIVE: Documentation

Benchmark results comparing the 4-unrolled super-scalar loops against highway-based simd loops


Parent ↗

Summary (Dec 1, 2025)

First, the highway-based loops generate correct results.

They are also in general performing worse than the 4-unrolled super-scalar loop.

It is suspected that due to our use of double as base-type for pixel values we simply do not have enough vector lanes (two, IIRC) to reach the speed of the super-scalar ops. I.e. the simd loop may be the equivalent of a 2-unrolled super-scalar loop, at best.

A 2-unrolled simd loop on the other hand might be equivalent to 4-unrolled super-scalar, and a 4-unrolled simd loop might be faster. These are the next experiments to run.

Note: A suspicious behaviour is that for a number of operations the speed curve becomes totally flat starting with 1000 values and higher.

Addendum (Dec 2, 2025)

The experiment with unrolling the simd-based loops has not born out. Higher levels of unrolling make the simd-based loops slower.

Having watched the CPU temps during the run (up to 88 Celcius) I strongly suspect that the normal simd loop already encounters thermal throttling, and that unrolling simply drives the CPU deeper into it, undoing any performance boost we may have gained otherwise.

This throttling may also be the cause of why a number of scalar loops loose speed for larger vectors, i.e. the CPU is still throttled despite not using the vector units at the moment.

Checking this requires changing the order of benchmarks, i.e. instead of having the vector length as the outer loop make them the inner loop, thus keeping scalar and simd commands separate instead of interleaved.

Addendum II (Dec 2, 2025)

Separating the scalar and simd-based loops from each other did not result in any material changes in the results.

While it was found during watching that is was mostly the scalar loops which drove temperature, not the simd-based ones, this did not matter for relative performance and the shapes of the graphs.

Final comments

Due to the early decision in the life of AKTIVE to use double as the one and only type for pixel values it looks that using SIMD instructions is not a feasible way of boosting processing performance with the CPU I have.

While this may change with future CPUs and larger vector units right now the best way of boosting performance looks to be to look for and unroll the critical core loops.

Documentation

Summary (Dec 1, 2025)

Addendum (Dec 2, 2025)

Addendum II (Dec 2, 2025)

Final comments

Plots