Exercises

Putting it all
into practice

Two exercises from the CS-E4580 course — showing how the techniques from Chapters 1–4 combine into real optimised solutions.

The problem

Given a matrix of ny rows × nx columns, compute the Pearson correlation between every pair of rows (i, j) where j ≤ i. Store results in a lower-triangular output matrix.

// Interface
void correlate(int ny, int nx, const float* data, float* result);
// result[i + j*ny] = Pearson correlation(row i, row j), j ≤ i

Optimisations applied

Normalise rows — subtract mean, divide by L2 norm. Reduces correlation to a simple dot product: result[i,j] = dot(norm[i], norm[j])

AVX-512 (float16_t) — 16 floats per ZMM register. 2× throughput vs AVX2.

6×16 kernel — compute 6 i-rows × 16 j-rows simultaneously. Each j-row chunk reused 6× across i, each i-row reused 16× across j. 96 independent fmadd accumulators.

hsum via intrinsics — 4-step tree reduction instead of 7-deep scalar chain.

Z-order tile traversal — both norm[i] and norm[j] rows stay in L2 cache across revisits.

Dynamic OpenMP — pre-enumerate all tile pairs, parallelise over all of them with schedule(dynamic,1).

Key inner kernel

// 6×16 micro-kernel: 96 independent fmadd chains
for (int k = 0; k < nx16; k += 16) {
    // Load 6 i-rows (reused 16× each)
    const __m512 a0=_mm512_loadu_ps(ri[0]+k), a1=_mm512_loadu_ps(ri[1]+k);
    // ... a2..a5 ...

    // Load 16 j-rows (reused 6× each)
    const __m512 b0=_mm512_loadu_ps(rj[0]+k);
    // ... b1..b15 ...

    // 96 fmadd: acc[p][q] += a[p] * b[q]
    acc[0][0] = _mm512_fmadd_ps(a0, b0, acc[0][0]);
    acc[0][1] = _mm512_fmadd_ps(a0, b1, acc[0][1]);
    // ... 94 more ...
}

The problem

Find the rectangle (y0, x0, y1, x1) in an image that maximises the score: inner_sum²/inner_pixels + outer_sum²/outer_pixels. This is O(ny² × nx²) naively — needs aggressive optimisation.

Key techniques used

2D prefix sums — precompute so any rectangle sum is O(1): sum(y0,x0,y1,x1) = S[y1][x1] - S[y0][x1] - S[y1][x0] + S[y0][x0]

AVX-512 inner loop — vectorise over x1 with 16-wide loads. Fixed (y0,y1,x0) outer loops, SIMD over x1.

Newton-Raphson reciprocal — rcp_nr replaces 1/x division with faster approximate reciprocal + one refinement step. ~8× faster than vdivps.

OpenMP parallel — outer y0 loop parallelised. Each thread maintains local best, merged via #pragma omp critical.

The reciprocal trick

// Division is slow (~11-14 cycles). Reciprocal + NR = 2 cycles.
static inline __m512 rcp_nr(const __m512 x) {
    const __m512 r = _mm512_rcp14_ps(x);  // ~14-bit approx
    // Newton-Raphson step: r = r * (2 - x*r)
    return _mm512_mul_ps(r,
        _mm512_fnmadd_ps(x, r, _mm512_set1_ps(2.0f)));
    // Result: ~28 bits accuracy (close to full float precision)
}

← Chapter 4 ↑ Home

Putting it allinto practice

Putting it all
into practice