Two exercises from the CS-E4580 course — showing how the techniques from Chapters 1–4 combine into real optimised solutions.
ny rows × nx columns, compute the Pearson correlation between every pair of rows (i, j) where j ≤ i. Store results in a lower-triangular output matrix.
// Interface void correlate(int ny, int nx, const float* data, float* result); // result[i + j*ny] = Pearson correlation(row i, row j), j ≤ i
result[i,j] = dot(norm[i], norm[j])schedule(dynamic,1).// 6×16 micro-kernel: 96 independent fmadd chains for (int k = 0; k < nx16; k += 16) { // Load 6 i-rows (reused 16× each) const __m512 a0=_mm512_loadu_ps(ri[0]+k), a1=_mm512_loadu_ps(ri[1]+k); // ... a2..a5 ... // Load 16 j-rows (reused 6× each) const __m512 b0=_mm512_loadu_ps(rj[0]+k); // ... b1..b15 ... // 96 fmadd: acc[p][q] += a[p] * b[q] acc[0][0] = _mm512_fmadd_ps(a0, b0, acc[0][0]); acc[0][1] = _mm512_fmadd_ps(a0, b1, acc[0][1]); // ... 94 more ... }
inner_sum²/inner_pixels + outer_sum²/outer_pixels. This is O(ny² × nx²) naively — needs aggressive optimisation.
sum(y0,x0,y1,x1) = S[y1][x1] - S[y0][x1] - S[y1][x0] + S[y0][x0]rcp_nr replaces 1/x division with faster approximate reciprocal + one refinement step. ~8× faster than vdivps.#pragma omp critical.// Division is slow (~11-14 cycles). Reciprocal + NR = 2 cycles. static inline __m512 rcp_nr(const __m512 x) { const __m512 r = _mm512_rcp14_ps(x); // ~14-bit approx // Newton-Raphson step: r = r * (2 - x*r) return _mm512_mul_ps(r, _mm512_fnmadd_ps(x, r, _mm512_set1_ps(2.0f))); // Result: ~28 bits accuracy (close to full float precision) }