Chapter 3

Multithreading &
OpenMP

OpenMP turns sequential loops into parallel work with a single pragma. But correctness is your responsibility — the compiler won't warn you about race conditions.

The basics
How OpenMP splits work across threads
#pragma omp parallel for
for (int i = 0; i < 10; ++i) {
    c(i);  // runs in parallel across all cores
}
With 4 cores, OpenMP automatically splits the 10 iterations:
Thread 1
c(0)
c(1)
c(2)
Thread 2
c(3)
c(4)
c(5)
Thread 3
c(6)
c(7)
Thread 4
c(8)
c(9)
OpenMP creates threads, assigns work, and waits for all to finish before continuing.
Scheduling strategies
schedule(static) — default
// Equal contiguous chunks
// Best cache locality
#pragma omp parallel for schedule(static)
for (int i = 0; i < n; ++i) { ... }
schedule(dynamic) — unequal work
// Grab 1 unit at a time
// Best load balancing
#pragma omp parallel for schedule(dynamic,1)
for (int b = 0; b < blocks; ++b) { ... }
Use dynamic for triangular loops (like correlate's lower-triangular tiles) where diagonal blocks have ~half the work of off-diagonal blocks.
Safety rules
When is a loop safe to parallelise?
Two rules — both must hold for every pair of iterations X and Y:
No shared element read by X and written by Y
No shared element written by both X and Y
✗ UNSAFE — read-write conflict
x[i+1] = f(x[i]); // i reads what i-1 wrote
✗ UNSAFE — write-write conflict
y[0] = f(x[i]); // all write same y[0]
✓ SAFE — disjoint writes
y[i] = f(x[i]); // each i writes different y[i]
Reduction — safe accumulation
Race condition: #pragma omp parallel for + sum += arr[i] = data race. All threads write to the same sum simultaneously → wrong answer, silently.
// WRONG — race condition on sum
float sum = 0;
#pragma omp parallel for
for (int i = 0; i < n; ++i) sum += arr[i]; // ✗

// CORRECT — reduction clause
float sum = 0;
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < n; ++i) sum += arr[i]; // ✓
OpenMP gives each thread a private copy of sum, then combines them at the end.
Memory model
False sharing — the hidden performance killer
Two threads write to different variables that share the same 64-byte cache line. Every write by one invalidates the other's cached copy → cache bouncing → 10-50× slowdown with no logical error.
// BAD: partial_sum[0] and partial_sum[1] share a cache line
float partial_sum[nthreads]; // adjacent floats = same cache line
#pragma omp parallel for
for (int t = 0; t < nthreads; ++t)
    partial_sum[t] += work(t); // threads stomp each other's cache

// GOOD: pad to 64-byte boundary
struct alignas(64) Padded { float val; char pad[60]; };
Padded partial_sum[nthreads]; // each on its own cache line ✓
Common OpenMP patterns
// Pattern 1: parallel for (most common)
#pragma omp parallel for schedule(static)
for (int i = 0; i < n; ++i) result[i] = compute(i);

// Pattern 2: parallel for + reduction
float total = 0;
#pragma omp parallel for reduction(+:total)
for (int i = 0; i < n; ++i) total += arr[i];

// Pattern 3: parallel region + critical section
#pragma omp parallel
{
    Result local = compute_local();
    #pragma omp critical
    { if (local.score > best.score) best = local; }
}

// Pattern 4: pre-enumerate work items for dynamic balance
std::vector<std::pair<int,int>> blocks; // all (i,j) tile pairs
#pragma omp parallel for schedule(dynamic,1)
for (int b = 0; b < blocks.size(); ++b)
    process_tile(blocks[b]);  // used in correlate exercise
← Chapter 2 Chapter 4: GPU →