Chapter 3

Multithreading &
OpenMP

OpenMP turns sequential loops into parallel work with a single pragma. But correctness is your responsibility — the compiler won't warn you about race conditions.

The basics

How OpenMP splits work across threads

#pragma omp parallel for
for (int i = 0; i < 10; ++i) {
    c(i);  // runs in parallel across all cores
}

With 4 cores, OpenMP automatically splits the 10 iterations:

Thread 1

c(0)

c(1)

c(2)

Thread 2

c(3)

c(4)

c(5)

Thread 3

c(6)

c(7)

Thread 4

c(8)

c(9)

OpenMP creates threads, assigns work, and waits for all to finish before continuing.

Scheduling strategies

schedule(static) — default

// Equal contiguous chunks
// Best cache locality
#pragma omp parallel for schedule(static)
for (int i = 0; i < n; ++i) { ... }

schedule(dynamic) — unequal work

// Grab 1 unit at a time
// Best load balancing
#pragma omp parallel for schedule(dynamic,1)
for (int b = 0; b < blocks; ++b) { ... }

Use dynamic for triangular loops (like correlate's lower-triangular tiles) where diagonal blocks have ~half the work of off-diagonal blocks.

Safety rules

When is a loop safe to parallelise?

Two rules — both must hold for every pair of iterations X and Y:

No shared element read by X and written by Y

No shared element written by both X and Y

✗ UNSAFE — read-write conflict

x[i+1] = f(x[i]); // i reads what i-1 wrote

✗ UNSAFE — write-write conflict

y[0] = f(x[i]); // all write same y[0]

✓ SAFE — disjoint writes

y[i] = f(x[i]); // each i writes different y[i]

Reduction — safe accumulation

Race condition: #pragma omp parallel for + sum += arr[i] = data race. All threads write to the same sum simultaneously → wrong answer, silently.

// WRONG — race condition on sum
float sum = 0;
#pragma omp parallel for
for (int i = 0; i < n; ++i) sum += arr[i]; // ✗

// CORRECT — reduction clause
float sum = 0;
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < n; ++i) sum += arr[i]; // ✓

OpenMP gives each thread a private copy of sum, then combines them at the end.

Memory model

False sharing — the hidden performance killer

Two threads write to different variables that share the same 64-byte cache line. Every write by one invalidates the other's cached copy → cache bouncing → 10-50× slowdown with no logical error.

// BAD: partial_sum[0] and partial_sum[1] share a cache line
float partial_sum[nthreads]; // adjacent floats = same cache line
#pragma omp parallel for
for (int t = 0; t < nthreads; ++t)
    partial_sum[t] += work(t); // threads stomp each other's cache

// GOOD: pad to 64-byte boundary
struct alignas(64) Padded { float val; char pad[60]; };
Padded partial_sum[nthreads]; // each on its own cache line ✓

Common OpenMP patterns

// Pattern 1: parallel for (most common)
#pragma omp parallel for schedule(static)
for (int i = 0; i < n; ++i) result[i] = compute(i);

// Pattern 2: parallel for + reduction
float total = 0;
#pragma omp parallel for reduction(+:total)
for (int i = 0; i < n; ++i) total += arr[i];

// Pattern 3: parallel region + critical section
#pragma omp parallel
{
    Result local = compute_local();
    #pragma omp critical
    { if (local.score > best.score) best = local; }
}

// Pattern 4: pre-enumerate work items for dynamic balance
std::vector<std::pair<int,int>> blocks; // all (i,j) tile pairs
#pragma omp parallel for schedule(dynamic,1)
for (int b = 0; b < blocks.size(); ++b)
    process_tile(blocks[b]);  // used in correlate exercise

← Chapter 2 Chapter 4: GPU →

Multithreading &OpenMP

Multithreading &
OpenMP