CS-E4580 · Aalto University · Interactive Platform

Programming
Parallel Computers

From a naive baseline using 0.6% of CPU capacity to 93% of theoretical maximum — an interactive deep dive into modern parallel computing.

0× Total speedup

0% Theoretical max

99s → 0.7s runtime

4 Chapters covered

▶ Start with Chapter 2 Browse all chapters

V0 → V7 Performance Journey (multi-core, ops/sec)

baseline

5.1B

transpose

6.5B

ILP

21B

SIMD

31B

data reuse

118B

permutations

175B

prefetch

190B

Z-order + slicing

197B

Key techniques

Everything you need to squeeze
maximum performance from hardware.

⚡

SIMD / AVX-512

Process 16 floats per instruction using 512-bit ZMM registers

🔀

ILP

Multiple independent accumulators to saturate execution ports

🧵

OpenMP

Parallelise loops across CPU cores with a single pragma

🗄️

Cache blocking

Tile computations to keep working set in L1/L2 cache

↔️

Transpose trick

Convert column access to row access — eliminate cache misses

🔲

Compute NxM output blocks to maximise data reuse in registers

🌀

Z-order curve

Traverse tiles in Morton order to maximise cache reuse

🚀

Prefetching

Software hints to hide 200-cycle memory latency

🎮

CUDA

GPU threads, warps, shared memory, and coalesced access

ProgrammingParallel Computers

Four chapters.One complete picture.

Everything you need to squeezemaximum performance from hardware.

Programming
Parallel Computers

Four chapters.
One complete picture.

Everything you need to squeeze
maximum performance from hardware.