CS-E4580 · Aalto University · Interactive Platform

Programming
Parallel Computers

From a naive baseline using 0.6% of CPU capacity to 93% of theoretical maximum — an interactive deep dive into modern parallel computing.

Total speedup
0% Theoretical max
99s → 0.7s runtime
4 Chapters covered
▶ Start with Chapter 2 Browse all chapters
V0 → V7 Performance Journey (multi-core, ops/sec)
V0
baseline
5.1B
V1
transpose
6.5B
V2
ILP
21B
V3
SIMD
31B
V4
data reuse
118B
V5
permutations
175B
V6
prefetch
190B
V7
Z-order + slicing
197B

Four chapters.
One complete picture.

CH.01
Role of Parallelism
Moore's Law, the clock speed wall, and why latency vs throughput changes everything about how we write code.
Moore's Law Latency Throughput
CH.02
Case Study: V0 → V7
One problem, eight versions. From 0.6% to 93% of theoretical CPU maximum using OpenMP, SIMD, ILP, and cache engineering.
SIMD OpenMP AVX-512
CH.03
Multithreading
OpenMP deep dive — memory model, false sharing, scheduling strategies, atomic operations, and race conditions.
OpenMP Memory model Atomics
CH.04
GPU Programming
CUDA V0→V4: coalesced memory access, shared memory tiling, vectorised loads, occupancy, and profiling with Nsight.
CUDA Warps Coalescing

Everything you need to squeeze
maximum performance from hardware.

SIMD / AVX-512
Process 16 floats per instruction using 512-bit ZMM registers
🔀
ILP
Multiple independent accumulators to saturate execution ports
🧵
OpenMP
Parallelise loops across CPU cores with a single pragma
🗄️
Cache blocking
Tile computations to keep working set in L1/L2 cache
↔️
Transpose trick
Convert column access to row access — eliminate cache misses
🔲
Register tiling
Compute NxM output blocks to maximise data reuse in registers
🌀
Z-order curve
Traverse tiles in Morton order to maximise cache reuse
🚀
Prefetching
Software hints to hide 200-cycle memory latency
🎮
CUDA
GPU threads, warps, shared memory, and coalesced access