⚡
SIMD / AVX-512
Process 16 floats per instruction using 512-bit ZMM registers
🔀
ILP
Multiple independent accumulators to saturate execution ports
🧵
OpenMP
Parallelise loops across CPU cores with a single pragma
🗄️
Cache blocking
Tile computations to keep working set in L1/L2 cache
↔️
Transpose trick
Convert column access to row access — eliminate cache misses
🔲
Register tiling
Compute NxM output blocks to maximise data reuse in registers
🌀
Z-order curve
Traverse tiles in Morton order to maximise cache reuse
🚀
Prefetching
Software hints to hide 200-cycle memory latency
🎮
CUDA
GPU threads, warps, shared memory, and coalesced access