In pursuit of peak computational performance

(Or: How fast should my model be running?)

Marshall Ward

NOAA-GFDL

March 1, 2022

Relative Scaling

image

Relative Comparisons

image

Goals

  • Can we talk about absolute model performance?
  • Can we estimate the idealized peak performance?
  • When do CPUs and GPUs provide that performance?

Hardware Peak

\[W_\text{FLOPS} = W_\text{core} \times N_\text{cores} \times f_\text{cycles}\]

Platform Theory (GF/s) Observed
x86 (Cascade Lake) 2150.4 2145.3
Nvidia (Volta) 7065.6 7007.3

But are these numbers achievable?

How we compute

Cascade Lake

24 core × 8 (AVX512) × 2 (FMA) × 2 port × 2.8 GHz

Volta

80 SM × 32 FP64 core × 2 (FMA) × 1.38 GHz

Cost of Business

Platform FLOPs Year Price TDP(W)
Cascade Lake 2145.3 2019 $6432 240
Volta (GPU) 7007.3 2018 $8999 250

Models

Cascade Lake

Intel(R) Xeon(R) Platinum 8274 CPU @ 3.20GHz

Volta

Nvidia Volta GV100 (1.38GHz)

Peak Performance

image

Arithmetic Intensity

Operation FLOPs/byte AI
\(\text{C}_{ij} = \sum_{k} A_{ik} B_{kj}\) \(\frac{2n^3}{3n^2 \times \texttt{sizeof}(A)}\) \(\frac{1}{12}n\)
\(\phi_{i} = \phi_{i} + \Delta t \ \mathcal{F}_{i}\) \(\frac{2n}{3n \times \texttt{sizeof} (\phi)}\) \(\frac{1}{12}\)

Array computation is proportional to bandwidth

Traditional CFD

Most solvers are hyperbolic and elliptic, e.g.:

\[\frac{\partial \mathbf{u}}{\partial t} = - \mathbf{u} \cdot \nabla \mathbf{u} -\nabla p + \mathcal{F}\]\[\nabla^2 p = -\nabla \left( \mathbf{u} \cdot \nabla \mathbf{u} \right) - \nabla \cdot \mathcal{F}\]

Pressure (Montgomery potential, PV, ...) solver is a (sparse) matrix operation.

Ocean CFD

Hydrostatic balance eliminates elliptic step:

\[\frac{\partial \mathbf{u}}{\partial t} = - \mathbf{u} \cdot \nabla \mathbf{u} -\nabla p + \mathcal{F}\]\[p = -g \rho\]\[w \sim \frac{\partial \rho}{\partial t} = ...\]

Many (but not all) climate ocean models are hyperbolic

Performance Bounds

image

\(W_\text{FLOPS} \leq I \times B_\text{mem}\)

How these are produced

int main() {
   /* ... */
   start(timer);
   for (r = 0; r < r_max; r++) {
       for (int i = 0; i < n; i++)
           kernel(i, a, b, x, y);

       // An impossible branch to block loop interchange
       if (y[0] < 0.) dummy(a, b, x, y);
   }
   stop(timer);
}

void kernel(int i, double a, double b, double *x, double *y) {
    y[i] = a * x[i] + y[i];
}

Apply artificially high iteration, but let the compiler construct the kernel.

The Roofline Model

\[W = \text{min}(W_\text{max}, I \times B_\text{mem})\]

  • \(W\): Performance (FLOP/s)
  • \(W_\text{max}\): Theoretical peak (FLOP/s)
  • \(I\): Arithmetic Intensity (FLOP/byte)
  • \(B_\text{mem}\): Memory bandwidth (byte/s) (cache, RAM, ...)

For our codes, often \(W = I \times B_\text{mem}\)

Unaccounted Factors

  • Pipelining
  • Instruction latency
  • Instruction decoder
  • Variable Clock Frequency
  • Depletion of registers

Peak Bandwidth

image

Platform BW (GB/s)
x86 (cache) 8600
x86 (RAM) 140
Volta (RAM) 900

Cache BW is across all cores!

Computing Bandwidth

Platform Speed (GHz) Width (B) BW (GB/s)
x86 cache 2.8* 128* 8602
x86 RAM 2.933 8 × 6 141
Volta RAM 877 1024 898
  • AVX512 clock speed, L1 load+store, etc

x86 Cache Bandwidth

image

Three 64B + AGU ports, but only 2 AGU are usable

Euler Step

\[\phi^{n+1}_i = \phi^n_i + \Delta t \left( \frac{\partial \phi}{\partial t} \right)_i \qquad I = \frac{2}{3\times 8} = \frac{1}{12}\]

y[i] = y[i] + a * x[i]
Platform Theory (GF/s) Observed %
x86 (Cache) 716.8 715.1 99.7
x86 (RAM) 11.7 9.5 80.8
GPU 74.8 67.3 90.0

RAM efficiencies of 80-90% are common

Euler Step

\[\phi^{n+1}_i = \phi^n_i + \Delta t \left( \frac{\partial \phi}{\partial t} \right)_i\]

y[i] = y[i] + a * x[i]

2 FLOPs per 3×8 bytes: \(\frac{1}{12}\)

Peak Euler

Platform Observed Theory %
x86 (Cache) 715.1 716.8 99.7
x86 (RAM) 9.5 11.7 81.2
GPU 67.3 74.8 90.0

RAM efficiencies of 80-90% are common

Finite Difference

\[\phi_{i+\tfrac{1}{2}} = \frac{1}{2} \left( \phi_i + \phi_{i+1} \right) \qquad I = \frac{2}{2\times 8} = \frac{1}{8}\]

y[i] = a * (x[i] + x[i+1])
Platform Theory (GF/s) Observed %
x86 (Cache) 1075.3 781.4 72.7
x86 (RAM) 17.6 9.4 53.4
GPU 112.3 72.4 64.5

Why are estimates so much lower?

Write-Allocate Effect

Unlike Euler step, target is not on RHS:

y[i] = a * x[i] + a * x[i+1]
Platform Theory (GF/s) Observed %
x86 (Cache) 1075.3 781.4 72.7
x86 (RAM) 11.7 9.4 80.1
GPU 74.8 72.4 96.8

For RAM writes, one implicit read: \(I_\text{RAM} = \frac{1}{12}\)

W-A Roofline

image

Diffusion Solver

\[\phi^{n+1}_i = \phi^n_i + \frac{\Delta t}{\Delta x^2} \left(\phi^n_{i+1} - 2 \phi^n_i + \phi^n_{i-1} \right)\]

y[i] = a * x[i-1] + b * x[i] + a * x[i+1]

x[i] = y[i]

4 FLOPs (amortized) per 4×8 bytes: \(I = \frac{1}{8}\)

RAM-bound: \(I_\text{RAM} = \frac{1}{10}\)

How many FLOPS?

Is it four or five operations?

\[\phi^{n+1}_i = \phi^n_i + \frac{\Delta t}{\Delta x^2} \left(\phi^n_{i+1} - 2 \phi^n_i + \phi^n_{i-1} \right)\]

y[i] = x[i] + a * (x[i-1] - 2 * x[i] + x[i+1])

\[\phi^{n+1}_i = \left( 1 - 2 \frac{\Delta t}{\Delta x^2} \right) \phi^n_i + \frac{\Delta t}{\Delta x^2} \left(\phi^n_{i+1} + \phi^n_{i-1} \right)\]

y[i] = a * x[i] + b * (x[i-1] + x[i+1])

Do we follow parentheses? Ask the compiler!

Aligned update

Platform +1 +8 Theory
x86 (Cache) 499.9 690.7 1075.2
x86 (RAM) 10.8 10.8 17.6
GPU 81.1 81.1 112.3

Single-displacement has a performance penalty

Diffusion Performance

Platform Theory (GF/s) Observed %
x86 (Cache) 1075.2 690.7 64.2
x86 (RAM) 14.1 10.8 76.7
GPU 89.9 81.1 90.2

Significant drop in cache performance

Staging Penalty

Stage Observed Theory %
\(y \leftarrow x + \alpha \nabla^2 x\) 1085 GF/s 1434 GF/s 75.7
\(x \leftarrow y\) 9145 GB/s 9216* GB/s 99.2
S1 + S2 691 GF/s 1075 GF/s 64.2

Restarting on a new array incurs a performance penalty

NOTE: L1⇋RAM transfer obscures this cost!

Theoretical values

Why 9216 GB/s?

  • AVX512 copy was clocked at 3.0 GHz, slightly above AVX512 arithmetic at 2.8 GHz.

MOM6 Benchmark

image

  • 192 × 128 grid, 75 level
    • 32 x 32 per MPI rank
    • ~1.8M points / field
  • 288 steps (3 day, \(\Delta t = 900s\))
  • "Benchmark" configuration:
    • Split barotropic
    • Thermodynamic EOS
    • Parameterization suite

MOM6 Performance

image

High FLOPs don't necessarily indicate efficiency

Observed AI

Can we measure the subroutine intensity?

  • FLOP count is reasonably accurate
  • We cannot count bytes, but can count transfers

Estimate bandwidth by ratio of instructions.

\[I \approx \frac{N_\text{FLOPs}} {8 \times \left( N_\text{iLoad} + N_\text{iStore} \right) \times \alpha_\text{vec}}\]

where \(\alpha_\text{vec} = N_\text{FLOPs} / N_\text{iFLOPs}\)

MOM6 Roofline

image

MOM6: All subroutines

image

MOM6: 16x Domain

image

Recent Speedups

Subroutine Runtime Speedup
hor_visc 1.69s → 1.01s 1.67x
corAdCalc 1.11s → 0.64s 1.74x

Highest points correspond to recent optimization efforts

(NOTE: Timings are from a different config)

Procedure

  • Compute \(I\) for operation
  • If targeting CPUs:
    • Seek optimal length, distribute across cores
  • If targeting GPUs:
    • Push for very large domains
  • Evaluate performance, assess efficiency
    • Seek improvements ("move up") if possible

Summary

  • Arithmetic intensity is a strong indicator of bandwidth-limited performance
  • CPUs benefit from modest arrays (~1M)
  • GPUs favor very large arrays and implicit solvers
  • Roofline analysis enables performance prediction and targeted optimzation

Further Work

  • Build up a "calculus" of \(W_\text{FLOP}\) estimation
  • Develop confidence in estimating model bandwidth

Future Hardware

  • Nvidia Hopper/Ampere have even greater bandwidth (> 2-3TB/s?)
  • Fujitsu's A64FX: Combine cache CPU with HBM memory?