Stanford CS 149
Introduction
Overall
- Parallel thinking
- Decomposing work into pieces that can safely be performed in parallel
- Assigning work to processors
- Managing communication/synchronization between the processors so that it does not limit speedup
- Writing code
- Performance characteristics of implementation
- Design trade-offs: performance vs. convenience vs. cost
- Hardware
- Fast != efficient
- Fast on parallel computer does not mean that it is using the hardware efficiently
- Make use of provided machine capabilities (Programmer’s perspective) vs choosing the right capabilities to put in system (HW designer’s perspective)
- Why parallel: recent 15 years, processor performance improved on exploiting instruction-level parallelism and increasing CPU clock frequency
Machine code
- Structure: fetch/decode $\rightarrow$ ALU (execution unit) $\rightarrow$ execution context (Registers)
- ALU: performs the operation
- Registers: maintain program state; store value of variables
Instruction level parallelism (ILP)
- Superscalar execution:
- Processor automatically find independent instructions in an instruction sequence and executes them in parallel on multiple execution units
- Superscalar processor: decode and execute multiple instructions per clock
- Out-of-order control logic $\rightarrow$ fetch/decode 1 // fetch/decode 2 $\rightarrow$ execution 1 // execution 2 $\rightarrow$ execution context
- Ex. Old Intel Pentium 4 CPU: instruction decoder has 2 simple instruction decoders, and 1 complex instruction decoder; 2 integer units, 1 floating-point unit, and 1 memory interface unit in execution unit block
- Number of ILP referring to depth of the instruction graph
Hardware background info
- Diminishing returns of super scalar execution when over 4 instructions issue to a processor per clock
- Moore’s Law: no. of transistors on microchips doubles every two years
- ILP tapped out after around 2001; processor clock rate stops increasing after around 2005
- Building faster processors by adding more execution units that run in parallel or units that are specialized for a specific task (graphic, audio/video playback, etc.)
- Ex. multi-core CPU: Intel “Comet Lake” i9 10-core CPU, AMD Ryzen Threaddripper 3990X 64 core (4 8-core chiplets), NVIDIA AD102 GPU, GPU-accelerated supercomputing, Apple A15 Bionic 6-core CPU multi-core GPU, Raspberry Pi 3 quad-core ARM A53 CPU
- Specialized processing is ubiquitous in mobile systems
- Apple-designed multi-core GPU: neural engine (NPU) for DNN acceleartion + image/video encode/decode processor + motion(sensor) processor, TPU: tensor processing unit, a specialized processor for ML computations
- Specialized hardware to accelerate DNN inference/training: Google TPU3, GraphCore IPU, Intel deep learning inference accelerator, Apple neural engine
- Software must be written to be parallel to see performance gains.
Power wall
- Power consumed by a trasistaor: dynamic power $\propto$ capacitive load, voltage2, frequency
- Static power: transistors burn power even when inactive due to leakage
- High power = high heat
Memory
Efficient processing almost always comes down to accessing data efficiently
- Memory is organized as an array of bytes; each byte is identified by its address in memory
- Terminology
- Memory access latency: the amount of time it takes the memory system to provide data to the processor
- Stalls: a processor stalls (can’t make progress) when it cannot run the next instruction in an instruction stream because future instructions depend on a previous instruction that is not yet complete. $\rightarrow$ accessing memory is a major source of stalls
- Memory access times $~100$’s of cycles (measure of latency)
- Caches
- A cache is a hardware implementation detail that does not impact the output of a program, only its performance
- Cache is on-chip storage that maintains a copy of a subset of the values in memory
- Cache memory can be load/store more quickly than in DRAM (dynmamic random access memory) (on Kaby Lake CPU, latency on caches range from 4 to 38, DRAM around 248)
- Caches operate at the granularity of “cache lines”; Each line holds 4 bytes of data
- LRU (least recently used) replacement policy
- Spatial locality: loading data in a cache line “preloads” the data needed for subsequent accesses to different addresses in the same line, leading to cache hits; those instructions which are stored nearby to the recently executed instruction have high chances of execution.
- Temporal locality: repeated accesses to the same address result in hits; a instruction which is recently executed have high chances of execution again.
- Caches reduce length of stalls; caches provide high bandwidth data transfer
- Implementation of the linear memory address space abstraction on a modern computer
- Common organization: hierarchy of caches: level 1 (L1, 32 KB), level 2 (L2, 256 KB), level 3 (L3, 20 MB); DRAM (64 GB)
- The instruction “load the value stored at address X into register R0” might involve a complex sequence of operations by multiple data caches and access to DRAM
Summary
- Single-thread-of-control performance is improving very slowly
- Ultilize multiple processing elements
- Specialized processing hardware
- Parallel programing
- Problem partitioning, communication, synchronization
- Knowledge of machine characteristics
- Understanding data movement
A Modern Multi-Core Processor
Parallelism
- simple processor $\rightarrow$ superscalar processor (2 instructions/clock) $\rightarrow$ pre multi-core era processor (larger cache, smarter out-of-order logic, smarter branch predictor, etc.) $\rightarrow$ multi-core era processor
- Multi-core era processor
- Idea #1: use increasing transistor count to add more cores to the processor
- Idea #2: amortize cost/complexity of managing an instruction stream across many ALUs
- SIMD processing: single instruction, multiple data (same instruction broadcast to all ALUs; parallel on all ALUs)
- Vector program: realize with AVX intrinsics datatypes and functions (C)
- Intrinsic functions operate on vectors of $8$ $32$-bit values (e.g. vector of $8$ floats)
- Compiled program: processes $8$ array elements simultaneously using vector instructions on $256$-bit vector registers
- Data-parallel expression (“forall” construct)
- Loop iterations are independent
- The same loop body will be executed on a large number of data elements
- All the iterations of the loop carry out the exact same sequence of instructions
- This abstraction can facilitate automatic generation of both multi-core parallel code and vector instructions to make use of SIMD proc essing capabilities within a core.
- Conditional execution
- Mask/discard output of ALU
- Not all the ALUs do useful work
- After branch, continue at full performance
- Worse case: $1/8$ peak performance
- Coherent execution
- Property of a program where the same instruction sequence applies to many data elements
- Coherent execution is NECESSARY for SIMD processing resources to be used efficiently
- Coherent execution is NOT NECESSARY for efficient parallelization across different cores
- A lack of instruction stream coherence in a program called “divergent” execution
- SIMD execution: modern CPU examples
- Instructions are generated by the compiler parallelism requested by programmer using intriinsics
- Parallelism conveyed using parallel langurage semantics (“forall”)
- Parallelism inferred by dependency analysis of loops by “auto-vectorizing” compiler
- “Explicit SIMD”: SIMD parallelization is performed at compile time; can inspect program binary and see SIMD instructions
- SIMD execution: modern GPUs
- “Implicit SIMD”
- Compiler generates a binary with scalar instructions
- N instances of the program are always run together on the processor
- Hardware (not compiler) is responsible for simultaneously executing the same instruction from multiple program instances on different data on SIMD ALUs
- SIMD width of most modern GPUs ranges from 8 to 32
- Divergent execution can be a big issue (this means poorly written code might execute at $1/32$ the peak capability of the machine)
- “Implicit SIMD”
- Summary: 3 different forms of parallel execution
- Superscalar
- Exploit ILP within an instruction stream
- Process different instructions from the same instruction stream in parallel (within a core)
- Parallelism automatically discovered by the hardware during execution
- SIMD
- Multiple ALUs controlled by the same instruction (within a core)
- Efficient for data-parallel workload: amortize control costs over many ALUs
- Vectorization done by compiler (explicit SIMD) or at runtime by hardware (implicit SIMD)
- Multi-Core
- Use multiple processing cores
- Provides thread-level parallelism: simultaneously execute a completely different instruction stream on each one
- Software creates thresds to expose parallelism to hardware (e.g., via threading API)
- Superscalar
// Example: paralleism using C++ threads
typedef struct {
int N;
int terms;
float* x;
float* y;
} my_args;
void my_thread_func(my_args* args) {
sinx(args->N, args->terms, args->x, args->y);
}
void parallel_sinx(int N, int terms, float* x, float* y) {
std::thread my_thread;
my_args args;
args.N = N/2;
args.terms = terms;
args.x = x;
args.y = y;
// launch thread
my_thread = std::thread(my_thread_func, &args);
// on main thread
sinx(N - args.N, terms, x + args.N, y + args.N);
// wait for thread to complete
my_thread.join();
}
// Data-parallel expression
// Taylor's expansion of sin(x) function for each element of an array of N floating-point numbers
void sinx(int N, int terms, float* x, float* y) {
//substitute for (int i = 0; i < N; ++i) { with forall function
// declares that loop iterations are independent
// A compiler might automatically generate threaded code for you
forall (int i from 0 to N) {
float value = x[i];
float numer - x[i] * x[i] * x[i];
int denom = 6;
int sign = -1;
for (int j = 1; j <= terms; j ++) {
value += sign * numer / denom;
numer *= x[i] * x[i];
denom *= (2 * j + 2) * (2 * j + 3);
sign *= -1;
}
y[i] = value;
}
}
// Vector program (using AVX intrinsics)
# include <immintrin.h>
void sinx(int N, int terms, float* x, float* y) {
float three_fact = 6;
for (int i = 0; i < N; i += 8) {
_m256 origx = _mm256_load_ps(&x[i]);
_m256 value = origx;
_m256 numer = _mm256_mul_ps(origx, _mm256_mul_ps(origx, origx));
_m256 denom = _mm256_broadcast_ss(&three_fact);
int sign = -1;
for (int j = 1; j <= terms; ++j) {
// value += sign * numer / denom
_m256 tmp = _mm256_div_ps(_mm256_mul_ps(_mm256_set1ps(sign), numer), denom);
value = _mm256_add_ps(value, tmp);
numer = _mm256_mul_ps(numer, _mm256_mul_ps(origx, origx));
denom = _mm256_mul_ps(denom, _mm256_broadcast_ss((2 * j + 2) * (2*j + 3)));
sign *= -1;
}
_mm256_store_ps(&y[i], value);
}
}
Accessing Memory
- Multi-threading reduces stalls (latency)
- Idea #3: interleave processing of multiple threads on the same core to hide stalls
- stall $\rightarrow$ runnable $\rightarrow$ short of time (not executed, core is executing instructions from another thread) $\rightarrow$ running $\rightarrow$ done
- Throughput-oriented systems: potentially increase time to complete work by any one thread, in order to increase overall system throughput when running multiple threads
- On-chip storage of execution Contexts as a finite resource (L1 cache)
- with many small contexts (storage for small working set per thread): high latency hiding ability
- with large context (storage for large working set per thread): low latency hiding ability
- Summary
- A processor with multiple hardware threads has the ability to avoid stalls by performing instructions from other threads when one thread must wait for a long latency operation to complete
- The latency of the memory operation is not changed by multi-threading (No longer causes reduced processor utilization)
- A mutli-threaded processor hides memory latency by performing arithmetic from other threads
- Programs that feature more arithmetic instructions per meomory access need fewer threads to hide memory stalls
- Hardware-supported multi-threading
- Core manages execution contexts for multiple threads; porcessor makes decision about which thread to run each clock
- Interleaved multi-threading (e.g., temporal multi-threading)
- Simultaneous multi-threading (SMT, each clock, core chooses instructions from multiple threads to run on ALUs)
- Bandwidth
- Terminology
- Bandwidth: the rate at which the memory system cam provide data to a processor (e.g. $20$ $GB/s$; analogy: no. of lanes of traffic)
- Latency: time completing one instruction (analogy: time cost driving from one city to another)
- Throughput: instructions/cycle (analogy: within a unit time, total no. of cars drived from one city to another)
- High bandwidth memories: modern GPUs leverage high bandwidth memories located near processor
- In modern computing, bandwidth is the critical resource
- Performant paralle program:
- Organize computation to fetch data from memory less
- Reuse data previously loaded by the same thread (temporal locality optimizations)
- Share data across threads (inter-thread cooperation)
- Favor performing additional arithmetic to storing/restoring values (math does not require bandwidth)
- Organize computation to fetch data from memory less
- Terminology
Parallel Programming Abstractions
- Abstraction vs. implementation
- Semantic (abstraction): given a program, and given the semantics of th e operations used, the program computations
- Implementation (scheduling): potentailly parallel which a program’s operations be executed per thread, per execution unit, per lane of a vector instruction
- ISPC (Intel SPMD Program Compiler)
- SPMD (Single Program Multiple Data)
- Check this link (https://pharr.org/matt/blog/2018/04/30/ispc-all.html) and fill this part
- Function
sinx()Example:main()call toispc_sinx(): control transferred toispc_sinx()function (sequential execution, C code)- Begin executing
programCountinstances ofispc_sinx()(ISPC code) - SPMD programming abstraction:
- Call to ISPC function spawns “gang” of ISPC “program instances”
- All instances run ISPC code concurrently
- Each instance has its own copy of local variables
- Upon return, all instances have completed
- ISPC implements the gang abstraction using SIMD instructions
- No. of instances in a gang is the SIMD width of the hardware (or multiple of SIMD width)
- ISPC compiler generates a C++ function binary(.o) whose body contains SIMD instructions
- C++ linking against generated object files as usual
- Begin executing
- Return to
main(): control transferred back tomain()function (sequential execution, C code)
- ISPC keywords:
programCount: no. of simultaneously executing instances in the gang (uniform value), hereprogramCountis $8$programIndex: id of the current instance in the gang (a non-uniform value)uniform: A type modifier; all instances have the same value for this variable; purely for optimization, not needed for correctnessforeach: declares parallel loop iterations; iteration the entire gang (not each instance)
- Interleaved assignment (0,1,2,…,7) and blocked assignment (0, 8, 16, …, 56)
- ISPC corss program instance operations:
uniform int64 reduce_add(int32 x);: sum of a variable’s value in all program instances in a ganguniform int32 reduce_min(int32 a);: min of all values in a gangint32 broadcast(int32 value, uniform int index);: broadcast a value from a instance to all instances in a gangint32 rotate(int32 value, uniform int offset);: for alli, pass value from instanceito instancei+offset%programCount
- ISPC: abstraction vs. implementation
- SPMD: single program, multiple data program model
- Running a gang is spawning
programCountlogical instruction streams (each with a different value ofprogramIndex) - single thread of control$\rightarrow$ call SPMD function $\rightarrow$ PMD execution (multiple instances of function run in parallel, multiple threads) $\rightarrow$ SPMD function returns $\rightarrow$ resume single thread of control
- Running a gang is spawning
- SIMD: single instruction, multiple data program model
- ISPC compiler emits vector instructions (logic by a ISPC gang)
- ISPC compiler mapping of conditional control flow to vector instructions
- ISPC gang abstraction in implemented by SIMD instructions within on thread running on one core of a CPU
- SPMD: single program, multiple data program model
- Common errors:
* ```cpp export uniform float sum_incorrect_2(uniform int N, uniform float* x) { // sum of type uniform float: one copy of variable for all program instances uniform float sum = 0.0f; foreach(i = 0 ... N) { // x[i] has different values for each program instance sum += x[i]; } // Many copies of a variable to the calling // C code expects one return value of type float returnm sum; // compile-time type error } -
- Another alternative: with
Collectionstructure, no loops (no indexing) which is similar to NumPy, PyTorch etc.
// ISPC code
// Interleaved assignment
export void ispc_sinx (
uniform int N, // N = 1024 as input value
uniform int terms,
uniform float* X,
uniform float* result) {
// assume N % programCount = 0
// Interleave assignment
// "Gang" of ISPC program instances, Gang contains programCount = 8 instances
for (uniform int i = 0; i < N; i += programCount) {
int idx = i + programIndex; // Local variable
float value = x[idx]; // Local variable
float numer = x[idx] * x[idx] *x[idx]; // Local variable
uniform int denom = 6;
uniform int sign = -1;
for (uniform int j = 1; j <= terms; ++j) {
value += sign * numer / denom;
numer *= x[idx] * x[idx];
denom *= (2 * j + 2) * (2 * j + 3);
sign *= -1;
}
result[idx] = value;
}
}
// Blocked assingment of array elements to program instances
// Assign multiple elements to each instance
export void ispc_sinx_v2 (
uniform int N,
uniform int terms,
uniform float* x,
uniform float* result) {
// Assume N % programCount = 0;
// Block assingment
uniform int count = N / programCount;
int start = programIndex * count;
for (uniform int i = 0; i < count; ++i) {
int idx = start + i;
float value = x[idx];
float numer = x[idx] * x[idx] * x[idx];
uniform int denom = 6;
uniform int sign = -1;
for (uniform int j = 1; j < terms; ++j) {
value += sign * numer / denom;
numer *= x[idx] * x[idx];
denom *= (j + 3) * (j + 4); // block assignment
sign *= -1;
}
result[idx] = value;
}
}
Parallel Programming Basics
Case study on writing an optimizing a parallel program: data parallel; shared address space
Creating a parallel program
- Thought process:
- Identify work that can be performed in parallel
- Partition work and data associated
- Manage data access, communication, and synchronization
- Goal: maximizing speedup for a fixed computation Speedup(P processors) = Time(1 processor)/Time(P processors)
- Flowchart
- Problem to solve
- Decomposition into subproblems (“tasks”)
- Assignment into parallel threads (“workers”)
- Orchestration to parallel program (communicating threads)
- Mapping execution on parallel machine
Problem decomposition
- Create at least enough tasks to keep all execution unites on a machine busy
- Key: identifying dependencies
- Amdahl’s law: dependencies limit maximum speedup due to parallelism
- Assume $$S$$ is the fraction of sequential execution that is inherently sequential (dependencies prevent parallel execution)
- speedup $$= \frac{1}{s + \frac{1-s}{P}}$$, where $$1-s$$ is the proportion that cannot be parallelized, $$s$$ is the proportion can be parallelzed by $$P$$ processors
- Maximum speedup from parallel execution $$\leq 1/S$$
- A small serial region can limit speedup on a large parallel machine
- Maximum speedup $$\approx P$$ if $$0.1%$$ of application is serial
An Example
- 2-Step computation on a $$N\times N$$ image: multiple brightness of all pixels by 2; average of all pixel values
- Sequential implementation: time complex $$\approx 2N^{2}$$
- 1st attempt at parallelism (P processors)
- On step 1: multiplication in parallel (time complexity: $$N^{2}/P$$)
- Overall performance: speedup $$\leq\frac{2N^{2}}{n^{2}/P + N^{2}}\approx 2$$
- One step 2: computing partial sums in parallel, combine results serially (time complexity: $$N^{2}/P + P$$)
- Overall performance: speedup $$\leq\frac{2N^{2}}{2n^{2}/P + P\rightarrow P$$ when $$N»P$$
- On step 1: multiplication in parallel (time complexity: $$N^{2}/P$$)
Decomposition
- Programmers are responsible for decomposing a program into independent tasks
- Automatic decomposition of sequential programs continues to be a challenging research problem
- Compiler must analyze program, identify dependencies (but can be data dependent which is not known at compile time)
- Had modest success with simple loop nests
- “Magic parallelizing compiler” for complex, general purpose code as not yet been achieved
Assignment
- “Tasks”, “workers” (threads, program instances, vector lanes, etc.)
- Goals: good workload balance, reduce communication costs
- Statically or dynamically
- Many languages/runtimes take responsibility for assignment
- Examples in ISPC of work by loop iteration
- Programmer-managed assignment: static assignment; assign iterations to ISPC program instances in interleaved fashion
- System-managed assignment:
foreachconstruct exposes independent work(iterations) to system; abstraction leaves room for dynamic assignment; currrent ISPC implementation is static - Progammer-managed assignment: static assignment; loop iterations to threads in a blocked fashion (first half to spawned thread, second half to main thread)
void my_thread_start(int N, int terms, float* x, float* results) { sinx(N, terms, x, result); // do work } void parallel_sinx(int N, int terms, float* x, float* result) { int half = N / 2; // launch thread to do work on first half of array std:: thread t1(my_thread_start, half, terms, x, result); // do work on second half of array in main thread sinx(N - half, terms, x + half, result + half); t1.join(); }- Dynamic assignment using ISPC tasks: ISPC runtime(invisible to the programmer); assigns tasks to workder threads in a thread pool
void foo(uniform float* input, uniform float* output, uniform int N) { // create a bunch of tasks launch[100] my_ispc_task(input, output, N); }
Orchestration
- Goals: reduce costs of communication/sync, preserve locality of data reference, reduce overhead, etc.
- Structring communication
- Adding synchronization to preserve dependecies
- Organizing data structures in memory
- Scheduling tasks
- If synchronization is expensive, programmer might use it more sparsely
- Mapping to hardware (mapping “threads” to hardware execution units)
- Mapping by the operating system, i.e. a thread to HW execution context on a CPU core
- Mapping by the compiler, i.e. a ISPC pregram instances to vector instruction lanes
- Mapping CUDA thread blocks to GPU cores
- Mapping decisions
- Place related threads (cooperating threads) on the same core (maximize locality, data sharing, minimize costs of comm/sync)
- Place unrealated threads on the same core (one might be bnadwidth limited and another might be compute limited) to use machine more efficiently
Example:
- A 2D-grid based solver: solve partial differential equation on $(N+2)\times(N+2)$ grid
- Iteration algorithm: Gauss-Seidel sweeps over grid until convergence;
A[i,j] = 0.2*(A[i,j] + A[i,j-1] + A[i-1, j] + A[i, j+1] + A[i+1,j]);
const int n;
float* A; // allocate space
void solve(float* A) {
float diff, prev;
bool done = false;
while (!done) {
diff = 0.f;
for (int i = 1; i < n; ++i) { // only iterate the non border pixels
for (int j = 1; j < n; ++j) {
prev = A[i, j];
A[i,j] = 0.2f * (A[i,j] + A[i, j - 1] + A[i - 1, j] + A[i, j + 1] + A[i + 1, j]);
diff += fabs(A[i,j] - prev); // compute amount of change
}
}
if (diff/(n*n) < TOLERANCE) {
done = true;
}
}
}
- Parallel programming step 1: identify dependencies (problem decomposition phase)
- Depends on previous row
- Depends on the left element
- Independent work along diagonals (each diagonal depends on previous diagonal, but every pixel on diagonal is independent)
- $\rightarrow$ Reorder grid cell update via red-black coloring
- Parallel programming step 2: assignment
- Block assignment or interleaved assignment depending on the system
- Red cell update in parallel $\rightarrow$ wait until all processors done with update $\rightarrow$ communicate updated red cells to other processors $\rightarrow$ black cell update in parallel $\rightarrow$ wait until processors done with update $\rightarrow$ communicate updated black cells to other processors $\rightarrow$ repeat
- Writing the program: data parallel thinking; SPMD/shared address space
- Data-parallel expression of grid solver
const int n;
float* A = allocate(n + 2, n + 2);
void solve(float* A) {
bool done = false;
float diff = 0.f;
while (!done) {
// Decomposition
for_all (red_cells(i,j)) {
float prev = A[i,j];
A[i,j] = 0.2f * (A[i,j] + A[i, j - 1] + A[i - 1, j] + A[i, j + 1] + A[i + 1, j]);
// Orchestration: builtin communication primitive: reduceAdd
reduceAdd(diff, abs(A[i,j] - prev));
} // Orchestration: for_all block is implicit wait for all workders before returning to sequential control
if (diff / (n * n) < TOLERANCE) {
done = true;
}
}
}
* Shared address space (with SPMD threads) expression of solver:
* Locks: only one thread in the critical region at a time
* Barriers: wait for threads to reach this point
// Global variables
int n; // grid size n x n
bool done = false;
float diff = 0.0;
LOCK myLock;
BARRIER myBarrier;
float* A = allocate(n + 2, n + 2);
void solve(float* A) { // executed by all threads; SPMD-style
float myDiff;
int threadId = getThreadId(); // threadID is different for each SPMD instance
int myMin = 1 + (threadId * n / NUM_PROCESSORS);
int myMax = myMin + (n / NUM_PROCESSORS); // each thread computes the rows
while (!done) {
float myDiff = 0.f;
diff = 0.f;
barrier(myBarrier, NUM_PROCESSORS);
for (j = myMin to myMax) {
for (i = red_cells in this row) {
float prev = A[i,j];
A[i,j] = 0.2f * (A[i,j] + A[i, j - 1] + A[i - 1, j] + A[i, j + 1] + A[i + 1, j]);
myDiff += abs(A[i,j] - prev);
}
}
lock(myLock);
diff += myDiff;
unlock(myLock);
barrier(myBarrier, NUM_PROCESSORS);
if (diff / (n * n) < TOLERANCE) {
done = true;
}
barrier(myBarrier, NUM_PROCESSORS);
}
}