← Back to list

Stanford CS 149

Introduction

Overall

Parallel thinking
1. Decomposing work into pieces that can safely be performed in parallel
2. Assigning work to processors
3. Managing communication/synchronization between the processors so that it does not limit speedup
Writing code
- Performance characteristics of implementation
- Design trade-offs: performance vs. convenience vs. cost
- Hardware
Fast != efficient
- Fast on parallel computer does not mean that it is using the hardware efficiently
- Make use of provided machine capabilities (Programmer’s perspective) vs choosing the right capabilities to put in system (HW designer’s perspective)
Why parallel: recent 15 years, processor performance improved on exploiting instruction-level parallelism and increasing CPU clock frequency

Machine code

Structure: fetch/decode $\rightarrow$ ALU (execution unit) $\rightarrow$ execution context (Registers)
ALU: performs the operation
Registers: maintain program state; store value of variables

Instruction level parallelism (ILP)

Superscalar execution:
- Processor automatically find independent instructions in an instruction sequence and executes them in parallel on multiple execution units
- Superscalar processor: decode and execute multiple instructions per clock
- Out-of-order control logic $\rightarrow$ fetch/decode 1 // fetch/decode 2 $\rightarrow$ execution 1 // execution 2 $\rightarrow$ execution context
- Ex. Old Intel Pentium 4 CPU: instruction decoder has 2 simple instruction decoders, and 1 complex instruction decoder; 2 integer units, 1 floating-point unit, and 1 memory interface unit in execution unit block
- Number of ILP referring to depth of the instruction graph

Hardware background info

Diminishing returns of super scalar execution when over 4 instructions issue to a processor per clock
Moore’s Law: no. of transistors on microchips doubles every two years
ILP tapped out after around 2001; processor clock rate stops increasing after around 2005
Building faster processors by adding more execution units that run in parallel or units that are specialized for a specific task (graphic, audio/video playback, etc.)
- Ex. multi-core CPU: Intel “Comet Lake” i9 10-core CPU, AMD Ryzen Threaddripper 3990X 64 core (4 8-core chiplets), NVIDIA AD102 GPU, GPU-accelerated supercomputing, Apple A15 Bionic 6-core CPU multi-core GPU, Raspberry Pi 3 quad-core ARM A53 CPU
Specialized processing is ubiquitous in mobile systems
- Apple-designed multi-core GPU: neural engine (NPU) for DNN acceleartion + image/video encode/decode processor + motion(sensor) processor, TPU: tensor processing unit, a specialized processor for ML computations
- Specialized hardware to accelerate DNN inference/training: Google TPU3, GraphCore IPU, Intel deep learning inference accelerator, Apple neural engine
Software must be written to be parallel to see performance gains.

Power wall

Power consumed by a trasistaor: dynamic power $\propto$ capacitive load, voltage², frequency
Static power: transistors burn power even when inactive due to leakage
High power = high heat

Memory

Efficient processing almost always comes down to accessing data efficiently

Memory is organized as an array of bytes; each byte is identified by its address in memory
Terminology
- Memory access latency: the amount of time it takes the memory system to provide data to the processor
- Stalls: a processor stalls (can’t make progress) when it cannot run the next instruction in an instruction stream because future instructions depend on a previous instruction that is not yet complete. $\rightarrow$ accessing memory is a major source of stalls
- Memory access times $~100$’s of cycles (measure of latency)
- Caches
  - A cache is a hardware implementation detail that does not impact the output of a program, only its performance
  - Cache is on-chip storage that maintains a copy of a subset of the values in memory
  - Cache memory can be load/store more quickly than in DRAM (dynmamic random access memory) (on Kaby Lake CPU, latency on caches range from 4 to 38, DRAM around 248)
  - Caches operate at the granularity of “cache lines”; Each line holds 4 bytes of data
  - LRU (least recently used) replacement policy
  - Spatial locality: loading data in a cache line “preloads” the data needed for subsequent accesses to different addresses in the same line, leading to cache hits; those instructions which are stored nearby to the recently executed instruction have high chances of execution.
  - Temporal locality: repeated accesses to the same address result in hits; a instruction which is recently executed have high chances of execution again.
  - Caches reduce length of stalls; caches provide high bandwidth data transfer
Implementation of the linear memory address space abstraction on a modern computer
- Common organization: hierarchy of caches: level 1 (L1, 32 KB), level 2 (L2, 256 KB), level 3 (L3, 20 MB); DRAM (64 GB)
- The instruction “load the value stored at address X into register R0” might involve a complex sequence of operations by multiple data caches and access to DRAM

Summary

Single-thread-of-control performance is improving very slowly
- Ultilize multiple processing elements
- Specialized processing hardware
Parallel programing
- Problem partitioning, communication, synchronization
- Knowledge of machine characteristics
- Understanding data movement

A Modern Multi-Core Processor

Parallelism

simple processor $\rightarrow$ superscalar processor (2 instructions/clock) $\rightarrow$ pre multi-core era processor (larger cache, smarter out-of-order logic, smarter branch predictor, etc.) $\rightarrow$ multi-core era processor
Multi-core era processor
- Idea #1: use increasing transistor count to add more cores to the processor
- Idea #2: amortize cost/complexity of managing an instruction stream across many ALUs
  - SIMD processing: single instruction, multiple data (same instruction broadcast to all ALUs; parallel on all ALUs)
  - Vector program: realize with AVX intrinsics datatypes and functions (C)
    - Intrinsic functions operate on vectors of $8$ $32$-bit values (e.g. vector of $8$ floats)
    - Compiled program: processes $8$ array elements simultaneously using vector instructions on $256$-bit vector registers
  - Data-parallel expression (“forall” construct)
    - Loop iterations are independent
    - The same loop body will be executed on a large number of data elements
    - All the iterations of the loop carry out the exact same sequence of instructions
    - This abstraction can facilitate automatic generation of both multi-core parallel code and vector instructions to make use of SIMD proc essing capabilities within a core.
- Conditional execution
  - Mask/discard output of ALU
  - Not all the ALUs do useful work
  - After branch, continue at full performance
  - Worse case: $1/8$ peak performance
- Coherent execution
  - Property of a program where the same instruction sequence applies to many data elements
  - Coherent execution is NECESSARY for SIMD processing resources to be used efficiently
  - Coherent execution is NOT NECESSARY for efficient parallelization across different cores
  - A lack of instruction stream coherence in a program called “divergent” execution
- SIMD execution: modern CPU examples
  - Instructions are generated by the compiler parallelism requested by programmer using intriinsics
  - Parallelism conveyed using parallel langurage semantics (“forall”)
  - Parallelism inferred by dependency analysis of loops by “auto-vectorizing” compiler
  - “Explicit SIMD”: SIMD parallelization is performed at compile time; can inspect program binary and see SIMD instructions
- SIMD execution: modern GPUs
  - “Implicit SIMD”
    - Compiler generates a binary with scalar instructions
    - N instances of the program are always run together on the processor
    - Hardware (not compiler) is responsible for simultaneously executing the same instruction from multiple program instances on different data on SIMD ALUs
  - SIMD width of most modern GPUs ranges from 8 to 32
    - Divergent execution can be a big issue (this means poorly written code might execute at $1/32$ the peak capability of the machine)
Summary: 3 different forms of parallel execution
- Superscalar
  - Exploit ILP within an instruction stream
  - Process different instructions from the same instruction stream in parallel (within a core)
  - Parallelism automatically discovered by the hardware during execution
- SIMD
  - Multiple ALUs controlled by the same instruction (within a core)
  - Efficient for data-parallel workload: amortize control costs over many ALUs
  - Vectorization done by compiler (explicit SIMD) or at runtime by hardware (implicit SIMD)
- Multi-Core
  - Use multiple processing cores
  - Provides thread-level parallelism: simultaneously execute a completely different instruction stream on each one
  - Software creates thresds to expose parallelism to hardware (e.g., via threading API)

// Example: paralleism using C++ threads
typedef struct {
  int N;
  int terms;
  float* x;
  float* y;
} my_args;

void my_thread_func(my_args* args) {
  sinx(args->N, args->terms, args->x, args->y);
}
void parallel_sinx(int N, int terms, float* x, float* y) {
  std::thread my_thread;
  my_args args;

  args.N = N/2;
  args.terms = terms;
  args.x = x;
  args.y = y;

  // launch thread
  my_thread = std::thread(my_thread_func, &args);
  // on main thread
  sinx(N - args.N, terms, x + args.N, y + args.N);
  // wait for thread to complete
  my_thread.join();
}

// Data-parallel expression
// Taylor's expansion of sin(x) function for each element of an array of N floating-point numbers
void sinx(int N, int terms, float* x, float* y) {
  //substitute for (int i = 0; i < N; ++i) { with forall function
  // declares that loop iterations are independent
  // A compiler might automatically generate threaded code for you
  forall (int i from 0 to N) {
    float value = x[i];
    float numer - x[i] * x[i] * x[i];
    int denom = 6;
    int sign = -1;

    for (int j = 1; j <= terms; j ++) {
      value += sign * numer / denom;
      numer *= x[i] * x[i];
      denom *= (2 * j + 2) * (2 * j + 3);
      sign *= -1; 
    }
    y[i] = value;
  }
}

// Vector program (using AVX intrinsics)
# include <immintrin.h>
void sinx(int N, int terms, float* x, float* y) {
  float three_fact = 6;
  for (int i = 0; i < N; i += 8) {
    _m256 origx = _mm256_load_ps(&x[i]);
    _m256 value = origx;
    _m256 numer = _mm256_mul_ps(origx, _mm256_mul_ps(origx, origx));
    _m256 denom = _mm256_broadcast_ss(&three_fact);
    int sign = -1;

    for (int j = 1; j <= terms; ++j) {
      // value += sign * numer / denom
      _m256 tmp = _mm256_div_ps(_mm256_mul_ps(_mm256_set1ps(sign), numer), denom);
      value = _mm256_add_ps(value, tmp);

      numer = _mm256_mul_ps(numer, _mm256_mul_ps(origx, origx));
      denom = _mm256_mul_ps(denom, _mm256_broadcast_ss((2 * j + 2) * (2*j + 3)));
      sign *= -1;
    }    
    _mm256_store_ps(&y[i], value);
  }
}

Accessing Memory

Multi-threading reduces stalls (latency)
- Idea #3: interleave processing of multiple threads on the same core to hide stalls
- stall $\rightarrow$ runnable $\rightarrow$ short of time (not executed, core is executing instructions from another thread) $\rightarrow$ running $\rightarrow$ done
- Throughput-oriented systems: potentially increase time to complete work by any one thread, in order to increase overall system throughput when running multiple threads
- On-chip storage of execution Contexts as a finite resource (L1 cache)
  - with many small contexts (storage for small working set per thread): high latency hiding ability
  - with large context (storage for large working set per thread): low latency hiding ability
- Summary
  - A processor with multiple hardware threads has the ability to avoid stalls by performing instructions from other threads when one thread must wait for a long latency operation to complete
  - The latency of the memory operation is not changed by multi-threading (No longer causes reduced processor utilization)
  - A mutli-threaded processor hides memory latency by performing arithmetic from other threads
  - Programs that feature more arithmetic instructions per meomory access need fewer threads to hide memory stalls
- Hardware-supported multi-threading
  - Core manages execution contexts for multiple threads; porcessor makes decision about which thread to run each clock
  - Interleaved multi-threading (e.g., temporal multi-threading)
  - Simultaneous multi-threading (SMT, each clock, core chooses instructions from multiple threads to run on ALUs)
Bandwidth
- Terminology
  - Bandwidth: the rate at which the memory system cam provide data to a processor (e.g. $20$ $GB/s$; analogy: no. of lanes of traffic)
  - Latency: time completing one instruction (analogy: time cost driving from one city to another)
  - Throughput: instructions/cycle (analogy: within a unit time, total no. of cars drived from one city to another)
- High bandwidth memories: modern GPUs leverage high bandwidth memories located near processor
- In modern computing, bandwidth is the critical resource
- Performant paralle program:
  - Organize computation to fetch data from memory less
    - Reuse data previously loaded by the same thread (temporal locality optimizations)
    - Share data across threads (inter-thread cooperation)
  - Favor performing additional arithmetic to storing/restoring values (math does not require bandwidth)

Parallel Programming Abstractions

Abstraction vs. implementation
- Semantic (abstraction): given a program, and given the semantics of th e operations used, the program computations
- Implementation (scheduling): potentailly parallel which a program’s operations be executed per thread, per execution unit, per lane of a vector instruction
ISPC (Intel SPMD Program Compiler)
- SPMD (Single Program Multiple Data)
- Check this link (https://pharr.org/matt/blog/2018/04/30/ispc-all.html) and fill this part
- Function sinx() Example:
  - main() call to ispc_sinx(): control transferred to ispc_sinx() function (sequential execution, C code)
    - Begin executing programCount instances of ispc_sinx() (ISPC code)
    - SPMD programming abstraction:
      - Call to ISPC function spawns “gang” of ISPC “program instances”
      - All instances run ISPC code concurrently
      - Each instance has its own copy of local variables
      - Upon return, all instances have completed
    - ISPC implements the gang abstraction using SIMD instructions
      - No. of instances in a gang is the SIMD width of the hardware (or multiple of SIMD width)
      - ISPC compiler generates a C++ function binary(.o) whose body contains SIMD instructions
      - C++ linking against generated object files as usual
  - Return to main(): control transferred back to main() function (sequential execution, C code)
- ISPC keywords:
  - programCount: no. of simultaneously executing instances in the gang (uniform value), here programCount is $8$
  - programIndex: id of the current instance in the gang (a non-uniform value)
  - uniform: A type modifier; all instances have the same value for this variable; purely for optimization, not needed for correctness
  - foreach: declares parallel loop iterations; iteration the entire gang (not each instance)
- Interleaved assignment (0,1,2,…,7) and blocked assignment (0, 8, 16, …, 56)
- ISPC corss program instance operations:
  - uniform int64 reduce_add(int32 x);: sum of a variable’s value in all program instances in a gang
  - uniform int32 reduce_min(int32 a);: min of all values in a gang
  - int32 broadcast(int32 value, uniform int index);: broadcast a value from a instance to all instances in a gang
  - int32 rotate(int32 value, uniform int offset);: for all i, pass value from instance i to instance i+offset%programCount
- ISPC: abstraction vs. implementation
  - SPMD: single program, multiple data program model
    - Running a gang is spawning programCount logical instruction streams (each with a different value of programIndex)
    - single thread of control$\rightarrow$ call SPMD function $\rightarrow$ PMD execution (multiple instances of function run in parallel, multiple threads) $\rightarrow$ SPMD function returns $\rightarrow$ resume single thread of control
  - SIMD: single instruction, multiple data program model
    - ISPC compiler emits vector instructions (logic by a ISPC gang)
    - ISPC compiler mapping of conditional control flow to vector instructions
  - ISPC gang abstraction in implemented by SIMD instructions within on thread running on one core of a CPU
- Common errors:
  export uniform float sum_incorrect_1(uniform int N, uniform float* x) { // sum of type float: different variable for all program instances float sum = 0.0f; foreach(i = 0 … N) { sum += x[i]; } // Many copies of a variable to the calling // C code expects one return value of type float returnm sum; // compile-time type error }
```
* ```cpp
export uniform float sum_incorrect_2(uniform int N, uniform float* x) {
 // sum of type uniform float: one copy of variable for all program instances
 uniform float sum = 0.0f;
 foreach(i = 0 ... N) {
   // x[i] has different values for each program instance
   sum += x[i];
 }
 // Many copies of a variable to the calling
 // C code expects one return value of type float
 returnm sum;
 // compile-time type error
}
```
  export uniform float sum_array(uniform int N, uniform float* x) { uniform float sum; // each instance accumulates a private partial sum(no communication) float partial = 0.0f; foreach(i = 0 … N) { partial += x[i]; } // reduce_add() is part of ISPC’s cross program instance standard library // partial sums are added cross-instance communication prirmitive sum = reduce_add(partial); return sum; }
- Another alternative: with Collection structure, no loops (no indexing) which is similar to NumPy, PyTorch etc.

// ISPC code
// Interleaved assignment
export void ispc_sinx (
  uniform int N,            // N = 1024 as input value
  uniform int terms,
  uniform float* X,
  uniform float* result) {
  // assume N % programCount = 0
  // Interleave assignment
  // "Gang" of ISPC program instances, Gang contains programCount = 8 instances
  for (uniform int i = 0; i < N; i += programCount) {
    int idx = i + programIndex;             // Local variable
    float value = x[idx];                   // Local variable
    float numer = x[idx] * x[idx] *x[idx];  // Local variable
    uniform int denom = 6;
    uniform int sign = -1;
    for (uniform int j = 1; j <= terms; ++j) {
      value += sign * numer / denom;
      numer *= x[idx] * x[idx];
      denom *= (2 * j + 2) * (2 * j + 3);
      sign *= -1;
    }
    result[idx] = value;
  }
}

// Blocked assingment of array elements to program instances
// Assign multiple elements to each instance
export void ispc_sinx_v2 (
  uniform int N,
  uniform int terms,
  uniform float* x,
  uniform float* result) {
  // Assume N % programCount = 0;
  // Block assingment
  uniform int count = N / programCount;
  int start = programIndex * count;
  for (uniform int i = 0; i < count; ++i) {
    int idx = start + i;
    float value = x[idx];
    float numer = x[idx] * x[idx] * x[idx];
    uniform int denom = 6;
    uniform int sign = -1;
    for (uniform int j = 1; j < terms; ++j) {
      value += sign * numer / denom;
      numer *= x[idx] * x[idx];
      denom *= (j + 3) * (j + 4); // block assignment
      sign *= -1;
    }
    result[idx] = value;
  }
}

Parallel Programming Basics

Case study on writing an optimizing a parallel program: data parallel; shared address space

Creating a parallel program

Thought process:
1. Identify work that can be performed in parallel
2. Partition work and data associated
3. Manage data access, communication, and synchronization
Goal: maximizing speedup for a fixed computation Speedup(P processors) = Time(1 processor)/Time(P processors)
Flowchart
- Problem to solve
- Decomposition into subproblems (“tasks”)
- Assignment into parallel threads (“workers”)
- Orchestration to parallel program (communicating threads)
- Mapping execution on parallel machine

Problem decomposition

Create at least enough tasks to keep all execution unites on a machine busy
Key: identifying dependencies
Amdahl’s law: dependencies limit maximum speedup due to parallelism
- Assume $$S$$ is the fraction of sequential execution that is inherently sequential (dependencies prevent parallel execution)
- speedup $$= \frac{1}{s + \frac{1-s}{P}}$$, where $$1-s$$ is the proportion that cannot be parallelized, $$s$$ is the proportion can be parallelzed by $$P$$ processors
- Maximum speedup from parallel execution $$\leq 1/S$$
- A small serial region can limit speedup on a large parallel machine
- Maximum speedup $$\approx P$$ if $$0.1%$$ of application is serial

An Example

2-Step computation on a $$N\times N$$ image: multiple brightness of all pixels by 2; average of all pixel values
Sequential implementation: time complex $$\approx 2N^{2}$$
1st attempt at parallelism (P processors)
- On step 1: multiplication in parallel (time complexity: $$N^{2}/P$$)
  - Overall performance: speedup $$\leq\frac{2N^{2}}{n^{2}/P + N^{2}}\approx 2$$
- One step 2: computing partial sums in parallel, combine results serially (time complexity: $$N^{2}/P + P$$)
  - Overall performance: speedup $$\leq\frac{2N^{2}}{2n^{2}/P + P\rightarrow P$$ when $$N»P$$

Decomposition

Programmers are responsible for decomposing a program into independent tasks
Automatic decomposition of sequential programs continues to be a challenging research problem
- Compiler must analyze program, identify dependencies (but can be data dependent which is not known at compile time)
- Had modest success with simple loop nests
- “Magic parallelizing compiler” for complex, general purpose code as not yet been achieved

Assignment

“Tasks”, “workers” (threads, program instances, vector lanes, etc.)
Goals: good workload balance, reduce communication costs
Statically or dynamically
Many languages/runtimes take responsibility for assignment

Examples in ISPC of work by loop iteration

Programmer-managed assignment: static assignment; assign iterations to ISPC program instances in interleaved fashion
System-managed assignment: foreach construct exposes independent work(iterations) to system; abstraction leaves room for dynamic assignment; currrent ISPC implementation is static
Progammer-managed assignment: static assignment; loop iterations to threads in a blocked fashion (first half to spawned thread, second half to main thread)

void my_thread_start(int N, int terms, float* x, float* results) {
 sinx(N, terms, x, result); // do work
}
void parallel_sinx(int N, int terms, float* x, float* result) {
 int half = N / 2;
 // launch thread to do work on first half of array
 std:: thread t1(my_thread_start, half, terms, x, result);
 // do work on second half of array in main thread
 sinx(N - half, terms, x + half, result + half);
 t1.join();
}

Dynamic assignment using ISPC tasks: ISPC runtime(invisible to the programmer); assigns tasks to workder threads in a thread pool

void foo(uniform float* input, uniform float* output, uniform int N) {
 // create a bunch of tasks
 launch[100] my_ispc_task(input, output, N);
}

Orchestration

Goals: reduce costs of communication/sync, preserve locality of data reference, reduce overhead, etc.
- Structring communication
- Adding synchronization to preserve dependecies
- Organizing data structures in memory
- Scheduling tasks
If synchronization is expensive, programmer might use it more sparsely
Mapping to hardware (mapping “threads” to hardware execution units)
- Mapping by the operating system, i.e. a thread to HW execution context on a CPU core
- Mapping by the compiler, i.e. a ISPC pregram instances to vector instruction lanes
- Mapping CUDA thread blocks to GPU cores
Mapping decisions
- Place related threads (cooperating threads) on the same core (maximize locality, data sharing, minimize costs of comm/sync)
- Place unrealated threads on the same core (one might be bnadwidth limited and another might be compute limited) to use machine more efficiently

Example:

A 2D-grid based solver: solve partial differential equation on $(N+2)\times(N+2)$ grid
Iteration algorithm: Gauss-Seidel sweeps over grid until convergence; A[i,j] = 0.2*(A[i,j] + A[i,j-1] + A[i-1, j] + A[i, j+1] + A[i+1,j]);

const int n;
float* A; // allocate space
void solve(float* A) {
  float diff, prev;
  bool done = false;
  while (!done) {
    diff = 0.f;
    for (int i = 1; i < n; ++i) {  // only iterate the non border pixels
      for (int j = 1; j < n; ++j) {
        prev = A[i, j];
        A[i,j] = 0.2f * (A[i,j] + A[i, j - 1] + A[i - 1, j] + A[i, j + 1] + A[i + 1, j]);
        diff += fabs(A[i,j] - prev); // compute amount of change
      }
    }
    if (diff/(n*n) < TOLERANCE) {
      done = true;
    }
  }
}

Parallel programming step 1: identify dependencies (problem decomposition phase)
- Depends on previous row
- Depends on the left element
- Independent work along diagonals (each diagonal depends on previous diagonal, but every pixel on diagonal is independent)
- $\rightarrow$ Reorder grid cell update via red-black coloring
Parallel programming step 2: assignment
- Block assignment or interleaved assignment depending on the system
- Red cell update in parallel $\rightarrow$ wait until all processors done with update $\rightarrow$ communicate updated red cells to other processors $\rightarrow$ black cell update in parallel $\rightarrow$ wait until processors done with update $\rightarrow$ communicate updated black cells to other processors $\rightarrow$ repeat
Writing the program: data parallel thinking; SPMD/shared address space
- Data-parallel expression of grid solver

const int n;
float* A = allocate(n + 2, n + 2);
void solve(float* A) {
  bool done = false;
  float diff = 0.f;
  while (!done) {
    // Decomposition
    for_all (red_cells(i,j)) {
      float prev = A[i,j];
      A[i,j] = 0.2f * (A[i,j] + A[i, j - 1] + A[i - 1, j] + A[i, j + 1] + A[i + 1, j]);
      // Orchestration: builtin communication primitive: reduceAdd
      reduceAdd(diff, abs(A[i,j] - prev));
    } // Orchestration: for_all block is implicit wait for all workders before returning to sequential control

    if (diff / (n * n) < TOLERANCE) {
      done = true;
    }
  }
}

 * Shared address space (with SPMD threads) expression of solver:
   * Locks: only one thread in the critical region at a time
   * Barriers: wait for threads to reach this point

// Global variables
int n; // grid size n x n
bool done = false;
float diff = 0.0;
LOCK myLock;
BARRIER myBarrier;

float* A = allocate(n + 2, n + 2);
void solve(float* A) { // executed by all threads; SPMD-style
  float myDiff;
  int threadId = getThreadId(); // threadID is different for each SPMD instance
  int myMin = 1 + (threadId * n / NUM_PROCESSORS);
  int myMax = myMin + (n / NUM_PROCESSORS); // each thread computes the rows

  while (!done) {
    float myDiff = 0.f;
    diff = 0.f;
    barrier(myBarrier, NUM_PROCESSORS);
    for (j = myMin to myMax) {
      for (i = red_cells in this row) {
        float prev = A[i,j];
        A[i,j] = 0.2f * (A[i,j] + A[i, j - 1] + A[i - 1, j] + A[i, j + 1] + A[i + 1, j]);
        myDiff += abs(A[i,j] - prev);
      }
    }
    lock(myLock);
    diff += myDiff;
    unlock(myLock);
    barrier(myBarrier, NUM_PROCESSORS);
    if (diff / (n * n) < TOLERANCE) {
      done = true;
    }
    barrier(myBarrier, NUM_PROCESSORS);
  }
}