Tuning MPI Programs for Peak Performance

William Gropp

Tuning MPI Programs for Peak Performance

Overview

Background and Models

Goals of the Tutorial

Background

What is message passing?

Quick review of MPI Message passing

Basic Send/Receive modes

Nonblocking Modes

Completion

Persistent Communications

Testing for Messages

Buffered Communications

Abstract Model of MPI Implementation

The MPI Automaton

Message protocols

Special Protocols for DSM

Message Protocol Details

Eager Protocol

Eager Features

How Scaleable is Eager Delivery?

Rendezvous Protocol

Rendezvous Features

Short Protocol

User and System Buffering

Packetization

Collective operations

Non-contiguous Datatypes

Why Bsend

BREAK

Performance Model

Main components

Latency and Bandwidth

Interpreting Latency and Bandwidth

Including contention

Synchronization Delays

Polling Mode MPI

Interrupt Mode MPI

Example of the effect of Polling

More on Synchronization Delays

Load Balancing

Related effects

Contention

Effect of contention

Memory copies

Example: Performance Impact of Memory Copies

Example: Why MPI Datatypes

Performance of MPI Datatypes

Packet sizes/stepping

Example of Packetization

Measurement Techniques

Timing with WTIME/WTICK

Sample Timing Harness

Pitfalls in timing

Using PMPI routines

Logging and visualization

Upshot and MPE

Sample Upshot Output

IBM VT

Pablo

Paragraph

Paradyn

AIMS

Other visualization tools

Validating the logging

Deficiency analysis/filtering techniques

Correlating processes for synchronization delays

LUNCH

Tuning for Performance

Tuning for Performance (General Techniques)

Aggregation to reduce latency

Aggregation in Collective operations

Decomposition

Regular data structures

Issues in choosing a decomposition

More issues in decompositions

MPI Support for regular decompositions

Performance Issues of Decompositions

Example: Matrix factoring

Scaling of decompositions

Sharing Data in Decompositions

Irregular Data Structures

Changing the Algorithm

Issues in Changing the Algorithm

Trading Communication for Computation

Analysis of Communication Tradeoff

Multicoloring

Loop Unrolling

Aggregation in Loop Unrolling

Using Associativity

Load balancing

Identifying Load Imbalances

Master/slave models

Implementing Fairness

Tuning Decompositions for Load Balance

PP Presentation

MPI-Specific Tuning

Constant stride datatypes

Contiguous Structures

Improving structure performance

Tuning for MPI protocols

Aggressive Eager

Tuning for Aggressive Eager

Rendezvous with Sender Push

Rendezvous Blocking

Tuning for Rendezvous with Sender Push

Rendezvous with Receiver Pull

Tuning for Rendezvous with Receiver Pull

Sample Problems

Jacobi Iteration

Background to tests

Different send/recv modes

Some send/recv approaches

Scheduling Communications

Scheduling for contention

Some Example Results

Send and Recv

Better to start receives first

Ensure recvs posted before sends

Receives posted before sends

Ordered (no overlap)

Shift with MPI_Sendrecv

Use of Ssend versions

Nonblocking operations, overlap effective

Persistent operations

Manually advance automaton

Summary of Results

MPI Implementation parameters

IBM SP

MPICH

T3D

LAM

Miscellaneous Tricks

MPI-2 techniques

Pitfalls I

Pitfalls II

Review of Techniques I

Review of Techniques II

Review of Techniques III