Tuning MPI Programs for Peak Performance

William Gropp

Tuning MPI Programs for Peak Performance

Background and Models

Goals of the Tutorial

What is message passing?

Quick review of MPI Message passing

Basic Send/Receive modes

Nonblocking Modes

Persistent Communications

Testing for Messages

Buffered Communications

Abstract Model of MPI Implementation

The MPI Automaton

Message protocols

Special Protocols for DSM

Message Protocol Details

How Scaleable is Eager Delivery?

Rendezvous Protocol

Rendezvous Features

User and System Buffering

Collective operations

Non-contiguous Datatypes

Performance Model

Main components

Latency and Bandwidth

Interpreting Latency and Bandwidth

Including contention

Synchronization Delays

Polling Mode MPI

Interrupt Mode MPI

Example of the effect of Polling

More on Synchronization Delays

Related effects

Effect of contention

Example: Performance Impact of Memory Copies

Example: Why MPI Datatypes

Performance of MPI Datatypes

Packet sizes/stepping

Example of Packetization

Measurement Techniques

Timing with WTIME/WTICK

Sample Timing Harness

Pitfalls in timing

Using PMPI routines

Logging and visualization

Sample Upshot Output

Other visualization tools

Validating the logging

Deficiency analysis/filtering techniques

Correlating processes for synchronization delays

Tuning for Performance

Tuning for Performance (General Techniques)

Aggregation to reduce latency

Aggregation in Collective operations

Regular data structures

Issues in choosing a decomposition

More issues in decompositions

MPI Support for regular decompositions

Performance Issues of Decompositions

Example: Matrix factoring

Scaling of decompositions

Sharing Data in Decompositions

Irregular Data Structures

Changing the Algorithm

Issues in Changing the Algorithm

Trading Communication for Computation

Analysis of Communication Tradeoff

Aggregation in Loop Unrolling

Using Associativity

Identifying Load Imbalances

Master/slave models

Implementing Fairness

Tuning Decompositions for Load Balance

PP Presentation

MPI-Specific Tuning

Constant stride datatypes

Contiguous Structures

Improving structure performance

Tuning for MPI protocols

Aggressive Eager

Tuning for Aggressive Eager

Rendezvous with Sender Push

Rendezvous Blocking

Tuning for Rendezvous with Sender Push

Rendezvous with Receiver Pull

Tuning for Rendezvous with Receiver Pull

Sample Problems

Jacobi Iteration

Background to tests

Different send/recv modes

Some send/recv approaches

Scheduling Communications

Scheduling for contention

Some Example Results

Better to start receives first

Ensure recvs posted before sends

Receives posted before sends

Ordered (no overlap)

Shift with MPI_Sendrecv

Use of Ssend versions

Nonblocking operations, overlap effective

Persistent operations

Manually advance automaton

Summary of Results

MPI Implementation parameters

Miscellaneous Tricks

MPI-2 techniques

Review of Techniques I

Review of Techniques II

Review of Techniques III