Tuning MPI Programs for Peak Performance
William Gropp
Tuning MPI Programs for Peak Performance
Overview
Background and Models
Goals of the Tutorial
Background
What is message passing?
Quick review of MPI Message passing
Basic Send/Receive modes
Nonblocking Modes
Completion
Persistent Communications
Testing for Messages
Buffered Communications
Abstract Model of MPI Implementation
The MPI Automaton
Message protocols
Special Protocols for DSM
Message Protocol Details
Eager Protocol
Eager Features
How Scaleable is Eager Delivery?
Rendezvous Protocol
Rendezvous Features
Short Protocol
User and System Buffering
Packetization
Collective operations
Non-contiguous Datatypes
Why Bsend
BREAK
Performance Model
Main components
Latency and Bandwidth
Interpreting Latency and Bandwidth
Including contention
Synchronization Delays
Polling Mode MPI
Interrupt Mode MPI
Example of the effect of Polling
More on Synchronization Delays
Load Balancing
Related effects
Contention
Effect of contention
Memory copies
Example: Performance Impact of Memory Copies
Example: Why MPI Datatypes
Performance of MPI Datatypes
Packet sizes/stepping
Example of Packetization
Measurement Techniques
Timing with WTIME/WTICK
Sample Timing Harness
Pitfalls in timing
Using PMPI routines
Logging and visualization
Upshot and MPE
Sample Upshot Output
IBM VT
Pablo
Paragraph
Paradyn
AIMS
Other visualization tools
Validating the logging
Deficiency analysis/filtering techniques
Correlating processes for synchronization delays
LUNCH
Tuning for Performance
Tuning for Performance (General Techniques)
Aggregation to reduce latency
Aggregation in Collective operations
Decomposition
Regular data structures
Issues in choosing a decomposition
More issues in decompositions
MPI Support for regular decompositions
Performance Issues of Decompositions
Example: Matrix factoring
Scaling of decompositions
Sharing Data in Decompositions
Irregular Data Structures
Changing the Algorithm
Issues in Changing the Algorithm
Trading Communication for Computation
Analysis of Communication Tradeoff
Multicoloring
Loop Unrolling
Aggregation in Loop Unrolling
Using Associativity
Load balancing
Identifying Load Imbalances
Master/slave models
Implementing Fairness
Tuning Decompositions for Load Balance
PP Presentation
MPI-Specific Tuning
Constant stride datatypes
Contiguous Structures
Improving structure performance
Tuning for MPI protocols
Aggressive Eager
Tuning for Aggressive Eager
Rendezvous with Sender Push
Rendezvous Blocking
Tuning for Rendezvous with Sender Push
Rendezvous with Receiver Pull
Tuning for Rendezvous with Receiver Pull
Sample Problems
Jacobi Iteration
Background to tests
Different send/recv modes
Some send/recv approaches
Scheduling Communications
Scheduling for contention
Some Example Results
Send and Recv
Better to start receives first
Ensure recvs posted before sends
Receives posted before sends
Ordered (no overlap)
Shift with MPI_Sendrecv
Use of Ssend versions
Nonblocking operations, overlap effective
Persistent operations
Manually advance automaton
Summary of Results
MPI Implementation parameters
IBM SP
MPICH
T3D
LAM
Miscellaneous Tricks
MPI-2 techniques
Pitfalls I
Pitfalls II
Review of Techniques I
Review of Techniques II
Review of Techniques III