Abstracts
H.G. Kaper and S. Tipei, "An Abstract Approach to Music," Preprint ANL/MCS-P748-0399, March 1999. The notion of formalized music implies that a musical composition can be described in mathematical terms. In this article we explore some formal aspects of music and propose a framework for an abstract approach.
H.G. Kaper, S. Tipei, and E. Wiebel, "Data Sonification and Sound Visualization," Computing in Science and Engineering 1(4) (July/Aug 1999), pp. 48-58. This article describes a collaborative project between researchers in the Mathematics and Computer Science Division at Argonne National Laboratory and the Computer Music Project of the University of Illinois at Urbana-Champaign. The project focuses on the use of sound for the exploration and analysis of complex data sets in scientific computing. The article addresses digital sound synthesis in the context of DIASS, a Digital Instrument for Additive Sound Synthesis, and sound visualization in a virtual-reality environment by means of M4CAVE. It describes the procedures and preliminary results of some experiments in scientific sonification and sound visualization.
G. W. Crabtree, D. O. Gunter, H.G. Kaper, A. E. Koshelev, G. K. Leaf, and V. M. Vinokur, "Numerical Simulations of Driven Vortex Systems," Physical Review B 61, no. 2 (J an. 2000), pp. 1446-1455. This article reports on several large-scale numerical simulations of vortex systems that are driven through superconducting media with defects. The simulations are based on the time-dependent Ginzburg-Landau equations. The simulations demonstrate regimes of plastic and elastic steady-state motion in the presence of a twin boundary, show the effect of regular and irregular arrays of point defects on vortex trajectories, and show a mechanism by which vortices move through an array of columnar defects. Also presented are the results of some transient simulations in two and three dimensions, which show that, in the transition from the Meissner state to the vortex state, vortices are formed by a process of deposition.
D. O. Gunter, H.G. Kaper, and G. K. Leaf, "Implicit Integration of the Time-Dep endent Ginzburg-Landau Equations of Superconductivity," SIAM J. Sci. Comput. 23, no. 6 (2002) 1944-1959. This article is concerned with the integration of the time-dependent Ginzburg-Landau (TDGL) equations of superconductivity. Four algorithms, ranging from fully explicit to fully implicit, are presented and evaluated for stability, accuracy, and compute time. The benchmark problem for he evaluation is the equilibration of a vortex configuration in a superconductor that is embedded in a thin insulator and subject to an applied magnetic field.
M. Grimsditch, L. Giovannini, F. Montoncello, F. Nizzoli, G. K. Leaf, and H.G. Kaper, "Normal Modes in Ferromagnetic nanoparticles: A Dynamical Matrix Approach," Preprint ANL/MCS-P1116-0104, January 2004. We present a method to compute the magnetic normal modees of a ferromagnetic particle. The method is a hybrid of micromagnetic simuations and a "dynamical matrix" approach similar to that used for vibrational studies. We use the method to calculate the normal modes of an Fe parallelepiped and compare the results with the modees recently extracted from a purely micromagnetic simulation. The results of the two approaches are in excellent agreement. We discus the pros and cons of both approaches. We also present information on standing waves with wavevector perpendicular to the applied field and on a family of modes localized at the particle ends.
W. McCune, R. Padmanabhan, M. A. Rose, and R. Veroff, :Automated Discovery of Single Axioms for Ortholattices," Albegra univers. 52 (2004) 541-549. We present single short axioms for ortholattices, orthomodular lattices, and modular ortholattices, all in terms of the Sheffer stroke. The ortholattice axiom is the shortest possible. We also give multiequation bases in terms of the Sheffer stroke and in terms of join, meet, and complementation. Proofs are omitted but are available in an associated technical report and on the Web. We used computers extensively to find candidates, reject candidates, and search for proofs that candidates are single axioms.
H.G. Kaper and H. Nordborg, "The Ginzburg-Landau Equations of Superconductivity in the Limit of Weak Coupling near the Upper Critical Field," Preprint ANL/MCS-P786-0100, January 2000. This article is concerned with the Ginzburg-Landau (GL) equations of superconductivity. The equations provide a mathematical model for the study of magnetic flux vortices in superconductors. The focus is on the asymptotic case when the charge of the superconducting charge carriers (Cooper pairs) is vanishingly small and the applied magnetic field approaches the upper critical field. It is shown that the GL model reduces in the limit to the frozen-field model, where the superconducting phenomena are affected by the electromagnetic phenomena, but not vice versa. The convergence is second order in the small parameter. The analytical results are confirmed in some numerical examples.
D. Bonachea, P. Dickens, and R. Thakur, "High-Performance File I/O in Java: Existing Approaches and Bulk I/O Extensions," Preprint ANL/MCS-P840-0800, Argonne National Laboratory, August 2000. There is a growing interest in using Java as the language for developing high-performance computing applications. To be successful in the high-performance computing domain, however, Java must not only be able to provide high computational performance but also high-performance I/O. In this paper, we first examine several approaches that attempt tp provide high-performance I/O in Java--many of which are not obvious at first glance--and evaluate their performance on two parallel machines: the IBM SP and the SGI Origin2000. We then propose extensions to the Java I/O library that address the deficiencies in the Java I/O API and improve performance dramatically, The extensions add bulk (array) I/O operations to Java, thereby removing much of the overhead currently associated with array I/O in Java. We have implemented the extensions in two ways: in a standard JVM using the Java Native Interface and in a high-performance parallel dialect of Java called Titanium. We describe the two implementations and present performance results that demonstrate the benefits of the proposed extensions.
J. No, R. Thakur, and A. Choudhary, "High-Performance Scientific Data Management System," Preprint ANL/MCS-P973-0502, April 2003. Many scientific applications have large I/O requirements, in terms of both the size of data and the number of files or datasets. Management, storage, efficient access, and analysis of this data present an extremely challenging task. Traditionally, two different solutions have been used for this task: file I/O or databases. File I/O can provide high performance but is tedious to use with large numbers of file and larage and complex data sets. Databases can be convenient, flexible, and powerful but do not perform and scale well for parallel supercomputing applications. We have developed a software system, called Scientific Data Manager (SDM), that combines the good features of both file I/O and databases. SDM provides a high-level API to the user and, internally, uses a parallel file system to store real data (using various I/O optimizations available in MPI-IO) and a database to store application-related metadata. In order to support I/O in irregular applications, SDM makes extensive use of MPI-IO's noncontiguous collective I/O functions. Moreover, SDM uses the concept of a history file to optimize the cost of the index distribution using the metadata stored in the database. We describe the design and implementation of SDM and present performance results with two regular applications, ASTRO3D and an Euler solver, and with two irregular applications, a CFD code called FUN3D and a Rayleigh-Taylor instability code.
W. Ligon and R. Ross, "Parallel I/O and the Parallel Virtual File System", in Beowulf Cluster Computing with Linux, 2nd ed., edited by W. Gropp, E. Lusk, and T. Sterling, MIT Press, 2003. This chapter discusses some of the most important issues in parallel I/O systems. These include parallel access patterns, parallel I/O system components and architectures, and consistency semantics. Knowing how parallel I/O systems operate and the issues involved can be useful when performance tunign an application for a particular system or choosing an I/O solution to mach expected workloads. Following this general discussion, the authors delve into PVFS, covering management and tuning and approaches for narrowing the source of problems that crop up. The chapter concludes with discussion of some critical issues for parallel file systems and how PVFS2 attempts to address these.
J. Lee, X. Ma, R. Ross, R. Thakur, and M. Winslett," "RFS: Efficient and Flexible Remote File Access for MPI-IO," Preprint ANL/MCS-P1176-0604, June 2004. In this paper we present RFS, a high-performance remote I/O facility for ROMIO, a well-known MPI-IO implementation. The simple, portable, and flexible design eliminates the shortcomings of previous remote I/O efforts. In particular, RFS improves the remote I/O performance by adopting active buffering with threads (ABT), which hides I/O cost by aggressively buffering the output data using available memory and performing background I/O using threads while computation is taking place. Our experimental results show that RFS with ABT can significantly reduce the remote I/O visible cost, achieving up to 92% of the theoretical peak throughput. The compoutation slowdown caused by concurrent I/O activities was 0.2-6.2%, which is dwarfed by the overall performance improvement in application turnaround time.
M. S. Min, T. W. Lee, P. F. Fischer, and S. K. Gray,
"Fourier Spectral Simulations
and Gegenbauer Reconstructions for Electromagnetic Waves in the Presence of
a Metal Nanoparticle," Preprint ANL/MCS-P1202-1004. We deescribe Fourier psuedospectral time-domain simulations, carried out in order to study light interacting with a metallic nanoscale object. The difficulty of using Fourier
methods to accurately predict the electromagnetic scattering in such dielectric configuration arises from the discontinuity in the dielectric
function along the surface of the metallic object.
Standard Fourier methods lead to oscillatory behavior in approximating solutions that are nonsmooth or that have steep gradients.
By applying the Gegenbauer reconstruction technique as a postprocessing method to the Fourier pseudospectral solution, we successfully reduce the oscillations after postprocessing.
W. Gropp, R. Ross, and N. Miller, "Providing Efficient I/O Redundancy in MPI Environments," Preprint ANL/MCS-P1178-0604, June 2004. We demonstrate a scalable method for recovering from single disk failures that is optimized for typical scientific data sets. This approach exploits coarser-grained (but precise) semantics to reduce the overhead of constructing revovery data and uses parallel computation (proportional to the data size and independent of number of processors) to construct data. Experiments showing the efficiency of this approach on a cluster with independent disks are presented, and a technique for hiding the creation of redundant data within the MPI-IO implementation is described.
G. Almasi, C. Archer, J. G. Castanos, J. Gunnels, C. Erway, P. Heidelberger, X. Martorell, J. E. Moreira, K. Pinnow, J. Ratterman, B. Steinmacher-burow, W. Gropp, and B. Toonen, "The Design and Implementation of Message Passing Services for the BlueGene/L Supercomputer," Preprint ANL/MCS-P1183-0604, June 2004. OThe BlueGene/L supercomputer, with 65,536 dual-processor compoute nodes, was designed to support effocoemt execution of massively prallel message-passing programs. Part of this support is an optimized implementation of MPI that leverages the hardware features of BlueGene/L. MPI for BlueGene/L is implemented on top of a more basic message-passing infrastructure called the message layer. This layer can be used to implement other higher-level libraries and directly by applications. MPI and the message layer are used in the two modes of oepration of BlueGene/L: coprocessor mode and virtual node mode. Performance measurements show that our message-passing services deliver peformance close to the hardware limits of the machine. They also show that dedicating one of the processors of a node to communication functions (coprocessor mode) greatly improves the message-passing bandwidth and that running two processes per compute node (virtual node mode) can have a positive effect on application performance.
J. No, R. Thakur, A. Choudhary, "Integrating Parallel File I/O and Database Support for High-Performance Scientific Data Management," Preprint ANL/MCS-P798-0300, March 2000. Many scientific applications have large I/O requirements, in terms of both the size of data and the number of files or data sets. Management, storage, efficient access, and analysis of this data present an extremely challenging task. Traditionally, two different solutions are used for this problem: file I/O or databases. File I/O can provide high performance but is tedious to use within large numbers of files and large and complex data sets. Databases can be convenient, flexible, and powerful but do ot perform and scale well for parallel supercomputing applications. We have developed a software system, called Scientific Data Manager (SDM), that combines the good features of both file I/O and databases. SDM provides a thin layer of database-like functionality on top of a high-performance, parallel file-I/O interface (MPI-IO). As a result, users can access data with the convenience of databases and the performance of MPI-IO, without having to bother with the details of either. In this paper, we describe the design and implementation of SDM. With the help of two parallel application templates, ASTRO3D and an Euler solver, we illustrate how some of the design criteria affect performance.
R. Thakur, W. Gropp, and E. Lusk, "Data Sieving and Collective I/O in ROMIO, Preprint ANL/MCS-P723-0898, August 1998. The I/O access patterns of parallel programs often consist of accesses to a large number of small, noncontiguous pieces of data. If an application's I/O needs are met by making many small, distinct I/O requests, however, the I/O performance degrades drastically. To avoid this problem, MPI-IO allows users to access a noncontiguous data set with a single I/O function call. This feature provides MPI-IO implementations an opportunity to optimize data access. We describe how our MPI-IO implementation, ROMIO, delivers high performance in the presence of noncontiguous requests. We explain in detail the two key optimizations ROMIO performs: data sieving for noncontiguous requests from one process and collective I/O for noncontiguous requests from multiple processes. We describe how one can implement these optimizations portably on multiple machines and file systems, control their memory requirements, and also achieve high performance. We demonstrate the performance and portability with performance results for three applications---an astrophysics-application template (DIST3D), the NAS BTIO benchmark, and an unstructured code (UNSTRUC)---on five different parallel machines: HP Exemplar, IBM SP, Intel Paragon, NEC SX-4, and SGI Origin2000.
I. Foster, W. Gropp, C. Kesselman, "Message Passing and Threads," in CRPC Handbook of Parallel Computing, eds. J. Dongarra, I. Foster, G. Fox, K. Kennedy, L. Torczon, A. White, Morgan Kaufmann Publishers (to appear). Also Preprint ANL/MCS-P843-0700, July 2000. In this chapter, we examine two fundamental, although low-level, approaches to expressing parallelism in programs. Over the years, numerous different approaches to designing and implementing parallel prorams have been developed. However, over time, two dominant alternatives have emerged: message passing and multithreading.
W. D. Gropp, "The 2-D Poisson Problem," in CRPC Handbook of Parallel Computing, eds. J. Dongarra, I. Foster, G. Fox, K. Kennedy, L. Torczon, A. White, Morgan Kaufmann Publishers (to appear). Also Preprint ANL/MCS-P842-0700, July 2000. In this chapter we briefly describe how an approximate solution to a simple partial differential equation can be found when using parallel computing. This chapter will allow us to illustrate the issues of parallelizing an application and contrast the two major approaches.
J. Czyzyk, T. Wisniewski, and S. J. Wright, "Optimization Case Studies in the NEOS Guide," SIAM Review 41 (1999) 148-163. The point of applied mathematics is that the theoretical and algorithmic developments at the core of the subject are relevant to important applications in the real world. In studying the subject, we learn the usefulness of abstracting individual problem characteristics to a mathematical level. The connection to applications motivates us to tackle many of the conceptual difficulties that arise in our study of the mathematics.
Joseph Czyzyk, Michael P. Mesnier, and Jorge J. More', "The NEOS Server," IEEE Computational Science & Engineering 5 (1998) 68-75. The Network-Enabled Optimization System (NEOS) is an environment for solving optimization problems over the Internet. Users submit optimization problems to the NEOS Server via e-mail, the World Wide Web, or the NEOS Submission Tool. The NEOS Server locates the appropriate optimization solver, computes all additional information (for example, derivatives and sparsity patterns) required by the solver, links the optimization problem with the solver, and returns a solution. This article discusses the design and implementation of the NEOS Server.
William Gropp and Jorge J. More', "Optimization environments and the NEOS Server," Approximation Theory and Optimization, edited by M. D. Buhmann and A. Iserles, Cambridge University Press, 1997, pp. 167-182. IEEE Computational Science & Engineering 5 (1998) 68-75. The Network-Enabled Optimization System (NEOS) is an Internet-based service for optimization providing information, software, and problem-solving services for optimization. The main components of NEOS are the NEOS Guide and the NEOS Server. Additional information on the various services provided by NEOS can be obtained from the home page of the Optimization Technology Center.
J. J. More' and C.-J. Lin, "Incomplete Cholesky Factorizations with Limited Memory," SIAM Journal on Scientific Computing, 21 (1999) 24-45. We propose an incomplete Cholesky factorization for the solution of large-scale trust region subproblems and positive definite systems of linear equations. This factorization depends on a parameter p that specifies the amount of additional memory (in multiples of n, the dimension of the problem) that is available; there is no need to specify a drop tolerance. Our numerical results show that the number of conjugate gradient iterations and the computing time are reduced dramatically for small values of p. We also show that in contrast with drop tolerance strategies, the new approach is more stable in terms of number of iterations and memory requirements.
J. J. More' and C.-J. Lin, "Newton's Method for Large Bound-Constrained Optimization Problems," SIAM Journal on Optimization, 9 (1999) 1100-1127. We analyze a trust region version of Newton's method for bound-constrained problems. Our approach relies on the geometry of the feasible set, not on the particular representation in terms of constraints. The convergence theory holds for linearly constrained problems, and yields global and superlinear convergence without assuming neither strict complementarity nor linear independence of the active constraints. We also show that the convergence theory leads to an efficient implementation for large bound-constrained problems.
W. Anderson, W. Gropp, D. Kaushik, D. Keyes, and B. Smith, "Achieving High Sustained Performance in an Unstructured Mesh CFD Application," Proceedings of SC '99, 1999. Many applications of economic and national security importance require the solution of nonlinear partial differential equations (PDEs). In many cases, PDEs possess a wide range of time scales---some (e.g., acoustic) faster than the phenomena of prime interest (e.g., convective), suggesting the need for implicit methods. In addition, many applications are geometrically complex and possess a wide range of length scales, requiring an unstructured mesh to adequately resolve the problem without requiring an excessive number of mesh points and to accomplish mesh generation and adaptation (almost) automatically. The best algorithms for solving nonlinear implicit problems are often Newton Methods, which themselves require the solution of very large, sparse linear systems. The best algorithms for these sparse linear problems, particularly at very large sizes, are often preconditioned iterative methods. This nested hierarchy of tunable algorithms has proved effective in solving complex problems in areas such as aerodynamics, combustion, radiation transport, and global circulation. Typically, for steady-state solutions from a trivial initial guess, the number of "work units" (evaluations of the discrete residuals on the finest mesh on which the problem is represented) is around 10**3 (to achieve reductions in the norm of the residual of 10**-8 to 10**-12). Although these algorithms are efficient (in the sense of using relatively few floating-point operations to arrive at the final result), they do not necessarily achieve the absolute flops-per-second (flop/s) ratings that less efficient or less versatile algorithms may.
N. Desai, R. Bradshaw, A. Lusk, and E. Lusk, "MPI Cluster System Software," Preprint ANL/MCS-P1160-0504, May 2004. We describe the use of MPI for writing system software and tools, an area where it has not been previously applied. By "system software" we mean collections of tools used for system management and operations. We describe the common methodologies used for system software development, together with our experiences in implementing three items of system software with MPI. We demonstrate that MPI can bring significant performance and other benefits to system software.
W. Gropp, "Building Library Components That Can Use Any MPI Implementation," Preprint ANL/MCS-P956-0502," May 2002. The MPI standard for programming parallel computers is widely used for builiding both programs and libraries. Two of the strengths of MPI are its support for libraries and the existence of multiple implementations on many platforms. These two strengths are in conflict, however, when an application wants to use libraries built with different MPI implementations. This paper describes several solutions to this problem, based on minor changes to the API. These solutions also usggest design considerations for other standards, particularly those that expect to have multiple implementations and to be used in concert with other libraries.
W. Gropp and E. Lusk, "Reproducible Measurements of MPI Performance Characteristics," in Recent Advances in Parallel Virtual Machine and Message Passing Interface, Proceedings of the 6th European PVM/MPI Users Group Meeting, Springer Lecture Notes in Computer Science #1697, Springer, 1999, pp. 11-18. In this paper we describe the difficulties inherent in making accurate, reproducible measurements of message-passing performance. We describe some of the mistakes often made in attempting such measurements and the consequences of such mistakes. We describe mpptest, a suite of performance measurement programs developed at Argonne National Laboratory, that attempts to avoid such mistakes and obtain reproducible measures of MPI performance that can be useful to both MPI implementors and MPI application writers. We include a number of illustrative examples of its use.
L. Freitag, M. Jones, and P. Plassmann, "A Parallel Algorithm for Mesh Smoothing," SIAM Journal for Scientific Computing, 20, no. 6, 1999 Maintaining good mesh quality during the generation and refinement of unstructured meshes in finite-element application is an important aspect in obtaining accurate discretizations and well-conditioned linear systems. In this article, we present a mesh-smoothing algorithm based on nonsmooth optimization techniques and a scalable implementation of this algorithm. We prove that the parallel algorithm has a provably fast runtime bound and executes correctly for a PRAM computational model. We extend the PRAM algorithm to distributed memory computers and report results for two- and three-dimensional simplicial meshes that demonstrate the efficiency and scalability of this approach for a number of different test cases. We also examine the effect of different architectures on the parallel algorithm and present results for the IBM SP supercomputer and an ATM-connected network of SPARC Ultras.
C.-J. Lin and J. J. More', "Incomplete Cholesky Factorizations with Limited Memory," SIAM J. on Scientific Computing 21(1) (1999), pp. 24-45. We propose an incomplete Cholesky factorization for the solution of large-scale trust region subproblems and positive definite systems of linear equations. This factorization depends on a parameter p that specifies the amount of additional memory (in multiples of n, the dimension of the problem) that is available; there is no need to specify a drop tolerance. Our numerical results show that the number of conjugate gradient iterations and the computing time are reduced dramatically for small values of p. We also show that in contrast with drop tolerance strategies, the new approach is more stable in terms of number of iterations and memory requirements.
F. Jarre and S. J. Wright, "The Role of Linear Objective Functions in Barrier Methods," Mathematical Programming, Series A, 84 (1999) 357-373. To simplify the analysis of interior-point methods, one commonly formulates the problem so that the objective function is linear, by introducing a single extra variable if necessary. Here we show that a linear objective function makes the Newton direction for a barrier function a useful search direction if the current iterate is sufficiently close to the central path. Hence, there are two advantages to using a linear objective and staying close to the central path. First, the Newton direction (which coincides with the affine scaling direction on the central path) gives a very accurate approximation to the direction to the minimum. Second, a long step along the Newton direction is possible without violating the inequality constraints."
S. J. Wright, "Modified Cholesky Factorizations in Interior-Point Algorithms for Linear Programming," SIAM Journal on Optimization 9 (1999) 1159-1191. We investigate a modified Cholesky algorithm similar to those used in current interior-point codes for linear programming. Cholesky-based interior-point codes are popular for three reasons: their implementation requires only minimal changes to standard sparse Cholesky codes (allowing us to take full advantage of software written by specialists in that area); they tend to be more efficient than competing approaches that use different factorizations; and they perform robustly on most practical problems, yielding good interior-point steps even when the coefficient matrix is ill conditioned. We explain the surprisingly good performance of the Cholesky-based approach by using analytical tools from matrix perturbation theory and error analysis, illustrating our results with computational experiments. Finally, we point out the limitations of this approach.
D. Levine, W. Gropp, K. Forsman, and L. Kettunen, "Parallel Computation of Three-dimensional Nonlinear Magnetostatic Problems," Concurrency Practice and Experience, 11(2):109--120, February 1999. We describe a general-purpose parallel electromagnetic code for computing accurate solutions to large computationally demanding, 3D, nonlinear magnetostatic problems. The code, CORAL, is based on a volume integral equation formulation. Using an IBM SP parallel computer and iterative solution methods, we successfully solved the dense linear systems inherent in such formulations. A key component of our work was the use of the PETSc library, which provides parallel portability and access to the latest linear algebra solution technology.
S. Benson, M. Krishnan, L. McInnes, J. Nieplocha, and J. Sarich, "Using the GA and TAO Toolkits for Solving Large-Scale Optimization Proble ms on Parallel Computers," Preprint ANL/MCS-P1084-0903, September 2003. Challenges in the scalable solution of large-scale optimization problems include the development of innovative algorithms as well as efficient tools for parallel data manipulation. This paper discusses the combined use of two complementary toolkits from the collection of Advanced CompuTational Software (ACTS), namely, Global Arrays (GA) for parallel data management and the Toolkit for Advanced Optimization (TAO). TAO uses abstractions for vectors and matrics, so that optimization algorithms can easily interface to the external linear algebra support provided by the GA library. The GA/TAO interfaces are available both in the traditional library mode and as components compliant with the Common Component Architecture (CCA). We highlight the design of each toolkit, describe the interfaces between them, and evaluate performance for model problems involving bound-constrained optimization.
S. J. Benson, L. McInnes, J. J. More', and J. Sarich, "Scalable Algorithms in Optimization: Computational Experiments," Preprint ANL/MCS-P1175-0604, June 2004. We survey techniques in the Toolkit for Advanced Optimization (TAO) for developing scalable algorithms for mesh-based optimization problems on distributed architectures. We discuss the distribution of the mesh, the computation of the gradient and the Hessian matrix, and the use of preconditioners. We show that these techniques, together with mesh sequencing, can produce results that scale with mesh size.
P. Hovland, K. Keahey, L. C. McInnes, B. Norris, L. Freitag, and P. Raghaven, "A Quality-of-Service Architecture for High-Performance Numerical Components," Preprint Preprint ANL/MCS-P1028-0203, February 2003. We propose a model for QoS-based composition of high-performance numerical components. We define an architecture that relies on five key capabilities and services, including component characterization, component proxy services, component replacement, a decision module, and archival run information processing. We describe quality metrics that are important for high-performance numerical simulations, includinmg computational cost, accuracy, and rates of convergence and failure. We discuss the use of the architecture and quality metrics in the context of a driven cavity flow simulation, which has been shown to benefit from adaptive solution techniques that could be derived from a QoS architecture.
S. Balay, W. D. Gropp, L. C. McInnes, B. F. Smith, "Software for the Scalable Solution of PDEs," in CRPC Handbook of Parallel Computing, eds. J. Dongarra, I. Foster, G. Fox, K. Kennedy, L. Torczon, and A. White, Morgan Kaufmann Publishers (to appear). The numerical approximation of the solution of partial differential equation (PDEs), which can be used to model physical, chemical, and biological phenomena, is an important application of parallel computers. Early efforts to build programs to solve PDE problems had to start from scratch, building code for each algorithm used in the solution process. This custom approach has two major drawback: it limits the use of parallel computers to a small number of groups that have the resources and expertise to develop these codes, and it hampers the ability to take advantage of developments in parallel algorithms. In conventional, serial programming, both of these drawbacks were partially solved by developing libraries of routines that contained he best numerical analysis and implementation techniques. The same route is being followed for parallel libraries, though parallelism introduces additional complications. Handling these complications has caused many groups to rethink the structure of numerical libraries, leading to better software even for uniprocessor applications. Handling these complications has caused many groups to rethink structure of numerical libraries, leading to better software even for uniprocessor applications. In this chapter, we will cover some of the issues and solutions in the context of the Portable, Extensible Toolkit for Scientific Computation (PETSc), a collection of tools for the numerical solution of PDEs and related problems.
N. Karonis, B de Supinski, I. Foster, W. Gropp, E. Lusk, and J. Bresnahan, "Exploiting Hierarchy in Parallel Computer Networks to Optimize Collective Operation Performance," Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS2000), to appear. The efficient implementation of collective communication operations has received much attention. Initial efforts modeled network communication and produced "optimal" trees based on those models. However, the models used by these initial efforts assumed equal point-to-point latencies between any two processes. This assumption is violated in heterogeneous systems such as clusters of SMPs and wide-area "computational grids," and as a result, collective operations that utilize the trees generated by these models perform suboptimally. In response, more recent work has focused on creating topology-aware trees for collective operations that minimize communication across slower channels (e.g., a wide-area network). While these efforts have significant communication benefits, they all limit their view of the network to only two layers. We present a strategy based upon a multilayer view of the network. By creating multilevel topology trees we take advantage of communication cost differences at every level in the network. We used this strategy to implement topology-aware versions of several MPI collective operations in MPICH-G, the Globus-enabled version of the popular MPICH mplementation of the MPI standard. Using information about topology discovered by Globus, we construct these topology-aware trees automatically during execution, thus freeing the MPI application programmer from having to write special files or functions to describe the topology to the MPICH library. We present results demonstrating the advantages of our multilevel approach by comparing it to the default (topology-unaware) implementation provided by MPICH and a topology-aware two-layer implementation.
O. Zaki, E. Lusk, W. Gropp, and D. Swider, "Toward Scalable Performance Visualization with Jumpshot," Int'l J. of High-Performance Computing Applications (to appear). Also Preprint ANL/MCS-P763-0699, June 1999. Jumpshot is a graphical tool for understanding the performance of parallel programs. It is in the tradition of the upshot tool, but contains a number of extensions and enhancements that make it suitable for large-scale parallel computations. Jumpshot takes as input a new, more flexible logfile format, and comes with a library for generating such logfiles. An MPI profiling library is also included, enabling the automatic generation of such logfiles from MPI programs. Jumpshot is written in Java, and can easily be integrated as an applet into browser-based computing environments. The most novel feature of Jumpshot is its automatic detection of anomalous durations, drawing the user's attention to problem areas in a parallel execution. This capability is particularly useful in large-scale parallel computations containing very many events.
C. H. Bischof, H. M. Bücker, and P. D. Hovland, "On Combining Computational Differentiation and Toolkits for Parallel Scientific Computing," Preprint ANL/MCS-P797-0200, Feb. 2000. Automatic differentiation is a powerful technique for evaluating derivatives of functions given in the form of a high-level programming language such as Fortran, C, or C++. The program is treated as a potentially very long sequence of elementary statements to which the chain rule of differential calculus is applied over and over again. Combining automatic differentiation and the organizational structure of toolkits for parallel scientific computing provides a mechanism for evaluating derivatives by exploiting mathematical insight on a higher level. In these toolkits, algorithmic structures such as BLAS-like operations, linear and nonlinear solvers, or integrators for ordinary differential equations can be identified by their standardized interfaces and recognized as high-level mathematical objects rather than as a sequence of elementary statements. In this note, the differentiation of a linear solver with respect to some parameter vector is taken as an example. Mathematical insight is used to reformulate this problem into the solution of multiple linear systems that share the same coefficient matrix but differ in their right-hand sides. The experiments reported here use ADIC, a tool for the automatic differentiation of C programs, and PETSc, an object-oriented toolkit for the parallel solution of scientific problems modeled by partial differential equations.
C. H. Bischof and H. M. Bucker, "Computing Derivatives of Computer Programs, Preprint ANL/MCS-P81 3-0400, April 2000. Automatic differentiation is introduced as a powerful technique to compute derivatives of functions given in the form of a computer program in a high-level programming language such as Fortran, C, or C++. In contrast to traditional approaches such as handcoding of analytic expressions, numerical approximation by divided differences, or manipulation of symbolic algebraic expressions by computer algebra systems, automatic differentiation offers the following substantial benefits: it is accurate up to machine precision, efficient in terms of computational cost, applicable to a 1-line formula as well las to a 100,000-line code, and can be produced with minimal human effort.
P. H. Carns, W. B. Ligon III, R. B. Ross, and R. Thakur, "PVFS: A Parallel File System for Linux Clusters,& quot; Preprint ANL/MCS-P804-0400, April 2000. As Linux clusters have matured as platforms for low-cost, high-performance parallel computing, software packages to provide many key services have emerged, especially in areas such as message passing and networking. One area devoid of support, however, has been parallel file systems, which are critical for high-performance I/O on such clusters. We have developed a parallel file system for Linux clusters, called the Parallel Virtual File System (PVFS). PVFS is intended both as a high-performance parallel file system that anyone can download and use and as a tool for pursuing further research in parallel I/O and parallel file systems for Linux clusters. In this paper, we describe the design and implementation of PVFS and present performance results on the Chiba City cluster at Argonne. We provide performance results for a workload of concurrent reads and writes for various numbers of compute nodes, I/O nodes, and I/O request sizes. We also present performance results for MPI-IO on PVFS, both for a concurrent read/write workload and for the BTIO benchmark. We compare the I/O performance when using a Myrinet network versus a fast-ethernet network for I/O-related communication in PVFS. We obtained read and write bandwidths as high as 700 Mbytes/sec with Myrinet and 225 Mbytes/sec with fast ethernet.
L. Freitag and R. M. Loy, "Comparison of Remote Visualization Strategies for Interactive Exploration of Large Data Sets," Preprint ANL/MCS-P803-0300, March 2000. We compare three remote visualization strategies used for interactive exploration of large data sets: image-based rendering, parallel visualization servers, and subsampling. We review each strategy and provide details for an adaptive multiresolution subsampling technique that we have developed. To determine the problem regimes for which each approach is most cost effective, we develop performance models to analyze the costs of computation and communication associated with the common visualization task of isosurface generation. Using these models, we investigate a number of hardware system configurations and task complexity scenarios when parameters such as problem size, visualization demands, and network bandwidth change. For one particular strategy, subsampling, we further investigate the tradeoffs between multiresolution and uniform grid methods in terms of performance and approximation errors.
J. Cownie and W. Gropp, "A Standard Interface for Debugger Access to Message Queue Information in MPI," Preprint ANL/MCS-P754-0699, June 1999. This paper discusses the design and implementation of an interface that allows a debugger to obtain the information necessary to display the contents of the MPI message queues. The design has been implemented in the TotalView debugger, and dynamic libraries that conform to the interface exist for MPICH, as well as the proprietary MPI implementations from Compaq, IBM, and SGI.
L. McInnes, B. Norris, S. Bhowmick, and P. Raghaven "Adaptive Sparse Linear Solvers for Implicit CFD Using Newton-Krylov Algorithms," Preprint ANL/MCS-P998-0902, September 2002. We consider the simulation of three-dimensional transonic Euler flow using pseudeo-transient Newton-Krylov methods. The main computation involves solving a large, sparse linear system at each Newton (nonlinear) iteration. We develop a technique for adaptively selecting the linear solver method to match better the numeric properties of the linear systems as they evolve during the course of the nonlinear iterations. We show how such adaptive methods can be implemented using advanced software environments, leading to significant improvements in simulation time.
W. D. Gropp, D. E. Keyes, L. C. McInnes, and M. D. Tidriri, "Globalized Newton-Krylov-Schwarz Algorithms and Software for Parallel Implicit CFD," Int. J. High Perform. Comput. Appl. (to appear). Implicit solution methods are important in applications modeled by PDEs with disparate temporal and spatial scales. Because such applications require high resolution with reasonable turnaround, parallelization is essential. The pseudo-transient matrix-free Newton-Krylov-Schwarz algorithmic framework is presented as a widely applicable answer. This article shows that, for the classical problem of three-dimensional transonic Euler flow about an M6 wing, NKS can simultaneously deliver (1) globalized, asymptotically rapid convergence through adaptive pseudo-transient continuation and Newton's method; (2) easonable parallelizability for an implicit method through deferred synchronization and favorable communication-to-communication scaling in the Krylov linear solver; and (3) high per-processor performance through attention to distributed memory and cache locality, especially through the Schwarz preconditioner. Two discouraging features of NKS methods are their sensitivity to the coding of the underlying PDE discretization and the large number of parameters that must be selected to govern convergence. We therefore distill several recommendations from our experience and from our reading of the literature on various algorithmic components of NKS, and we describe a freely available, MPI-based portable parallel software implementation of the solver employed here.
L. G. de Pillis and A. Radunskaya, "A Mathematical Tumor Model with Immune Resistance and Drug Therapy: An Optimal Control Approach," Preprint ANL/MCS-P805-0400, April 2000. We present a competition model of cancer tumor growth that includes both the immune system response and drug therapy. This is a four-population model that includes tumor cells, host cells, immune cells, and drug interaction. We analyze the stability of the drug-free equilibria with respect to the immune response in order to look for target basins of attraction. One of our goals was to simulate qualitatively the asynchronous tumor-drug interaction known as "Jeff's phenomenon." The model we develop is successful in generating this asynchronous response behavior. Our other goal was to identify treatment protocols that could improve standard pulsed chemotherapy regimens. Using optimal control theory with constraints and numerical simulations, we obtain new therapy protocols that we then compare with traditional pulsed periodic treatment. The optimal control generated therapies produce larger oscillations in the tumor population over time. However, by th4e end of the treatment period, total tumor size is smaller than that achieved through raditional pulsed therapy, and the normal cell population suffers nearly no oscillations.
P. D. Cha and L. G. de Pillis, "Model Updating by Adding Known Masses," International Journal for Numerical Methods in Engineering (to appear). Also Preprint ANL/MCS-P841-0800, August 2000. New approaches are developed that use measured data to adjust the analytical mass and stiffness matrices of a system so that the agreement between the analytical modes of vibration and the modal survey is improved. By adding known masses to the structure of interest, measuring the modes of vibration of this mass-modified system, and finally using this set of new data in conjunction with the initial lmodal survey, the analytical mass matrix of the structure can be corrected, after which the analytical stiffness matrix can be readily updated. By manipulating the correction matrices into vector forms, the connectivity information can be enforced, thereby preserving the physical configuration of the system and reducing the sizes of the least squares problems that need to be solved. Solution techniques for updating the system matrices are introduced, and the numerical issues associated with solving overdetermined and underdetermined least squares problems are investigated. The effects of round-off errors are also studied, and heuristic criteria are given for determining the minimum number of modes that need to be measured in order to ensure accurate updated mass and stiffness matrices. Numerical experiments are presented to validate the proposed model-updating techniques, to illustrate the effects of the number of measured modes on the quality of the updated model, to show how the magnitudes and locations of the added masses influence the updated matrices, and to highlight the numerical issues discussed in this paper.
L. G. de Pillis and E. G. de Pillis, "A Mathematical Framework for Understanding Continuum Effects of Budget Fluctuations on a University," Preprint ANL/MCS-P806-0400, April 2000. As a large part of most state budgets, public universities are tempting targets of budget cuts. Using the theory of diffusion of innovation, we build a mathematical framework that allows us to simulate the continuum effects of such budget cuts. This technique of mathematical modeling can provide insights into the mechanisms that affect the beneficial impact of the university. Such insights are difficult to obtain from standard economic impact studies.
E. G. de Pillis and L. G. de Pillis, "The Long-Term Impact of University Budget Cuts: A Mathematical Model," Preprint ANL/MCS-P817-0500, May 2000. Policymakers acknowledge the regional benefits of the university, yet cut higher education budgets. Incorporating the theory of diffusion of innovation, we develop a mathematical model to explore the long-term effects of university budget cuts. Simulations indicate that the full impact of budget modifications may not be realized for several decades.
L. A. Freitag and R. M. Loy, "Using Desktop Graphics Workstations for Interactive Remote Exploration of Large Data Sets," Preprint ANL/MCS-P809-0400, April 2000. The interactive visualization and exploration of large scientific data sets is a challenging and difficult task; their size often far exceeds the performance and memory capacity of even the most powerful graphics workstations. To address this problem, we have created a technique that combines multiresolution data reduction methods with parallel computing to allow interacative exploration of large data sets while retaining full-resolution capability. We describe the creation of reduced data sets using several different criteria including user-specified error bounds or a preset performance criterion. We discuss the software architecture of the system with particular emphasis on the algorithms used to efficiently create a reduced data set and the software used to communicate between the remote data reduction server and the local graphics client. We present performance results for the visualization of Rayleigh-Taylor instability and hairpin vortex data sets.
Z. Ren, R. Sheng, and S. J. Wright, "Advanced Computational Techniques for Laue Diffraction," Preprint ANL/MCS-P740-0199, Feb. 1999. We describe LaueView, a code for processing the measured intensity data in Laue X-ray diffraction experiments to obtain corrected structure amplitudes for each reflection that take account of the various distortion effects. The resulting corrected intensity data can then be used to recover the molecular structure by isomorphous refinement or by solution of the phase problem. We describe the key numerical techniques used in LaueView and outline the improvements we made to obtain a new, more efficient, and parallel version of the code. We conclude with some computational results obtained on a real data set that illustrate our improvements. The basic principles of the Laue method are described in an appendix, where we outline the distortions in the measured intensity data due to effects such as blurring, overlap of the spots, the nonuniform distribution of intensities in the incident X-ray beam, and absorption effects of various types.
E. A. Yildirim and S. J. Wright, "Warm-start Strategies in Interior-Point Methods for Linear Programming," Preprint ANL/MCS-P799-0300, March 2000. We study the situation in which, having solved a linear program with an interior-point method, we are presented with a new problem instance whose data is slightly perturned from the original. We describe strategies for recovering a "warm-start" point for the perturbed problem instance from the iterates of the original problem instance. We obtain worst-case estimates of the number of iterations required to converge to a solution of the perturbation and on the conditioning and other properties of the problem instances.
S. J. Wright, "On Reduced Convex QP Formulations of Monotone LCP Problems," Preprint ANL/MCS-P808-0400, April 2000. Techniques for transforming convex quadratic programs (QPs) into monotone linear complementarity problems (LCPs) and vice versa are well known. We describe a class of LCPs for which a reduced QP formulation---one that has fewer constraints than the "standard" QP formulations---is available. We mention several instances of this class, including the known case in which the coefficient matrix in the LCP is symmetric.
S. J. Wright, "Recent Developments in Interior-Point Methods," Proceedings of the IFIP Conference on System Modelling and Optimization, M. J. D. Powell, ed., Kluwer, to appear. The modern era of interior-point methods dates to 1984, when Karmarkar proposed his algorithm for linear programming. In the years since then, algorithms and software for linear programming have become quite sophisticated, while extensions to more general classes of problems, such as convex quadratic programming, semidefinite programming, and nonconvex and nonlinear problems, have reached varying levels of maturity. Interior-point methodology has been used as part of the solution strategy in many other optimization contexts as well, including analytic center methods and column-generation algorithms for large linear programs. We review some core developments in the area and discuss their impact on these other problem areas.
I. Lee, P. Raghaven, S. Schofield, and P. Fischer, ""An O(n log n) Solution Algorithm for Spectral Element Methods," Preprint ANL/MCS-P1041-0403, April 2003. To leverage significant software development effort, general-purpose unstructured codes are often used in structured or semi-structured applications. We show that O(n log n) computational complexities, competitive with classic Fourier methods, are achievable for some classes of semi-structured spectral element applications.
J. W. Lottes and P. F. Fischer, "Hybrid Multigrid/Schwarz Algorithms for the Spectral Element Method," Preprint ANL/MCS-P1052-0403, April 2003. We study the performance of the multigrid method applied to spectral element discretizations of the Poisson equation. Smoothers based on finite element (FE) discretizations, overlapping Schwarz methods, and point-jacobi are considered in conjunction with conjugate gradient and GMRES acceleration techniques. It is found that Schwarz methods based on restrictions of the originating SE matrices converge faster than FE-based methods and that weighting the Schwarz matrices by the inverse of the diagonal coutning matrix is essential to effective Schwarz smoothing. Several of the methods considered achieve convergence rates comparable to those attained by classic multigrid on regular grids.
R. Armstrong, D. Gannon, A. Geist, K. Keahey, S. Kohn, L. McInnes, S. Parker, and B. Smolinski, "Toward a Common Component Architecture for High-Performance Scientific Computing," Preprint ANL/MCS-P759-0699, Proceedings of the 1999 High-Performance Distributed Computing Conference, to appear. This paper describes work in progress to develop a standard for interoperability among high-performance scientific components. This research stems from growing recognition that the scientific community needs to better manage the complexity of multidisciplinary simulations and better address scalable performance issues on parallel and distributed architectures. Driving forces are the need for fast connections among components that perform numerically intensive work and for parallel collective interactions among components that use multiple processes or threads. This paper focuses on the areas we believe are most crucial in this context, namely, an interface definition language that supports scientific abstractions for specifying component interfaces and a ports connections model for specifying component interactions.
H. M. Tufo and P. F. Fischer, "Terascale Spectral Element Algorithms and Implementations," Preprint ANL/MCS-P762-0699, June 1999. We describe the development and implementation of an efficient spectral element code for multimillion gridpoint simulations of incompressible flows in general two- and three-dimensional domains. We review basic and recently developed algorithmic underpinnings that have resulted in good parallel and vector performance on a broad range of architectures, including the terascale computing systems now coming on line at the DOE labs. Sustained performance of 219 GFLOPS has been recently achieved on 2048 nodes of the Intel ASCI-Red machine at Sandia.
H. M. Tufo, P. F. Fischer, M. E. Papka, and M. Szymanski, "Hairpin Vortex Formation, a Case Study for Unsteady Visualization," Preprint ANL/MCS-P774-0799, July 1999. To better understand the vortex dynamics of coherent structures in turbulent and transitional boundary layers, we consider direct numerical simulation of the interaction between a flat-plate-boundary-layer flow and an isolated hemispherical roughness element. Of principal interest is the evolution of hairpin vortices that form an interlacing pattern in the wake of the hemisphere, lift away from the wall, and are stretched by the shearing action of the boundary layer. Using animations of unsteady three-dimensional representations of this flow, produced by the vtk toolkit and enhanced to operate in a CAVE virtual environment, we identify and study several key features in the evolution of this complex vortex topology not previously observed in other visualization formats.
W. K. Anderson, W. D. Gropp, D. K. Kaushik, D. E. Keyes, and B. F. Smith, "Achieving High Sustained Performance in an Unstructured Mesh CFD Application," Preprint ANL/MCS-P776-0899, Aug. 1999. Overview: Many applications of economic and national security importance require the solution of nonlinear partial differential equations (PDEs). In many cases, PDEs possess a wide range of time scales---some (e.g., acoustic) faster than the phenomena of prime interest (e.g., convective), suggesting the need for implicit methods. In addition, many applications are geometrically complex and possess a wide range of length scales, requiring an unstructured mesh to adequately resolve the problem without requiring an excessive number of mesh points and to accomplish mesh generation and adaptation (almost) automatically. The best algorithms for solving nonlinear implicit problems are often Newton Methods, which themselves require the solution of very large, sparse linear systems. The best algorithms for these sparse linear problems, particularly at very large sizes, are often preconditioned iterative methods. This nested hierarchy of tunable algorithms has proved effective in solving complex problems in areas such as aerodynamics, combustion, radiation transport, and global circulation. Typically, for steady-state solutions from a trivial initial guess, the number of "work units" (evaluations of the discrete residuals on the finest mesh on which the problem is represented) is around 10**3 (to achieve reductions in the norm of the residual of 10**-8 to 10**-12). Although these algorithms are efficient (in the sense of using relatively few floating-point operations to arrive at the final result), they do not necessarily achieve the absolute flops-per-second (flop/s) ratings that less efficient or less versatile algorithms may.
P. F. Fischer and J. S. Mullen, "Filter-based Stabilization of Spectral Element Methods," C. R. Acad. Sci. Paris 332, series 1 (2001) 265-270. We present a simple filtering procedure for stabilizing the spectral element method (SEM) for the unsteady advection-diffusion and Navier-Stokes equations. A number of example applications are presented, along with basic analysis for the advection-diffusion case.
P. F. Fischer and H. M. Tufo, "High-Performance Spectral Element Algorithms and Implementations," Preprint ANL/MCS-P778-0899, Aug. 1999. We describe the development and implementation of a spectral element code for multimillion gridpoint simulations of incompressible flows in general two- and three-dimensional domains. Parallel performance is present on up to 2048 nodes of the Intel ASCI-Red machine at Sandia.
H. M. Tufo, P. F. Fischer, M. E. Papka, and K. Blom, "Numerical Simulation and Immersive Visualization of Hairpin Vortices," Preprint ANL/MCS-P779-0899, Aug. 1999. To better understand the vortex dynamics of coherent structures in turbulent and transitional boundary layers, we consider direct numerical simulation of the interaction between a flat-plate-boundary-layer flow and an isolated hemispherical roughness element. Of principal interest is the evolution of hairpin vortices that form an interlacing pattern in the wake of the hemisphere, lift away from the wall, and are stretched by the shearing action of the boundary layer. Using animations of unsteady three-dimensional representations of this flow, produced by the vtk toolkit and enhanced to operate in a CAVE virtual environment, we identify and study several key features in the evolution of this complex vortex topology not previously observed in other visualization formats.
L. Freitag and T. Urness, "Analyzing Industrial Furnace Efficiency Using Comparative Visualization in a Virtual Reality Environment," Preprint ANL/MCS-P744-0299, Feb. 1999. We describe an interactive toolkit used to perform comparative analysis of two or more data sets arising from numerical simulations. Several techniques have been incorporated into this toolkit, including (1) successive visualization of individual data sets, (2) data comparison techniques such as computation and visualization of the differences between data sets, and (3) image comparison methods such as scalar field height profiles plotted in a common coordinate system. We describe each technique in detail and show example usage in an industrial application aimed at designing an efficient, low-NOx burner for industrial furnaces. Critical insights are obtained by interactively adjusted color maps, data culling, and data manipulation. New paradigms for scaling small values in the data comparison technique are described. The display device used for this application was the CAVE virtual reality theater, and we describe the user interface to the visualization toolkit and the benefits of immersive 3D visualization for comparative analysis.
L. A. Freitag and P. Plassmann, "Local Optimization-Based Simplicial Mesh Untangling and Improvement," Preprint ANL/MCS-P749-0399, May 1999. We develop an optimization-based approach for mesh untangling formulated by maximizing the minimum area or volume of the simplicial elements in a local submesh. The method that the function level sets are convex regardless of the position of the free vertex, and hence the local subproblem is guaranteed to converge. Maximizing the minimum area of mesh elements, although well-suited for mesh untangling, is not ideal for mesh improvement and its use often results in poor quality meshes. We therefore combine the mesh untangling technique with optimization-based mesh improvement techniques, and expand previous results to show that the level sets for a commonly used mesh quality criterion is convex in the feasible region. Typical results showing the effectiveness of the combined untangling and smoothing techniques are given for both two- and three-dimensional meshes.
L. A. Freitag, W. D. Gropp, P. D. Hovland, L. C. McInnes, and B. F. Smith, "Infrastructure and Interfaces for Large-Scale Numerical Software," Preprint ANL/MCS-P751-0599, May 1999. The complexity of large-scale scientific simulations often necessitates the combined use of multiple software packages developed by different groups in areas such as adaptive mesh manipulations, scalable algebraic solvers, and optimization. Historically, these packages have been combined by using custom code. This practice inhibits experimentation with and comparison of multiple tools that provide similar functionality through different implementations. The ALICE project, a collaborative effort among researchers at Argonne National Laboratory, is exploring the use of component-based software engineering to provide better interoperability among numerical toolkits. We discuss some initial experiences in developing an infrastructure and interfaces for high-performance numerical computing.
L. A. Freitag and R. M. Loy, "Adaptive, Multiresolution Visualization of Large Data Sets Using Parallel Octrees," Preprint ANL/MCS-P752-0599, May 1999. The interactive visualization and exploration of large scientific data sets is a challenging and difficult task; their size often far exceeds the performance and memory capacity of even the most powerful graphics workstations. To address this problem, we have created a technique that combines hierarchical data reduction methods with parallel computing to allow interactive exploration of large data sets while retaining full-resolution capability. The hierarchical representation is built in parallel by strategically inserting field data into an octree data structure. We provide functionality that allows the user to interactively adapt the resolution of the reduced data sets so that resolution is increased in regions of interest without sacrificing local graphics performance. We describe the creation of the reduced data sets using a parallel octree, the software architecture of the system, and the performance of this system on the data from a Rayleigh-Taylor instability simulation.
L. A. Freitag and P. M. Knupp, "Tetrahedral Element Shape Optimization via the Jacobian Determinant and Condition Number," Preprint ANL/MCS-P769-0799, July 1999. We present a new shape measure for tetrahedral elements that is optimal in the sense that it gives the distance of a tetrahedron from the set of inverted elements. This measure is constructed from the condition number of the linear transformation between a unit equilateral tetrahedron and any tetrahedron with positive volume. We use this shape measure to formulate two optimization objective functions, which are differentiated by their goal: the first seeks to improve the average quality of the tetrahedral mesh, the second aims to improve the worst quality element in the mesh. Because the element condition number is not defined for tetrahedra with negative volume, these objective functions can be used only when the initial mesh is valid. Therefore, we formulate a third objective function using the determinant of the element Jacobian that is suitable for mesh untangling. We review the optimization techniques used with each objective function, and present experimental results that demonstrate the effectiveness of the mesh improvement and untangling methods. We show that a combined optimization approach that uses both condition number objective functions obtains the best quality meshes.
L. A. Freitag and P. M. Knupp, "Tetrahedral Mesh Improvement via Optimization of the Element Condition Number," Preprint ANL/MCS-P810-0500, May 2000. We present a new shape measure for tetrahedral elements that is optimal in that it gives the distance of a tetrahedron from the set of inverted elements. This measure is constructed from the condition number of the linear transformation between a unit equilateral tetrahedron and any tetrahedron with positive volume. Using this shape measure, we formulate two optimization objective functions that are differentiated by their goal: the first seeks to improve the average quality of the tetrahedral mesh; the second aims to improve the worst-quality element in the mesh. We review the optimization techniques used with each objective function and present experimental results that demonstrate the effectiveness of the mesh improvement methods. We show that a combined optimization approach that uses both objective functions obtains the best-quality meshes for several complex geometries.
R. Butler, W. Gropp, and E. Lusk, "A Scalable Process-Management Environment for Parallel Programs," Preprint ANL/MCS-P812-0400, April 2000. We present a process management system for parallel programs such as those written using MPI. A primary goal of the system, which we call MPD (for multipurpose daemon), is to be scalable. By this we mean that startup of interactive parallel jobs comprising a thousand processes is quick, that signals can be quickly delivered to processes, and that stdin, stdout, and stderr are managed intuitively. Our primary target is parallel machines made up of clusters of SMPs, but the system is also useful in more tightly integrated environments. We describe how MPD enables much faster startup and better runtime management of MPICH jobs. We show how close control of stdio can support the easy implementation of a number of convenient system utilities, even a parallel debugger. MPD is implemented and freely distributed with MPICH.
J. E. Flaherty, R. M. Loy, M. S. Shephard, and J. D. Teresco, "Software for the Parallel Adaptive Solution of Conservation Laws by Discontinuous Galerkin Methods," Preprint ANL/MCS-P773-0799, July 1999. We develop software tools for the solution of conservation laws using parallel adaptive discontinuous Galerkin methods. In particular, the Rensselaer Partition Model (RPM) provides parallel mesh structures within an adaptive framework to solve the Euler equations of compressible flow by a discontinuous Galerkin method (LOCO). Results are presented for a Rayleigh-Taylor flow instability for computations performed on 128 processors of an IBM SP computer. In addition to managing the distributed data and maintaining a load balance, RPM provides information about the parallel environment that can be used to tailor partition to a specific computational environment.
M. Anitescu, "On the Rate of Convergence of Sequential Quadratic Programming with Nondifferentiable Exact Penalty Function in the Presence of Constraint Degeneracy," Preprint ANL/MCS-P760-0699, June 1999. We analyze the convergence of the sequential quadratic programming (SQP) method for nonlinear programming for the case in which the Jacobian of the active constraints is rank deficient at the solution and/or strict complementarity does not hold for some or any feasible LaGrange multipliers. We use a nondifferentiable exact penalty function, and we prove that the sequence generated by the SQP is locally R-linearly convergent if the matrices of the quadratic program are uniformly positive definite and bounded, provided that the Mangasarian-Fromowitz constraint qualification and some second-order sufficiency conditions hold.
M. Anitescu, "Global Convergence of an Elastic Mode Approach for a Class of Mathematical Programs with Complementarity Constraints," Preprint ANL/MCS-P1143-0404, April 2004. We prove that any accumulation point of an elastic mode approach, applied to the optimization of a mixed P variational inequality that approximately solves the relaxed subproblems is a C-stationary point of the problem of optimizaing a parametric mixed P variational inequality. If, in addition, the accumulation point satisfies the MPCC-LICQ constraint qualitfication and if the solutions of the subproblem satisfy approximate second-order sufficient conditions, then the limiting point is an M-stationary point. Moreover, if the accumulation point satisfies the upper-level strict complementarity condition, the accumulation point will be a strongly stationary point. If we assume that the penalty function associated with the feasible set of the mathematical program with complementarity constraints has bounded level sets and if the objective function is bounded below, we show that the algorithm will produce bounded iterates and will therefore have at least one accumulation point. We prove that the obstable problem satisfies our assumptions for both a rigid and a deformable obstacle. The theoretical conclusions are validated by several numerical examples.
M. Anitescu and G. D. Hart, "A Fixed-Point Iteration Approach for Multibody Dynamics with Contact and Small Friction," Preprint ANL/MCS-P985-0802, February 2004. Acceleration-force setups for multi-rigid-body dynamics are known to be inconsistent for some configurations and sufficiently large friction coefficients (a Painleve paradox). This difficulty is circumvented by time-stepping methods using impulse-velocity approaches, which solve complementarity problems with possibly nonvconvex solution sets. We show that very simple configurations involving two bodies may have a nonconvex solution set for any nonzero value of the friction coefficient. We construct two fixed-point iteration algorithms that solve convex subproblems and that are guaranteed, for sufficiently small friction coefficients, to retrieve, at a linear convergence rate, the unique velocity solution of the nonconvex linear complementarity problem whenever the frictionless configuration can be disassembled. In addition, we show that one step of one of the interative algorithms provides an excellent approximation to the velocity solution of the original, possibly nonconvex, problem if for all contacts we have that either the friction coefficient is small or the slip velocity is small.
M. Anitescu, W. J. Layton, and F. Pahlevani, "Unconditional Stability for Numerical Scheme Combining Implicit Timestepping fo r Local Effects and Explicit Timestepping for Nonlocal Effects," Preprint ANL/MCS-P1093-0903, September 2003. A combination of implicit and explicit timestepping is analyzed for a system of ODEs motivated by ones arising from spatial discretizations of evolutionary partial differential equations. Loosely speaking, the method we consider is implicit in local and stabilizing terms in the underlying PDE and explicit in nonlocal and unstabilizing terms. Unconditional stability and convergence of the numerical scheme are proven by the energy method and by algebraic techniques. This stability result is surprising because usually when different methods are combined, the stability properties of the least stable method plays a determining role in the combination.
F. A. Potra, M. Anitescu, B. Gavrea, and J. Trinkle, "A Linearly Implicit Trapezoidal Method for Integrating Stiff Multibody Dynamics with Contact, Joints, and Friction," Preprint ANL/MCS-P1162-0504, May 2004. We present a hard constraint, linear complementarity-based method for the simulation of stiff multibody dynamics with contact, joints, and friction. The approach uses a linearization of the modified trapezoidal method, incorporates a Poisson restitution model at collision, and solves only one linear complementarity problem per time step when no collisions are encountered. We prove that the method has order two, a fact that is also demonstrated by our numerical simulations. When we use a special approximation of the Jacobian matrix for the case where the stiff forces originate in springs and dampers attached to two points in the system, we prove that the linear coplementarity problem can be solved for any value of the time step and that the method is stiffly stable. In particular, the numerical solution converges to the numerical solution of the contraint problem where the stiff springs and dampers have been replaced by rigid links, a fact that we also demonstrate by numerical simulations. The method was implemented in UMBBRA, an industrial-grade virtual prototyping software.
G. D. Hart and M. Anitescu, "A Hard-Constraint Time-Stepping Approach for Rigid Multibody Dynamics wit h Joints, Contact, and Friction," Preprint ANL/MCS-P1099-0903, October 2003. We present a method for simulating rigid multibody dynamics with joints, contact, and friction. In this work, the nonsmooth contact and frictional constraints are represented by hard constraints. The method requires the soluiton of only one linear complementarity problem per step an can progress at much larger time steps than explicit penalty methods, which are currently the method of choice for most of these simulations.
M. Anitescu, "Degenerate Nonlinear Programming with a Quadratic Growth Condition," Preprint ANL/MCS-P761-0699, June 1999. We show that the quadratic growth condition and the Mangasarian-Fromowitz constraint qualification imply that local minima of nonlinear programs are isolated stationary points. As a result, when started sufficiently close to such points, an L_\infty exact penalty sequential quadratic programming algorithm will induce at least R-linear convergence of the iterates to such a local minimum. We construct an example of a degenerate nonlinear program with a unique local minimum satisfying the quadratic growth and the Mangasarian-Fromowitz constraint qualification but for which no positive semidefinite augmented Lagrangian exists. We present numerical results obtained using several nonlinear programming packages on this example, and discuss its implications for some algorithms.
L. A. Freitag and P. Plassmann, "Local Optimization-Based Simplicial Mesh Untangling and Improvement," Preprint ANL/MCS-P749-0399, May 1999. We develop an optimization-based approach for mesh untangling formulated by maximizing the minimum area or volume of the simplicial elements in a local submesh. The method that the function level sets are convex regardless of the position of the free vertex, and hence the local subproblem is guaranteed to converge. Maximizing the minimum area of mesh elements, although well-suited for mesh untangling, is not ideal for mesh improvement and its use often results in poor quality meshes. We therefore combine the mesh untangling technique with optimization-based mesh improvement techniques, and expand previous results to show that the level sets for a commonly used mesh quality criterion is convex in the feasible region. Typical results showing the effectiveness of the combined untangling and smoothing techniques are given for both two- and three-dimensional meshes.
M. K. Mihçak, P. Moulin, M. Anitescu, and K. Ramchandran, "Rate-Distortion-Optimal Subband Coding without Perfect-Reconstruction Constraints," Preprint ANL/MCS-P766-0699, June 1999. We investigate the design of subband coders without the traditional perfect-reconstruction constraint on the filters. The coder uses scalar quantizers, and its filters and bit allocation are designed so as to optimize a rate-distortion criterion. Convexity properties play a central role in the analysis. Our results hold for a broad class of rate-distortion criteria. First, we show that optimality can be achieved using filter banks that are the cascade of a (paraunitary) principal component filter bank for the input spectral process and a set of pre- and post-filters surrounding each quantizer. An algorithm for computing the globally optimal filters and bit allocation is given. We then develop closed-form solutions for the special case of two-channel coders under an exponential rate-distortion model. Finally, we investigate a constrained-length version of the filter design problem, which is applicable to practical coding scenarios. While the optimal filter banks are nearly perfect-reconstruction at high rates, we demonstrate an apparently surprising advantage of optimal FIR filter banks: they significantly outperform optimal perfect-reconstruction FIR filter banks at all bit rates.
S. J. Wright, "Recent Developments in Interior-Point Methods," Proc. 19th IFIP TC7 Conference on System Modeling and Optimization, Kluwer Academic Publishers (to appear 2000). Also Preprint ANL/MCS-P783-0999, Sept. 1999. The modern era of interior-point methods dates to 1984, when Karmarkar proposed his algorithm for linear programming. In the years since then, algorithms and software for linear programming have become quite sophisticated, while extensions to more general classes of problems, such as convex quadratic programming, semidefinite programming, and nonconvex and nonlinear problems, have reached varying levels of maturity. Interior-point methodology has been used as part of the solution strategy in many other optimization contexts as well, including analytic center methods and column-generation algorithms for large linear programs. We review some core developments in the area and discuss their impact on these other problem areas.
F. A. Potra and S. J. Wright, Journal of Computational and Applied Mathematics 124 (2000) 281-302. The modern era of interior-point methods dates to 1984, when Karmarkar proposed his algorithm for linear programming. In the years since then, algorithms and software for linear programming have bec ome quite sophisticated, while extensions to more general classes of problems, s uch as convex quadratic programming, semidefinite programming, and nonconvex and nonlinear problems, have reached varying levels of maturity. Interior-point methodology has been used as part of the solution strategy in man y other optimization contexts as well, including analytic center methods and col umn-generation algorithms for large linear programs. We review some core developments in the area and discuss their impact on these o ther problem areas.
S. J. Wright and D. Orban, "Properties of the Log-Barrier Function on Degenerate Nonlinear Programs," Preprint ANL/MCS-P772-0799, July 1999. We examine the sequence of local minimizers of the log-barrier function for a nonlinear program near a solution at which second-order sufficient conditions and the Mangasarian-Fromowitz constraint qualifications are satisfied, but the active constraint gradients are not necessarily linearly independent. When a strict complementarity condition is satisfied, we show uniqueness of the local minimizer of the barrier function in the vicinity of the nonlinear program solution, and obtain a semi-explicit characterization of this point. When strict complementarity does not hold, we obtain several other interesting characterizations, in particular, an estimate of the distance between the minimizers of the barrier function and the nonlinear program in terms of the barrier parameter, and a result about the direction of approach of the sequence of minimizers of the barrier function to the nonlinear programming solution.
R. Thakur, W. Gropp, and E. Lusk, "Achieving High Performance with MPI-IO," Preprint ANL/MCS-P742-0299, Feb. 1999. The I/O access patterns of many parallel applications consist of accesses to a large number of small, noncontiguous pieces of data. If an application's I/O needs are met by making many small, distinct I/O requests, however, the I/O performance degrades drastically. To avoid this problem, MPI-IO allows users to access noncontiguous data with a single I/O function call, unlike in Unix I/O. In this paper, we explain how critical this feature of MPI-IO is for high performance and how it enables implementations to perform optimizations. An application can be written in many different ways with MPI-IO. We classify the different ways of expressing an application's I/O access pattern in MPI-IO into four levels, called level 0 through level 3. We demonstrate that, for applications with noncontiguous access patterns, the I/O performance improves significantly if users write the application such that it makes level-3 MPI-IO requests (noncontiguous, collective) rather than level-0 requests (Unix style). We describe how our MPI-IO implementation, ROMIO, delivers high performance for noncontiguous requests. We explain in detail the two key optimizations ROMIO performs: data sieving for noncontiguous requests from one process and collective I/O for noncontiguous requests from multiple processes. We describe how one can implement these optimizations portably on multiple machines and file systems, control their memory requirements, and also achieve high performance. We demonstrate the performance and portability with performance results for three applications---an astrophysics-application template (DIST3D), the NAS BTIO benchmark, and an unstructured code (UNSTRUC)---on five different parallel machines: HP Exemplar, IBM SP, Intel Paragon, NEC SX-4, and SGI Origin2000.
J. Cownie and W. Gropp, "A Standard Interface for Debugger Access to Message Queue Information in MPI," Preprint ANL/MCS-P754-0699, June 1999. This paper discusses the design and implementation of an interface that allows a debugger to obtain the information necessary to display the contents of the MPI message queues. The design has been implemented in the TotalView debugger, and dynamic libraries that conform to the interface exist for MPICH, as well as the proprietary MPI implementations from Compaq, IBM, and SGI.
A. S. Bondarenko, D. M. Bortz, and J. J. More', "COPS: Large-Scale Nonlinearly Constrained Optimization Problems," Technical Memorandum ANL/MCS-TM-237, Sept. 1998, revised Oct. 1999. We have started the development of COPS, a collection of large-scale nonlinearly Constrained Optimization ProblemS. The primary purpose of this collection is to provide difficult test cases for optimization software. Problems in the current version of the collection come from fluid dynamics, population dynamics, optimal design, and optimal control. For each problem we provide a short description of the problem, notes on the formulation of the problem, and results of computational experiments with general optimization solvers. We currently have results for DONLP2, LANCELOT, MINOS, SNOPT, and LOQO.
Y. Zhou, "Eigenvalue Computation from the Optimization Perspective: On Jacobi-David son, IIGD, RQI, and Newton Updates," Preprint ANL/MCS-P1074-0803, August 2003. We discuss the close connection between eigenvalue computation using the Newton method and subspace methods. From the connection we derive a new class of Newton updates. The new update formulation is similar to the well-known Jacobian-Davidson method. This similarity leads to simplified versions of the Jacobi-Davidson method and the inverse iteration generalized Davidson (IIGD) method. We prove that the projection subspace augmented by the updating direction from each of these methods is able to include the Rayleigh quotient iteration (RQI) direction. Hence, the locally quadratic (cubic for normal matrices) convergence rat of the RQI method is retained and strengthened by the subspace methods. The theory is supported by extensive numerical results. Preconditioned formulations are also briefly discussed for large-scale eigenvalue problems.
L. Freitag, "Users Manual for Opt-MS: Local Methods for Simplicial Mesh Smoothing and Untangling," Technical Memorandum ANL/MCS-TM-239, April 1999. Creating meshes containing good-quality elements is a challenging, yet critical, problem facing computational scientists today. Several researches have shown that the size of the mesh, the shape of the elements within that mesh, and their relationship to the physical application of interest can profoundly affect the efficiency and accuracy of many numerical approximation techniques. If the application contains anisotropic physics, the mesh can be improved by considering both local characteristics of the approximate application solution and the geometry of the computational domain. If the application is isotropic, regularly shaped elements in the mesh reduce the discretization error, and the mesh can be improved a priori by considering geometric criteria only. The Opt-MS package provides several local node point smoothing techniques that improve elements in the mesh by adjusting grid point location using geometric criteria. The package is easy to use; only three subroutine calls are required for the user to begin using the software. The package is also flexible; the user may change the technique, function, or dimension of the problem at any time during the mesh smoothing process. Opt-MS is designed to interface with C and C++ codes, and examples for both two- and three-dimensional meshes are provided.
W. D. Gropp, "Runtime Checking of Datatype Signatures in MPI," Recent Advances in PVM and MPI, eds. J. Dongarra, P. Kacsuk, N. Podhorszki, 7th European PVM/MPI User's Group Meeting, 9/10-13/00 (to appear). Also Preprint ANL/MCS-P826-0500, May 2000. The MPI standard provides a way to send and receive complex combinations of datatypes (e.g., integers and doubles) with a single communication operation. The MPI standard specifies that the type signature, that is, the basic datatypes (language-defined types such as int or DOUBLE PRECISION), must match in communication operations such as send/receive or broadcast. Because datatypes may be defined by the user in MPI, there is a limitless collection of possible type signatures. Detecting the programmer error of mismatched datatypes is difficult in this case; detecting all errors essentially requires sending a complete description of the type signature with a message. This paper discusses an alternative: send the value of a function of the type signature so that (a) identical type signatures always give the same function value, (b) different type signatures often give difference values, and (c) common cases (e.g., predefined datatypes) are handled exactly. Thus, erronneous programs are often (but not always) detected; correct programs never are flagged as erroneous. The method described is relatively inexpensive to compute and uses a small (and fixed, independent of the complexity of the datatype) amount of space in the message envelope.
Y. Chen, B. F. Hobbs, S. Leyffer, and T. S. Munson, "Leader-Follower Equilibria for Electric Power and NOx Allowances Markets," Preprint ANL/MCS-P1191-0804, August 2004. This paper investigates the ability of the largest producer in an electricity market to manipulate both the electricity and emission allowances markets to its advantage. The Stackelberg game to analyze this situation is constructed in which the largest firm plays the role of leader while the medium-sized firms are treated as Cournot followers with price-taking fringes that behave competitively in both markets. Analysis of the computed solution for the Pennsylvania-New Jersey-Maryland electricity market shows that the leader can gain substantial profits by withholding allowances and driving up NOx allowance costs for rival producers. The allowances price is higher than the corresponding price in the Nash-Cournot case, although the electricity prices are essentially the same.
T. Munson, "Mesh Shape-Quality Optimization Using the Inverse Mean-Ratio Metric," Preprint ANL/MCS-1136-0304, April 2004. We present a nonlinear fractional program that relocates the vertices of a given mesh to optimize the average element shape quality as measured by the inverse mean-ratio metric. To solve the resulting large-scale optimization problems we apply an efficient implementation of an inexact Newton algorithm using the conjugate gradient method with a block Jacobi preconditioner to compute the direction. Numerical results on several test meshes are presented and used to quantify the impact on solution time and memory requirements of using a modeling language and general-purpose algorithm to solve these problems.
S. J. Benson and T. S. Munson, "Flexible Complementarity Solvers for Large-Scale Applications," Preprint ANL/MCS-P1055-0603, June 2003. Discretizations of infinite-dimensional variational inequalities lead to linear and nonlinear complementarity problems with many degrees of freedom. To solve these problems in a parallel environment, we propose two active-set methods that solve only one linear system of equations per iteration. The linear solver, preconditioner, and matrix structures can be chosen by the user for a particular application to achieve high parallel performance. The parallel scalability of these methods is demonstrated for some discretizations of infinite-dimensional variational inequalities.
J. J. More' and T. S. Munson, "Computing Mountain passes and Transition States," Preprint ANL/MCS-P957-0502, May 2002. We propose the elastic string algorithm for computing mountain passes in finite-dimensional problems. We analyze the convergence properties and numerical performance of this algorithm for benchmark problems in chemistry and discretizations of infinite-dimensional variational problems. We show that any limit point of the elastic strring algorithm is a path that crosses a critical point at which the Hessian matrix is not positive definite.
S. J. Benson and Y. Ye, "Approximating Maximum Stable Set and Minimum Graph Coloring Problems with the Positive Semidefinite Relaxation," Preprint ANL/MCS-P830-0600, June 2000. We compute approximate solutions to the maximum stable set problem and the minimum graph coloring problem using a positive semidefinite relaxation. The positive semidefinite programs are solved using an implementation of the dual scaling algorithm that takes advantage of the sparsity inherent in most graphs and th4e structure inherent in the problem formulation. From the solution to the relaxation, we apply a randomized algorithm to find approximate maximum stable sets and a modification of a popular heuristic to find graph colorings. We obtained high quality answers for graphs with over 1000 vertices and over 6000 edges.
J. Nocedal and S. J. Wright, Numerical Optimization, Springer, 1999. Numerical Optimization presents a comprehensive and up-to-date description of the most effective methods in continuous optimization. It responds to the growing interest in optimization in engineering, science, and business by focusing on the methods that are best suited to practical problems.
Stephen J. Wright, Primal-Dual Interior-Point Methods, SIAM, 1996. In the past decade, primal-dual algorithms have emerged as the most important and useful algorithms from the interior-point class. This book presents the major primal-dual algorithms for linear programming in straightforward terms. A thorough description of the theoretical properties of these methods is given, as are a discussion of practical and computational aspects and a summary of current software.
B. Smith, P. Bjorstad, and W. Gropp, Domain Decomposition, Cambridge University Press, 1996. The emergence of parallel computers and their potential for the numerical solution of Grand Challenge problems has led to a large amount of research in domain decomposition methods. This book presents an easy-to-read discussion of domain decomposition algorithms, their implementation and analysis. This book is ideal for graduate students about to embark on a career in computational science. It will also be a valuable resource for all those interested in parallel computing and numerical computational methods.
S. J. Wright, "Algorithms and Software for Linear and Nonlinear Programming," Foundations of Computer-Aided Process Design, AIChE Symposium Series, CACHE Publications, 1999. We investigate a modified Cholesky algorithm similar to those used in current interior-point codes for linear programming. Cholesky-based interior-point codes are popular for three reasons: their implementation requires only minimal changes to standard sparse Cholesky codes (allowing us to take full advantage of software written by specialists in that area); they tend to be more efficient than competing approaches that use different factorizations; and they perform robustly on most practical problems, yielding good interior-point steps even when the coefficient matrix is ill conditioned. We explain the surprisingly good performance of the Cholesky-based approach by using analytical tools from matrix perturbation theory and error analysis, illustrating our results with computational experiments. Finally, we point out the limitations of this approach.
Using MPI, 2nd edition, William Gropp, Ewing Lusk, and Anthony Skjellum, MIT Press, 1999. Using MPI is a completely up-to-date version of the authors' 1994 introduction to the core functions of MPI. It adds material on the new C++ and Fortran 90 bindings for MPI throughout the book. It contains greater discussion of datatype extents, the most frequently misunderstood feature of MPI-1, as well as material on the new extensions to basic MPI functionality added by the MPI-2 Forum in the area of MPI datatypes and collective operations.
Using MPI-2, William Gropp, Ewing Lusk, and Rajeev Thakur, MIT Press, 1999. Using MPI-2 covers the new extensions to basic MPI. These include parallel I/O, remote memory access operations, and dynamic process management. The volume also includes material on tuning MPI applications for high performance on modern MPI implementations.
MPI: The Complete Reference - 2nd Edition, Volume 2 - The MPI-2 Extensions, William Gropp, Steven Huss-Lederman, Andrew Lumsdaine, Ewing Lusk, Bill Nitzberg, William Saphir, and Marc Snir, MIT Press, 1998. Since its release in summer 1994, the Message Passing Interface (MPI) specification has become a standard for message-passing libraries for parallel computations. There exist more than a dozen implementations on a variety of computing platforms, from the IBM SP-2 supercomputer to PCs running Windows NT. The MPI Forum, which has continued to work on MPI, has recently released MPI-2, a new definition that includes significant extensions, improvements, and clarifications. This volume presents a complete specification of the MPI-2 Standard. It is annotated with comments that clarify complicated issues, including why certain design choices were made, how users are intended to use the interface, and how they should construct their version of MPI. The volume also provides many detailed, illustrative programming examples.
Using MPI: Portable Parallel Programming with the Message Passing Interface, William Gropp, Ewing Lusk, and Anthony Skjellum, MIT Press, 1994. This book, now replaced by Using MPI, second ed., introduced the core functions of the Message Passing Interface.
W. Gropp, E. Lusk, and D. Swider, "Improving the Performance of MPI Derived Datatypes," Proceedings of the 3rd Annual MPI Developers and Users Conference, ed. A. Skjellum, P. V. Bangalore, and Y. S. Dandass, Starkville, Mis sissippi, March 10-12, 1999, pp. 25-30. Jumpshot is a graphical tool for understanding the performance of parallel programs. It is in the tradition of the upshot tool, but contains a number of extensions and enhancements that make it suitable for large-scale parallel computations. Jumpshot takes as input a new, more flexible logfile format, and comes with a library for generating such logfiles. An MPI profiling library is also included, enabling the automatic generation of such logfiles from MPI programs. Jumpshot is written in Java, and can easily be integrated as an applet into browser-based computing environments. The most novel feature of Jumpshot is its automatic detection of anomalous durations, drawing the user's attention to problem areas in a parallel execution. This capability is particularly useful in large-scale parallel computations containing very many events.
R. Thakur, W. Gropp, and E. Lusk, "On Implementing MPI-IO Portably and with High Performance," Proceedings of the 6th Workshop on I/O in Parallel and Distributed Systems, 1999, 23-32 (Preprint ANL/MCS-P732-1098). We discuss the issues involved in implementing MPI-IO portably on multiple machines and file systems and also achieving high performance. One way to implement MPI-IO portably is to implement it on top of the basic Unix I/O functions (open, lseek, read, write, and close), which are themselves portable. We argue that this approach has limitations in both functionality and performance. We instead advocate an implementation approach that combines a large portion of portable code and a small portion of code that is optimized separately for different machines and file systems. We have used such an approach to develop a high-performance, portable MPI-IO implementation, called ROMIO.
W. D. Gropp, D. K. Kaushik, D. E. Keyes, and B. F. Smith, "Performance Modeling and Tuning of an Unstructured Mesh CFD Application," Preprint ANL/MCS-P833-0700, July 2000.This paper describes performance tuning experiences with a three-dimensional unstructured grid Euler flow code from NASA, which we have reimplemented in the PETSc framework and ported to several large-scale machines, including the ASCI Red and Blue Pacific machines, the SGI Origin, the Cray T3E, and Beowulf clusters. The code achieves a respectable level of performance for sparse problems, typical of scientific and engineering codes based on partial differential equations, and scales well up to thousands of processors. Since the gap between CPU speed and memory access rate is widening, the code is analyzed from a memory-centric perspective (in contrast to traditional flop-orientation) to understand its sequential and parallel performance. Performance tuning is approached on three fronts: data layouts to enhance locality of reference, algorithmic parameters, and parallel programming model. This effort was guided partly by some simple performance models developed for the sparse matrix-vector product operation.
H.G. Kaper and H. Nordborg, "The Frozen-Field Approximation and the Ginzburg-Landau Equations of Superconductivity," J. Engineering Mathematics 39 (2001) 221-240. The Ginzburg-Landau (GL) equations of superconductivity provide a computational model for the study of magnetic flux vortices in type-II superconductors. In this article we show through numerical examples and rigorous mathematical analysis that the GL model reduces to the frozen-field model when the charge of the Cooper pairs (the superconducting charage ) goes to zero while the applied field stays near the upper critical field.
E. Dolan, P. Hovland, J. More', B. Norris, and B. Smith, "Remote Access to Mathematical Software," Preprint ANL/MCS-P889-0601, June 2001. The network-oriented application services paradigm is becoming increasingly common for scientific computing. The popularity of this approach can be attributed to the numerous advantages to both user and developer provided by network-enabled mathematical software. The burden of installing and maintaining complex systems is lifted from the user, while enabling developers to provide frequent updates without disrupting service. Access to software with similar functionality can be unified under the same interface. Remote servers can utilize potentially more powerful computing resources than may be available locally. We discuss some of the application services developed by the Mathematics and Computer Science Division at Argonne National Laboratory, including the Network Enabled Optimization System (NEOS) Server and the Automatic Differentiation of C (ADIC) Server, as well as preliminary work on Web access to the Portable Extensible Toolkit for Scientific Compuoting (PETSc). We also provide a brief survey of related work.
R. Thakur and W. Gropp, "Parallel I/O," in CRPC Handbook of Parallel Computing, J. Dongarra, I. Foster, G. Fox, K. Kennedy, L. Torczon, and A. White, Morgan Kaufmann Publishers, Inc. (to appear). Also Preprint ANL/MCS-P837-0700, July 2000. Many parallel applications need to access large amounts of data. In such applications, the I/O performance can play a significant role in the overall time to completion. Although I/O is always much slower than computation, it is still possible to achieve good I/O performance in parallel applications by using a combination of sufficient amount of high-speed I/O hardware, appropriate file-system software, appropriate API for I/O, a high-performance implementation of the API, and by using that API the right way. We explain in further detail. [from a preliminary draft of the CRPC Handbook of parallel Computing]
P. M. Dickens and R. Thakur, "On Implementing High-Performance Collective I/O," Preprint ANL/MCS-P852-0700, November 2000. In many parallel applications, the I/O requirements quite often present a significant obstacle in the way of achieving good performance. An important area of current research is collective I/O, where the processors cooperatively develop an I/O strategy that reduces the number, and increases the size, of I/O requests, thereby making a better use of the I/O subsystem. While many studies have been presented in the literature showing excellent results using collective I/O techniques, there has been little discussion of implementation techniques for collective I/O operations and the impact on performance of varioius implementation strategies. A closely related issue, which has not yet received sufficient attention, is whether the high cost of I/O can be further reduced by executing the collective I/O operation in the background, thus overlapping its execution with other computation occurring in the foreground. In this paper, we investigate both of these important issues. First, we explore the issues that arise in the implementation of a collective I/O library and show the impact on performance resulting from various implementation strategies. To make these results as general as possible, we investigate the performance of collective I/O implementations on four different parallel architectures: the IBM SP2, the Intel Paragon, the HP Exemplar, and the SGI Origin2000. We show that a naive implementation of collective I/O does not result in significant performance gains for any of the architectures, but that an optimized collective I/O implementatin provides excellent performance across all of the platforms under study. Furthermore, we demonstrate that there exists a single implementation strategy that provides the best performance for all computational platforms. Next, we explore the issues that arise in the implementation of thread-based collective I/O operations. We show that the most obvious implementation technique, which is to spawn a thread to execute the whole collective I/O operation in the background, frequently provides the worst performance, often performing much worse than just executing the collective I/O routine entirely in the foreground. To improve the performance of thread-based collective I/O, we develop an alternate approach where part of the collective I/O operation is performed in the background, and part is performed in the foreground. We demonstrate that this new technique can provide significant performance gains, offering up to 50% improvement when compared with implementations that do not attempt to overlap collective I/O and computation.
A. Roy, I. Foster, W. Gropp, N. Karonis, V. Sander, and B. Toonen, "MPICH-GQ: Quality-of-Service for Message Passing Programs," Preprint ANL/MCS-P838-0700, July 2000. Parallel programmers typically assume that all resources required for a program's execution are dedicated to that purpose. However, in local and wide area networks, contention for shared networks, CPUs, and I/O systems can result in significant variations in availability, with consequent adverse effects on overall performance. We describe a new message-passing architecture, MPICH-GQ, that uses quality of service (QoS) mechanisms to manage contention and hence improve performance of message passing interface (MPI) applications. MPICH-GQ combines new QoS specification, traffic shaping, QoS reservation, and QoS implementation techniques to deliver QoS capabilities to the high-bandwidth bursty flows, complex structures, and reliable protocols used in high-performance application---characteristics very different from the low-bandwidth, constant bit-rate media flows and unreliable protocols for which QoS mechanisms were designed. Results obtained on a differentiated services testbed demonstrate our ability to maintain application performance in the face of heavy network contention.
W. D. Gropp, D. K. Kaushik, D. E. Keyes, B. F. Smith, "Understanding the Parallel Scalability of an Implicit Unstructured Mesh CFD Code," Preprint ANL/MCS-P845-0900, Sept. 2000. In this paper, we identify the scalability bottlenecks of an unstructured grid CFD code (PETSc-FUN3D) by studying the impact of several algorithmic and architectural parameters and by examining different programming models. We discuss the basic performance characteristics of this PDE code with the help of simple performance models developed in our earlier work, presenting primarily experimental results. In addition to achieving good per-processor performance (which has been addressed in our cited work and without which scalability claims are suspect) we strive to improve the implementation and convergence scalability of PETSc-FUN3D on thousands of processors.
R. Butler, W. Gropp, and E. Lusk, "A Scalable Process-Management Environment for Parallel Programs," Preprint ANL/MCS-P812-0400, April 2000, We present a process management system for parallel programs such as those written using MPI. A primary goal of the system, which we call MPD (for multipurpose daemon), is to be scalable. By this we mean that startup of interactive parallel jobs comprising a thousand processes is quick, that signals can be quickly delivered to processes, and that stdin, stdout, and stderr are managed intuitively. Our primary target is parallel machines made up of clusters of SMPs, but the system is also useful in more tightly integrated environments. We describe how MPD enables much faster startup and beter runtime management of MPICH jobs. We show how close control of stdio can support the easy implementation of a number of convenient system utilities, even a parallel debugger. MPD is implemented and freely distributed with MPICH.
P. D. Hovland and L. C. McInnes, "Parallel Simulation of Compressible Flow Using Automatic Differentiation and PETSc," Preprint ANL/MCS-P796-0200, November 2000. Many aerospace applications require parallel implicit solution strategies and software. We consider the use of two computational tools, PETSc and ADIFOR, to implement a Newton-Krylov-Schwarz method with pseudo-transient continuation for a particular application, namely, a steady-state, fully implicit, three-dimensional compressible Euler model of flow over an M6 wing. We describe how automatic differentiation (AD) can be used within the PETSc framework to compute the required derivatives. We present performance data demonstrating the suitability of AD and PETSc for this problem. We conclude with a synopsis of our results and a description of opportunities for future work.
T. Iliescu, "Genuinely Nonlinear Models for Convection-dominated Problems," Preprint ANL/MCS-P857-1100, November 2000. This paper introduces a general, nonlinear subgrid-scale (SGS) model, having bounded artificial viscosity, for the numerical simulation of convection-dominated problems. We also present a numerical comparison (error analysis and numerical experiments) between this model and the most common SGS model of Smagorinsky, which uses a p-Laplacian regularization. The numerical experiments for the 2-D convection-dominated convection-diffusion test problem show a clear improvement in solution quality for the new SGS model. This improvement is consistent with the bounded amount of artificial viscosity introduced by the new SGS model in the sharp transition regions.
S. L. Lee and P. D. Hovland, "Sensitivity Analysis Using Parallel ODE Solvers an d Automatic Differentiation in C: SensPVODE and ADIC," Preprint P818-0500, May 2000. PVODE is a high-performance ordinary differential equation solver for the types of initial value problems (IVPs) that arise in large-scale computational simulations. Often, one wants to compute sensitivities with respect to certain parameters in the IVP. We discuss the use of automatic differentiation (AD) to compute these sensitivities in the context of PVODE. Results on a simple test problem indicate that the use of AD-generated derivative code can reduce the time to solution over finite difference approximations.
< name="seaice"> J. G. Kim and P. D. Hovland", "Sensitivity Analysis and Parameter Tuning of a Sea-Ice Model," Preprint ANL/MCS-P819-0500, May 2000. The values of many of the parameters in climate models are often not known with any great precision. We describe the use of automatic differentiation to examine the sensitivity of an uncoupled dynamic-thermodynamic sea-ice model to various parameters. We also illustrate the effectiveness of using these sensitivity derivatives with an optimization algorithm to tune the parameters to maximize the agreement between simulated results and observational data.
W. D. Gropp, D. K. Kaushik, D. E. Keyes, and B. F. Smith, "Performance Modeling and Tuning of an Unstructured Mesh CFD Application," Preprint ANL/MCS-P833-0700, July 2000. This paper describes performance tuning experiences with a three-dimensional unstructured grid Euler flow code from NASA, which we have reimplemented in the PETSc framework and ported to several large-scale machines, including the ASCI Red and Blue Pacific, the SCI Origin, the Cray T3E, and Beowulf clusters. The code achieves a respectable level of performance for sparse problems, typical of scientific and engineering codes based on PDEs, and scales well up to thousands of processors. Since the gap between CPU speed and memory access rate is widening, the code is analyzed from a memory-centric perspective (in contrast to traditional flop orientation) to understand its sequential and parallel performance. Performance tuning is approached on three fronts: data layouts to enhance locality of reference, algorithmic parameters, and parallel programming model. This effort was guided partly by some simple performance models developed for the sparse matrix-vector product operation. P. Dickens and R. Thakur, "On Implementing High-Performance Collective I/O," Preprint ANL/MCS-P849-1000, Argonne National Laboratory, September 2000. Java is quickly becoming the preferred language for writing distributed applications because of its inherent support for programming on distributed platforms. In particular, Java provides compile-time and run-time security, automatic garbage collection, inherent support for multithreading, support for persistent objects and object migration, and portability. Given these significant advantages of Java, there is a growing interest in using Java for high-performance computing applications. To be successful in the high-performance computing domain, however, Java must have the capability to efficiently handle the significant I/O requirements commonly found in high-performance computing applications. While there has been significant research in high-performance I/O using languages such as C, C++, and Fortran, there has been relatively little research into the I/O capabilities of Java. In this paper, we evaluate the I/O capabilities of Java for high-performance computing. We examine several approaches that attempt to provide high-performance I/O--many of which are not obvious at first glance--and investigate their performance in both parallel and multithreaded environments. We also provide suggestions for expanding the I/O capabilities of Java to better support the needs of high-performance computing applications.
H. M. Bucker, K. R. Buschelman, and P. D. Hovland, "A Matrix-Matrix Multiplication Approach to the Automatic Differentiation and Parallelization of Straight-Line Codes," Preprint ANL/MCS-P854-1000, Argonne National Laboratory, October 2000. A straight-line code, which consists of assignment, addition, and multiplication statements, is an abstraction of a serial computer program to compute a function with n inputs. Given a serial straight-line code with N statements, we derive an algorithm that automatically evaluates not only the function but also its first-order derivatives with respect to the n inputs on a parallel computer. The basic idea of the algorithm is to marry automatic computation of derivatives with automatic parallelization of serial programs.
W. D. Gropp, D. K. Kaushik, D. E. Keyes, and B. F. Smith, "Latency, Bandwidth, and Concurrent Issue Limitations in High-Performance CFD," Preprint ANL/MCS-P850-1000, October 2000. This paper presents some strategies that have proved effective in tolerating the latencies arising from the hierarchical memory system and networ. We compare the different programming models from a performance standpoint. The demonstration code, PETSc-FUN3D, solves the Euler and Navier-Stokes equations of fluid flow in incompressible and compressible forms with second-order flux-limited characteristics-based convection schemes and Galerkin-type diffusion on unstructured meshes. The solution algorithm is pseudo-transient Newton-Krylov-Schwartz with block-incomplete factorization on each subdomain of the Schwarz preconditioner and with varying degrees of overlap.
K. R. Buschelman, W. D. Gropp, L. C. McInnes, and B. F. Smith, "PETSc and Overture: Lessons Learned Developing an Interface between Components, " Preprint ANL/MCS-P858-1100, November 2000. We consider two software packages that interact with each other as components: Overture and PETSc. An interface between these two packages could be of tremendous value to application developers in that Overture provides a simple mechanism for generating the large, sparse systems of linear equations that correspond to discretizations of a PDE, and PETSc provides a powerful collection of methods for solving these systems. Two types of interfaces are discussed: the internal interface between components, and the external interface for the application developer. We compare three basic approaches to developing the internal interface between Overture and PETSc, the final one of which is a peer-to-peer model.
B. Norris and P. D. Hovland, "An Application Server for Automatic Differentiation," Preprint ANL/MCS-P856-1100, November 2000. The ADIC Application Server brings the accuracy and efficiency of automatic differentiation to the World Wide Web. Users of the ADIC Application Server can upload source code written in ANSI-C, manage remote files, differentiate selected functions, and download code augmented with derivative computations. Using a simple driver and linking to the appropriate libraries, the user can compile and run the differentiated code locally. We discuss the unique requirements for an automatic differentiation application server and describe the implementation of the ADIC Application Server.
P. D. Hovland, D. E. Keyes, L. C. McInnes, and W. Samyono, "Using Automatic Differentiation for Second-Order Matrix-Free Methods in PDE-constrained Optimization," Preprint ANL/MCS-P855-1100, November 2000. Classical methods of constrained optimization are often based on the assumption that projection onto the constraint manifold is routine but accessing second-derivative information is not. Both assumptions need revision for the application of optimization to systems constrained by PDEs, in the contemporary limit of millions of state variables and in the parallel setting. Large-scale PDE solvers are complex pieces of software that exploit detailed knowledge of architecture and application and cannot easily be modified to fit the interface requirements of a blackbox optimizer. Furthermore, in view of the expense of PDE analyses, optimization methods not using second derivatives may require too many iterations to be practical. For general problems, automatic differentiation is likely to be the most convenient means of exploiting second derivatives. We delineate a role for automatic differentiation in matrix-free optimization formulations involving Newton's method, in which little more storage is required than that for the analysis code alone.
J. J. More', "Automatic Differentiation Tools in Optimization Software," Preprint ANL/MCS-P859-1100, November 2000. We discuss the role of automatic differentiation tools in optimization software. We emphasize issues that are important to large-scale optimization and that have proved useful in the installation of nonlinear solvers in the NEOS Server. Our discussion centers on the computation of the gardient and Hessian matrix for partially separable functions and shows that the gradient and Hessian matrix can be computed with guaranteed bounds in time and memory requirements.
W. D. Gropp, D. K. Kaushik, D. E. Keyes, and B. F. Smith, "High-Performance Parallel Implicit CFD," Preprint ANL/MCS-P863-1200, December 2000. Fluid dynamical simulations based on finite discreteizations on (quasi-)static grids scale well in parallel, but execute at a disappointing percentage of per-processor peak floating point operation rates withiout special attention to layout and access ordering of data. We document both claims from our experience with an unstructured grid CFD code that is typical of the state of the practice at NASA. These basic performance characteristics of PDE-based codes can be understood with surprisingly simple models, for which we quote earlier work, presenting primarily experimental results herein. These perormance models and experimental results motivate algorithmic and software practices that lead to improvements in both parallel scalability and per node performance. This snapshot of ongoing work updates our 1999 Bell Prize-winning simulation on ASCI computers.
S. J. Wright, "Constraint Identification and Algorithm Stabilization for Degenerate Nonlinear Programs," Preprint ANL/MCS-P865-1200, Devember 2000. In the vicinity of a solution of a nonlinear programming problem at which both strict complementarity and linear independence of the active constraints may fail to hold, we describe a technique for distinguishing weakly active from strongly active constraints. We show that this information can be used to modify the sequential quadratic programming algorithim so that it exhibits superlinear convergence to the solution under assumptions weaker than those made in previous analyses.
S. J. Benson, L. C. McInnes, and J. J. More', "GPCG: A Case Study in the Performance and Scalability of Optimization Algorithms." Preprint ANL/MCS-P768-0799, September 2000. PCG is an algorithm within the Toolkit for Advanced Optimization (TAO) for solving bound constrained, convex quadratic problems. Originally developed by More' and Toraldo, this algorithm was designed for large-scale problems but had been implemented only for a single processor. The TAO implementation is available for a wide range of high-performance architecture, and has been tested on up to 64 processors to solve problems with over 2.5 million variables.
E. D. Dolan and J. J. More', "Benchmarking Optimization Software with COPS," Technical Memo ANL/MCS-TM-246, November 2000. We describe version 2.0 of the COPS set of nonlinearly constrained optimization problems. We have added new problems, as well as streamlined and improved most of the problems. We also provide a comparison of the LANCELOT, LOQO, MINOS, and SNOPT solvers on these problems.
M. C. Ferris and T. S. Munson, "Semismooth Support Vector Machines," Preprint ANL/MCS-P860-1100, November 2000. The linear support vector machine can be posed as a quadratic program in a variety of ways. In this paper, we look at a formulation using the two-norm for the misclassification error that leads to a positive definite quadratic program with a single equality constraint when the Wolfe dual is taken. The quadratic term is a small rank update to a positive definite matrix. We reformulate the optimality conditions as a semismooth system of equations using the Fischer-Burmeister function and apply a damped Newton method to solve the resulting problem. The algorithm is shown to converge from any starting point with a Q-quadratic rate of convergence. At each iteration, we use the Sherman-Morrison-Woodbury update formula to solve a single linear system of equations. Significant computational savings are realized as the inactive variables are identified and exploited during the solution process. Results for a 60 million variable problem are presented, demonstrating the effectiveness of the proposed method on a personal c |