How we solved nug30
The solution of nug30 has been made possible by a tight collaboration between experts in computational optimization and experts in massively distributed computing. In order to succeed we needed the following ingredients :
- A state-of-the-art sequential solver.
- A powerful computational pool.
- Tools to easily distribute and manage the computations.
- A careful parallel implementation
A state-of-the-art sequential solver
The quadratic assignment problem is among the most complex combinatorial optimization problems. It belongs
to the class of NP complete problems and therefore the time spent by any exact algorithm will grow
exponentially with the size of the problem.
The simplest (and worst) algorithm to solve quadratic assignment problems consists of enumerating all possible
assignments. This enumeration process requires n! evaluations and is not practical for even
small values of n. If we were able to evaluate a trillion assignments per second it would take around
140 times the age of the universe to solve nug30 to optimality !
Researchers have been developing more advanced techniques to cut down the combinatorial complexity of
the problem. The method of choice is the so-called branch-and-bound tree search technique. It consists
of dividing the initial search space in smaller pieces and bounding what could be the best possible
solution that one could find in each of this smaller regions. For the quadratic assignment problem,
there exists several ways of branching and bounding.
In order to
minimize the total amount of time required by the algorithm a trade-off have to be found between
the quality and the speed of these decisions. Mediocre decisions will perform very well
for small problems but will not cut the complexity for large problems. Excellent decisions
will also be too expensive for large problems.
The classic bounds for the quadratic assignment problem are (ranked by increasing complexity):
- Gilmore-Lawler bound
- Projected eigenvalue bound
- Dual LP bounds
- Semi-definite programming bounds
In our branch-and-bound implementation we are using the new quadratic programming bound developed
by Kurt Anstreicher and Nate Brixius. This bound has an excellent cost/quality ratio and was the
best candidate to tackle nug30. Details on this bound can be found in :
A New Bound for the Quadratic Assignment Problem Based on Convex Quadratic Programming K. M. Anstreicher, N. W. Brixius, 1999.
Advanced branching techniques have been developed such that the exponential growth of the computational
time is better than any other known algorithm. The sequential algorithm has succesfully solved problems
of size up to 24 in record breaking time.
The different branching rules and the sequential branch-and-bound are described in :
Solving Quadratic Assignment Problems Using Convex Quadratic Programming RelaxationsN. W. Brixius, K. M. Anstreicher, 2000.
A powerful computational pool
Our sequential solver was unfortunately not powerfull enough to solve nug30 in a reasonable amount of time. Estimates to solve this problem with the best of our desktop workstation were around 7 years (slighly less than the age of the universe though...). As other teams, we decided to work on a parallel implementation of the branch-and-bound algorithm to reduce this wall clock time. With 100 equivalent processors it would still have taken around a month to solve nug30 to optimality. Considering the lack of flexibility of big
iron supercomputer it sounded difficult for us to obtain the necessary computational power.
The dramatic improvements in the performance of networks and in the computational power of individual workstations/computers makes it possible to assemble geographically dispersed into powerful "metacomputers" or "computational grid". By using the untapped power of hundreds of workstations sitting idle we have been able to build
the powerful and flexible platform needed to solve nug30.
Tools to distribute and manage the computations
However metacomputing platforms are much more complex to use than classical big iron computers. Resources are potentially much larger but they may go away at any time without any notice, their number can vary considerably over time and the connectivity between resources may be very bad.
Therefore advanced tools are needed to draw the raw power provided by metacomputing platforms. The Condor system developed at the University of Wisconsin is used to detect idle workstations and match these available resources with Condor users requests. In order to parallelize algorithms on the
fault prone and dynamic platforms provided by Condor we have used the MW master-worker system described in details in the
following paper :
An Enabling Framework for Master-Worker Computing Applications on the Computational Grid. Jean-Pierre Goux, Sanjeev Kulkani, Jeff Linderoth and Michael Yoder, 2000.
MW is also able to handle supercomputing resources joining the computational pool during the course of the computations. This is made possible by the glide-in tool developed by the Condor and Globus teams.
The robust, scalable and easy-to-use MW framework helped us to quickly develop a distributed version of our sequential branch-and-bound solver.
A careful parallel implementation
The dynamic nature of metacomputing platforms makes it difficult to use resources efficiently. If the units of
work sent to the machines participating to the computation is too small, the master machine will be overhelmed
and the machines will wait for new units of work. If the units of work are too large, it is likely that the
machines will disappear. Therefore we develop careful efforts to tune our parallel strategy parameters. In the NUG30
run, we used machines with 93% of efficiency, this result is excellent considering that there were in
average 650 geographically distributed machines participating. Details on the computational
pool used for nug30 can be found in NUG30 run page.
metaneos@mcs.anl.gov
Last modified: Mon Jul 3 23:17:19 CDT 2000