On Wed, 16 Apr 2008, Ben Tay wrote:
Hi Satish, thank you very much for helping me run the ex2f.F code.
I think I've a clearer picture now. I believe I'm running on Dual-
Core Intel
Xeon 5160. The quad core is only on atlas3-01 to 04 and there's
only 4 of
them. I guess that the lower peak is because I'm using Xeon 5160,
while you
are using Xeon X5355.
I'm still a bit puzzled. I just ran the same binary on a 2 dualcore
xeon 5130 machine [which should be similar to your 5160 machine] and
get the following:
[balay@n001 ~]$ grep MatMult log*
log.1:MatMult 1192 1.0 1.0591e+01 1.0 3.86e+09 1.0 0.0e
+00 0.0e+00 0.0e+00 14 11 0 0 0 14 11 0 0 0 364
log.2:MatMult 1217 1.0 6.3982e+00 1.0 1.97e+09 1.0 2.4e
+03 4.8e+03 0.0e+00 14 11100100 0 14 11100100 0 615
log.4:MatMult 969 1.0 4.7780e+00 1.0 7.84e+08 1.0 5.8e
+03 4.8e+03 0.0e+00 14 11100100 0 14 11100100 0 656
[balay@n001 ~]$
You mention about the speedups for MatMult and compare between
KSPSolve. Are
these the only things we have to look at? Because I see that some
other event
such as VecMAXPY also takes up a sizable % of the time. To get an
accurate
speedup, do I just compare the time taken by KSPSolve between
different no. of
processors or do I have to look at other events such as MatMult as
well?
Sometimes we look at individual components like MatMult() VecMAXPY()
to understand whats hapenning in each stage - and at KSPSolve() to
look at the agregate performance for the whole solve [which includes
MatMult VecMAXPY etc..]. Perhaps I should have also looked at
VecMDot() aswell - at 48% of runtime - its the biggest contributor to
KSPSolve() for your run.
Its easy to get lost in the details of log_summary. Looking for
anamolies is one thing. Plotting scalability charts for the solver is
something else..
In summary, due to load imbalance, my speedup is quite bad. So
maybe I'll just
send your results to my school's engineer and see if they could do
anything.
For my part, I guess I'll just 've to wait?
Yes - load imbalance at MatMult level is bad. On 4 proc run you have
ratio = 3.6 . This implies - there is one of the mpi-tasks is 3.6
times slower than the other task [so all speedup is lost here]
You could try the latest mpich2 [1.0.7] - just for this SMP
experiment, and see if it makes a difference. I've built mpich2 with
[default gcc/gfortran and]:
./configure --with-device=ch3:nemesis:newtcp -with-pm=gforker
There could be something else going on on this machine thats messing
up load-balance for basic petsc example..
Satish