[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Slow speed after changing from serial to parallel



Hi,

Here's the summary for 1 processor. Seems like it's also using a long time... Can someone tell me when my mistakes possibly lie? Thank you very much!

************************************************************************************************************************
*** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r -fCourier9' to print this document ***
************************************************************************************************************************


---------------------------------------------- PETSc Performance Summary: ----------------------------------------------

./a.out on a atlas3-mp named atlas3-c45 with 1 processor, by g0306332 Wed Apr 16 00:39:22 2008
Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST 2007 HG revision: 414581156e67e55c761739b0deb119f7590d0f4b


                        Max       Max/Min        Avg      Total
Time (sec):           1.088e+03      1.00000   1.088e+03
Objects:              4.300e+01      1.00000   4.300e+01
Flops:                2.658e+11      1.00000   2.658e+11  2.658e+11
Flops/sec:            2.444e+08      1.00000   2.444e+08  2.444e+08
MPI Messages:         0.000e+00      0.00000   0.000e+00  0.000e+00
MPI Message Lengths:  0.000e+00      0.00000   0.000e+00  0.000e+00
MPI Reductions:       1.460e+04      1.00000

Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)
e.g., VecAXPY() for real vectors of length N --> 2N flops
and VecAXPY() for complex vectors of length N --> 8N flops


Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages --- -- Message Lengths -- -- Reductions --
Avg %Total Avg %Total counts %Total Avg %Total counts %Total
0: Main Stage: 1.0877e+03 100.0% 2.6584e+11 100.0% 0.000e+00 0.0% 0.000e+00 0.0% 1.460e+04 100.0%


------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on interpreting output.
Phase summary info:
Count: number of times phase was executed
Time and Flops/sec: Max - maximum over all processors
Ratio - ratio of maximum to minimum over all processors
Mess: number of messages sent
Avg. len: average message length
Reduct: number of global reductions
Global: entire computation
Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().
%T - percent time in this phase %F - percent flops in this phase
%M - percent messages in this phase %L - percent message lengths in this phase
%R - percent reductions in this phase
Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all processors)
------------------------------------------------------------------------------------------------------------------------



########################################################## # # # WARNING!!! # # # # This code was run without the PreLoadBegin() # # macros. To get timing results we always recommend # # preloading. otherwise timing numbers may be # # meaningless. # # preloading. otherwise timing numbers may be # # meaningless. # ##########################################################


Event Count Time (sec) Flops/sec --- Global --- --- Stage --- Total
Max Ratio Max Ratio Max Ratio Mess Avg len Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------


--- Event Stage 0: Main Stage

MatMult 7412 1.0 1.3344e+02 1.0 2.16e+08 1.0 0.0e+00 0.0e+00 0.0e+00 12 11 0 0 0 12 11 0 0 0 216
MatSolve 7413 1.0 2.6851e+02 1.0 1.07e+08 1.0 0.0e+00 0.0e+00 0.0e+00 25 11 0 0 0 25 11 0 0 0 107
MatLUFactorNum 1 1.0 4.3947e-02 1.0 8.83e+07 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 88
MatILUFactorSym 1 1.0 3.7798e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 1.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatAssemblyBegin 1 1.0 9.5367e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatAssemblyEnd 1 1.0 2.5835e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatGetRowIJ 1 1.0 1.9073e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatGetOrdering 1 1.0 6.0391e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatZeroEntries 1 1.0 1.7377e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
KSPGMRESOrthog 7173 1.0 5.6323e+02 1.0 3.41e+08 1.0 0.0e+00 0.0e+00 7.2e+03 52 72 0 0 49 52 72 0 0 49 341
KSPSetup 1 1.0 1.2676e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
KSPSolve 1 1.0 1.0144e+03 1.0 2.62e+08 1.0 0.0e+00 0.0e+00 1.5e+04 93100 0 0100 93100 0 0100 262
PCSetUp 1 1.0 8.7809e-02 1.0 4.42e+07 1.0 0.0e+00 0.0e+00 3.0e+00 0 0 0 0 0 0 0 0 0 0 44
PCApply 7413 1.0 2.6853e+02 1.0 1.07e+08 1.0 0.0e+00 0.0e+00 0.0e+00 25 11 0 0 0 25 11 0 0 0 107
VecMDot 7173 1.0 2.6720e+02 1.0 3.59e+08 1.0 0.0e+00 0.0e+00 7.2e+03 25 36 0 0 49 25 36 0 0 49 359
VecNorm 7413 1.0 1.7125e+01 1.0 3.74e+08 1.0 0.0e+00 0.0e+00 7.4e+03 2 2 0 0 51 2 2 0 0 51 374
VecScale 7413 1.0 9.2787e+00 1.0 3.45e+08 1.0 0.0e+00 0.0e+00 0.0e+00 1 1 0 0 0 1 1 0 0 0 345
VecCopy 240 1.0 5.1628e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecSet 241 1.0 6.4428e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecAXPY 479 1.0 2.0082e+00 1.0 2.06e+08 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 206
VecMAXPY 7413 1.0 3.1536e+02 1.0 3.24e+08 1.0 0.0e+00 0.0e+00 0.0e+00 29 38 0 0 0 29 38 0 0 0 324
VecAssemblyBegin 2 1.0 2.3127e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 6.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecAssemblyEnd 2 1.0 4.0531e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecNormalize 7413 1.0 2.6424e+01 1.0 3.64e+08 1.0 0.0e+00 0.0e+00 7.4e+03 2 4 0 0 51 2 4 0 0 51 364
------------------------------------------------------------------------------------------------------------------------


Memory usage is given in bytes:

Object Type          Creations   Destructions   Memory  Descendants' Mem.

--- Event Stage 0: Main Stage

Matrix 2 2 65632332 0
Krylov Solver 1 1 17216 0
Preconditioner 1 1 168 0
Index Set 3 3 5185032 0
Vec 36 36 120987640 0
========================================================================================================================
Average time to get PetscTime(): 3.09944e-07
OptionTable: -log_summary
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 8
Configure run at: Tue Jan 8 22:22:08 2008
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 8
Configure run at: Tue Jan 8 22:22:08 2008 Configure options: --with-memcmp-ok --sizeof_char=1 --sizeof_void_p=8 --sizeof_short=2 --sizeof_int=4 --sizeof_long=8 --sizeof_long_long=8 --sizeof_float=4 --sizeof_double=8 --bits_per_byte=8 --sizeof_MPI_Comm=4 --sizeof_MPI_Fint=4 --with-vendor-compilers=intel --with-x=0 --with-hypre-dir=/home/enduser/g0306332/lib/hypre --with-debugging=0 --with-batch=1 --with-mpi-shared=0 --with-mpi-include=/usr/local/topspin/mpi/mpich/include --with-mpi-lib=/usr/local/topspin/mpi/mpich/lib/libmpich.a --with-mpirun=/usr/local/topspin/mpi/mpich/bi
n/mpirun --with-blas-lapack-dir=/opt/intel/cmkl/8.1.1/lib/em64t --with-shared=0 -----------------------------------------
Libraries compiled on Tue Jan 8 22:34:13 SGT 2008 on atlas3-c01
Machine characteristics: Linux atlas3-c01 2.6.9-42.ELsmp #1 SMP Wed Jul 12 23:32:02 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux
Using PETSc directory: /nfs/home/enduser/g0306332/petsc-2.3.3-p8
Using PETSc arch: atlas3-mpi
-----------------------------------------
Using C compiler: mpicc -fPIC -O Using Fortran compiler: mpif90 -I. -fPIC -O -----------------------------------------
Using include paths: -I/nfs/home/enduser/g0306332/petsc-2.3.3-p8 -I/nfs/home/enduser/g0306332/petsc-2.3.3-p8/bmake/atlas3-mpi -I/nfs/home/enduser/g0306332/petsc-2.3.3-p8/include -
I/home/enduser/g0306332/lib/hypre/include -I/usr/local/topspin/mpi/mpich/include ------------------------------------------
Using C linker: mpicc -fPIC -O
Using Fortran linker: mpif90 -I. -fPIC -O Using libraries: -Wl,-rpath,/nfs/home/enduser/g0306332/petsc-2.3.3-p8/lib/atlas3-mpi -L/nfs/home/enduser/g0306332/petsc-2.3.3-p8/lib/atlas3-mpi -lpetscts -lpetscsnes -lpetscksp -lpetscdm -lpetscmat -lpetscvec -lpetsc -Wl,-rpath,/home/enduser/g0306332/lib/hypre/lib -L/home/enduser/g0306332/lib/hypre/lib -lHYPRE -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -lstdc++ -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -Wl,-rpath,/usr/local/topspin/mpi/mpich/lib -L/usr/local/topspin/mpi/mpich/lib -lmpich -Wl,-rpath,/opt/intel/cmkl/8.1.1/lib/em64t -L/opt/intel/cmkl/8.1.1/lib/em64t -lmkl_lapack -lmkl_em64t -lguide -lpthread -Wl,-rpath,/usr/local/ofed/lib64 -L/usr/local/ofed/lib64 -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -L/opt/mvapich/0.9.9/gen2/lib -ldl -lmpich -libverbs -libumad -lpthread -lrt -Wl,-rpath,/opt/intel/cce/9.1.049/lib -L/opt/intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -L/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -L/usr/lib64 -lsvml -limf -lipgo -lirc -lgcc_s -lirc_s -lmpichf90nc -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -Wl,-rpath,/opt/intel/fce/9.1.045/lib -L/opt/intel/fce/9.1.045/lib -lifport -lifcore -lm -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -lm -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -lstdc++ -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -lstdc++ -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -L/opt/mvapich/0.9.9/gen2/lib -ldl -lmpich -Wl,-rpath,/usr/local/ofed/lib64 -L/usr/local/ofed/lib64 -libverbs -libumad -lpthread -lrt -Wl,-rpath,/opt/intel/cce/9.1.049/lib -L/opt/intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -L/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -L/usr/lib64 -lsvml -limf -lipgo -lirc -lgcc_s -lirc_s -ldl -lc
------------------------------------------
639.52user 4.80system 18:08.23elapsed 59%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (20major+172979minor)pagefaults 0swaps
Job /usr/lsf62/bin/mvapich_wrapper time ./a.out -log_summary


TID HOST_NAME COMMAND_LINE STATUS TERMINATION_TIME
===== ========== ================ ======================= ===================
00000 atlas3-c45 time ./a.out -lo Done 04/16/2008 00:39:23



Barry Smith wrote:

It is taking 8776 iterations of GMRES! How many does it take on one process? This is a huge
amount.


MatMult 8776 1.0 1.5701e+02 2.2 2.43e+08 2.2 1.8e+04 4.8e+03 0.0e+00 10 11100100 0 10 11100100 0 217
MatSolve 8777 1.0 2.8379e+02 2.9 1.73e+08 2.9 0.0e+00 0.0e+00 0.0e+00 17 11 0 0 0 17 11 0 0 0 120


One process is spending 2.9 times as long in the embarresingly parallel MatSolve then the other process;
this indicates a huge imbalance in the number of nonzeros on each process. As Matt noticed, the partitioning
between the two processes is terrible.


  Barry

On Apr 15, 2008, at 10:56 AM, Ben Tay wrote:
Oh sorry here's the whole information. I'm using 2 processors currently:

************************************************************************************************************************

*** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r -fCourier9' to print this document ***
************************************************************************************************************************



---------------------------------------------- PETSc Performance Summary: ----------------------------------------------


./a.out on a atlas3-mp named atlas3-c05 with 2 processors, by g0306332 Tue Apr 15 23:03:09 2008
Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST 2007 HG revision: 414581156e67e55c761739b0deb119f7590d0f4b


                       Max       Max/Min        Avg      Total
Time (sec):           1.114e+03      1.00054   1.114e+03
Objects:              5.400e+01      1.00000   5.400e+01
Flops:                1.574e+11      1.00000   1.574e+11  3.147e+11
Flops/sec:            1.414e+08      1.00054   1.413e+08  2.826e+08
MPI Messages:         8.777e+03      1.00000   8.777e+03  1.755e+04
MPI Message Lengths:  4.213e+07      1.00000   4.800e+03  8.425e+07
MPI Reductions:       8.644e+03      1.00000

Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)
e.g., VecAXPY() for real vectors of length N --> 2N flops
and VecAXPY() for complex vectors of length N --> 8N flops


Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages --- -- Message Lengths -- -- Reductions --
Avg %Total Avg %Total counts %Total Avg %Total counts %Total
0: Main Stage: 1.1136e+03 100.0% 3.1475e+11 100.0% 1.755e+04 100.0% 4.800e+03 100.0% 1.729e+04 100.0%


------------------------------------------------------------------------------------------------------------------------

See the 'Profiling' chapter of the users' manual for details on interpreting output.
Phase summary info:
Count: number of times phase was executed
Time and Flops/sec: Max - maximum over all processors
Ratio - ratio of maximum to minimum over all processors
Mess: number of messages sent
Avg. len: average message length
Reduct: number of global reductions
Global: entire computation
Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().
%T - percent time in this phase %F - percent flops in this phase
%M - percent messages in this phase %L - percent message lengths in this phase
%R - percent reductions in this phase
Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all processors)
------------------------------------------------------------------------------------------------------------------------




    ##########################################################
    #                                                        #
    #                          WARNING!!!                    #
    #                                                        #
    #   This code was run without the PreLoadBegin()         #
    #   macros. To get timing results we always recommend    #
    #   preloading. otherwise timing numbers may be          #
    #   meaningless.                                         #
    ##########################################################


Event Count Time (sec) Flops/sec --- Global --- --- Stage --- Total
Max Ratio Max Ratio Max Ratio Mess Avg len Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------



--- Event Stage 0: Main Stage

MatMult 8776 1.0 1.5701e+02 2.2 2.43e+08 2.2 1.8e+04 4.8e+03 0.0e+00 10 11100100 0 10 11100100 0 217
MatSolve 8777 1.0 2.8379e+02 2.9 1.73e+08 2.9 0.0e+00 0.0e+00 0.0e+00 17 11 0 0 0 17 11 0 0 0 120
MatLUFactorNum 1 1.0 2.7618e-02 1.2 8.68e+07 1.2 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 140
MatILUFactorSym 1 1.0 2.4259e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 1.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatAssemblyBegin 1 1.0 5.6334e+01853005.4 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+00 3 0 0 0 0 3 0 0 0 0 0
MatAssemblyEnd 1 1.0 4.7958e-02 1.0 0.00e+00 0.0 2.0e+00 2.4e+03 7.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatGetRowIJ 1 1.0 3.0994e-06 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatGetOrdering 1 1.0 3.8640e-03 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatZeroEntries 1 1.0 1.8353e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
KSPGMRESOrthog 8493 1.0 6.2636e+02 1.3 2.32e+08 1.3 0.0e+00 0.0e+00 8.5e+03 50 72 0 0 49 50 72 0 0 49 363
KSPSetup 2 1.0 1.0490e-02 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
KSPSolve 1 1.0 9.9177e+02 1.0 1.59e+08 1.0 1.8e+04 4.8e+03 1.7e+04 89100100100100 89100100100100 317
PCSetUp 2 1.0 5.5893e-02 1.2 4.02e+07 1.2 0.0e+00 0.0e+00 3.0e+00 0 0 0 0 0 0 0 0 0 0 69
PCSetUpOnBlocks 1 1.0 5.5777e-02 1.2 4.03e+07 1.2 0.0e+00 0.0e+00 3.0e+00 0 0 0 0 0 0 0 0 0 0 69
PCApply 8777 1.0 2.9987e+02 2.9 1.63e+08 2.9 0.0e+00 0.0e+00 0.0e+00 18 11 0 0 0 18 11 0 0 0 114
VecMDot 8493 1.0 5.3381e+02 2.2 2.36e+08 2.2 0.0e+00 0.0e+00 8.5e+03 35 36 0 0 49 35 36 0 0 49 213
VecNorm 8777 1.0 1.8237e+0210.2 2.13e+0810.2 0.0e+00 0.0e+00 8.8e+03 9 2 0 0 51 9 2 0 0 51 42
VecScale 8777 1.0 5.9594e+00 4.7 1.49e+09 4.7 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 0 1 0 0 0 636
VecCopy 284 1.0 4.2563e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecSet 9062 1.0 1.5833e+01 2.6 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
VecAXPY 567 1.0 1.4142e+00 2.8 4.90e+08 2.8 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 346
VecMAXPY 8777 1.0 2.6692e+02 2.7 6.15e+08 2.7 0.0e+00 0.0e+00 0.0e+00 16 38 0 0 0 16 38 0 0 0 453
VecAssemblyBegin 2 1.0 1.6093e-04 2.5 0.00e+00 0.0 0.0e+00 0.0e+00 6.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecAssemblyEnd 2 1.0 4.7684e-06 1.7 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecScatterBegin 8776 1.0 6.6898e-01 6.7 0.00e+00 0.0 1.8e+04 4.8e+03 0.0e+00 0 0100100 0 0 0100100 0 0
VecScatterEnd 8776 1.0 1.7747e+0130.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
VecNormalize 8777 1.0 1.8366e+02 7.7 2.39e+08 7.7 0.0e+00 0.0e+00 8.8e+03 9 4 0 0 51 9 4 0 0 51 62
------------------------------------------------------------------------------------------------------------------------



Memory usage is given in bytes:

Object Type Creations Destructions Memory Descendants' Mem.

--- Event Stage 0: Main Stage

Matrix 4 4 49227380 0
Krylov Solver 2 2 17216 0
Preconditioner 2 2 256 0
Index Set 5 5 2596120 0
Vec 40 40 62243224 0
Vec Scatter 1 1 0 0
========================================================================================================================


Average time to get PetscTime(): 4.05312e-07
Average time for MPI_Barrier(): 7.62939e-07
Average time for zero size MPI_Send(): 2.02656e-06
OptionTable: -log_summary
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
Compiled without FORTRAN kernels Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 8
Configure run at: Tue Jan 8 22:22:08 2008
Configure options: --with-memcmp-ok --sizeof_char=1 --sizeof_void_p=8 --sizeof_short=2 --sizeof_int=4 --sizeof_long=8 --sizeof_long_long=8 --sizeof_float=4 --sizeof_double=8 --bits_per_byte=8 --sizeof_MPI_Comm=4 --sizeof_MPI_Fint=4 --with-vendor-compilers=intel --with-x=0 --with-hypre-dir=/home/enduser/g0306332/lib/hypre --with-debugging=0 --with-batch=1 --with-mpi-shared=0 --with-mpi-include=/usr/local/topspin/mpi/mpich/include --with-mpi-lib=/usr/local/topspin/mpi/mpich/lib/libmpich.a --with-mpirun=/usr/local/topspin/mpi/mpich/bin/mpirun --with-blas-lapack-dir=/opt/intel/cmkl/8.1.1/lib/em64t --with-shared=0
-----------------------------------------
Libraries compiled on Tue Jan 8 22:34:13 SGT 2008 on atlas3-c01
Machine characteristics: Linux atlas3-c01 2.6.9-42.ELsmp #1 SMP Wed Jul 12 23:32:02 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux
Using PETSc directory: /nfs/home/enduser/g0306332/petsc-2.3.3-p8
Using PETSc arch: atlas3-mpi
-----------------------------------------
Using C compiler: mpicc -fPIC -O Using Fortran compiler: mpif90 -I. -fPIC -O -----------------------------------------
Using include paths: -I/nfs/home/enduser/g0306332/petsc-2.3.3-p8 -I/nfs/home/enduser/g0306332/petsc-2.3.3-p8/bmake/atlas3-mpi -I/nfs/home/enduser/g0306332/petsc-2.3.3-p8/include -
I/home/enduser/g0306332/lib/hypre/include -I/usr/local/topspin/mpi/mpich/include ------------------------------------------
Using C linker: mpicc -fPIC -O
Using Fortran linker: mpif90 -I. -fPIC -O Using libraries: -Wl,-rpath,/nfs/home/enduser/g0306332/petsc-2.3.3-p8/lib/atlas3-mpi -L/nfs/home/enduser/g0306332/petsc-2.3.3-p8/lib/atlas3-mpi -lpetscts -lpetscsnes -lpetscksp -lpetscdm -lpetscmat -lpetscvec -lpetsc -Wl,-rpath,/home/enduser/g0306332/lib/hypre/lib -L/home/enduser/g0306332/lib/hypre/lib -lHYPRE -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -lstdc++ -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -Wl,-rpath,/usr/local/topspin/mpi/mpich/lib -L/usr/local/topspin/mpi/mpich/lib -lmpich -Wl,-rpath,/opt/intel/cmkl/8.1.1/lib/em64t -L/opt/intel/cmkl/8.1.1/lib/em64t -lmkl_lapack -lmkl_em64t -lguide -lpthread -Wl,-rpath,/usr/local/ofed/lib64 -L/usr/local/ofed/lib64 -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -L/opt/mvapich/0.9.9/gen2/lib -ldl -lmpich -libverbs -libumad -lpthread -lrt -Wl,-rpath,/opt/intel/cce/9.1.049/lib -L/opt/intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -L/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -L/usr/lib64 -lsvml -limf -lipgo -lirc -lgcc_s -lirc_s -lmpichf90nc -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -Wl,-rpath,/opt/intel/fce/9.1.045/lib -L/opt/intel/fce/9.1.045/lib -lifport -lifcore -lm -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -lm -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -lstdc++ -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -lstdc++ -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -L/opt/mvapich/0.9.9/gen2/lib -ldl -lmpich -Wl,-rpath,/usr/local/ofed/lib64 -L/usr/local/ofed/lib64 -libverbs -libumad -lpthread -lrt -Wl,-rpath,/opt/intel/cce/9.1.049/lib -L/opt/intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -L/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -L/usr/lib64 -lsvml -limf -lipgo -lirc -lgcc_s -lirc_s -ldl -lc
------------------------------------------
1079.77user 0.79system 18:34.82elapsed 96%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (28major+153248minor)pagefaults 0swaps
387.76user 3.95system 18:34.77elapsed 35%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (18major+158175minor)pagefaults 0swaps
Job /usr/lsf62/bin/mvapich_wrapper time ./a.out -log_summary
TID HOST_NAME COMMAND_LINE STATUS TERMINATION_TIME
===== ========== ================ ======================= ===================
00000 atlas3-c05 time ./a.out -lo Done 04/15/2008 23:03:10
00001 atlas3-c05 time ./a.out -lo Done 04/15/2008 23:03:10



I have a cartesian grid 600x720. Since there's 2 processors, it is partitioned to 600x360. I just use:


call MatCreateMPIAIJ(MPI_COMM_WORLD,PETSC_DECIDE,PETSC_DECIDE,total_k,total_k,5,PETSC_NULL_INTEGER,5,PETSC_NULL_INTEGER,A_mat,ierr)


call MatSetFromOptions(A_mat,ierr)

      call MatGetOwnershipRange(A_mat,ksta_p,kend_p,ierr)

      call KSPCreate(MPI_COMM_WORLD,ksp,ierr)

call VecCreateMPI(MPI_COMM_WORLD,PETSC_DECIDE,size_x*size_y,b_rhs,ierr)

total_k is actually size_x*size_y. Since it's 2d, the maximum values per row is 5. When you says setting off-process values, do you mean I insert values from 1 processor into another? I thought I insert the values into the correct processor...

Thank you very much!



Matthew Knepley wrote:
1) Please never cut out parts of the summary. All the information is valuable,
and most times, necessary


2) You seem to have huge load imbalance (look at VecNorm). Do you partition
the system yourself. How many processes is this?


3) You seem to be setting a huge number of off-process values in the matrix
(see MatAssemblyBegin). Is this true? I would reorganize this part.


 Matt

On Tue, Apr 15, 2008 at 10:33 AM, Ben Tay <zonexo@xxxxxxxxx> wrote:

Hi,

I have converted the poisson eqn part of the CFD code to parallel. The grid
size tested is 600x720. For the momentum eqn, I used another serial linear
solver (nspcg) to prevent mixing of results. Here's the output summary:


--- Event Stage 0: Main Stage

MatMult 8776 1.0 1.5701e+02 2.2 2.43e+08 2.2 1.8e+04 4.8e+03
0.0e+00 10 11100100 0 10 11100100 0 217
MatSolve 8777 1.0 2.8379e+02 2.9 1.73e+08 2.9 0.0e+00 0.0e+00
0.0e+00 17 11 0 0 0 17 11 0 0 0 120
MatLUFactorNum 1 1.0 2.7618e-02 1.2 8.68e+07 1.2 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 140
MatILUFactorSym 1 1.0 2.4259e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
1.0e+00 0 0 0 0 0 0 0 0 0 0 0
*MatAssemblyBegin 1 1.0 5.6334e+01853005.4 0.00e+00 0.0 0.0e+00
0.0e+00 2.0e+00 3 0 0 0 0 3 0 0 0 0 0*
MatAssemblyEnd 1 1.0 4.7958e-02 1.0 0.00e+00 0.0 2.0e+00 2.4e+03
7.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatGetRowIJ 1 1.0 3.0994e-06 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatGetOrdering 1 1.0 3.8640e-03 1.3 0.00e+00 0.0 0.0e+00 0.0e+00
2.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatZeroEntries 1 1.0 1.8353e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
KSPGMRESOrthog 8493 1.0 6.2636e+02 1.3 2.32e+08 1.3 0.0e+00 0.0e+00
8.5e+03 50 72 0 0 49 50 72 0 0 49 363
KSPSetup 2 1.0 1.0490e-02 1.3 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
KSPSolve 1 1.0 9.9177e+02 1.0 1.59e+08 1.0 1.8e+04 4.8e+03
1.7e+04 89100100100100 89100100100100 317
PCSetUp 2 1.0 5.5893e-02 1.2 4.02e+07 1.2 0.0e+00 0.0e+00
3.0e+00 0 0 0 0 0 0 0 0 0 0 69
PCSetUpOnBlocks 1 1.0 5.5777e-02 1.2 4.03e+07 1.2 0.0e+00 0.0e+00
3.0e+00 0 0 0 0 0 0 0 0 0 0 69
PCApply 8777 1.0 2.9987e+02 2.9 1.63e+08 2.9 0.0e+00 0.0e+00
0.0e+00 18 11 0 0 0 18 11 0 0 0 114
VecMDot 8493 1.0 5.3381e+02 2.2 2.36e+08 2.2 0.0e+00 0.0e+00
8.5e+03 35 36 0 0 49 35 36 0 0 49 213
*VecNorm 8777 1.0 1.8237e+0210.2 2.13e+0810.2 0.0e+00 0.0e+00
8.8e+03 9 2 0 0 51 9 2 0 0 51 42*
*VecScale 8777 1.0 5.9594e+00 4.7 1.49e+09 4.7 0.0e+00 0.0e+00
0.0e+00 0 1 0 0 0 0 1 0 0 0 636*
VecCopy 284 1.0 4.2563e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecSet 9062 1.0 1.5833e+01 2.6 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 1 0 0 0 0 1 0 0 0 0 0
VecAXPY 567 1.0 1.4142e+00 2.8 4.90e+08 2.8 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 346
VecMAXPY 8777 1.0 2.6692e+02 2.7 6.15e+08 2.7 0.0e+00 0.0e+00
0.0e+00 16 38 0 0 0 16 38 0 0 0 453
VecAssemblyBegin 2 1.0 1.6093e-04 2.5 0.00e+00 0.0 0.0e+00 0.0e+00
6.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecAssemblyEnd 2 1.0 4.7684e-06 1.7 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
*VecScatterBegin 8776 1.0 6.6898e-01 6.7 0.00e+00 0.0 1.8e+04 4.8e+03
0.0e+00 0 0100100 0 0 0100100 0 0*
*VecScatterEnd 8776 1.0 1.7747e+0130.1 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 1 0 0 0 0 1 0 0 0 0 0*
*VecNormalize 8777 1.0 1.8366e+02 7.7 2.39e+08 7.7 0.0e+00 0.0e+00
8.8e+03 9 4 0 0 51 9 4 0 0 51 62*


------------------------------------------------------------------------------------------------------------------------

Memory usage is given in bytes:
Object Type Creations Destructions Memory Descendants' Mem.
--- Event Stage 0: Main Stage
Matrix 4 4 49227380 0
Krylov Solver 2 2 17216 0
Preconditioner 2 2 256 0
Index Set 5 5 2596120 0
Vec 40 40 62243224 0
Vec Scatter 1 1 0 0
========================================================================================================================


Average time to get PetscTime(): 4.05312e-07 Average time
for MPI_Barrier(): 7.62939e-07
Average time for zero size MPI_Send(): 2.02656e-06
OptionTable: -log_summary



The PETSc manual states that ratio should be close to 1. There's quite a
few *(in bold)* which are >1 and MatAssemblyBegin seems to be very big. So
what could be the cause?


I wonder if it has to do the way I insert the matrix. My steps are:
(cartesian grids, i loop faster than j, fortran)

For matrix A and rhs

Insert left extreme cells values belonging to myid

if (myid==0) then

  insert corner cells values

  insert south cells values

  insert internal cells values

else if (myid==num_procs-1) then

  insert corner cells values

  insert north cells values

  insert internal cells values

else

  insert internal cells values

end if

Insert right extreme cells values belonging to myid

All these values are entered into a big_A(size_x*size_y,5) matrix. int_A
stores the position of the values. I then do


call MatZeroEntries(A_mat,ierr)

  do k=ksta_p+1,kend_p   !for cells belonging to myid

      do kk=1,5

          II=k-1

          JJ=int_A(k,kk)-1

call MatSetValues(A_mat,1,II,1,JJ,big_A(k,kk),ADD_VALUES,ierr)
end do


  end do

  call MatAssemblyBegin(A_mat,MAT_FINAL_ASSEMBLY,ierr)

  call MatAssemblyEnd(A_mat,MAT_FINAL_ASSEMBLY,ierr)


I wonder if the problem lies here.I used the big_A matrix because I was
migrating from an old linear solver. Lastly, I was told to widen my window
to 120 characters. May I know how do I do it?




Thank you very much.

Matthew Knepley wrote:


On Mon, Apr 14, 2008 at 8:43 AM, Ben Tay <zonexo@xxxxxxxxx> wrote:



Hi Matthew,

I think you've misunderstood what I meant. What I'm trying to say is
initially I've got a serial code. I tried to convert to a parallel one.


Then

I tested it and it was pretty slow. Due to some work requirement, I need

to

go back to make some changes to my code. Since the parallel is not

working

well, I updated and changed the serial one.

Well, that was a while ago and now, due to the updates and changes, the
serial code is different from the old converted parallel code. Some


files

were also deleted and I can't seem to get it working now. So I thought I
might as well convert the new serial code to parallel. But I'm not very


sure

what I should do 1st.

Maybe I should rephrase my question in that if I just convert my

poisson

equation subroutine from a serial PETSc to a parallel PETSc version,

will it

work? Should I expect a speedup? The rest of my code is still serial.



You should, of course, only expect speedup in the parallel parts

Matt




Thank you very much.



Matthew Knepley wrote:




I am not sure why you would ever have two codes. I never do this.

PETSc

is designed to write one code to run in serial and parallel. The PETSc



part



should look identical. To test, run the code yo uhave verified in

serial


and



output PETSc data structures (like Mat and Vec) using a binary viewer.
Then run in parallel with the same code, which will output the same
structures. Take the two files and write a small verification code


that

loads both versions and calls MatEqual and VecEqual.

Matt

On Mon, Apr 14, 2008 at 5:49 AM, Ben Tay <zonexo@xxxxxxxxx> wrote:





Thank you Matthew. Sorry to trouble you again.

I tried to run it with -log_summary output and I found that there's



some



errors in the execution. Well, I was busy with other things and I

just


came



back to this problem. Some of my files on the server has also been



deleted.



It has been a while and I  remember that  it worked before, only

much

slower.

Anyway, most of the serial code has been updated and maybe it's

easier


to



convert the new serial code instead of debugging on the old parallel



code



now. I believe I can still reuse part of the old parallel code.

However,


I



hope I can approach it better this time.

So supposed I need to start converting my new serial code to

parallel.

There's 2 eqns to be solved using PETSc, the momentum and poisson. I



also



need to parallelize other parts of my code. I wonder which route is

the

best:

1. Don't change the PETSc part ie continue using PETSC_COMM_SELF,



modify



other parts of my code to parallel e.g. looping, updating of values

etc.

Once the execution is fine and speedup is reasonable, then modify

the


PETSc



part - poisson eqn 1st followed by the momentum eqn.

2. Reverse the above order ie modify the PETSc part - poisson eqn

1st

followed by the momentum eqn. Then do other parts of my code.

I'm not sure if the above 2 mtds can work or if there will be



conflicts. Of



course, an alternative will be:

3. Do the poisson, momentum eqns and other parts of the code



separately.



That is, code a standalone parallel poisson eqn and use samples

values


to



test it. Same for the momentum and other parts of the code. When

each of

them is working, combine them to form the full parallel code.

However,


this



will be much more troublesome.

I hope someone can give me some recommendations.

Thank you once again.



Matthew Knepley wrote:






1) There is no way to have any idea what is going on in your code
without -log_summary output

2) Looking at that output, look at the percentage taken by the

solver

KSPSolve event. I suspect it is not the biggest component,

because

it is very scalable.

Matt

On Sun, Apr 13, 2008 at 4:12 AM, Ben Tay <zonexo@xxxxxxxxx> wrote:







Hi,

I've a serial 2D CFD code. As my grid size requirement

increases,


the



simulation takes longer. Also, memory requirement becomes a

problem.




Grid





size 've reached 1200x1200. Going higher is not possible due to



memory



problem.

I tried to convert my code to a parallel one, following the

examples




given.





I also need to restructure parts of my code to enable parallel



looping.




I





1st changed the PETSc solver to be parallel enabled and then I





restructured





parts of my code. I proceed on as longer as the answer for a

simple


test



case is correct. I thought it's not really possible to do any

speed




testing





since the code is not fully parallelized yet. When I finished

during




most of





the conversion, I found that in the actual run that it is much



slower,



although the answer is correct.

So what is the remedy now? I wonder what I should do to check

what's




wrong.





Must I restart everything again? Btw, my grid size is 1200x1200.

I




believed





it should be suitable for parallel run of 4 processors? Is that

so?

Thank you.