[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Slow speed after changing from serial to parallel
The convergence here is jsut horrendous. Have you tried using LU to check
your implementation? All the time is in the solve right now. I would first
try a direct method (at least on a small problem) and then try to understand
the convergence behavior. MUMPS can actually scale very well for big problems.
Matt
On Tue, Apr 15, 2008 at 11:44 AM, Ben Tay <zonexo@xxxxxxxxx> wrote:
> Hi,
>
> Here's the summary for 1 processor. Seems like it's also using a long
> time... Can someone tell me when my mistakes possibly lie? Thank you very
> much!
>
>
> ************************************************************************************************************************
> *** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r
> -fCourier9' to print this document ***
>
> ************************************************************************************************************************
>
> ---------------------------------------------- PETSc Performance Summary:
> ----------------------------------------------
>
> ./a.out on a atlas3-mp named atlas3-c45 with 1 processor, by g0306332 Wed
> Apr 16 00:39:22 2008
> Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST 2007 HG
> revision: 414581156e67e55c761739b0deb119f7590d0f4b
>
> Max Max/Min Avg Total
> Time (sec): 1.088e+03 1.00000 1.088e+03
> Objects: 4.300e+01 1.00000 4.300e+01
> Flops: 2.658e+11 1.00000 2.658e+11 2.658e+11
> Flops/sec: 2.444e+08 1.00000 2.444e+08 2.444e+08
> MPI Messages: 0.000e+00 0.00000 0.000e+00 0.000e+00
> MPI Message Lengths: 0.000e+00 0.00000 0.000e+00 0.000e+00
> MPI Reductions: 1.460e+04 1.00000
>
> Flop counting convention: 1 flop = 1 real number operation of type
> (multiply/divide/add/subtract)
> e.g., VecAXPY() for real vectors of length N -->
> 2N flops
> and VecAXPY() for complex vectors of length N -->
> 8N flops
>
> Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages ---
> -- Message Lengths -- -- Reductions --
> Avg %Total Avg %Total counts %Total
> Avg %Total counts %Total
> 0: Main Stage: 1.0877e+03 100.0% 2.6584e+11 100.0% 0.000e+00 0.0%
> 0.000e+00 0.0% 1.460e+04 100.0%
>
>
> ------------------------------------------------------------------------------------------------------------------------
> See the 'Profiling' chapter of the users' manual for details on
> interpreting output.
> Phase summary info:
> Count: number of times phase was executed
> Time and Flops/sec: Max - maximum over all processors
> Ratio - ratio of maximum to minimum over all
> processors
> Mess: number of messages sent
> Avg. len: average message length
> Reduct: number of global reductions
> Global: entire computation
> Stage: stages of a computation. Set stages with PetscLogStagePush() and
> PetscLogStagePop().
> %T - percent time in this phase %F - percent flops in this
> phase
> %M - percent messages in this phase %L - percent message lengths in
> this phase
> %R - percent reductions in this phase
> Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over
> all processors)
>
> ------------------------------------------------------------------------------------------------------------------------
>
>
> ##########################################################
> # #
> # WARNING!!! #
> # #
> # This code was run without the PreLoadBegin() #
> # macros. To get timing results we always recommend #
> # preloading. otherwise timing numbers may be #
> # meaningless. #
> # preloading. otherwise timing numbers may be #
> # meaningless. #
> ##########################################################
>
>
> Event Count Time (sec) Flops/sec
> --- Global --- --- Stage --- Total
> Max Ratio Max Ratio Max Ratio Mess Avg len
> Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
>
> ------------------------------------------------------------------------------------------------------------------------
>
> --- Event Stage 0: Main Stage
>
> MatMult 7412 1.0 1.3344e+02 1.0 2.16e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00 12 11 0 0 0 12 11 0 0 0 216
> MatSolve 7413 1.0 2.6851e+02 1.0 1.07e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00 25 11 0 0 0 25 11 0 0 0 107
> MatLUFactorNum 1 1.0 4.3947e-02 1.0 8.83e+07 1.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 88
> MatILUFactorSym 1 1.0 3.7798e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 1.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatAssemblyBegin 1 1.0 9.5367e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatAssemblyEnd 1 1.0 2.5835e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatGetRowIJ 1 1.0 1.9073e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatGetOrdering 1 1.0 6.0391e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 2.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatZeroEntries 1 1.0 1.7377e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> KSPGMRESOrthog 7173 1.0 5.6323e+02 1.0 3.41e+08 1.0 0.0e+00 0.0e+00
> 7.2e+03 52 72 0 0 49 52 72 0 0 49 341
> KSPSetup 1 1.0 1.2676e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> KSPSolve 1 1.0 1.0144e+03 1.0 2.62e+08 1.0 0.0e+00 0.0e+00
> 1.5e+04 93100 0 0100 93100 0 0100 262
> PCSetUp 1 1.0 8.7809e-02 1.0 4.42e+07 1.0 0.0e+00 0.0e+00
> 3.0e+00 0 0 0 0 0 0 0 0 0 0 44
> PCApply 7413 1.0 2.6853e+02 1.0 1.07e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00 25 11 0 0 0 25 11 0 0 0 107
> VecMDot 7173 1.0 2.6720e+02 1.0 3.59e+08 1.0 0.0e+00 0.0e+00
> 7.2e+03 25 36 0 0 49 25 36 0 0 49 359
> VecNorm 7413 1.0 1.7125e+01 1.0 3.74e+08 1.0 0.0e+00 0.0e+00
> 7.4e+03 2 2 0 0 51 2 2 0 0 51 374
> VecScale 7413 1.0 9.2787e+00 1.0 3.45e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00 1 1 0 0 0 1 1 0 0 0 345
> VecCopy 240 1.0 5.1628e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> VecSet 241 1.0 6.4428e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> VecAXPY 479 1.0 2.0082e+00 1.0 2.06e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 206
> VecMAXPY 7413 1.0 3.1536e+02 1.0 3.24e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00 29 38 0 0 0 29 38 0 0 0 324
> VecAssemblyBegin 2 1.0 2.3127e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 6.0e+00 0 0 0 0 0 0 0 0 0 0 0
> VecAssemblyEnd 2 1.0 4.0531e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> VecNormalize 7413 1.0 2.6424e+01 1.0 3.64e+08 1.0 0.0e+00 0.0e+00
> 7.4e+03 2 4 0 0 51 2 4 0 0 51 364
>
> ------------------------------------------------------------------------------------------------------------------------
>
> Memory usage is given in bytes:
>
> Object Type Creations Destructions Memory Descendants' Mem.
>
> --- Event Stage 0: Main Stage
>
> Matrix 2 2 65632332 0
> Krylov Solver 1 1 17216 0
> Preconditioner 1 1 168 0
> Index Set 3 3 5185032 0
> Vec 36 36 120987640 0
>
> ========================================================================================================================
> Average time to get PetscTime(): 3.09944e-07
> OptionTable: -log_summary
> Compiled without FORTRAN kernels
> Compiled with full precision matrices (default)
> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
> sizeof(PetscScalar) 8
> Configure run at: Tue Jan 8 22:22:08 2008
> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
> sizeof(PetscScalar) 8
> Configure run at: Tue Jan 8 22:22:08 2008 Configure
> options: --with-memcmp-ok --sizeof_char=1 --sizeof_void_p=8 --sizeof_short=2
> --sizeof_int=4 --sizeof_long=8 --sizeof_long_long=8 --sizeof_float=4
> --sizeof_double=8 --bits_per_byte=8 --sizeof_MPI_Comm=4 --sizeof_MPI_Fint=4
> --with-vendor-compilers=intel --with-x=0
> --with-hypre-dir=/home/enduser/g0306332/lib/hypre --with-debugging=0
> --with-batch=1 --with-mpi-shared=0
> --with-mpi-include=/usr/local/topspin/mpi/mpich/include
> --with-mpi-lib=/usr/local/topspin/mpi/mpich/lib/libmpich.a
> --with-mpirun=/usr/local/topspin/mpi/mpich/bi
> n/mpirun --with-blas-lapack-dir=/opt/intel/cmkl/8.1.1/lib/em64t
> --with-shared=0 -----------------------------------------
> Libraries compiled on Tue Jan 8 22:34:13 SGT 2008 on atlas3-c01
> Machine characteristics: Linux atlas3-c01 2.6.9-42.ELsmp #1 SMP Wed Jul 12
> 23:32:02 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux
> Using PETSc directory: /nfs/home/enduser/g0306332/petsc-2.3.3-p8
> Using PETSc arch: atlas3-mpi
> -----------------------------------------
> Using C compiler: mpicc -fPIC -O Using Fortran compiler: mpif90 -I. -fPIC
> -O -----------------------------------------
> Using include paths: -I/nfs/home/enduser/g0306332/petsc-2.3.3-p8
> -I/nfs/home/enduser/g0306332/petsc-2.3.3-p8/bmake/atlas3-mpi
> -I/nfs/home/enduser/g0306332/petsc-2.3.3-p8/include -
> I/home/enduser/g0306332/lib/hypre/include
> -I/usr/local/topspin/mpi/mpich/include
> ------------------------------------------
> Using C linker: mpicc -fPIC -O
> Using Fortran linker: mpif90 -I. -fPIC -O Using libraries:
> -Wl,-rpath,/nfs/home/enduser/g0306332/petsc-2.3.3-p8/lib/atlas3-mpi
> -L/nfs/home/enduser/g0306332/petsc-2.3.3-p8/lib/atlas3-mpi -lpetscts
> -lpetscsnes -lpetscksp -lpetscdm -lpetscmat -lpetscvec -lpetsc
> -Wl,-rpath,/home/enduser/g0306332/lib/hypre/lib
> -L/home/enduser/g0306332/lib/hypre/lib -lHYPRE
> -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64
> -Wl,-rpath,/opt/intel/cce/9.1.049/lib
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64
> -lstdc++ -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib
> -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64
> -Wl,-rpath,/usr/local/topspin/mpi/mpich/lib
> -L/usr/local/topspin/mpi/mpich/lib -lmpich
> -Wl,-rpath,/opt/intel/cmkl/8.1.1/lib/em64t -L/opt/intel/cmkl/8.1.1/lib/em64t
> -lmkl_lapack -lmkl_em64t -lguide -lpthread -Wl,-rpath,/usr/local/ofed/lib64
> -L/usr/local/ofed/lib64 -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib
> -L/opt/mvapich/0.9.9/gen2/lib -ldl -lmpich -libverbs -libumad -lpthread -lrt
> -Wl,-rpath,/opt/intel/cce/9.1.049/lib -L/opt/intel/cce/9.1.049/lib
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/
> -L/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -L/usr/lib64
> -lsvml -limf -lipgo -lirc -lgcc_s -lirc_s -lmpichf90nc
> -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64
> -Wl,-rpath,/opt/intel/cce/9.1.049/lib
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64
> -Wl,-rpath,/opt/intel/fce/9.1.045/lib -L/opt/intel/fce/9.1.045/lib -lifport
> -lifcore -lm -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib
> -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -lm
> -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64
> -Wl,-rpath,/opt/intel/cce/9.1.049/lib
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64
> -lstdc++ -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib
> -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64
> -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64
> -Wl,-rpath,/opt/intel/cce/9.1.049/lib
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64
> -lstdc++ -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib
> -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64
> -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -L/opt/mvapich/0.9.9/gen2/lib -ldl
> -lmpich -Wl,-rpath,/usr/local/ofed/lib64 -L/usr/local/ofed/lib64 -libverbs
> -libumad -lpthread -lrt -Wl,-rpath,/opt/intel/cce/9.1.049/lib
> -L/opt/intel/cce/9.1.049/lib
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/
> -L/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -L/usr/lib64
> -lsvml -limf -lipgo -lirc -lgcc_s -lirc_s -ldl -lc
> ------------------------------------------
> 639.52user 4.80system 18:08.23elapsed 59%CPU (0avgtext+0avgdata
> 0maxresident)k
> 0inputs+0outputs (20major+172979minor)pagefaults 0swaps
> Job /usr/lsf62/bin/mvapich_wrapper time ./a.out -log_summary
>
> TID HOST_NAME COMMAND_LINE STATUS
> TERMINATION_TIME
> ===== ========== ================ =======================
> ===================
> 00000 atlas3-c45 time ./a.out -lo Done 04/16/2008
> 00:39:23
>
>
> Barry Smith wrote:
>
> >
> > It is taking 8776 iterations of GMRES! How many does it take on one
> process? This is a huge
> > amount.
> >
> > MatMult 8776 1.0 1.5701e+02 2.2 2.43e+08 2.2 1.8e+04 4.8e+03
> 0.0e+00 10 11100100 0 10 11100100 0 217
> > MatSolve 8777 1.0 2.8379e+02 2.9 1.73e+08 2.9 0.0e+00 0.0e+00
> 0.0e+00 17 11 0 0 0 17 11 0 0 0 120
> >
> > One process is spending 2.9 times as long in the embarresingly parallel
> MatSolve then the other process;
> > this indicates a huge imbalance in the number of nonzeros on each process.
> As Matt noticed, the partitioning
> > between the two processes is terrible.
> >
> > Barry
> >
> > On Apr 15, 2008, at 10:56 AM, Ben Tay wrote:
> >
> > > Oh sorry here's the whole information. I'm using 2 processors currently:
> > >
> > >
> ************************************************************************************************************************
> > > *** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r
> -fCourier9' to print this document ***
> > >
> ************************************************************************************************************************
> > >
> > > ---------------------------------------------- PETSc Performance
> Summary: ----------------------------------------------
> > >
> > > ./a.out on a atlas3-mp named atlas3-c05 with 2 processors, by g0306332
> Tue Apr 15 23:03:09 2008
> > > Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST 2007
> HG revision: 414581156e67e55c761739b0deb119f7590d0f4b
> > >
> > > Max Max/Min Avg Total
> > > Time (sec): 1.114e+03 1.00054 1.114e+03
> > > Objects: 5.400e+01 1.00000 5.400e+01
> > > Flops: 1.574e+11 1.00000 1.574e+11 3.147e+11
> > > Flops/sec: 1.414e+08 1.00054 1.413e+08 2.826e+08
> > > MPI Messages: 8.777e+03 1.00000 8.777e+03 1.755e+04
> > > MPI Message Lengths: 4.213e+07 1.00000 4.800e+03 8.425e+07
> > > MPI Reductions: 8.644e+03 1.00000
> > >
> > > Flop counting convention: 1 flop = 1 real number operation of type
> (multiply/divide/add/subtract)
> > > e.g., VecAXPY() for real vectors of length N
> --> 2N flops
> > > and VecAXPY() for complex vectors of length N
> --> 8N flops
> > >
> > > Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages
> --- -- Message Lengths -- -- Reductions --
> > > Avg %Total Avg %Total counts %Total
> Avg %Total counts %Total
> > > 0: Main Stage: 1.1136e+03 100.0% 3.1475e+11 100.0% 1.755e+04
> 100.0% 4.800e+03 100.0% 1.729e+04 100.0%
> > >
> > >
> ------------------------------------------------------------------------------------------------------------------------
> > > See the 'Profiling' chapter of the users' manual for details on
> interpreting output.
> > > Phase summary info:
> > > Count: number of times phase was executed
> > > Time and Flops/sec: Max - maximum over all processors
> > > Ratio - ratio of maximum to minimum over all
> processors
> > > Mess: number of messages sent
> > > Avg. len: average message length
> > > Reduct: number of global reductions
> > > Global: entire computation
> > > Stage: stages of a computation. Set stages with PetscLogStagePush() and
> PetscLogStagePop().
> > > %T - percent time in this phase %F - percent flops in this
> phase
> > > %M - percent messages in this phase %L - percent message lengths
> in this phase
> > > %R - percent reductions in this phase
> > > Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time
> over all processors)
> > >
> ------------------------------------------------------------------------------------------------------------------------
> > >
> > >
> > > ##########################################################
> > > # #
> > > # WARNING!!! #
> > > # #
> > > # This code was run without the PreLoadBegin() #
> > > # macros. To get timing results we always recommend #
> > > # preloading. otherwise timing numbers may be #
> > > # meaningless. #
> > > ##########################################################
> > >
> > >
> > > Event Count Time (sec) Flops/sec
> --- Global --- --- Stage --- Total
> > > Max Ratio Max Ratio Max Ratio Mess Avg len
> Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
> > >
> ------------------------------------------------------------------------------------------------------------------------
> > >
> > > --- Event Stage 0: Main Stage
> > >
> > > MatMult 8776 1.0 1.5701e+02 2.2 2.43e+08 2.2 1.8e+04 4.8e+03
> 0.0e+00 10 11100100 0 10 11100100 0 217
> > > MatSolve 8777 1.0 2.8379e+02 2.9 1.73e+08 2.9 0.0e+00 0.0e+00
> 0.0e+00 17 11 0 0 0 17 11 0 0 0 120
> > > MatLUFactorNum 1 1.0 2.7618e-02 1.2 8.68e+07 1.2 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 140
> > > MatILUFactorSym 1 1.0 2.4259e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
> 1.0e+00 0 0 0 0 0 0 0 0 0 0 0
> > > MatAssemblyBegin 1 1.0 5.6334e+01853005.4 0.00e+00 0.0 0.0e+00
> 0.0e+00 2.0e+00 3 0 0 0 0 3 0 0 0 0 0
> > > MatAssemblyEnd 1 1.0 4.7958e-02 1.0 0.00e+00 0.0 2.0e+00 2.4e+03
> 7.0e+00 0 0 0 0 0 0 0 0 0 0 0
> > > MatGetRowIJ 1 1.0 3.0994e-06 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> > > MatGetOrdering 1 1.0 3.8640e-03 1.3 0.00e+00 0.0 0.0e+00 0.0e+00
> 2.0e+00 0 0 0 0 0 0 0 0 0 0 0
> > > MatZeroEntries 1 1.0 1.8353e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> > > KSPGMRESOrthog 8493 1.0 6.2636e+02 1.3 2.32e+08 1.3 0.0e+00 0.0e+00
> 8.5e+03 50 72 0 0 49 50 72 0 0 49 363
> > > KSPSetup 2 1.0 1.0490e-02 1.3 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> > > KSPSolve 1 1.0 9.9177e+02 1.0 1.59e+08 1.0 1.8e+04 4.8e+03
> 1.7e+04 89100100100100 89100100100100 317
> > > PCSetUp 2 1.0 5.5893e-02 1.2 4.02e+07 1.2 0.0e+00 0.0e+00
> 3.0e+00 0 0 0 0 0 0 0 0 0 0 69
> > > PCSetUpOnBlocks 1 1.0 5.5777e-02 1.2 4.03e+07 1.2 0.0e+00 0.0e+00
> 3.0e+00 0 0 0 0 0 0 0 0 0 0 69
> > > PCApply 8777 1.0 2.9987e+02 2.9 1.63e+08 2.9 0.0e+00 0.0e+00
> 0.0e+00 18 11 0 0 0 18 11 0 0 0 114
> > > VecMDot 8493 1.0 5.3381e+02 2.2 2.36e+08 2.2 0.0e+00 0.0e+00
> 8.5e+03 35 36 0 0 49 35 36 0 0 49 213
> > > VecNorm 8777 1.0 1.8237e+0210.2 2.13e+0810.2 0.0e+00 0.0e+00
> 8.8e+03 9 2 0 0 51 9 2 0 0 51 42
> > > VecScale 8777 1.0 5.9594e+00 4.7 1.49e+09 4.7 0.0e+00 0.0e+00
> 0.0e+00 0 1 0 0 0 0 1 0 0 0 636
> > > VecCopy 284 1.0 4.2563e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> > > VecSet 9062 1.0 1.5833e+01 2.6 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
> > > VecAXPY 567 1.0 1.4142e+00 2.8 4.90e+08 2.8 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 346
> > > VecMAXPY 8777 1.0 2.6692e+02 2.7 6.15e+08 2.7 0.0e+00 0.0e+00
> 0.0e+00 16 38 0 0 0 16 38 0 0 0 453
> > > VecAssemblyBegin 2 1.0 1.6093e-04 2.5 0.00e+00 0.0 0.0e+00 0.0e+00
> 6.0e+00 0 0 0 0 0 0 0 0 0 0 0
> > > VecAssemblyEnd 2 1.0 4.7684e-06 1.7 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> > > VecScatterBegin 8776 1.0 6.6898e-01 6.7 0.00e+00 0.0 1.8e+04 4.8e+03
> 0.0e+00 0 0100100 0 0 0100100 0 0
> > > VecScatterEnd 8776 1.0 1.7747e+0130.1 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
> > > VecNormalize 8777 1.0 1.8366e+02 7.7 2.39e+08 7.7 0.0e+00 0.0e+00
> 8.8e+03 9 4 0 0 51 9 4 0 0 51 62
> > >
> ------------------------------------------------------------------------------------------------------------------------
> > >
> > > Memory usage is given in bytes:
> > >
> > > Object Type Creations Destructions Memory Descendants'
> Mem.
> > >
> > > --- Event Stage 0: Main Stage
> > >
> > > Matrix 4 4 49227380 0
> > > Krylov Solver 2 2 17216 0
> > > Preconditioner 2 2 256 0
> > > Index Set 5 5 2596120 0
> > > Vec 40 40 62243224 0
> > > Vec Scatter 1 1 0 0
> > >
> ========================================================================================================================
> > > Average time to get PetscTime(): 4.05312e-07
> > > Average time for MPI_Barrier(): 7.62939e-07
> > > Average time for zero size MPI_Send(): 2.02656e-06
> > > OptionTable: -log_summary
> > > Compiled without FORTRAN kernels
> > > Compiled with full precision matrices (default)
> > > Compiled without FORTRAN kernels Compiled
> with full precision matrices (default)
> > > sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
> sizeof(PetscScalar) 8
> > > Configure run at: Tue Jan 8 22:22:08 2008
> > > Configure options: --with-memcmp-ok --sizeof_char=1 --sizeof_void_p=8
> --sizeof_short=2 --sizeof_int=4 --sizeof_long=8 --sizeof_long_long=8
> --sizeof_float=4 --sizeof_double=8 --bits_per_byte=8 --sizeof_MPI_Comm=4
> --sizeof_MPI_Fint=4 --with-vendor-compilers=intel --with-x=0
> --with-hypre-dir=/home/enduser/g0306332/lib/hypre --with-debugging=0
> --with-batch=1 --with-mpi-shared=0
> --with-mpi-include=/usr/local/topspin/mpi/mpich/include
> --with-mpi-lib=/usr/local/topspin/mpi/mpich/lib/libmpich.a
> --with-mpirun=/usr/local/topspin/mpi/mpich/bin/mpirun
> --with-blas-lapack-dir=/opt/intel/cmkl/8.1.1/lib/em64t --with-shared=0
> > > -----------------------------------------
> > > Libraries compiled on Tue Jan 8 22:34:13 SGT 2008 on atlas3-c01
> > > Machine characteristics: Linux atlas3-c01 2.6.9-42.ELsmp #1 SMP Wed Jul
> 12 23:32:02 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux
> > > Using PETSc directory: /nfs/home/enduser/g0306332/petsc-2.3.3-p8
> > > Using PETSc arch: atlas3-mpi
> > > -----------------------------------------
> > > Using C compiler: mpicc -fPIC -O Using Fortran compiler: mpif90 -I.
> -fPIC -O -----------------------------------------
> > > Using include paths: -I/nfs/home/enduser/g0306332/petsc-2.3.3-p8
> -I/nfs/home/enduser/g0306332/petsc-2.3.3-p8/bmake/atlas3-mpi
> -I/nfs/home/enduser/g0306332/petsc-2.3.3-p8/include -
> > > I/home/enduser/g0306332/lib/hypre/include
> -I/usr/local/topspin/mpi/mpich/include
> ------------------------------------------
> > > Using C linker: mpicc -fPIC -O
> > > Using Fortran linker: mpif90 -I. -fPIC -O Using libraries:
> -Wl,-rpath,/nfs/home/enduser/g0306332/petsc-2.3.3-p8/lib/atlas3-mpi
> -L/nfs/home/enduser/g0306332/petsc-2.3.3-p8/lib/atlas3-mpi -lpetscts
> -lpetscsnes -lpetscksp -lpetscdm -lpetscmat -lpetscvec -lpetsc
> -Wl,-rpath,/home/enduser/g0306332/lib/hypre/lib
> -L/home/enduser/g0306332/lib/hypre/lib -lHYPRE
> -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64
> -Wl,-rpath,/opt/intel/cce/9.1.049/lib
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64
> -lstdc++ -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib
> -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64
> -Wl,-rpath,/usr/local/topspin/mpi/mpich/lib
> -L/usr/local/topspin/mpi/mpich/lib -lmpich
> -Wl,-rpath,/opt/intel/cmkl/8.1.1/lib/em64t -L/opt/intel/cmkl/8.1.1/lib/em64t
> -lmkl_lapack -lmkl_em64t -lguide -lpthread -Wl,-rpath,/usr/local/ofed/lib64
> -L/usr/local/ofed/lib64 -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib
> -L/opt/mvapich/0.9.9/gen2/lib -ldl -lmpich -libverbs -libumad -lpthread -lrt
> -Wl,-rpath,/opt/intel/cce/9.1.049/lib -L/opt/intel/cce/9.1.049/lib
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/
> -L/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -L/usr/lib64
> -lsvml -limf -lipgo -lirc -lgcc_s -lirc_s -lmpichf90nc
> -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64
> -Wl,-rpath,/opt/intel/cce/9.1.049/lib
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64
> -Wl,-rpath,/opt/intel/fce/9.1.045/lib -L/opt/intel/fce/9.1.045/lib -lifport
> -lifcore -lm -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib
> -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -lm
> -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64
> -Wl,-rpath,/opt/intel/cce/9.1.049/lib
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64
> -lstdc++ -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib
> -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64
> -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64
> -Wl,-rpath,/opt/intel/cce/9.1.049/lib
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64
> -lstdc++ -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib
> -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64
> -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -L/opt/mvapich/0.9.9/gen2/lib -ldl
> -lmpich -Wl,-rpath,/usr/local/ofed/lib64 -L/usr/local/ofed/lib64 -libverbs
> -libumad -lpthread -lrt -Wl,-rpath,/opt/intel/cce/9.1.049/lib
> -L/opt/intel/cce/9.1.049/lib
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/
> -L/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -L/usr/lib64
> -lsvml -limf -lipgo -lirc -lgcc_s -lirc_s -ldl -lc
> > > ------------------------------------------
> > > 1079.77user 0.79system 18:34.82elapsed 96%CPU (0avgtext+0avgdata
> 0maxresident)k
> > > 0inputs+0outputs (28major+153248minor)pagefaults 0swaps
> > > 387.76user 3.95system 18:34.77elapsed 35%CPU (0avgtext+0avgdata
> 0maxresident)k
> > > 0inputs+0outputs (18major+158175minor)pagefaults 0swaps
> > > Job /usr/lsf62/bin/mvapich_wrapper time ./a.out -log_summary
> > > TID HOST_NAME COMMAND_LINE STATUS
> TERMINATION_TIME
> > > ===== ========== ================ =======================
> ===================
> > > 00000 atlas3-c05 time ./a.out -lo Done 04/15/2008
> 23:03:10
> > > 00001 atlas3-c05 time ./a.out -lo Done 04/15/2008
> 23:03:10
> > >
> > >
> > > I have a cartesian grid 600x720. Since there's 2 processors, it is
> partitioned to 600x360. I just use:
> > >
> > > call
> MatCreateMPIAIJ(MPI_COMM_WORLD,PETSC_DECIDE,PETSC_DECIDE,total_k,total_k,5,PETSC_NULL_INTEGER,5,PETSC_NULL_INTEGER,A_mat,ierr)
> > >
> > > call MatSetFromOptions(A_mat,ierr)
> > >
> > > call MatGetOwnershipRange(A_mat,ksta_p,kend_p,ierr)
> > >
> > > call KSPCreate(MPI_COMM_WORLD,ksp,ierr)
> > >
> > > call
> VecCreateMPI(MPI_COMM_WORLD,PETSC_DECIDE,size_x*size_y,b_rhs,ierr)
> > >
> > > total_k is actually size_x*size_y. Since it's 2d, the maximum values per
> row is 5. When you says setting off-process values, do you mean I insert
> values from 1 processor into another? I thought I insert the values into the
> correct processor...
> > >
> > > Thank you very much!
> > >
> > >
> > >
> > > Matthew Knepley wrote:
> > >
> > > > 1) Please never cut out parts of the summary. All the information is
> valuable,
> > > > and most times, necessary
> > > >
> > > > 2) You seem to have huge load imbalance (look at VecNorm). Do you
> partition
> > > > the system yourself. How many processes is this?
> > > >
> > > > 3) You seem to be setting a huge number of off-process values in the
> matrix
> > > > (see MatAssemblyBegin). Is this true? I would reorganize this part.
> > > >
> > > > Matt
> > > >
> > > > On Tue, Apr 15, 2008 at 10:33 AM, Ben Tay <zonexo@xxxxxxxxx> wrote:
> > > >
> > > >
> > > > > Hi,
> > > > >
> > > > > I have converted the poisson eqn part of the CFD code to parallel.
> The grid
> > > > > size tested is 600x720. For the momentum eqn, I used another serial
> linear
> > > > > solver (nspcg) to prevent mixing of results. Here's the output
> summary:
> > > > >
> > > > > --- Event Stage 0: Main Stage
> > > > >
> > > > > MatMult 8776 1.0 1.5701e+02 2.2 2.43e+08 2.2 1.8e+04
> 4.8e+03
> > > > > 0.0e+00 10 11100100 0 10 11100100 0 217
> > > > > MatSolve 8777 1.0 2.8379e+02 2.9 1.73e+08 2.9 0.0e+00
> 0.0e+00
> > > > > 0.0e+00 17 11 0 0 0 17 11 0 0 0 120
> > > > > MatLUFactorNum 1 1.0 2.7618e-02 1.2 8.68e+07 1.2 0.0e+00
> 0.0e+00
> > > > > 0.0e+00 0 0 0 0 0 0 0 0 0 0 140
> > > > > MatILUFactorSym 1 1.0 2.4259e-02 1.1 0.00e+00 0.0 0.0e+00
> 0.0e+00
> > > > > 1.0e+00 0 0 0 0 0 0 0 0 0 0 0
> > > > > *MatAssemblyBegin 1 1.0 5.6334e+01853005.4 0.00e+00 0.0
> 0.0e+00
> > > > > 0.0e+00 2.0e+00 3 0 0 0 0 3 0 0 0 0 0*
> > > > > MatAssemblyEnd 1 1.0 4.7958e-02 1.0 0.00e+00 0.0 2.0e+00
> 2.4e+03
> > > > > 7.0e+00 0 0 0 0 0 0 0 0 0 0 0
> > > > > MatGetRowIJ 1 1.0 3.0994e-06 1.1 0.00e+00 0.0 0.0e+00
> 0.0e+00
> > > > > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> > > > > MatGetOrdering 1 1.0 3.8640e-03 1.3 0.00e+00 0.0 0.0e+00
> 0.0e+00
> > > > > 2.0e+00 0 0 0 0 0 0 0 0 0 0 0
> > > > > MatZeroEntries 1 1.0 1.8353e-02 1.2 0.00e+00 0.0 0.0e+00
> 0.0e+00
> > > > > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> > > > > KSPGMRESOrthog 8493 1.0 6.2636e+02 1.3 2.32e+08 1.3 0.0e+00
> 0.0e+00
> > > > > 8.5e+03 50 72 0 0 49 50 72 0 0 49 363
> > > > > KSPSetup 2 1.0 1.0490e-02 1.3 0.00e+00 0.0 0.0e+00
> 0.0e+00
> > > > > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> > > > > KSPSolve 1 1.0 9.9177e+02 1.0 1.59e+08 1.0 1.8e+04
> 4.8e+03
> > > > > 1.7e+04 89100100100100 89100100100100 317
> > > > > PCSetUp 2 1.0 5.5893e-02 1.2 4.02e+07 1.2 0.0e+00
> 0.0e+00
> > > > > 3.0e+00 0 0 0 0 0 0 0 0 0 0 69
> > > > > PCSetUpOnBlocks 1 1.0 5.5777e-02 1.2 4.03e+07 1.2 0.0e+00
> 0.0e+00
> > > > > 3.0e+00 0 0 0 0 0 0 0 0 0 0 69
> > > > > PCApply 8777 1.0 2.9987e+02 2.9 1.63e+08 2.9 0.0e+00
> 0.0e+00
> > > > > 0.0e+00 18 11 0 0 0 18 11 0 0 0 114
> > > > > VecMDot 8493 1.0 5.3381e+02 2.2 2.36e+08 2.2 0.0e+00
> 0.0e+00
> > > > > 8.5e+03 35 36 0 0 49 35 36 0 0 49 213
> > > > > *VecNorm 8777 1.0 1.8237e+0210.2 2.13e+0810.2 0.0e+00
> 0.0e+00
> > > > > 8.8e+03 9 2 0 0 51 9 2 0 0 51 42*
> > > > > *VecScale 8777 1.0 5.9594e+00 4.7 1.49e+09 4.7 0.0e+00
> 0.0e+00
> > > > > 0.0e+00 0 1 0 0 0 0 1 0 0 0 636*
> > > > > VecCopy 284 1.0 4.2563e-01 1.2 0.00e+00 0.0 0.0e+00
> 0.0e+00
> > > > > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> > > > > VecSet 9062 1.0 1.5833e+01 2.6 0.00e+00 0.0 0.0e+00
> 0.0e+00
> > > > > 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
> > > > > VecAXPY 567 1.0 1.4142e+00 2.8 4.90e+08 2.8 0.0e+00
> 0.0e+00
> > > > > 0.0e+00 0 0 0 0 0 0 0 0 0 0 346
> > > > > VecMAXPY 8777 1.0 2.6692e+02 2.7 6.15e+08 2.7 0.0e+00
> 0.0e+00
> > > > > 0.0e+00 16 38 0 0 0 16 38 0 0 0 453
> > > > > VecAssemblyBegin 2 1.0 1.6093e-04 2.5 0.00e+00 0.0 0.0e+00
> 0.0e+00
> > > > > 6.0e+00 0 0 0 0 0 0 0 0 0 0 0
> > > > > VecAssemblyEnd 2 1.0 4.7684e-06 1.7 0.00e+00 0.0 0.0e+00
> 0.0e+00
> > > > > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> > > > > *VecScatterBegin 8776 1.0 6.6898e-01 6.7 0.00e+00 0.0 1.8e+04
> 4.8e+03
> > > > > 0.0e+00 0 0100100 0 0 0100100 0 0*
> > > > > *VecScatterEnd 8776 1.0 1.7747e+0130.1 0.00e+00 0.0 0.0e+00
> 0.0e+00
> > > > > 0.0e+00 1 0 0 0 0 1 0 0 0 0 0*
> > > > > *VecNormalize 8777 1.0 1.8366e+02 7.7 2.39e+08 7.7 0.0e+00
> 0.0e+00
> > > > > 8.8e+03 9 4 0 0 51 9 4 0 0 51 62*
> > > > >
> > > > >
> ------------------------------------------------------------------------------------------------------------------------
> > > > > Memory usage is given in bytes:
> > > > > Object Type Creations Destructions Memory
> Descendants' Mem.
> > > > > --- Event Stage 0: Main Stage
> > > > > Matrix 4 4 49227380 0
> > > > > Krylov Solver 2 2 17216 0
> > > > > Preconditioner 2 2 256 0
> > > > > Index Set 5 5 2596120 0
> > > > > Vec 40 40 62243224 0
> > > > > Vec Scatter 1 1 0 0
> > > > >
> ========================================================================================================================
> > > > > Average time to get PetscTime(): 4.05312e-07
> Average time
> > > > > for MPI_Barrier(): 7.62939e-07
> > > > > Average time for zero size MPI_Send(): 2.02656e-06
> > > > > OptionTable: -log_summary
> > > > >
> > > > >
> > > > > The PETSc manual states that ratio should be close to 1. There's
> quite a
> > > > > few *(in bold)* which are >1 and MatAssemblyBegin seems to be very
> big. So
> > > > > what could be the cause?
> > > > >
> > > > > I wonder if it has to do the way I insert the matrix. My steps are:
> > > > > (cartesian grids, i loop faster than j, fortran)
> > > > >
> > > > > For matrix A and rhs
> > > > >
> > > > > Insert left extreme cells values belonging to myid
> > > > >
> > > > > if (myid==0) then
> > > > >
> > > > > insert corner cells values
> > > > >
> > > > > insert south cells values
> > > > >
> > > > > insert internal cells values
> > > > >
> > > > > else if (myid==num_procs-1) then
> > > > >
> > > > > insert corner cells values
> > > > >
> > > > > insert north cells values
> > > > >
> > > > > insert internal cells values
> > > > >
> > > > > else
> > > > >
> > > > > insert internal cells values
> > > > >
> > > > > end if
> > > > >
> > > > > Insert right extreme cells values belonging to myid
> > > > >
> > > > > All these values are entered into a big_A(size_x*size_y,5) matrix.
> int_A
> > > > > stores the position of the values. I then do
> > > > >
> > > > > call MatZeroEntries(A_mat,ierr)
> > > > >
> > > > > do k=ksta_p+1,kend_p !for cells belonging to myid
> > > > >
> > > > > do kk=1,5
> > > > >
> > > > > II=k-1
> > > > >
> > > > > JJ=int_A(k,kk)-1
> > > > >
> > > > > call
> MatSetValues(A_mat,1,II,1,JJ,big_A(k,kk),ADD_VALUES,ierr)
> > > > > end do
> > > > >
> > > > > end do
> > > > >
> > > > > call MatAssemblyBegin(A_mat,MAT_FINAL_ASSEMBLY,ierr)
> > > > >
> > > > > call MatAssemblyEnd(A_mat,MAT_FINAL_ASSEMBLY,ierr)
> > > > >
> > > > >
> > > > > I wonder if the problem lies here.I used the big_A matrix because I
> was
> > > > > migrating from an old linear solver. Lastly, I was told to widen my
> window
> > > > > to 120 characters. May I know how do I do it?
> > > > >
> > > > >
> > > > >
> > > > > Thank you very much.
> > > > >
> > > > > Matthew Knepley wrote:
> > > > >
> > > > >
> > > > >
> > > > > > On Mon, Apr 14, 2008 at 8:43 AM, Ben Tay <zonexo@xxxxxxxxx> wrote:
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > > Hi Matthew,
> > > > > > >
> > > > > > > I think you've misunderstood what I meant. What I'm trying to
> say is
> > > > > > > initially I've got a serial code. I tried to convert to a
> parallel one.
> > > > > > >
> > > > > > >
> > > > > >
> > > > > Then
> > > > >
> > > > >
> > > > > >
> > > > > > > I tested it and it was pretty slow. Due to some work
> requirement, I need
> > > > > > >
> > > > > > >
> > > > > >
> > > > > to
> > > > >
> > > > >
> > > > > >
> > > > > > > go back to make some changes to my code. Since the parallel is
> not
> > > > > > >
> > > > > > >
> > > > > >
> > > > > working
> > > > >
> > > > >
> > > > > >
> > > > > > > well, I updated and changed the serial one.
> > > > > > >
> > > > > > > Well, that was a while ago and now, due to the updates and
> changes, the
> > > > > > > serial code is different from the old converted parallel code.
> Some
> > > > > > >
> > > > > > >
> > > > > >
> > > > > files
> > > > >
> > > > >
> > > > > >
> > > > > > > were also deleted and I can't seem to get it working now. So I
> thought I
> > > > > > > might as well convert the new serial code to parallel. But I'm
> not very
> > > > > > >
> > > > > > >
> > > > > >
> > > > > sure
> > > > >
> > > > >
> > > > > >
> > > > > > > what I should do 1st.
> > > > > > >
> > > > > > > Maybe I should rephrase my question in that if I just convert my
> > > > > > >
> > > > > > >
> > > > > >
> > > > > poisson
> > > > >
> > > > >
> > > > > >
> > > > > > > equation subroutine from a serial PETSc to a parallel PETSc
> version,
> > > > > > >
> > > > > > >
> > > > > >
> > > > > will it
> > > > >
> > > > >
> > > > > >
> > > > > > > work? Should I expect a speedup? The rest of my code is still
> serial.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > You should, of course, only expect speedup in the parallel parts
> > > > > >
> > > > > > Matt
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > > Thank you very much.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Matthew Knepley wrote:
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > I am not sure why you would ever have two codes. I never do
> this.
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > PETSc
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > > is designed to write one code to run in serial and parallel.
> The PETSc
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > part
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > should look identical. To test, run the code yo uhave verified
> in
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > serial
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > and
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > output PETSc data structures (like Mat and Vec) using a binary
> viewer.
> > > > > > > > Then run in parallel with the same code, which will output the
> same
> > > > > > > > structures. Take the two files and write a small verification
> code
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > that
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > > loads both versions and calls MatEqual and VecEqual.
> > > > > > > >
> > > > > > > > Matt
> > > > > > > >
> > > > > > > > On Mon, Apr 14, 2008 at 5:49 AM, Ben Tay <zonexo@xxxxxxxxx>
> wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > Thank you Matthew. Sorry to trouble you again.
> > > > > > > > >
> > > > > > > > > I tried to run it with -log_summary output and I found that
> there's
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > some
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > errors in the execution. Well, I was busy with other things
> and I
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > just
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > came
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > back to this problem. Some of my files on the server has
> also been
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > deleted.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > It has been a while and I remember that it worked before,
> only
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > much
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > slower.
> > > > > > > > >
> > > > > > > > > Anyway, most of the serial code has been updated and maybe
> it's
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > easier
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > to
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > convert the new serial code instead of debugging on the old
> parallel
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > code
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > now. I believe I can still reuse part of the old parallel
> code.
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > However,
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > I
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > hope I can approach it better this time.
> > > > > > > > >
> > > > > > > > > So supposed I need to start converting my new serial code to
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > parallel.
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > There's 2 eqns to be solved using PETSc, the momentum and
> poisson. I
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > also
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > need to parallelize other parts of my code. I wonder which
> route is
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > the
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > best:
> > > > > > > > >
> > > > > > > > > 1. Don't change the PETSc part ie continue using
> PETSC_COMM_SELF,
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > modify
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > other parts of my code to parallel e.g. looping, updating of
> values
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > etc.
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > Once the execution is fine and speedup is reasonable, then
> modify
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > the
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > PETSc
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > part - poisson eqn 1st followed by the momentum eqn.
> > > > > > > > >
> > > > > > > > > 2. Reverse the above order ie modify the PETSc part -
> poisson eqn
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > 1st
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > followed by the momentum eqn. Then do other parts of my
> code.
> > > > > > > > >
> > > > > > > > > I'm not sure if the above 2 mtds can work or if there will
> be
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > conflicts. Of
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > course, an alternative will be:
> > > > > > > > >
> > > > > > > > > 3. Do the poisson, momentum eqns and other parts of the code
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > separately.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > That is, code a standalone parallel poisson eqn and use
> samples
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > values
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > to
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > test it. Same for the momentum and other parts of the code.
> When
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > each of
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > them is working, combine them to form the full parallel
> code.
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > However,
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > this
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > will be much more troublesome.
> > > > > > > > >
> > > > > > > > > I hope someone can give me some recommendations.
> > > > > > > > >
> > > > > > > > > Thank you once again.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Matthew Knepley wrote:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > 1) There is no way to have any idea what is going on in
> your code
> > > > > > > > > > without -log_summary output
> > > > > > > > > >
> > > > > > > > > > 2) Looking at that output, look at the percentage taken by
> the
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > solver
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > > KSPSolve event. I suspect it is not the biggest component,
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > because
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > > it is very scalable.
> > > > > > > > > >
> > > > > > > > > > Matt
> > > > > > > > > >
> > > > > > > > > > On Sun, Apr 13, 2008 at 4:12 AM, Ben Tay
> <zonexo@xxxxxxxxx> wrote:
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > Hi,
> > > > > > > > > > >
> > > > > > > > > > > I've a serial 2D CFD code. As my grid size requirement
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > increases,
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > the
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > simulation takes longer. Also, memory requirement
> becomes a
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > problem.
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > Grid
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > size 've reached 1200x1200. Going higher is not possible
> due to
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > memory
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > problem.
> > > > > > > > > > >
> > > > > > > > > > > I tried to convert my code to a parallel one, following
> the
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > examples
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > given.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > I also need to restructure parts of my code to enable
> parallel
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > looping.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > I
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > 1st changed the PETSc solver to be parallel enabled and
> then I
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > restructured
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > parts of my code. I proceed on as longer as the answer
> for a
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > simple
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > test
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > case is correct. I thought it's not really possible to
> do any
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > speed
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > testing
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > since the code is not fully parallelized yet. When I
> finished
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > during
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > most of
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > the conversion, I found that in the actual run that it
> is much
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > slower,
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > although the answer is correct.
> > > > > > > > > > >
> > > > > > > > > > > So what is the remedy now? I wonder what I should do to
> check
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > what's
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > wrong.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > Must I restart everything again? Btw, my grid size is
> 1200x1200.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > I
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > believed
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > it should be suitable for parallel run of 4 processors?
> Is that
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > so?
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > Thank you.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> > >
> >
> >
> >
>
>
--
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which
their experiments lead.
-- Norbert Wiener