[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: profiling PETSc code



Take out your stage push/pop for the moment, and the log_summary
call. Just run with -log_summary and send the output as a test.

 Thanks,

    Matt

On 8/1/06, Matt Funk <mafunk@xxxxxxxx> wrote:
Hi,

i need to profile my code. Specifically the KSPSolve(...) call. Basically i am
(just for testing) setting up the identity matrix and pass in the solution
and RHS vectors. I solve the system 4000 times or so (400 times steps that
is). Anyway, when i run on 1 processor it takes essentially no time (ca 7
secs). When i run on 4 procs it takes 96 secs. I use an external timer to
profile just the call to KSPSolve which is where the times come from.

So i read through chap 12. Unfortunately i cannot find ex21.c in
src/ksp/ksp/examples/tutorials so i couldn't look at the code.

So what i do is the following. At the end of my program, right before calling
PetscFinalize() i call PetscLogPrintSummary(...). I registered stage 0 and
then right before the KSPSolve() call i do :   m_ierr = PetscLogStagePush(0);
and right after i do: m_ierr = PetscLogStagePop(). There is also barriers
right before the push call and right after the pop call.

The output i get (for the single proc run) is:

...

---------------------------------------------- PETSc Performance Summary:
----------------------------------------------

/home2/users/mafunk/AMR/Chombo.2.0/example/node/maskExec/testNodePoisson2d.Linux.g++.g77.MPI32.ex
on a linux-gnu named .1 with 1 processor, by mafunk Tue Aug  1 14:47:43 2006
Using Petsc Release Version 2.3.1, Patch 12, Wed Apr  5 17:55:50 CDT 2006
BK revision: balay@xxxxxxxxxxxxxxxxxxx|ChangeSet|20060405225457|13540

                         Max       Max/Min        Avg      Total
Time (sec):           4.167e+01      1.00000   4.167e+01
Objects:              1.500e+01      1.00000   1.500e+01
Flops:                2.094e+08      1.00000   2.094e+08  2.094e+08
Flops/sec:            5.025e+06      1.00000   5.025e+06  5.025e+06
Memory:               4.700e+05      1.00000              4.700e+05
MPI Messages:         0.000e+00      0.00000   0.000e+00  0.000e+00
MPI Message Lengths:  0.000e+00      0.00000   0.000e+00  0.000e+00
MPI Reductions:       2.402e+04      1.00000

Flop counting convention: 1 flop = 1 real number operation of type
(multiply/divide/add/subtract)
                            e.g., VecAXPY() for real vectors of length N -->
2N flops
                            and VecAXPY() for complex vectors of length N -->
8N flops

Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages ---
-- Message Lengths --  -- Reductions --
                        Avg     %Total     Avg     %Total   counts   %Total
Avg         %Total   counts   %Total
 0:      Main Stage: 4.1647e+01  99.9%  2.0937e+08 100.0%  0.000e+00   0.0%
0.000e+00        0.0%  2.402e+04 100.0%

------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on interpreting
output.
Phase summary info:
   Count: number of times phase was executed
   Time and Flops/sec: Max - maximum over all processors
                       Ratio - ratio of maximum to minimum over all processors
   Mess: number of messages sent
   Avg. len: average message length
   Reduct: number of global reductions
   Global: entire computation
   Stage: stages of a computation. Set stages with PetscLogStagePush() and
PetscLogStagePop().
      %T - percent time in this phase         %F - percent flops in this phase
      %M - percent messages in this phase     %L - percent message lengths in
this phase
      %R - percent reductions in this phase
   Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over
all processors)
------------------------------------------------------------------------------------------------------------------------


########################################################## # # # WARNING!!! # # # # This code was compiled with a debugging option, # # To get timing results run config/configure.py # # using --with-debugging=no, the performance will # # be generally two or three times faster. # # # ##########################################################




########################################################## # # # WARNING!!! # # # # This code was run without the PreLoadBegin() # # macros. To get timing results we always recommend # # preloading. otherwise timing numbers may be # # meaningless. # ##########################################################


Event Count Time (sec) Flops/sec --- Global --- --- Stage --- Total Max Ratio Max Ratio Max Ratio Mess Avg len Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s ------------------------------------------------------------------------------------------------------------------------

--- Event Stage 0: Main Stage

------------------------------------------------------------------------------------------------------------------------

Memory usage is given in bytes:

Object Type          Creations   Destructions   Memory  Descendants' Mem.

--- Event Stage 0: Main Stage

           Index Set     3              3      35976     0
                 Vec     8              8     169056     0
              Matrix     2              2      23304     0
       Krylov Solver     1              1      17216     0
      Preconditioner     1              1        168     0
========================================================================================================================
Average time to get PetscTime(): 2.14577e-07
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
sizeof(PetscScalar) 8
Configure run at: Tue Aug  1 11:28:46 2006
Configure options: --with-debugging=1 --with-blas-lapack-dir=/usr/local
--with-mpi-dir=/usr --with-log=1 --with-shared=0


....


So i want to find out why KSPSolve() takes so long in parallel. However, there i no summary for stage 0. Does someone know why this is? Am i using it in a wrong way? I compiled the libraries with -with_log=1 -with-debugging=1

thanks
mat




--
"Failure has a thousand explanations. Success doesn't need one" -- Sir
Alec Guiness