[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: general question on speed using quad core Xeons



Okay, but if I'm stuck with a big 3D finite difference code, written in PETSc
using Distributed Arrays, with 3 dof per node, then you're saying there is
really nothing I can do, except using blocking, to improve things on quad
core cpus? They talk about blocking using BAIJ format, and so is this the
same thing as creating MPIBAIJ matrices in PETSc? And is creating MPIBAIJ
matrices in PETSc going to make a substantial difference in the speed?

I'm sorry if I'm being dense, I'm just trying to understand if there is some
simple way I can utilize those extra cores on each cpu easily, and since
I'm not a computer scientist, some of these concepts are difficult.

Thanks, Randy

Matthew Knepley wrote:
On Tue, Apr 15, 2008 at 7:41 PM, Randall Mackie <rlmackie862@xxxxxxxxx> wrote:
Then what's the point of having 4 and 8 cores per cpu for parallel
 computations then? I mean, I think I've done all I can to make
 my code as efficient as possible.

I really advise reading the paper. It explicitly treats the case of blocking, and uses a simple model to demonstrate all the points I made.

With a single, scalar sparse matrix, there is definitely no point at
all of having
multiple cores. However, this will speed up things like finite element
integration.
So, for instance, making this integration dominate your cost (like
spectral element
codes do) will show nice speedup. Ulrich Ruede has a great talk about this on
his website.

  Matt

 I'm not quite sure I understand your comment about using blocks
 or unassembled structures.


Randy




Matthew Knepley wrote:

On Tue, Apr 15, 2008 at 7:19 PM, Randall Mackie <rlmackie862@xxxxxxxxx>
wrote:
I'm running my PETSc code on a cluster of quad core Xeon's connected
 by Infiniband. I hadn't much worried about the performance, because
 everything seemed to be working quite well, but today I was actually
 comparing performance (wall clock time) for the same problem, but on
 different combinations of CPUS.

 I find that my PETSc code is quite scalable until I start to use
 multiple cores/cpu.

 For example, the run time doesn't improve by going from 1 core/cpu
 to 4 cores/cpu, and I find this to be very strange, especially since
 looking at top or Ganglia, all 4 cpus on each node are running at 100%
almost
 all of the time. I would have thought if the cpus were going all out,
 that I would still be getting much more scalable results.

Those a really coarse measures. There is absolutely no way that all cores
are going 100%. Its easy to show by hand. Take the peak flop rate and
this gives you the bandwidth needed to sustain that computation (if
everything is perfect, like axpy). You will find that the chip bandwidth
is far below this. A nice analysis is in

 http://www.mcs.anl.gov/~kaushik/Papers/pcfd99_gkks.pdf


 We are using mvapich-0.9.9 with infiniband. So, I don't know if
 this is a cluster/Xeon issue, or something else.

This is actually mathematics! How satisfying. The only way to improve
this is to change the data structure (e.g. use blocks) or change the
algorithm (e.g. use spectral elements and unassembled structures)

 Matt


 Anybody with experience on this?

 Thanks, Randy M.