Okay, but if I'm stuck with a big 3D finite difference code, written in PETSc using Distributed Arrays, with 3 dof per node, then you're saying there is really nothing I can do, except using blocking, to improve things on quad core cpus? They talk about blocking using BAIJ format, and so is this the same thing as creating MPIBAIJ matrices in PETSc? And is creating MPIBAIJ matrices in PETSc going to make a substantial difference in the speed?
I'm sorry if I'm being dense, I'm just trying to understand if there is some simple way I can utilize those extra cores on each cpu easily, and since I'm not a computer scientist, some of these concepts are difficult.
Thanks, Randy
On Tue, Apr 15, 2008 at 7:41 PM, Randall Mackie <rlmackie862@xxxxxxxxx> wrote:Then what's the point of having 4 and 8 cores per cpu for parallel computations then? I mean, I think I've done all I can to make my code as efficient as possible.
I really advise reading the paper. It explicitly treats the case of blocking, and uses a simple model to demonstrate all the points I made.
With a single, scalar sparse matrix, there is definitely no point at all of having multiple cores. However, this will speed up things like finite element integration. So, for instance, making this integration dominate your cost (like spectral element codes do) will show nice speedup. Ulrich Ruede has a great talk about this on his website.
Matt
I'm not quite sure I understand your comment about using blocks or unassembled structures.
Randy
Matthew Knepley wrote:
On Tue, Apr 15, 2008 at 7:19 PM, Randall Mackie <rlmackie862@xxxxxxxxx>wrote:I'm running my PETSc code on a cluster of quad core Xeon's connected by Infiniband. I hadn't much worried about the performance, because everything seemed to be working quite well, but today I was actually comparing performance (wall clock time) for the same problem, but on different combinations of CPUS.
I find that my PETSc code is quite scalable until I start to use multiple cores/cpu.
For example, the run time doesn't improve by going from 1 core/cpu to 4 cores/cpu, and I find this to be very strange, especially since looking at top or Ganglia, all 4 cpus on each node are running at 100% almost all of the time. I would have thought if the cpus were going all out, that I would still be getting much more scalable results.
Those a really coarse measures. There is absolutely no way that all cores are going 100%. Its easy to show by hand. Take the peak flop rate and this gives you the bandwidth needed to sustain that computation (if everything is perfect, like axpy). You will find that the chip bandwidth is far below this. A nice analysis is in
http://www.mcs.anl.gov/~kaushik/Papers/pcfd99_gkks.pdf
We are using mvapich-0.9.9 with infiniband. So, I don't know if this is a cluster/Xeon issue, or something else.
This is actually mathematics! How satisfying. The only way to improve this is to change the data structure (e.g. use blocks) or change the algorithm (e.g. use spectral elements and unassembled structures)
Matt
Anybody with experience on this?
Thanks, Randy M.