Parallel Computing Question

Rogene Eichler eichler at i2.arc.umn.edu
Fri Apr 23 15:38:56 EST 1993


> PCN , developed between Caltech and Argonne Labs allows for the same
> program to run with NO modification on numerous uniprocessor platforms as
> well as a number of popular hypercubes. It is available for anonymous ftp
> from: info.mcs.anl.edu in pub/pcn

Great- I am glad to hear that tools like this are becoming available. I
would however like to see a tech report that contains the performance
values for a code that has been converted by PCN and original code written
for a particular machine. I would suspect a performance deficit for
certain problems.

I would also note that hypercubes are but a subset of parallel architectures
available. The CM5 uses a fat-tree and Kendall Square (sp?) uses nested
rings, and many others use a mesh. My original comment suggested that optimal
performance takes architecture into account, and that users inquire
about the machine for which a code was written. If one wishes to invest
that much money into a platform, I think it would be presumptuous to
assume that parallel computing is at a point where one can easily port
code between any platform.

>    In general, if you have a problem that requires a significant amount
>    of interprocessor communication, a serial machine might give you better
>    performance.
>
> Hmm, I beg to disagree. I spent a summer 2 years ago parallelizing an
> fcc-lattice algorithm to predict tertiary protein structures. My parallel
> version (developed with PCN) ran worlds faster.
>
> Terry Brannon   tbrannon at lion.eecs.lehigh.edu
> medical biology via acupunctural bioelectrics
> primitive reigns supreme over nearly everyone

Then perhaps you have a problem that is structured well for parallel platforms.Startup time for message passing on most platforms is on the order of
100 microsecs. When you consider that a clock cycle on a Cray-2 is 4.2 nsecs,
communication is quite costly.

One theorectically measures speed-up as S = Ts/Tp where Ts equals the
serial time for an algorithm to perform and Tp is the parallel time. Tp,
however, is equal to Tc + Tcomm where Tc is the computation time and Tcomm
is the communication time.

Let's do an example. Ts = units of time for a problem that contains a
NXN matrix, and one wants to multiply it by a matrix of the same size. On a
parallel machine of P processing elements, I can assign one of elements
of each matrix to an individual processor. The serial algorithm requires
2N^3 reads and 2N^3 operations. Reads take as long as operations, so one
can claim that the serial algorithm takes 4N^3*Tc. On the parallel machine,
each of the N^2/P elements will need to perform 2 local reads, 2(N-1) non-
local reads, and 2N operations. This means that Tp has a performance of
((2N + 2)*Tc) + (2*(N-1)*Tcomm). I can assume that Tcomm = 100*Tc from
the hardware performance. Then my speedup becomes:

                        S = 4*N^3/(202*N - 198)

When this is greater than 1, it can be said that there is a benefit from
parallel computing. By inspection, one can see that the only case for
which S > 1 is the trivial case where N = 1.

In the example, I am ignoring any possible contention in the sends and I
am also assuming that P < N^2. (The same arguements hold with P > N^2,
but the math gets a bit messier..) I am also assuming that the clock
speeds Tc are the same for the serial and parallel machines. Also, the
calculations are based on a *read as you need* communication pattern, as
opposed to a batch send.

I suspect that a lattice algorithm uses primarily nearest-neighbor communica-
tions, which is why you did not run into this problem. Irregular communication
arrays such as those encountered in mapping biological neurons or biological
neuron networks have a large communication overhead. We have reports at
the AHPCRC where Cray outperforms the CM5, and others where the CM5 outperformsCray. It really depends on how the problem is structured. Issues are 1) the
regularity of the communication grid 2) the ratio of communication to
computation 3) the problem size 4) the choice of data partitioning/blocking.

Yes, it is quite trivial to write programs for parallel computers. However,
cost-effective performance requires a bit more work. The good news is that
once vendors can reduce this access time, we needn't consider these details.


                                                Cheers!

                                                        -Rogene


*******************************************************************************                        Rogene M. Eichler West

 Neuroscience Program                  Minnesota Supercomputer Institute
University of Minnesota         Army High Performance Computing Research Center
AHPCRC: (612)626-8082                     1100 Washington Ave South
Home:   (612)292-0573                      Suite 101 MN Tech Center
Fax:    (612)626-1596                           MPLS, MN 55415

*******************************************************************************



More information about the Comp-bio mailing list