	    Parallel Gaussian Elimination Library

						April 25, 1996


This program is the first prize winner in NEC Cenju-3 and Fujitsu
AP1000 sections of PSC95 (Parallel Software Contest '95) held with
JSPP95 (Joint Symposium on Parallel Processing).  This program works
on NEC Cenju-3, Fujitsu AP+ and the MPI (Message Passing Interface)
Standard.


***     Copyright notice

Copyright(C) 1996 Osamu Tatebe, all rights reserved, no warranty.

This software is a free software.  All rights of this software belong
to Osamu Tatebe.  You can redistribute this whole package as it is if
you do not modify and inform me by E-mail.


*** 1.  Configuration

Edit 'include/conf.h' to change the number of processors, the size of
a computation tile and the maximum problem size according to your
environment.  NUM_PROC_X and NUM_PROC_Y mean the number of processors
of X and Y directions, respectively.  So, total number of processors
is NUM_PROC_X * NUM_PROC_Y.  Since this program aims at the highest
performance, NUM_PROC_X and NUM_PROC_Y should be the power of 2.  You
can also change TILE_K, TILE_J and MAX_PSIZE for convenience.

    Warning:: If you change the configuration, you should make this
	      library again.

    Caution:: Take care that your application is built under the same
	      configuration of the library.


Then, create a symbolic link file 'conf/conf' to point a machine-
dependent configuration file.  For example,

% rm conf/conf
% (cd conf; ln -s conf.mpi conf)

You may change 'CC' and 'CFLAGS' in the file 'conf/conf'.

    Warning:: If you use this library in AP1000 (not AP+), you should
	      not define __BROAD_PUT__ in 'conf/conf.ap'.

    Caution:: If you compile this library for 1 proc, you must not
	      define __BROAD_PUT__ in 'conf/conf.ap'.


*** 2.  Installation of the libaray and sample programs

Type

% make

to create a library 'libgauss.a' that will be installed to '
lib/$(ARCH)' directory.  This also creates a sample program.  'ARCH'
is defined in 'conf/conf'.


*** 3.  How to execute the sample programs

To run the sample program, change to the 'bin/$(ARCH)' directory and
type

% cjsh -host <host> -n <NUM_PROC> sample -s <size>

on the Cenju-3,

% ./host -WS

on the AP+ or

% mpirun <nodes> -c2c sample -- -s <size>

on the MPI standard implemented by OSC LAM, for example.  <host> is a
host name, <NUM_PROC> is the number of processors you specify in '
conf.h' and <size> is a problem size.  Unfortunately you can't specify
<size> parameter in AP+ version, while you can easily change the size
by hacking 'sample.c'.

This sample program solves the problem of the LINPACK benchmark.


*** 4.  PSC95 problem

In this release, I include PSC95 problems.  A specification and a
ranking in the PSC written in Japanese/KANJI are placed in the '
psc/doc' directory.  To challenge the PSC problems, type 'make psc'
and change to the 'psc' directory.  On the Cenju-3, type

% cjsh -host <host> -n 64 psc <key>

or on the AP+,

% ./psc95_h -R 0

<key> is the problem number between 1 and 3.  On the AP+, the key is
required later.  On the Cenju-3, you can try the HONSEN (final
competition) problem changing the symbolic link of '
psc/cenju-lib/problem.o' to 'problem-honsen.o'.  Enjoy.

    Caution:: Because of the restriction of PSC, MAX_PSIZE must be set
	      2048 in AP+ version.

    Caution:: When you execute this program on the AP1000 not the AP+, 
	      __PUT__ option in 'Makefile.ap' must not be set.  That
	      is because 'active_init()' cannot be inserted before
	      creating a coefficient matrix because of the regulation
	      of PSC.  Thus we cannot use PUT/GET facilities that is
	      an active message included in the Fujitsu
	      message-passing library.


*** 5.  Parallel Gaussian elimination library

In the default setting, the Gaussian elimination library 'libgauss.a'
is installed in the directory 'lib/$(ARCH)'.  Linking this library,
you can use the 'solve_matrix()' routine whose definition is

void
solve_matrix(int cid, int size, dot_mat pA, double* global_x)

The 'dot_mat' datatype is defined in 'conf.h', which is equivalent to
double [][].  The matrix is distributed by block-cyclic manner:

	!HPF$ DISTRIBUTE (CYCLIC(8), CYCLIC)

and the distributed matrix 'pA' of each processor includes the
corresponding elements of the right-hand vector.  'cid' is a node
number, 'size' is a problem size and 'global_x' is an answer vector
that is an output.  'global_x' is not a distributed array, thus each
processor gets the same 'global_x'.

Algorithm, implementation matter and evaluation of this library are
briefly described in the postscript file 'doc/psc.ps' written in
Japanese/KANJI.


*** 6.  Questions and comments

When problems occur, you feel freely to contact me by E-mail:

	tatebe@is.s.u-tokyo.ac.jp

Your comments and questions are always welcome.
