Let be <install-dir> the directory on the machine where ADAPTOR is installed. Every user should set the environment variable PHOME to this directory in the following way:
setenv PHOME <install-dir> or export PHOME=<install-dir>
Furthermore, the bin-directory of ADAPTOR should be included in the path variable, the man-directory in the manpath variable.
setenv PATH $PHOME/bin:$PATH export PATH=$PHOME/bin:$PATH setenv MANPATH $PHOME/man:$MANPATH export MANPATH=$PHOME/man:$MANPATH
Now there should be no problem to call the commands of ADAPTOR, e.g. the following commands should work correctly:
fadapt -help adaptor -help man fadapt man adaptor
Furthermore, make sure that the ADAPTOR runtime systems are available.
ls $PHOME/dalib/lib*.a
The following libraries must be available:
$PHOME/dalib/libadp_hpf.a ! HPF runtime for parallel machines $PHOME/dalib/libadp_hpf_null.a ! HPF runtime for serial machines $PHOME/dalib/libadp_dm_null.a ! dummy HPF distributed memory interface $PHOME/dalib/libadp_sm_null.a ! dummy HPF shared memory interface $PHOME/dalib/libadp_pm_null.a ! dummy performance monitoring interface $PHOME/dalib/libadp_trace_null.a ! dummy trace interface
For using HPF parallelism on distributed memory machines via MPI and on shared memory machines via PThreads the following libraries are needed:
$PHOME/dalib/libadp_dm_mpi.a $PHOME/dalib/libadp_sm_pthreads.a
Assuming that you have PAPI or PCL available on your machine, you can use the ADAPTOR instrumentation to get performance data. The following interfaces are necessary:
$PHOME/dalib/libadp_pm_papi.a $PHOME/dalib/libadp_pm_pcl.a
VampirTrace can be utilized to collect runtime data to be visualized by the VAMPIR tool. The following interfaces are needed:
$PHOME/dalib/libadp_trace_vt.a ! traces Adaptor + MPI $PHOME/dalib/libadp_trace_vtsample.a ! traces Adaptor + MPI + performance $PHOME/dalib/libadp_trace_vtsp.a ! traces Adaptor for single process
The main purpose of the ADAPTOR system is the compilation of HPF programs to parallel programs.
The following data parallel HPF program prime.hpf (assumed to be in the directory $PHOME/hpf_examples/) computes the number of primes in the range from 2 to n. The program uses dynamic arrays and array syntax. Timing functions are used to measure the run time of the program.
Note: This program is used to demonstrate the functionality of HPF and ADAPTOR and not the efficiency of HPF.
There is exactly one array in the program. This array will be block distributed among all processors where the number of processors will be fixed at runtime.
program PRIME integer N, S, K logical, allocatable :: A(:) !hpf$ distribute A(block) integer TICKSTART, TICKSTOP, TICKRATE print *, 'Input n for counting primes in range 2 to n : ' read *, N call system_clock (TICKSTART, TICKRATE) allocate (A(1:N)) A = .true. A(1) = .false. K = 2 do while (K*K <= N) A(K*K:N:K) = .false. ! sieve all multiples of k K = K + 1 do while (.not. A(K)) ! find next prime k K = K + 1 end do end do S = count (A) call system_clock (TICKSTOP) TICKSTOP = TICKSTOP - TICKSTART print *, 'There are ', S, ' primes until ', N print *, 'Time needed : ', float(TICKSTOP)/float(TICKRATE) deallocate (A) end
Here are some examples for input values and their result.
Input value: Result ============ ====== 100 25 1000 168 10000 1229 100000 9592 1000000 78498
The compile driver adaptor drives the whole compilation and will generate a corresponding executable. The compile driver is described in detail in section 5.
adaptor -hpf -dm -o <executable> <file>.hpf
This call will generate the executable in three steps:
The following commands will generate a parallel SPMD program and run it on 4 processors (MPI installation must be available).
adaptor -hpf -dm -o prime prime.hpf mpirun -np 4 prime
The basic idea of the HPF compilation for distributed memory machines is that the abstract HPF processors will be identified with MPI processes where usually one MPI process runs on one processor. The number of abstract HPF processors is given implictly at runtime by the number of available MPI processes (MPI_Comm_size).
The following command translates the data parallel program to a serial Fortran program, and then compiles and links it.
adaptor -hpf_1 -o prime prime.hpf
The flag -hpf_1 specifies to identify all abstract HPF processors with one single processor. The program runs only on a single node, it does not use any explicit process or thread parallelism.
./prime ! runs the executable prime
Note: Compiling for a single node results in slightly better performance than running the MPI program on a single node as certain overhead will be avoided by the fact that it is already known at compile time that only one processor will be used.
The following command translates the data parallel HPF program to a Fortran program using shared memory parallelism via PThreads.
adaptor -hpf -sm -o prime prime.hpf
The flag -sm tells the compiler to identify the abstract HPF processors with threads.
setenv OMP_NUM_THREADS 4 ! export OMP_NUM_THREADS=4 ./prime ! run it with 4 threads
This new execution model of HPF is based on the idea that one physical processor emulates a certain number of abstract processors. Due to the implicit blocking of data and load by the HPF mapping directives, the data of one abstract processor fits better in the cache and is better reused.
adaptor -hpf -cb -o prime prime.hpf
The number of abstract processors (number of blocks) is specified by a flag.
./prime -nb 100 ! emulation of 100 abstract processors
The following command translates the data parallel HPF program to a parallel programs using process parallelism via MPI and thread parallelism via PThreads.
adaptor -hpf -dm -sm -o prime prime.hpf
The following commands will run the program on four nodes with two threads on each node.
export OMP_NUM_THREADS=2 mpirun -np 4 prime
Beside the HPF compilation, ADAPTOR can also be used to translate parallel OpenMP Fortran programs to programs using explicit thread parallelism.
program PRIME integer n, s, k logical, allocatable :: a(:) !hpf$ distribute a(block) integer tickstart, tickstop, tickrate integer nt, omp_get_max_threads print *, 'Input n for counting primes in range 2 to n : ' read *, n call system_clock (tickstart, tickrate) allocate (a(1:n)) a = .true. a(1) = .false. s = 0 nt = omp_get_max_threads () !$omp parallel private (i,k), reduction (+:s) k = 2 do while (k*k .le. n) c /* sieve all multiples of k */ !$omp do do i = k*k, n, k a(i) = .false. end do k = k + 1 c /* find next prime k */ do while (.not. a(k)) k = k + 1 end do end do !$omp do do i = 1, n if (a(i)) s = s + 1 end do !$omp end parallel call system_clock (tickstop) tickstop = tickstop - tickstart print *, 'Program runs on ', nt, ' threads' print *, 'There are ',s,' primes until ', n print *, 'Time needed : ', float(tickstop)/float(tickrate) deallocate (a) end
The following command translates the parallel OpenMP program to a Fortran program using shared memory parallelism via PThreads.
adaptor -openmp -o prime prime.f
The flag -openmp tells the compiler to translate the OpenMP directives and to bind the correct runtime system.
setenv OMP_NUM_THREADS 4 ! export OMP_NUM_THREADS=4 ./prime ! run it with 4 threads
ADAPTOR provides support for profiling all kind of Fortran programs (a detailed description can be found in the ADAPTOR profiling guide [Bra04b]).
Every OpenMP and HPF program that has been translated by ADAPTOR will be instrumented automatically.
Serial Fortran programs can be instrumented by the following command:
adaptor -instr prime.f
OpenMP Fortran programs:
adaptor -instr_openmp prime.f
MPI Fortran programs:
adaptor -instr_mpi prime.f
Every program that has been translated by the adaptor compiler driver and has run through the source-to-source translation fadapt will be instrumented automatically. At least there is a runtime system call at every entry and exit of a subprogram. Therefore the runtime system can react on user-specific commands during execution to provide useful information.
By the switch -trace or by setting the environment variable TRACE every entry and exit of an instrumented region (by default all subprograms) will be traced. By default, this is a line on the standard output.
export TRACE=1 prime -trace prime call of PRIME (file=prime.hpf,line=3) end call of PRIME (file=prime.hpf,line=31)
For parallel compiled programs every processor will identify its own regions.
export TRACE=1 mpirun -np 2 prime -trace mpirun -np 2 prime 1/2: call of PRIME (file=prime.hpf,line=3) 1/2: end call of PRIME (file=prime.hpf,line=31) 2/2: call of PRIME (file=prime.hpf,line=3) 2/2: end call of PRIME (file=prime.hpf,line=31)
By the switch -pm or by setting the environment variable PM the runtime collects the information how many time (walltime) has been spent in every region.
export PM=1 prime -pm prime profile data in pm.out for tid=0, nid=0
Counters for all regions ======================== PRIME (SUBPROGRAM,file=prime.hpf,lines=3:31) calls : 1 walltime(B) s : 1.470 walltime(N) s : 1.470