Special Considerations for Running on a Network of Workstations

Up: Running an MPI Program Next: Dealing with automounters Previous: Running an MPI Program

To run on a network of workstations, you must specify in some way the host names of the machines that you want to run on. This can be done in several ways. These are described in detail in the Users Guide. We give a shorter version here.

The easiest way is to edit the file mpich/util/machines/machines.xxxx, to contain names of machines of architecture xxxx. The xxxx matches the arch given when mpich was configured. Then whenever mpirun is executed, the required number of hosts will be selcted from this file for the run. (There is no fancy scheduling; the hosts are selected starting from the top). To run all your MPI processes on a single workstation, just make all the lines in the file the same. A sample machines.sun4 file might look like:

mercury 
    venus 
    earth 
    mars 
    earth 
    mars 
To run the test suite in examples/test, you need a machines file with at least five lines in it. This is for homogeneous networks. Heterogeneous networks are discussed in the Users' Guide.


Up: Running an MPI Program Next: Dealing with automounters Previous: Running an MPI Program


Dealing with automounters

Up: Special Considerations for Running on a Network of Workstations Next: Faster job startup Previous: Special Considerations for Running on a Network of Workstations

Automounters are programs that dynamically make file systems available when needed. While this is very convenient, many automounters are unable to recognize the file system names that the automounter itself generates. For example, if a user accesses a file /home/me, the automounter may discover that it needs to mount this file system, and does so in /tmp_mnt/home/me. Unfortunately, if the automounter on a different system is presented with /tmp_mnt/home/me instead of /home/me, it may not be able to find the file system. This would not be such a problem if commands like pwd returned /home/me instead of /tmp_mnt/home/me; unfortunately, it is all too easy to get a path that the automounter should, but does not, recognize.

To deal with this problem, configure allows you to specify a filter program when you configure with the option -automountfix=PROGRAM, where PROGRAM is a filter that reads a file path from standard input, makes any changes necessary, and writes the output to standard output. mpirun uses this program to By default, the value of PROGRAM is

sed -e s@/tmp_mnt/@/@g 
This uses the sed command to strip the string /tmp_mnt from the file name. Simple sed scripts like this may be used as long as they do not involve quotes (single or double) or use % (these will interfere with the shell commands in configure that do the replacements). If you need more complex processing, use a separate shell script or program.

As another example, some systems will generate paths like

/a/thishost/root/home/username/.... 
which are valid only on the machine thishost, but also have paths of the form
/u/home/username/.... 
that are valid everywhere. For this case, the configure option
-automountfix='sed -e s@/a/.\*/home@/u/home@g' 
will make sure that mpirun gets the proper filename.


Up: Special Considerations for Running on a Network of Workstations Next: Faster job startup Previous: Special Considerations for Running on a Network of Workstations


Faster job startup

Up: Special Considerations for Running on a Network of Workstations Next: Stopping the servers Previous: Dealing with automounters

When using the ch_p4 or ch_nexus devices, it is possible to speedup the process of starting jobs by using the secure server. The secure server is a program that runs on the machines listed in the machines.ARCH file and that allows programs to start faster. There are two ways to install this program: so that only one user may use it and so all users may use it. No special privileges are required to install the secure server for a single user's use.

To use the secure server, follow these steps:

    1. Choose a port. This is a number that you will use to identify the secure server (different port numbers may be used to allow multiple secure servers to operate). A good choice is a number over 1000. If you pick a number that is already being used, the server will exit, and you'll have to pick another number. On many systems, you can use the rpcinfo command to find out which ports are in use (or reserved). For example, to find the ports in use on host mysun, try
    rpcinfo -p mysun 
    

    2. If using the ch_p4 device, build the secure server. From the top level directory, do
    make serv_p4 
    
    At the end of this step, the executable for the secure server is in the same directory as the MPI libraries. The name of the server is serv_p4.

    Alternately, do

    make server 
    
    from the top level directory. At the end of this step, the new secure servier is in the same directory as the MPI libraries. The name of the server is server.


    3. Start the secure server. The script bin/chp4_servs

    bin/chp4_servs -port=n -arch=$ARCH 
    
    can be used to start the secure servers. This makes use of the remote shell command (rsh or remsh) to start the servers; if you can not use the remote shell command, you will need to log into each system on which you want to start the secure server and start the server manually. The command to start an individual server using port 2345 is
    serv_p4 -o -p 2345 & 
    
    For example, if you had choosen a port number of 2345 and were using sun4s, then you would give the command
    bin/chp4_servs -port=2345 -arch=sun4 
    
    The server will keep a log of its activities in a file with the name P4Server.Log.xxxx in the current directory, where xxxx is the process id of the process that started the server (note that the server may be running as a child of that initial process).

    The newer server uses the file Secure_Server.Log.xxxx.


    4. To make use of the secure servers using the ch_p4 device, you must inform mpirun of the port number. You can do this in two ways. The first is to give the -p4ssport n option to mpirun. For example, if the port is 2345 and you want to run cpi on four processors, use

    mpirun -np 4 -p4ssport 2345 cpi 
    
    The other way to inform mpirun of the secure server is to use the environment variables MPI_USEP4SSPORT and MPI_P4SSPORT. In the C-shell, you can set these with
    setenv MPI_USEP4SSPORT yes 
        setenv MPI_P4SSPORT 2345 
    
    The value of MPI_P4SSPORT must be the port with which you started the secure servers. When these environment variables are set, no extra options are needed with mpirun.


    5. If using the ch_nexus device, find the Nexus secure server in the Nexus directory, for example, /usr/local/nexus/bin/sserver.


    6. Start the Nexus secure server on each machine. The command to start an individual server using port 2345 is

    ssserver -d -p 2345 & 
    

    7. The ch_nexus device requires that you record the port numbers in a resource database (.rdb) file. The format of the file is
    <host> ss_port=<port #> 
    
    The -nexusdb flag should be used to tell mpirun the name of the file:
    mpirun -nexusdb ports program 
    
Note that when MPICH is installed, the secure server and the startup commands are copiedinto the library directory so that users may start their own copies of the server. This is discussed in the Users Guide.


Up: Special Considerations for Running on a Network of Workstations Next: Stopping the servers Previous: Dealing with automounters


Stopping the servers

Up: Special Considerations for Running on a Network of Workstations Next: Managing the servers Previous: Faster job startup

To stop the servers, their processes must be killed. This is easily done with the Scalable Unix Tools [4] with the command

pfps -all -tn serv_p4 -and -o $LOGNAME -kill INT 
Alternately, you can log into each system and execute something like
ps auxww | egrep '$LOGNAME.*serv_p4' 
and then use the kill command on the resulting process number (users of System V-style ps commands will have to figure out what their particular form of ps needs and adjust the egrep command accordingly).

An alternative approach is discussed in Section Managing the servers


Up: Special Considerations for Running on a Network of Workstations Next: Managing the servers Previous: Faster job startup


Managing the servers

Up: Special Considerations for Running on a Network of Workstations Next: Special Considerations for Running with Shared Memory Previous: Stopping the servers

An experimental perl5 program is provided to help you manage the p4 secure servers. This program is chkserv, and is in the util directory. You can use this program to check that your servers are running, start up new servers, or stop servers that are running.

Before using this script, you must edit it. It has sample values for the fields that it will use. In particular, you should set serv_p4, portnum, and machinelist appropriately; you may also need to set the first line to your version of perl5.

To check on the status of your servers, use

chkserv -port 2345  
To restart any servers that have stopped, use
chkserv -port 2345 -restart 
This does not restart servers that are already running; you can use this as a cron job every morning to make sure that your servers are running. Note that this uses rsh to start the process on the remote systems; if you can't use rsh, you'll need to restart the servers by hand. In that case, you can use the output from chkserv -port 1234 to see which servers need to be restarted.


chkserv -port 2345 -kill  
This contacts all running servers and tells them to exit. It does not use rsh, and can be used on any system (it contacts the server and tells it to exit).

This software is experimental. If you have comments or suggestions, please send them mpibugs@mcs.anl.gov.


Up: Special Considerations for Running on a Network of Workstations Next: Special Considerations for Running with Shared Memory Previous: Stopping the servers