Instructions for the Machines

Executive summary

More information about the high-performance computers at NYU is available at the ITS HPC page and on the NYU HPC wiki. See also the quick-start guide from Joseph.

To access any of the HPC machines, you will login to hpc.es.its.nyu.edu using your NYU username and password. From there, you can use ssh to access the login nodes for the Dell cluster (hpc1.its.nyu.edu or hpc2.its.nyu.edu). You can also login to gibbs.its.nyu.edu; Gibbs is the front-end node for Gauss, an SGI Altix machine that we will use for some of our shared-memory assignments.

Logging into the Dell cluster

All access to the high-performance machines at NYU goes through hpc.es.its.nyu.edu (a.k.a. bastionhost). Your username on this machine is your NYU username (the same one you use to login to NYUHome). Your password is the same one you use to access NYUHome.

You will need to login using ssh. On my desktop Linux machine, for example, I login using the command

[box207] ~$ ssh -Y dsb7@hpc.es.its.nyu.edu

If you are using another type of machine (e.g. a Windows machine), you can use your favorite other ssh client. The first time you log in, the system will warn you that it does not know about hpc.es.its.nyu.edu; just say yes to login anyhow and permanently accept this machine as a known host.

From hpc.es, the only thing you can do is use ssh to access other machines. To reach hpc1.its.nyu.edu (one of the front-end nodes for the Dell cluster), type ssh -Y hpc1.its.nyu.edu at the prompt, e.g.

[dsb7@hpc ~]$ ssh -Y hpc1.its.nyu.edu

The first time you log in, you will again get a warning from ssh saying that it doesn't know about the host hpc1.its.nyu.edu. Say yes to login anyways. Then you will be asked to enter your NYU password again.

After you give your password, you will be logged into hpc1. The first time you login, you will be prompted to set up an ssh key. This key is used by the system to establish secure communication between the nodes in the cluster. Just hit enter when prompted for a passphrase. The interaction will look something like this:

It doesn't appear that you have set up your ssh key.
This process will make the files:
     /home/dsb7/.ssh/id_rsa.pub
     /home/dsb7/.ssh/id_rsa

Generating public/private rsa key pair.
Enter file in which to save the key (/home/dsb7/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/dsb7/.ssh/id_rsa.
Your public key has been saved in /home/dsb7/.ssh/id_rsa.pub.
The key fingerprint is: ...

Now you are logged into the cluster!

Moving files to and from the Dell cluster

The wiki describes how to move data to and from the cluster. If you have a CIMS account, you can move files into your CIMS home directory, and then get to them via access. For example, here is how I get a file from my CIMS home directory to the cluster:

[dsb7@login-0-0 ~]$ sftp dbindel@access.cims.nyu.edu
Connecting to access.cims.nyu.edu...
Password: 
sftp> get myfile.txt
Fetching /home/dbindel/myfile.txt to myfile.txt
/home/dbindel/myfile.txt ...
sftp> quit

You can do something similar to the above if you have an account on another UNIX machine that provides sftp access.

If you are using Windows, use WinScp to copy files to hpc.es.its.nyu.edu, and then copy them to your directory on the Dell cluster with scp. WinScp needs to be set to scp not sftp - the first opening page has the radio button to select the scp vs sftp protocol.

Compiling serial code on the Dell cluster

The GNU compilers gcc and g77 are available on the Dell cluster, and these are the default system compilers. However, you will probably want to use the Intel compilers. To load the Intel C compiler (icc), first use the module load command to set up the correct paths and environment variables:

module load intel-c/cce/10.0.023

If you want to load the Intel Fortran compiler (ifort), run

module load intel-fortran/fce/10.0.023

The HPC wiki has further instructions.

Running serial jobs on the Dell cluster

For playing around, you may want to run in interactive mode. The HPC wiki has instructions on running interactively; to run on just one node, type the following line at the prompt on hpc1 or hpc2:

qsub -q pclass -I -l nodes=1:ppn=1,walltime=04:00:00

You can also use the general interactive queue.

qsub -I -q interactive -l nodes=1:ppn=1,walltime=04:00:00

For timing runs, you will probably want to submit your job using the batch queueing system. To do this, you will make a PBS script which is submitted using qsub. For example, to run the matrix multiplication test program, I use this PBS script. the wiki instructions for details. An important note: the PBS system doesn't copy your home environment, so you may need to tell it where to find some libraries. See the matmul PBS script for an example.

Note that if you follow these directions, your output files will appear in the scratch filesystem. The system will send email to your NYU home account when the build is done.

Compiling and running parallel jobs on the Dell cluster

The Dell cluster has two versions of MPI installed: MPICH and OpenMPI. At this point, I strongly recommend using the MPICH version. To compile using MPICH, install the intel-c and mvapich modules before you (or your Makefile) invoke mpicc.

I recommend using the pclass queue to run parallel jobs. You can do this either interactively or with the batch system. For example, to request two nodes and eight processors per node for at most two hours of interactive use, I would type this at the login node:

qsub -I -q pclass -l nodes=1:ppn=8,walltime=02:00:00

You can use either /opt/mpiexec/bin/mpiexec or mpirun to start your program. For example, here is part of one of my interactive sessions:

[dsb7@login-0-0 getname]$ qsub -I -q pclass -l nodes=1:ppn=8,walltime=02:00:00
qsub: waiting for job 330630.hpc0.its.nyu.edu to start
qsub: job 330630.hpc0.its.nyu.edu ready

[dsb7@compute-0-107 ~]$ module add intel-c
[dsb7@compute-0-107 ~]$ module add mvapich
[dsb7@compute-0-107 ~]$ cd hpc-fa08/demo/getname/
[dsb7@compute-0-107 getname]$ /opt/mpiexec/bin/mpiexec -comm ib -np 8 ./getname.x 
... program output ...

[dsb7@compute-0-107 getname]$ mpirun -np 8 ./getname.x
... program output ...

You can also run jobs with the batch queueing system with a line like

  qsub myscript.pbs

Here is an updated, annotated PBS script that you can use as a model. In it are comments that explain what all those #PBS lines mean and how to make sure you have the appropriate modules loaded.

The HPC wiki has reasonable instructions for running parallel jobs on the Dell cluster. Watch this spot for tips and for directions about using the queue of machines reserved for the class.

Seeing which machines you're using

The MPI_Get_processor_name function is used to get a string that describes which processor a particular MPI process is running on. I've written an example program (getname.tar.gz) that you can use to see how this procedure works. On the Dell cluster, the name returned by MPI_Get_processor_name is the address of the node; processes running on the same node have the same name string. This means that you can tell whether two processes are actually on the same node by getting the processor names and doing a string comparison. This may be useful if you're trying to compare the communication behavior for the bus on a node to the communication behavior of the inter-node network.

Checking for and cleaning up runaway processes

There is a bug in MVAPICH-0.9.9 that causes processes to sometimes not to be properly terminated. These processes can clog up the system, slowing down everyone's programs and making the nodes unresponsive. It appears that the default mpirun associated with the mvapich module has this bug. The program /opt/mpiexec/bin/mpiexec does not have this problem, so please use that.

This script (checknodes.sh) can be used as part of your PBS script to see whether the nodes you're running on are also being used by someone else (so that your timings might not be good). The script prints out how many user processes are running on the nodes for each user, and returns an error code if there are user processes not associated with you. If you put checknodes.sh in your home directory and run chmod +x checknodes.sh to make sure it's executable, then the following code fragment at the start of your script will cause the script to exit if someone else has processes running:

  if $HOME/checknodes.sh 
  then 
    echo "Okay, proceeding with run"
  else 
    echo "Error!  Someone else is using these nodes."
    exit 1;
  fi

If you have used the default mpirun and have runaway processes, this script (cleanup.sh) will login on each node in the pclass queue and kill any of your processes that are running on that node (so if you're in the middle of a run, wait until it's done before you run this!). To run the script, upload it to your account and do the following on hpc1 or hpc2:

  chmod +x cleanup.sh
  ./cleanup.sh

The cleanup script will also print out the processes that are killed. Ordinarily, there should be four (the processes associated with the script itself); any additional processes are probably MPI jobs gone astray.