1 minute read

While running Quantum Espresso on the Open Science Grid, we found a number of issues:
  • OpenMPI needs to have an rsh binary.  Even if you are using shared memory for openmpi, and openmpi does not use rsh, it still looks for the binary and fails if it cannot find it.
  • Chroots (used on HCC machines for grid jobs) do not support pty's.  OpenMPI has a compile option to turn off pty support.
Once these issues where fixed, we were able to submit QE jobs to the OSG using Condor's partitionable slots on 8 cores.

Preparing Submission

Before submitting our first QE job, we had to compile OpenMPI and QE.  Since we are an HPC center, we had OpenMPI compiled for our Infiniband, therefore it would always fail on the OSG where there is no Infiniband (let alone our brand and drivers).

After compiling, we created compressed files that contained the required files to run QE:
  • bin.tar.gz - Only includes the cp.x file, specific to our run.  It could have well included much more common pw.x.
  • lib.tar.gz - Includes the Intel math libraries and libgfortran.
  • openmpi.tar.gz - Includes the entire openmpi install directory (make install)
Additionally, we wrote a wrapper script, run_espresso_grid.sh, that unpacks the required files and sets the environment.

#!/bin/bash
tar xzf bin.tar.gz
tar xzf lib.tar.gz
tar xzf pseudo.tar.gz
tar xzf openmpi.tar.gz
mkdir tmp

export PATH=$PWD/bin:$PWD/openmpi/bin:$PATH
export LD_LIBRARY_PATH=$PWD/lib:$PWD/openmpi/lib:$LD_LIBRARY_PATH
export OPAL_PREFIX=$PWD/openmpi

mpirun --mca orte_rsh_agent `pwd`/rsh -np 8 cp.x < h2o-64-grid.in > h2o-64-grid.out

Submission

We used GlideinWMS to submit to the OSG, below is our HTCondor submit file.
universe = vanilla
output = condor.out.$(CLUSTER).$(PROCESS)
error = condor.err.$(CLUSTER).$(PROCESS)
log = condor.log
executable = run_espresso_grid.sh
request_cpus=8
request_memory = 10*1024
should_transfer_files = YES
when_to_transfer_output = ON_EXIT_OR_EVICT
transfer_input_files = bin.tar.gz, lib.tar.gz, pseudo.tar.gz, openmpi.tar.gz, h2o-64-grid.in, /usr/bin/rsh
transfer_output_files =h2o-64-grid.out
+RequiresWholeMachine=True
Requirements = CAN_RUN_WHOLE_MACHINE =?= TRUE
queue

Note that we pull rsh from the submission machine.  OpenMPI does not actually use rsh to start the processes on a shared memory machine, but it does require that the RSH binary is available.

Acknowledgments

This was done with the tremendous help of Jun Wang.



Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 Unported License.

Leave a comment