Parallel Runscript with Parallel Jobs

Hello All,

This probably should have been one of my first tutorial posts, but it is worth rehashing now. As many of you will be running on large many processor systems, tuning your runscript is hugely important for a couple reasons:

  • Proper scaling of codes
  • Proper use of Computational Resources
  • Proper use of your time as a computational professional

As we have more and more computational resources at our fingertips, it is important to stay vigilant in how your codes scale with the number of processors. I always say this as a joke, but there is some truth in it:

People barely know how to program and no one knows how to program in parallel
Levi, 2016

When ever you start a new project you should always run a scaling test. All this is is running your system in parallel with a different number of processors. By plotting this, you can determine the optimal point at which it makes sense to run your calculation. You do not want to waste computational resources now do you? A frugal computational theorist keeps their adviser completely happy.

Why does this matter? Most super computers charge computational time as:

$latex CPUc = P*W_{t}*N_{cpus} $

where $latex CPUc$ is the total cost of the calculation, $latex P$ is a charge prefactor >= 1, $latex W_{t}$ is the wall time of the calculation, and $latex N_{cpus}$ is the number of cpus used in the calculations. As you can see, this will be directly proportional to both the wall time and number of cpus; an efficient calculation will find the sweet spot in the calculation time and the number of cpus used in the calculation.

As an example, in the following figure you can see some scaling data that I have compiled for Quantum Espresso for a simple silicon calculation. I know this scaling looks terrible, there was a small error in the compilation so it made for a great example for this blog post. As you can see, it doesn’t make sense to run a calculation with more than 16 processors.

A scaling example showing QE data for CPU and GPU calculations.
A scaling example showing QE data for CPU and GPU calculations.

Say, hypothetically, you are running on a cluster with 32 cores/node where you have this version of QE installed, what makes sense to run? You might say ‘Well, Levi, 16 cores seems to be the minimum, and it actually takes longer with 32, so I am going to use 16.’ That would be a fine choice, but almost all supercomputers will charge you for running on the whole node at a time. I.E. even though you are only doing computations on 16 of the cores you will get charged as if you are running on 32 cores.

Fair, right?

The simple workaround is to run two processes in the same runscript. I know this may seem obvious, but I have had to reprimand lab-mates for not taking this into account. The key to being able to do this is: forking and waiting. Forking is the process of spawning the processes in the background. But it is really powerful in that the process will start in its own shell, allowing you to customize a given process’ environment. Waiting is a command within Unix that stops the script from dying until all sub-processes (the ones we forked) have completed. Here is an example for the above case:

#!/bin/bash
#SBATCH -n 32 # Number of cores
#SBATCH -p batch # Partition to submit to
#SBATCH -o hostname.out # File to which STDOUT will be written
#SBATCH -e hostname.err # File to which STDERR will be written
export PATH=/data/Software/bin/Espresso-5.3-bin:$PATH
export LD_LIBRARY_PATH=/data/Software/lib/Espresso-5.3-lib/:$LD_LIBRARY_PATH
export PHI_DGEMM_SPLIT=0.9
export PHI_ZGEMM_SPLIT=0.9
export OMP_NUM_THREADS=1

loop1() {
 export CUDA_VISIBLE_DEVICES=0
 for i in {2..500}
 do
 if [ -a ~/BaTiO3/MD_generation/Lammps/Positions/$i.out ] 
 then
 mkdir $i
 cd $i
 cp ../scf.in .
 cat ~/BaTiO3/MD_generation/Lammps/Positions/$i.out >> scf.in
 mpirun -np 16 pw-gpu.x -ndiag 1 -np 4 -ni 4 -in scf.in > scf.out
 cd ../
 fi
 done
}

loop2() {
 export CUDA_VISIBLE_DEVICES=1
 for i in {501..1000}
 do
 if [ -a ~/BaTiO3/MD_generation/Lammps/Positions/$i.out ] 
 then
 mkdir $i
 cd $i
 cp ../scf.in .
 cat ~/BaTiO3/MD_generation/Lammps/Positions/$i.out >> scf.in
 mpirun -np 16 pw-gpu.x -ndiag 1 -np 4 -ni 4 -in scf.in > scf.out
 cd ../
 fi
 done
}

loop1 &
loop2 &
wait

The key thing we are doing here is creating two functions, one called loop1 and the other called loop2. In this case I was generation a lot of data as I am currently working on neural network methods which require a lot of data. In each loop I am able to set the GPU I want to run on with export CUDA_VISIBLE_DEVICES which readily allows me to use both GPUs in one calculation, drastically speeding up the calculation. I then call loop1 and loop2 and immediately fork them to the background (with the &). Finally I call a ‘wait’ command which will keep this script running until both loop1 and loop2 complete. In this way, I can use all 32 processors while performing two calculations at once, effectively getting the most bang for my buck.

You could do the same if you had any number of processors on the system. There is no limit the number of processes you can fork and openmpi does a great job keeping track of which cores are being used.

I hope this helps!

Happy computing,

Levi

3 comments Add yours
  1. I’m so happy to see your QE-GPU here.I have some question about my QE-GPU,which I compose with E5-2680 V3,Intle-composer-2013,memory total 128GB, mpich2-intel,
    tar -xvf espresso-5.0.2.tar.gz
    tar -xvf QE-GPU-14.03.0.tar.gz -C espresso-5.0.2
    cd espresso-5.0.2
    cp -avf QE-GPU-14.03.0/GPU/ .
    cd GPU
    ./configure –enable-parallel –enable-cuda –with-gpu-arch=sm_30 –with-cuda-dir=/usr/local/cuda-6.0 –with-scalapack=-L/opt/mathlib/scalapack/libscalapack.a –with-internal-lapack=-L/usr/lib64/liblapack.a –with-internal-blas=-L/usr/lib64/libblas.a CC=mpicc FC=mpif90 F77=mpif77 –with-phigemm –without-magma
    cd ..
    make -f Makefile.gpu pw-gpu
    in the end,pw-gpu.x appeared in bin & GPU/PW/ directory;
    But when I have an test ,something went wrong:
    *** phiGEMM *** ERROR *** Missing initialization. Do CPU-only.
    How to solve this problem and which release you have used.
    Look forward to your reply!
    Thanks in advance!

    1. Hello Lee,

      What type of GPUs are you using on this system? I have not personally worked too much with the actual programming of QE-GPU, but I have major experience with compiling that god awful GPU compilations.

      First, make sure you have pw.x compiled normally. I.E. Compile it without the gpu version first. Then try to compile the GPU only version
      Second, if that does not work, try manually unpacking phigemm. I have found sometimes the makefile messed up on this
      Thirdly, if you have a failed compilation, please make sure you start from scratch. If you do not, sometimes it finds previous partial builds.

      Cheers,

      Levi

Leave a Reply

Your email address will not be published. Required fields are marked *