Sample Containerized Workflows¶
Warning
This page contains a sample of containerized workflows that demonstrate various techniques built up in practice, often from resolving user issues. We do not necessarily endorse or support each use case, rather these examples are provided in hopes they may be useful to demonstrate (i) sample containerized workflows, and (ii) solutions to various problems you may encounter.
NVIDIA's NGC containers¶
NVIDIA's NGC is a catalog of software optimized for GPUs. NGC containers allow you to run data science projects "out of the box" without installing, configuring, or integrating the infrastructure.
NVIDIA's Modulus physics-ML framework¶
NVIDIA Modulus is an open source deep-learning framework for building, training, and fine-tuning deep learning models using state-of-the-art Physics-ML methods. NVIDIA provides a frequently updated Docker image with a containerized PyTorch installation that can be run under Apptainer, albeit with some effort. Because the container is designed for Docker, some additional steps are required as discussed below.
Running containerized NVIDIA-Modulus on a single Casper GPU
-
Rather than pull the container and run as-is, we will create a derived container that allows us to encapsulate our desired changes. The primary reason for this is the Modulus container assumes the container is writable and makes changes during execution. Since we will run under Apptainer using a compressed, read-only image, this fails. Therefore we will make our own derived image and make the requisite changes during the build process.
This is accomplished first by creating a simple Apptainer definition file:
my_modulus.defThe definition file begins by pulling a specified version of the Modulus container, then modifying it in ourBootstrap: docker From: nvcr.io/nvidia/modulus/modulus:24.07 %post # update pip python -m pip install --upgrade pip # use pip to install additional packages needed for examples later pip install warp-lang mlflow # Remove cuda compat layer (https://github.com/NVIDIA/nvidia-docker/issues/1256) # note that the source container attempts to do this at run-time, but that will # fail when launched read-only. So we do that here instead. # (This issue will likely be resolved with newer versions of nvidia-modulus) rm -rf /usr/local/cuda/compat/lib
%post
step. In%post
we update thepip
Python package installer, usepip
to install some additional Python packages not in the base image but required for the examples run later, and finally removes a conflicting path from the source image.Using the
Note in this step we have explicitly setmy_modulus.def
file we now create our derived container and store it as a SIF:TMPDIR
to a local file system, as occasionally containers fail to build on the large parallel file systems usually used forTMPDIR
within NCAR. (The failure symptoms are usually fatal error messages related toxattrs
.) -
Fetch some examples so we can test our installation:
-
Run the container in an interactive session on a single Casper GPU. We will launch an interactive session, then run the container interactively with the
singularity shell
command.Note the command line arguments to# Interactive PBS submission from a login node: qsub -I -A <ACCOUNT> -q casper -l select=1:ncpus=4:mpiprocs=4:ngpus=1 -l gpu_type=v100 -l walltime=1:00:00 # Then on the GPU node: module load apptainer singularity shell \ --nv --cleanenv \ --bind /glade/work \ --bind /glade/campaign \ --bind /glade/derecho/scratch \ ./my_modulus.sif
singularity shell
:--nv
: enable Nvidia support,--cleanenv
: clean environment before running container, causing the container to be launched with no knowledge of environment variables set on the host. This is default behavior for Docker, and is required in this case to prevent conflictingCUDA_*
and other environment variable settings from confusing the contanierized PyTorch.--bind /glade/work
etc...: binds host file systems into the container, allowing us to read and write from GLADE.
-
Now we are inside the container, as evidenced by the
Apptainer>
command line prompt in the final step of this example. We will run one of the sample problems checked out in step 3:Apptainer> cd modulus/examples/cfd/darcy_fno/ Apptainer> python ./train_fno_darcy.py Warp 0.10.1 initialized: CUDA Toolkit: 11.5, Driver: 12.3 Devices: "cpu" | x86_64 "cuda:0" | Tesla V100-SXM2-32GB (sm_70) Kernel cache: /glade/u/home/benkirk/.cache/warp/0.10.1 [21:04:13 - mlflow - WARNING] Checking MLFlow logging location is working (if this hangs its not) [21:04:13 - mlflow - INFO] MLFlow logging location is working [21:04:13 - mlflow - INFO] No Darcy_FNO experiment found, creating... [21:04:13 - checkpoint - WARNING] Provided checkpoint directory ./checkpoints does not exist, skipping load [21:04:13 - darcy_fno - WARNING] Model FourierNeuralOperator does not support AMP on GPUs, turning off [21:04:13 - darcy_fno - WARNING] Model FourierNeuralOperator does not support AMP on GPUs, turning off [21:04:13 - darcy_fno - INFO] Training started... Module modulus.datapipes.benchmarks.kernels.initialization load on device 'cuda:0' took 205.84 ms Module modulus.datapipes.benchmarks.kernels.utils load on device 'cuda:0' took 212.94 ms Module modulus.datapipes.benchmarks.kernels.finite_difference load on device 'cuda:0' took 670.44 ms [21:04:46 - train - INFO] Epoch 1 Metrics: Learning Rate = 1.000e-03, loss = 6.553e-01 [21:04:46 - train - INFO] Epoch Execution Time: 3.241e+01s, Time/Iter: 1.013e+03ms [21:05:14 - train - INFO] Epoch 2 Metrics: Learning Rate = 1.000e-03, loss = 4.255e-02 [21:05:14 - train - INFO] Epoch Execution Time: 2.812e+01s, Time/Iter: 8.786e+02ms [...]
While this example demonstrated running the container interactively, alternatively steps 3 and 4 can be combined to be run inside a PBS batch job.
Popular AI/ML tools¶
Optimized Tensorflow and PyTorch models are available directly from the NGC.
Running AI/ML tools from NGC containers
Building an image with Apptainer
Anticipating that we may want to make additions to the container, we will build our own derived Apptainer image using a Definition file.
Run the image
module load apptainer
singularity shell \
--nv --cleanenv \
--bind /glade/work \
--bind /glade/campaign \
--bind /glade/derecho/scratch \
./my_image.sif
[...]
Apptainer>
singularity shell
: --nv --cleanenv
enables NVIDIA support with a clean environment; --bind /glade/work
etc...: binds host file systems into the container, allowing us to read and write from GLADE. Building and running containerized WRF under MPI¶
Warning
While we strongly encourage users to keep up with the latest WRF releases, we recognize some users may have customized older versions WRF for particular purposes and porting these changes can pose a significant burden.
In such cases containerization offers a viable (if unpalatable) option for running old code that may be difficult or impossible to compile unchanged on Derecho.
This example demonstrates building general purpose containers to facilitate compiling old versions of WRF.
Containerization approach¶
The container is built off-premises with docker
from three related images, each providing a foundation for the next. We begin with an
- OpenSUSE version 15 operating system (chosen to maximize Derecho interoperability) with a number of WRF dependencies installed,
- then add relevant compilers, MPI, and NetCDF,
- then compile various versions of WRF and WPS to demonstrate functionality.
The full set of Dockerfile
s and associated resources can be found on GitHub.
The base layer¶
The OpenSUSE 15 base layer
The base layer is the common foundation of components independent of the compiler suite ultimately used for WRF/WPS. It includes the operating system image along with relevant packages installed from package repositories.
- The image begins with a minimal OpenSUSE v15 image, and adds a utility script
docker-clean
copied from the build host. - The first
RUN
instruction updates the OS image, creates the/container
workspace, and installs a minimal development environment (compilers, file utilities etc...). - The second
RUN
instruction installs specific packages required by the WRF & WPS build system and build dependencies. We elect here to install HDF5 from the OpenSUSE package repository as a matter of convenience - it is required later to build NetCDF from source, and the packaged version is entirely adequate for that purpose. (Should the user want advanced capabilities within HDF5 it may be necessary instead to compile HDF5 from source.) - The remaining
RUN
instructions modify the image to expose our customization through environment configuration through a "source-able" file (/container/config_env.sh
) and also to comply with the (occasionally overly restrictive) expectations of the WRF build system. For example, the base OS image supplies the PNG library as-L/usr/lib64 -libpng16
, whereas WRF expects-L/usr/lib64 -libpng
. We also install an old version of the Jasper library from source using the systemgcc
compiler. While a newer version of Jasper is available directly from the OpenSUSE package repository, this version is too new for certain older WPS releases. - Finally, we set several environment variables used by the WRF/WPS build systems later on through
ENV
instructions.
Discussion
- Notice that each
RUN
step is finalized with adocker-clean
command. This utility script removes temporary files and cached data to minimize the size of the resulting image layers. One consequence is that the firstzypper
package manager interaction in aRUN
statement will re-cache these data. Since cached data are not relevant in the final image - especially when run much later on - we recommend removing it to reduce image bloat. - We choose
/container
as the base path for all 3rd-party and compiled software so that when running containers from this image it is obvious what files come from the image vs. the host file system. - We generally choose to add the search paths for compiled libraries to the "system" (container) default search paths rather than rely on
LD_LIBRARY_PATH
. Since we are installing single versions of the necessary libraries this approach is viable, and makes the resulting development environment less fragile.
The compiler + dependencies layer¶
Next we will extend the base layer to include 3rd-party compilers, a particular MPI (configured for maximum compatibility with Derecho), and NetCDF. We proceed on three parallel paths:
- Installing an old version the Intel "classic" compilers compatible with WRF versions 3 and 4;
- Using the OpenSUSE-provided
gcc
version 7.5.0; and - Installing a recent version of the
nvhpc
compilers, which provide the legacy Portland Grouppgf90
compiler supported by WRF/WPS.
Testing has shown the Intel
variant of the following recipes to be the most performant, while the gcc
version results in the smallest container image. Our intent here with showing all three options is primarily educational, and may provide solutions to issues encountered in related workflows.
Adding compilers, MPI, and dependencies
Note
Note that the Intel compiler is licensed software and usage is subject to terms of the End User License Agreements (EULA). As indicated in the Dockerfile
below, usage of this software is contingent on accepting the terms of the license agreement.
NCAR/derecho/WRF/intel-build-environment | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 |
|
RUN
instruction we install the Intel C/C++ and Fortran compilers, and also clean up the installation tree by removing unnecessary components (lines 13-15). By combining these steps into a single RUN
instruction we reduce the size of the "layer" produced, and the resulting image size. We can remove these components because they are not required to support WRF/WPS later. This step may need to be adapted if used to support other codes. For demonstration purposes only
Testing has shown the nvhpc
variant does not offer any performance benefits vs. the Intel or GCC variants, and therefore is not recommended for production runs. It is provided for demonstration purposes only, and in hopes it may prove useful for other projects that have a critical dependency on this particular compiler suite.
Note
Note that the NVHPC compiler is licensed software and usage is subject to terms of the End User License Agreements (EULA).
NCAR/derecho/WRF/nvhpc-build-environment | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 |
|
RUN
instruction we install the NVHPC C/C++ and Fortran compilers, and also clean up the installation tree by removing unnecessary components (lines 20-21). By combining these steps into a single RUN
instruction we reduce the size of the "layer" produced, and the resulting image size. We can remove these components because they are not required to support WRF/WPS later. This step may need to be adapted if used to support other codes. Dockerfile Steps
- We begin by installing the desired compiler suite (Intel & NVHPC cases). For GCC, the compilers already exist from the base layer.
- We then install NetCDF using the chosen compilers. We need to provide the Fortran interface to NetCDF, which is why we install from source here using our chosen compiler rather than selecting available versions from the package repository (as was the case with HDF5).
- The options
--disable-byterange
,--disable-dap
, and--disable-libxml2
are specified to prevent NetCDF from requiring additional dependencies (unnecessary for our WRF/WPS use case) we chose not to install in the base layer. - The option
--disable-dependency-tracking
is common to all GNU Automake packages and allows us to speed up one-time-only builds by skipping Automake's automated dependency generation.
- The options
- Finally, we install MPICH using the chosen compilers.
Discussion
Note that MPICH is necessary within the container in order to support building WRF. At runtime we will ultimately "replace" the container's MPICH with cray-mpich
on Derecho. For this to work properly, it is important the two implementations be ABI compatible, and that we are able to replace any container MPI shared libraries by their host counterpart. Since cray-mpich
is derived from the open-source MPICH-3.x, the version choice and configuration options here are very deliberate, and also this is why we do NOT add this MPI to the system library search path.
The WRF/WPS layer¶
The final step is to used the development environment built up in the previous two steps to compile WRF and WPS. As an exercise to assess the completeness of the environment we choose to install the latest versions of WRF/WPS (4.x) as well as the most recent release of the 3.x series, including both default WRF as well as WRF-Chem compilations. Inside the container we build WRF, WRF-Chem, and WPS according to compiler-version specific recipes (listed below).
Adding WRF & WPS
For demonstration purposes only
Testing has shown the nvhpc
variant does not offer any performance benefits vs. the Intel or GCC variants, and therefore is not recommended for production runs. It is provided for demonstration purposes only, and in hopes it may prove useful for other projects that have a critical dependency on this particular compiler suite.
Dockerfile Steps
- We begin by cloning specific versions of WRF/WPS. To reduce image size, we clone only the relevant branch and not the full
git
repository history. - Next we install the build recipes for the specific compiler suite.
- Finally, we compile WRF, then WRF-Chem, and WPS, version 3.x then 4.x - taking care at each step to clean intermediate files to limit the size of the container layers.
The completed images are then pushed to DockerHub.
Deploying the container on NCAR's HPC resources with Apptainer¶
All the previous steps were performed off-premises with a Docker installation. We can now deploy the resulting container images with Apptainer. We construct a Singularity Image File (.sif
) image from the DockerHub image, and also a convenience shell script that allows us to launch the container for interactive use.
Constructing SIF images & using the container interactively as a development environment
The Deffile
below simply pulls the desired image from DockerHub and performs shell initialization for interactive use. We also print some information on the configuration that is displayed when the user launches the container via singularity run
.
ncar-derecho-wrf-intel.sif
from ncar-derecho-wrf-intel.def
as described here. The simple utility script wrf_intel_env
allows us to easily interact with the container.
wrf_intel_env | |
---|---|
derecho$ ./wrf_intel_env
Welcome to "ncar-derecho-wrf-intel"
#----------------------------------------------------------
# MPI Compilers & Version Details:
#----------------------------------------------------------
/container/mpich/bin/mpicc
icc (ICC) 2021.7.1 20221019
Copyright (C) 1985-2022 Intel Corporation. All rights reserved.
/container/mpich/bin/mpif90
ifort (IFORT) 2021.7.1 20221019
Copyright (C) 1985-2022 Intel Corporation. All rights reserved.
MPICH Version: 3.4.3
MPICH Release date: Thu Dec 16 11:20:57 CST 2021
MPICH Device: ch4:ofi
MPICH CC: /container/intel/compiler/2022.2.1/linux/bin/intel64/icc -O2
MPICH CXX: /container/intel/compiler/2022.2.1/linux/bin/intel64/icpc -O2
MPICH F77: /container/intel/compiler/2022.2.1/linux/bin/intel64/ifort -O2
MPICH FC: /container/intel/compiler/2022.2.1/linux/bin/intel64/ifort -O2
MPICH Custom Information:
#----------------------------------------------------------
# WRF/WPS-Centric Environment:
#----------------------------------------------------------
JASPERINC=/container/jasper/1.900.1/include
JASPERLIB=/container/jasper/1.900.1/lib
HDF5=
NETCDF=/container/netcdf
FLEX_LIB_DIR=/usr/lib64
YACC=/usr/bin/byacc -d
LIB_LOCAL=
#----------------------------------------------------------
#----------------------------------------------------------
# Pre-compiled executables:
#----------------------------------------------------------
/container/wps-3.9.1/avg_tsfc.exe
/container/wps-3.9.1/calc_ecmwf_p.exe
/container/wps-3.9.1/g1print.exe
/container/wps-3.9.1/g2print.exe
/container/wps-3.9.1/geogrid.exe
/container/wps-3.9.1/height_ukmo.exe
/container/wps-3.9.1/int2nc.exe
/container/wps-3.9.1/metgrid.exe
/container/wps-3.9.1/mod_levs.exe
/container/wps-3.9.1/rd_intermediate.exe
/container/wps-3.9.1/ungrib.exe
/container/wps-4.5/avg_tsfc.exe
/container/wps-4.5/calc_ecmwf_p.exe
/container/wps-4.5/g1print.exe
/container/wps-4.5/g2print.exe
/container/wps-4.5/geogrid.exe
/container/wps-4.5/height_ukmo.exe
/container/wps-4.5/int2nc.exe
/container/wps-4.5/metgrid.exe
/container/wps-4.5/mod_levs.exe
/container/wps-4.5/rd_intermediate.exe
/container/wps-4.5/ungrib.exe
/container/wrf-3.9.1.1/ndown.exe
/container/wrf-3.9.1.1/real.exe
/container/wrf-3.9.1.1/tc.exe
/container/wrf-3.9.1.1/wrf.exe
/container/wrf-4.5.2/ndown.exe
/container/wrf-4.5.2/real.exe
/container/wrf-4.5.2/tc.exe
/container/wrf-4.5.2/wrf.exe
/container/wrf-chem-3.9.1.1/ndown.exe
/container/wrf-chem-3.9.1.1/real.exe
/container/wrf-chem-3.9.1.1/tc.exe
/container/wrf-chem-3.9.1.1/wrf.exe
/container/wrf-chem-4.5.2/ndown.exe
/container/wrf-chem-4.5.2/real.exe
/container/wrf-chem-4.5.2/tc.exe
/container/wrf-chem-4.5.2/wrf.exe
#----------------------------------------------------------
WRF-intel-dev>
ncar-derecho-wrf-gcc.sif
from ncar-derecho-wrf-gcc.def
as described here. The simple utility script wrf_gcc_env
allows us to easily interact with the container.
wrf_gcc_env | |
---|---|
Welcome to "ncar-derecho-wrf-gcc"
#----------------------------------------------------------
# MPI Compilers & Version Details:
#----------------------------------------------------------
/container/mpich/bin/mpicc
gcc (SUSE Linux) 7.5.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
/container/mpich/bin/mpif90
GNU Fortran (SUSE Linux) 7.5.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
MPICH Version: 3.4.3
MPICH Release date: Thu Dec 16 11:20:57 CST 2021
MPICH Device: ch4:ofi
MPICH CC: /usr/bin/gcc -O2
MPICH CXX: /usr/bin/g++ -O2
MPICH F77: /usr/bin/gfortran -O2
MPICH FC: /usr/bin/gfortran -O2
MPICH Custom Information:
#----------------------------------------------------------
# WRF/WPS-Centric Environment:
#----------------------------------------------------------
JASPERINC=/container/jasper/1.900.1/include
JASPERLIB=/container/jasper/1.900.1/lib
HDF5=
NETCDF=/container/netcdf
FLEX_LIB_DIR=/usr/lib64
YACC=/usr/bin/byacc -d
LIB_LOCAL=
#----------------------------------------------------------
#----------------------------------------------------------
# Pre-compiled executables:
#----------------------------------------------------------
/container/wps-3.9.1/avg_tsfc.exe
/container/wps-3.9.1/calc_ecmwf_p.exe
/container/wps-3.9.1/g1print.exe
/container/wps-3.9.1/g2print.exe
/container/wps-3.9.1/geogrid.exe
/container/wps-3.9.1/height_ukmo.exe
/container/wps-3.9.1/int2nc.exe
/container/wps-3.9.1/metgrid.exe
/container/wps-3.9.1/mod_levs.exe
/container/wps-3.9.1/rd_intermediate.exe
/container/wps-3.9.1/ungrib.exe
/container/wps-4.5/avg_tsfc.exe
/container/wps-4.5/calc_ecmwf_p.exe
/container/wps-4.5/g1print.exe
/container/wps-4.5/g2print.exe
/container/wps-4.5/geogrid.exe
/container/wps-4.5/height_ukmo.exe
/container/wps-4.5/int2nc.exe
/container/wps-4.5/metgrid.exe
/container/wps-4.5/mod_levs.exe
/container/wps-4.5/rd_intermediate.exe
/container/wps-4.5/ungrib.exe
/container/wrf-3.9.1.1/ndown.exe
/container/wrf-3.9.1.1/real.exe
/container/wrf-3.9.1.1/tc.exe
/container/wrf-3.9.1.1/wrf.exe
/container/wrf-4.5.2/ndown.exe
/container/wrf-4.5.2/real.exe
/container/wrf-4.5.2/tc.exe
/container/wrf-4.5.2/wrf.exe
/container/wrf-chem-3.9.1.1/ndown.exe
/container/wrf-chem-3.9.1.1/real.exe
/container/wrf-chem-3.9.1.1/tc.exe
/container/wrf-chem-3.9.1.1/wrf.exe
/container/wrf-chem-4.5.2/ndown.exe
/container/wrf-chem-4.5.2/real.exe
/container/wrf-chem-4.5.2/tc.exe
/container/wrf-chem-4.5.2/wrf.exe
#----------------------------------------------------------
WRF-gcc-dev>
For demonstration purposes only
Testing has shown the nvhpc
variant does not offer any performance benefits vs. the Intel or GCC variants, and therefore is not recommended for production runs. It is provided for demonstration purposes only, and in hopes it may prove useful for other projects that have a critical dependency on this particular compiler suite.
ncar-derecho-wrf-nvhpc.sif
from ncar-derecho-wrf-nvhpc.def
as described here. The simple utility script wrf_nvhpc_env
allows us to easily interact with the container.
wrf_nvhpc_env | |
---|---|
Welcome to "ncar-derecho-wrf-nvhpc"
#----------------------------------------------------------
# MPI Compilers & Version Details:
#----------------------------------------------------------
/container/mpich/bin/mpicc
gcc (SUSE Linux) 7.5.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
/container/mpich/bin/mpif90
nvfortran 23.9-0 64-bit target on x86-64 Linux -tp znver3
NVIDIA Compilers and Tools
Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
MPICH Version: 3.4.3
MPICH Release date: Thu Dec 16 11:20:57 CST 2021
MPICH Device: ch4:ofi
MPICH CC: /usr/bin/gcc -fPIC -O2
MPICH CXX: /usr/bin/g++ -fPIC -O2
MPICH F77: /container/nvidia/hpc_sdk/Linux_x86_64/23.9/compilers/bin/nvfortran -fPIC
MPICH FC: /container/nvidia/hpc_sdk/Linux_x86_64/23.9/compilers/bin/nvfortran -fPIC
MPICH Custom Information:
#----------------------------------------------------------
# WRF/WPS-Centric Environment:
#----------------------------------------------------------
JASPERINC=/container/jasper/1.900.1/include
JASPERLIB=/container/jasper/1.900.1/lib
HDF5=
NETCDF=/container/netcdf
FLEX_LIB_DIR=/usr/lib64
YACC=/usr/bin/byacc -d
LIB_LOCAL=
#----------------------------------------------------------
#----------------------------------------------------------
# Pre-compiled executables:
#----------------------------------------------------------
/container/wps-3.9.1/avg_tsfc.exe
/container/wps-3.9.1/calc_ecmwf_p.exe
/container/wps-3.9.1/g1print.exe
/container/wps-3.9.1/g2print.exe
/container/wps-3.9.1/geogrid.exe
/container/wps-3.9.1/height_ukmo.exe
/container/wps-3.9.1/int2nc.exe
/container/wps-3.9.1/metgrid.exe
/container/wps-3.9.1/mod_levs.exe
/container/wps-3.9.1/rd_intermediate.exe
/container/wps-3.9.1/ungrib.exe
/container/wps-4.5/avg_tsfc.exe
/container/wps-4.5/calc_ecmwf_p.exe
/container/wps-4.5/g1print.exe
/container/wps-4.5/g2print.exe
/container/wps-4.5/geogrid.exe
/container/wps-4.5/height_ukmo.exe
/container/wps-4.5/int2nc.exe
/container/wps-4.5/metgrid.exe
/container/wps-4.5/mod_levs.exe
/container/wps-4.5/rd_intermediate.exe
/container/wps-4.5/ungrib.exe
/container/wrf-3.9.1.1/ndown.exe
/container/wrf-3.9.1.1/real.exe
/container/wrf-3.9.1.1/tc.exe
/container/wrf-3.9.1.1/wrf.exe
/container/wrf-4.5.2/ndown.exe
/container/wrf-4.5.2/real.exe
/container/wrf-4.5.2/tc.exe
/container/wrf-4.5.2/wrf.exe
/container/wrf-chem-3.9.1.1/ndown.exe
/container/wrf-chem-3.9.1.1/real.exe
/container/wrf-chem-3.9.1.1/tc.exe
/container/wrf-chem-3.9.1.1/wrf.exe
/container/wrf-chem-4.5.2/ndown.exe
/container/wrf-chem-4.5.2/real.exe
/container/wrf-chem-4.5.2/tc.exe
/container/wrf-chem-4.5.2/wrf.exe
#----------------------------------------------------------
WRF-nvhpc-dev>
Running the container on Derecho¶
The PBS job script listed below shows the steps required to "bind" the host MPI into the container and launch an executable in a batch environment.
Containerized WRF PBS Script
Discussion
The PBS script examines the shared library dependencies of the executable using ldd
, first within the container and then with the host MPI "injected" (as described here). This process is often tedious, iterative, and error prone. As constructed the PBS script can be executed directly (without qsub
) to inspect these results before waiting for batch resources.
The mpiexec
command is fairly standard. Note that we are using it to launch singularity
, which in turn will start up the WRF
executable specified on line 20. note in this case the executable is build into the container, however it could also be resident on GLADE, provided it was complied from this same container development environment.
The singularity exec
command lines are complex, so let's deconstruct them here:
- We make use of the
--bind
argument first to mount familiar GLADE file systems within the container, -
and again to "inject" the host MPI into the container. The
/run
directory necessity is not immediately obvious but is used by Cray-MPICH as part of the launching process. -
We also need to use the
--env
to set theLD_LIBRARY_PATH
inside the image so that the application can find the proper host libraries. Recall when we built the FastEddy executable in the containerized environment it had no knowledge of these host-specific paths. Similarly, we use--env
to set theLD_PRELOAD
environment variable inside the container. This will cause a particular Cray-MPICH library to be loaded prior to application initialization. This step is not required for "bare metal" execution. -
We set the
MPICH_SMP_SINGLE_COPY_MODE
environment variable to work around an MPI run-time error that would otherwise appear. - Finally, a note on the
--bind /usr/lib64:/host/lib64
argument. Injecting the host MPI requires that some shared libraries from the host's/usr/lib64
directory be visible inside the image. However, this path also exists inside the image and contains other libraries needed by the application. We cannot simply bind the hosts directory into the same path, doing so will mask these other libraries. So we bind the host's/usr/lib64
into the container image at/host/lib64
, and make sure this path is set in theLD_LIBRARY_PATH
variable as well. Because we want these particular host libraries found as last resort (not taking precedence over similar libraries in the container, we append/host/lib64
to theLD_LIBRARY_PATH
search path.
The arguments above were determined iteratively through trial and error. Such is the reality of containerized MPI applications and proprietary host MPI integration. Feel free to experiment with the PBS file, omitting some of the --bind
and --env
arguments and observing the resulting error message.
Building and running containerized FastEddy under MPI on GPUs¶
Warning
While the result of this demonstration is a functional application, we recommend against using this container for production FastEddy workflows!
It is much easier to simply build FasyEddy "bare metal" when operating inside the NCAR HPC environment!!
This example demonstrates building a containerized version of FastEddy from the open-source variant hosted on GitHub. It is provided for demonstration purposes because it demonstrates several common issues encountered when running GPU-aware MPI applications inside containers across multiple nodes, particularly when binding the host MPI into the container, and the source code is open for any interested user to follow along and adapt.
About FastEddy¶
FastEddy is a large-eddy simulation (LES) model developed by the Research Applications Laboratory (RAL) here at NCAR. The fundamental premise of FastEddy model development is to leverage the accelerated and more power efficient computing capacity of graphics processing units (GPU)s to enable not only more widespread use of LES in research activities but also to pursue the adoption of microscale and multiscale, turbulence-resolving, atmospheric boundary layer modeling into local scale weather prediction or actionable science and engineering applications.
Containerization approach¶
The container is built off-premises with docker
from three related images, each providing a foundation for the next. We begin with a
- Rockylinux version 8 operating system with OpenHPC version 2 installed, then add
- a CUDA development environment and a CUDA-aware MPICH installation on top, and finally add
- the FastEddy source and compiled program.
A benefit of this layered approach is that the intermediate images created in steps 1 and 2 can be beneficial in their own right, providing base layers for other projects with similar needs. Additionally, by building the image externally with Docker we are able to switch user IDs within the process (discussed further below), which has some benefits when using containers to enable development workflows.
Building the image¶
Build framework
For complete details of the build process, see the Docker-based container build framework described here.
The image was built external to the HPC environment and then pushed to Docker Hub. (For users only interested in the details of running such a container, see instructions for running the container below.)
In this case a simple Mac laptop with git
, GNU make
, and docker
all installed locally was used and the process takes about an hour; any similarly configured system should suffice. No GPU devices are required to build the image.
The base layer¶
The Rockylinx 8 + OpenHPC base layer
For the base layer we deploy an OpenHPC v2 installation on top of a Rocklinux v8 base image. OpenHPC provides access to many pre-complied scientific libraries and applications, and supports a matrix of compilers and MPI permutations. and we will select one that works well with Derecho. Notably, at present OpenHPC does not natively support CUDA installations, however we will address this limitation in the subsequent steps.
- The image begins with a minimal Rockylinux v8 image, and adds a utility script
docker-clean
copied from the build host. - We parameterize several variables with build arguments using the
ARG
instructions. (Build arguments are available within the image build process as environment variables, but not when running the resulting container image; ratherENV
instructions can be used for those purposes. For a full discussion ofDockerfiles
and supported instructions see here.) -
We then perform a number of
RUN
steps. When runningdocker
, eachRUN
step creates a subsequent layer in the image. (We follow general Docker guidance and also strive to combine related commands inside a handful ofRUN
instructions.)- The first
RUN
instruction takes us from the very basic Rockylinux 8 source image to a full OpenHPC installation. We add a non-privileged userplainuser
to leverage later, update the OS image with any available security patches, and then generally follow an OpenHPC installation recipe to add compilers, MPI, and other useful development tools. - The second
RUN
step works around an issue we would find later when attempting to run the image on Derecho. Specifically, the OpenHPCmpich-ofi
package provides support for the long-deprecated MPI C++ interface. This is not present on Derecho with thecray-mpich
implementation we will ultimately use to run the container. Since we do not need this support, here we hack thempicxx
wrapper so that it does not link in-lmpicxx
, the problematic library. - The third and following
RUN
instructions steps create a directory space/opt/local
we can use from our unprivilegedplainuser
account, copy in some more files, and then switch toplainuser
to test the development environment by installing some common MPI benchmarks.
- The first
Discussion
-
OpenHPC v2 supports both OpenSUSE and Rocklinux 8 as its base OS. It would be natural to choose OpenSUSE for similarity to Casper and Derecho, however by choosing instead Rocklinux we gain access to a different build environment, which has benefits for developers looking to improve portability. This process followed here can also be thought of as a "roadmap" for deploying the application at similarly configured external sites.
-
OpenHPC supports
openmpi
andmpich
MPI implementations, with the latter in two forms:mpich-ucx
andmpich-ofi
. In this example we intentionally choosempich-ofi
with prior knowledge of the target execution environment. On Derecho the primary MPI implementation iscray-mpich
(itself forked frommpich
) which uses an HPE-proprietarylibfabric
interface to the Slingshot v11 high-speed communication fabric. -
Notice that each
RUN
step is finalized with adocker-clean
command. This utility script removes temporary files and cached data to minimize the size of the resulting image layers. One consequence is that the firstdnf
package manager interaction in aRUN
statement will re-cache these data. Since cached data are not relevant in the final image - especially when run much later on - we recommend removing it to reduce image bloat. -
In this example we are intentional switching between
root
(the default user in the build process) and our unprivilegedplainuser
account. Particularly in development workflows, we want to be sure compilation and installation steps work properly as an unprivileged user, and tools such as thelmod
module system andmpiexec
often are intended not to be used asroot
. -
Since MPI container runtime inregration can be a pain point at execution, we install OSU's and Intel's MPI benchmark suites to aid in deployment testing, independent of any user application.
Building the image
Adding CUDA & CUDA-aware MPICH¶
Adding CUDA + CUDA-aware MPICH
Next we add CUDA and add a CUDA-aware MPI installation. We choose a specific version of the open-source MPICH library (both to closely match what is provided by OpenHPC and for Derecho compatibility) and configure it to use the pre-existing OpenHPC artifacts (hwloc
, libfabric
) as dependencies. For both cuda
and the new mpich
we also install "modulefiles" so the new additions are available in the typical module environment. Finally, we re-install one of the MPI benchmark applications, this time with CUDA support.
Dockerfile Steps
-
We switch back to the
root
user so we can modify the operating system installation within the image. -
The first
RUN
instruction installs a full CUDA development environment and some additional development packages required to build MPI later. -
The next
RUN
instructions install modulefiles into the image so we can access the CUDA and (upcoming) MPICH installation, and clean up file permissions. The remaining steps are executed again as our unprivilegedplainuser
. -
The fourth
RUN
instruction downloads, configures, and installs MPICH. The version is chosen to closely match the baseline MPICH already installed in the image and uses some of its dependencies, and we also enable CUDA support. -
In the final
RUN
instruction we re-install one of the MPI benchmark applications, this time with CUDA support.
Discussion
-
There are several ways to install CUDA, here we choose a "local repo" installation because it allows us to control versions, but are careful also to remove the downloaded packages after installation, freeing up 3GB+ in the image.
-
The CUDA development environment is very large and it is difficult to separate unnecessary components, so is step increases the size of the image from ~1.2GB to 8.8GB. We leave all components in the development image, including tools we will very likely not need inside a container such as
nsight-systems
andnsight-compute
. For applications built on top of this image, a user could optionally remove these components later to decrease their final image size (demonstrated next).
Building the image
Building FastEddy¶
Adding FastEddy
- Again we switch back to
root
for performing operating system level tasks, as our base image left us asplainuser
. - The first
RUN
instruction installs the development package for NetCDF - an additional application dependency not already satisfied. We also remove some particularly large CUDA components from the development image not required in the final application image. - Then again as
plainuser
, the nextRUN
instruction downloads the FastEddy open-source variant. We make some changes to the definition of a few hard-codedmake
variables so that we can specify installation paths during linking later. - The final
RUN
instruction then builds FastEddy. We build up and use customINCLUDE
andLIBS
variables, specifying some unique paths for the particular build environment.
Discussion
- When building the image locally with Docker, the space savings from step (2) are not immediately apparent. This is a result of the Docker "layer" approach: the content still exists in the base layer and is only "logically" removed by the commands listed above. The space savings is realized on the HPC system when we "pull" the image with
singularity
. - If an even smaller container image is desired, even more components could be stripped: CUDA numerical libraries the application does not need, or even the containerized MPIs after we are done with them. As we will see next, we replace the container MPI with the host MPI at run-time, so technically no MPI is required inside the container when we are done using it for compilation.
Building the image
Pushing the image to Docker Hub
Running the container on Derecho¶
With the container built from the steps above (or simply pulling the resulting image from Docker Hub), we are now ready to run a sample test case on Derecho. We choose Example02_CBL.in
from the FastEddy Tutorial and modify it to run on 24 GPUs (full steps listed here). The PBS job script listed below shows the steps required to "bind" the host MPI into the container.
Containerized FastEddy PBS Script
#!/bin/bash
#PBS -q main
#PBS -j oe
#PBS -o fasteddy_job.log
#PBS -l walltime=02:00:00
#PBS -l select=6:ncpus=64:mpiprocs=4:ngpus=4
module load ncarenv/23.09
module load apptainer gcc cuda || exit 1
module list
nnodes=$(cat ${PBS_NODEFILE} | sort | uniq | wc -l)
nranks=$(cat ${PBS_NODEFILE} | sort | wc -l)
nranks_per_node=$((${nranks} / ${nnodes}))
container_image="./rocky8-openhpc-fasteddy.sif"
singularity \
--quiet \
exec \
${container_image} \
ldd /opt/local/FastEddy-model/SRC/FEMAIN/FastEddy
singularity \
--quiet \
exec \
--bind ${SCRATCH} \
--bind ${WORK} \
--pwd $(pwd) \
--bind /run \
--bind /opt/cray \
--bind /usr/lib64:/host/lib64 \
--env LD_LIBRARY_PATH=${CRAY_MPICH_DIR}/lib-abi-mpich:/opt/cray/pe/lib64:${LD_LIBRARY_PATH}:/host/lib64 \
--env LD_PRELOAD=/opt/cray/pe/mpich/${CRAY_MPICH_VERSION}/gtl/lib/libmpi_gtl_cuda.so.0 \
${container_image} \
ldd /opt/local/FastEddy-model/SRC/FEMAIN/FastEddy
echo "# --> BEGIN execution"; tstart=$(date +%s)
mpiexec \
--np ${nranks} --ppn ${nranks_per_node} --no-transfer \
set_gpu_rank \
singularity \
--quiet \
exec \
--bind ${SCRATCH} \
--bind ${WORK} \
--pwd $(pwd) \
--bind /run \
--bind /opt/cray \
--bind /usr/lib64:/host/lib64 \
--env LD_LIBRARY_PATH=${CRAY_MPICH_DIR}/lib-abi-mpich:/opt/cray/pe/lib64:${LD_LIBRARY_PATH}:/host/lib64 \
--env LD_PRELOAD=/opt/cray/pe/mpich/${CRAY_MPICH_VERSION}/gtl/lib/libmpi_gtl_cuda.so.0 \
--env MPICH_GPU_SUPPORT_ENABLED=1 \
--env MPICH_GPU_MANAGED_MEMORY_SUPPORT_ENABLED=1 \
--env MPICH_SMP_SINGLE_COPY_MODE=NONE \
${container_image} \
/opt/local/FastEddy-model/SRC/FEMAIN/FastEddy \
./Example02_CBL.in
echo "# --> END execution"
echo $(($(date +%s)-${tstart})) " elapsed seconds; $(date)"
Discussion
The mpiexec
command is fairly standard. Note that we are using it to launch singularity
, which in turn will start up the containerized FastEddy
executable.
The singularity exec
command line is complex, so let's deconstruct it here:
-
We make use of the
--bind
argument first to mount familiar GLADE file systems within the container, -
and again to "inject" the host MPI into the container (as described here). The
/run
directory necessity is not immediately obvious but is used by Cray-MPICH as part of the launching process. -
We also need to use the
--env
to set theLD_LIBRARY_PATH
inside the image so that the application can find the proper host libraries. Recall when we built the FastEddy executable in the containerized environment it had no knowledge of these host-specific paths. Similarly, we use--env
to set theLD_PRELOAD
environment variable inside the container. This will cause a particular Cray-MPICH library to be loaded prior to application initialization. This step is not required for "bare metal" execution. -
We set some important Cray-MPICH specific
MPICH_*
environment variables as well to enable CUDA-awareness (MPICH_GPU_*
) and work around an MPI run-time error (MPICH_SMP_SINGLE_COPY_MODE
) that will otherwise appear. - Finally, a note on the
--bind /usr/lib64:/host/lib64
argument. Injecting the host MPI requires that some shared libraries from the host's/usr/lib64
directory be visible inside the image. However, this path also exists inside the image and contains other libraries needed by the application. We cannot simply bind the hosts directory into the same path, doing so will mask these other libraries. So we bind the host's/usr/lib64
into the container image at/host/lib64
, and make sure this path is set in theLD_LIBRARY_PATH
variable as well. Because we want these particular host libraries found as last resort (not taking precedence over similar libraries in the container, we append/host/lib64
to theLD_LIBRARY_PATH
search path.
The arguments above were determined iteratively through trial and error. Such is the reality of containerized MPI applications and proprietary host MPI integration. Feel free to experiment with the PBS file, omitting some of the --bind
and --env
arguments and observing the resulting error message, however do NOT modify the MPICH_GPU_*
variables, doing so may trigger a very unfortunate kernel driver bug and render the GPU compute nodes unusable.
Pulling the image
We begin with pulling the image from Docke Hub and constructing a SIF. (If you want to test your own built/pushed image, replace benjaminkirk
with your own <dockerhub_username>
as specified in the tag/push operations listed above.)
derecho$ singularity pull rocky8-openhpc-fasteddy.sif docker://benjaminkirk/rocky8-openhpc-fasteddy:latest
[...]
derecho$ ls -lh rocky8-openhpc-fasteddy.sif
-rwxr-xr-x 1 someuser ncar 3.1G Dec 5 17:08 rocky8-openhpc-fasteddy.sif
Running the job
"Faking" a native installation of containerized applications¶
Occasionally it can be beneficial to "hide" the fact that a particular application is containerized, typically to simplify the user interface and usage experience. In this section we follow a clever approach deployed by the NIH Biowulf team and outlined here to enable users to interact transparently with containerized applications without needing to know any details of the run-time (singularity
, ch-run
, etc...).
The basic idea is to create a wrapper.sh
shell script that
- Infers the name of the containerized command to run,
- Invokes the chosen run-time transparently to the user, and
- Passes along any command-line arguments to the containerized application.
Consider the following directory tree structure, taken from a production deployment:
/glade/u/apps/opt/leap-container/15/
├── bin/
│ ├── eog -> ../libexec/wrap_singularity.sh
│ ├── evince -> ../libexec/wrap_singularity.sh
│ ├── gedit -> ../libexec/wrap_singularity.sh
│ ├── geeqie -> ../libexec/wrap_singularity.sh
│ ├── gimp -> ../libexec/wrap_singularity.sh
│ ├── gv -> ../libexec/wrap_singularity.sh
│ ├── smplayer -> ../libexec/wrap_singularity.sh
│ ├── vlc -> ../libexec/wrap_singularity.sh
│ └── xfig -> ../libexec/wrap_singularity.sh
└── libexec/
├── Makefile
├── ncar-casper-gui_tools.sif
└── wrap_singularity.sh
At the top level, we simply have two directories: ./bin/
(which likely will go into the user's PATH
) and ./libexec/
(where we will hide implementation details).
Constructing the bin
directory
The ./bin/
directory contains symbolic links to the wrap_singularity.sh
script, where the name of the symbolic link is the containerized application to run. For the example above, when a user runs ./bin/gv
for example, it will invoke the wrap_singularity.sh
"behind the scenes." In general there can be many application symbolic links in the ./bin/
directory, so long as the desired application exists within the wrapped container image.
The wrap_singularity.sh
wrapper script
The wrap_singularity.sh
script is written such that whatever symbolic links you create to it will run inside of the container, inferring the application name from that of the symbolic link.
Specifically:
- The command to execute is inferred from the shell argument
${0}
- the name of the script being executed. Here is where the symbolic links from./bin
are important: If the symbolic link./bin/gv
is invoked, for example, the script above will execute with the namegv
. This is accessible within the script as${0}
, and is stored in therequested_command
variable on line 7. - Any command-line arguments passed to the executable are captured in the
${@}
environment variable, and are passed directly through as command-line arguments to the containerized application (line 27). - We bind-mount the usual GLADE file systems so that expected data are accessible (lines 22-25).
- In this example we execute all commands in the same base container
ncar-casper-gui_tools.sif
(specified on line 8). This is the simplest approach, however strictly not required. (A more complex treatment could "choose" different base containers for different commands using abash
case
statement, for example, if desired.) - The container is launched with the users' directory
topdir
as the working directory. This is required so that any relative paths specified are handled properly. - In order to robustly access the required
apptainer
module, we first check to see if themodule
command is recognized and if not initialize the module environment (line 11), then load theapptainer
module (line 12). This allows the script to function properly even when the user does not have the module system initialized in their environment - a rare but an occasional issue.
While the example above wraps the Apptainer run-time, a similar approach works for Charliecloud and Podman as well if desired.
→