Production sites¶

NERSC¶

On NERSC you can use Conda/Mamba and Pixi, but you have to make sure that software is installed on a performant filesystem like $SCRATCH or in /global/common/software/....

Warning

Pixi cannot install software to /global/common/software, because this filesystem is read-only on compute nodes and Pixi needs to modify its environment at runtime. Use $SCRATCH for Pixi environments instead (see the detached-environments setting below).

In case you are running the production on CFS (slower filesystem), these environment variables are a good starting point (customize paths if needed):

export SWPREFIX="$SCRATCH/software"
export XDG_CACHE_HOME="$SCRATCH/cache"

# conda
export CONDA_ENVS_DIRS="$SWPREFIX/conda/envs"
export CONDA_PKGS_DIRS="$XDG_CACHE_HOME/conda/pkgs"

# uv
export UV_PYTHON_INSTALL_DIR="$SWPREFIX/share/uv/python"
export UV_TOOL_DIR="$SWPREFIX/share"
export UV_PYTHON_BIN_DIR="$SWPREFIX/bin"

# NOTE: pixi will still store cache in $XDG_CACHE_HOME
export PIXI_HOME="$SCRATCH/software/pixi"

# numba cache
export NUMBA_CACHE_DIR="$XDG_CACHE_HOME/numba"

Then, you can use Conda/Mamba to install Pixi:

> module load python
> mamba create -n my_env -- 'python<3.14' pixi
> conda config --add channels conda-forge
> conda config --set channel_priority strict

The last important step is to tell pixi to not store environments in the current directory (the simflow directory). At the time of writing, this is only possible by adding this to .config/pixi/config.toml:

detached-environments = "/pscratch/sd/l/.../software/pixi/envs"

Note

Environment variable expansion is not supported in the Pixi config at the time of writing these instructions, so you have to type in the value of $SCRATCH by hand.

Now you can proceed with setting up and running the production workflow, with e.g. on a compute node:

> pixi run prod --profile nersc-compute

Using the provided nersc-* profiles is recommended (have a look at them!).

I/O optimization¶

On NERSC, the Community File System (CFS) I/O performance can be a bottleneck for I/O-intensive parts of the workflow (for example, the stp tier production). NERSC provides two relevant mitigations. The first is the scratch filesystem, based on solid-state disks and offering very high performance. The second is a faster, read-only mount of /global/cfs at /dvs_ro/cfs.

For temporary productions, it is recommended to run the full Simflow entirely on scratch and, if needed, move the data to CFS at the end. Alternatively, if the simflow is hosted on CFS, it can automatically read input files from DVS and temporarily write the output of some I/O-intensive jobs to scratch, then move it to the expected CFS location at completion. These features can be enabled via the following block in simflow-config.yaml:

nersc:
  dvs_ro: true
  scratch: $SCRATCH/<SUBFOLDER>

Both features can be disabled by setting the corresponding fields to false.

Note

The NERSC profiles already set the Snakemake option local-storage-prefix to stage intermediate files on scratch. This is a Snakemake-native mechanism and is independent of the Simflow’s own nersc.scratch setting described above. If you pass --default-storage-provider fs to Snakemake (or add default-storage-provider: fs to a profile), Snakemake will use the local-storage-prefix path as the default storage location for all intermediate files, copying them to the final output path on completion. --default-storage-provider fs and nersc.scratch are mutually exclusive and should not be used together.

Warning

These features are implemented manually for each Snakemake rule, so it could be that some rules are unaffected by them.

Multi-node execution¶

Warning

Multi-node execution via snakemake-nersc / pixi run prodnodes is highly experimental. Use at your own risk and always verify results afterwards.

As of Snakemake v8.30, support for parallel execution across multiple compute nodes or interaction with job schedulers (such as Slurm) is not well supported.

Note

An experimental profile to interact with the NERSC batch job system is available in nersc-compute-slurm. Unfortunately, specifying rule resources (which is required for job submission) seems to slow down the DAG generation step by a lot.

Note

In principle, one could use the snakemake-executor-plugin-slurm-jobstep to prefix each rule command with a srun call, which would make it possible to parallelize the workflow over several nodes. In practice, NERSC discourages from starting many srun instances for performance reasons. As a result, the only reliable way to run Snakemake is with one instance on a single compute node.

The snakemake-nersc executable, exposed by legend-simflow offers a way to parallelize the workflow in some situations over several nodes. The recommended invocation is:

> pixi run prodnodes NUMBER_OF_NODES

which is equivalent to:

> snakemake-nersc --nodes NUMBER_OF_NODES

To submit the workflow as a batch Slurm job instead, use:

> pixi run prodsubmit --time HH:MM:SS [--nodes N]

where --time is the requested wall time (required) and --nodes defaults to 1.

The program determines the list of simulations (see the simlist in Production) that the user wants to process, partitions it in NUMBER_OF_NODES chunks, and spawns a dedicated Snakemake instance for each, prefixed by the appropriate srun call. This is equivalent to something like:

srun -N1 -n1 snakemake --workflow-profile workflow/profiles/nersc-compute --config simlist=LIST1 [SNAKEMAKE ARGS] &
srun -N1 -n1 snakemake --workflow-profile workflow/profiles/nersc-compute --config simlist=LIST2 [SNAKEMAKE ARGS] &
...

wait

Warning

The Snakemake instances spawned per chunk are fully independent and share no locking mechanism. It is the user’s responsibility to ensure that each instance operates on a disjoint subset of the DAG — otherwise multiple instances may try to build the same output concurrently, leading to race conditions and corrupted files.

In practice this means that any shared steps (e.g. par, which produces drift-time maps and other parameters consumed by all hit/opt jobs) must be completed before launching the multi-node run. The safest approach is to first run pixi run prod with only the shared steps in make_steps, wait for it to finish, and only then launch the multi-node run with those steps excluded.

This approach makes it unfortunately harder to manually interrupt the Simflow, e.g. hitting Ctrl+C will just make Slurm print some jobset status information. You should instead send signals (TERM to stop scheduling more jobs and just wait for running jobs and INT to kill all running jobs) directly to the snakemake instance.

Todo

Add commands to send signals.

Note

Since the actual jobs that need to be run will not be known a priori, the a-priori partitioning might be inefficient. To mitigate this, the simlist is randomly shuffled before partitioning.