Running on a cluster

PyBNF is designed to run on computing clusters that utilize a shared network filesystem. PyBNF comes with built-in support for clusters running Slurm. It may also be manually configured to run on clusters with other managers (Torque, PBS, etc.).

Installation of PyBNF on a cluster has the same requirements as installation on a workstation, namely Python 3 with the pip package manager. This is available on most clusters, but may require loading a module to access. In Slurm, you can view the available modules with the command module avail, and load the appropriate one with module load [modulename]. Once Python 3 and pip are loaded, the same installation instructions apply as for a standard installation. Assistance from the cluster administrators may be helpful if any cluster-specific issues arise during installation.

SLURM

The user may run PyBNF interactively or as a batch job using the salloc or sbatch commands respectively.

To tell PyBNF to use Slurm, pass “slurm” with the -t flag, i.e. pybnf -t slurm. It is also possible to instead specify the cluster_type key in the config file.

Interactive (quickstart)

Execute the salloc -Nx command where x is an integer denoting the number of nodes the user wishes to allocate

Log in to one of the nodes with the command slogin

Load the appropriate Python environment

Initiate a PyBNF fitting run, including the flag -t slurm

Batch

Write a shell script specifying the desired nodes and their properties according to SLURM specifications. Be sure that your script includes loading the appropriate Python environment if this step is required for your cluster, and that your call to pybnf includes the flag -t slurm. For an example shell script, see examples/tcr/tcr_batch.sh.

Submit the batch job to the queueing system using the command sbatch script.sh where script.sh is the name of the shell script.

Troubleshooting: SSH access to nodes

The above instructions assume that PyBNF can access all allocated nodes via SSH. For some clusters, additional configuration is necessary to enable SSH access: use ssh_keygen (documented in many places, such as here, or here for instructions specific to PyBNF’s Dask scheduler) to set up SSH keys.

To confirm that SSH keys are set up correctly, make sure that you are able to SSH into all allocated nodes without needing to enter a password.

If SSH access is not possible on your cluster, you will have to use Manual configuration with Dask.

TORQUE/PBS

Not yet implemented. Please refer to Manual configuration below

Manual configuration with node names

It is possible to run PyBNF on any cluster regardless of resource manager by simply telling PyBNF the names of the nodes it should run on.

Use manager-specific commands to allocate some number of nodes for your job, and find the names of those nodes. For example, in Torque: qsub -I <options> followed by qstat -u <username>.

Then set the keys scheduler_node and worker_nodes in your PyBNF config file. scheduler_node should be the name of one of the nodes allocated for your job, and worker_nodes should be the space-delimited names of all of your nodes (including the one set as scheduler_node).

PyBNF will then run this fitting job on the specified cluster nodes.

Manual configuration with Dask

PyBNF uses Dask.distributed to manage cluster computing. In most cases, it is not necessary for the user to interact directly with Dask. However, if PyBNF’s automatic Dask setup is unsatisfactory, then the instructions in this section may be helpful to set up Dask manually.

In the automatic PyBNF setup, the command dask-ssh is run on one of the available nodes (which becomes the scheduler node), with all available nodes as arguments (which become the worker nodes). dask-ssh is run with --nthreads 1 and --nprocs equal to the number of available cores per node. The default number of available processes per core is the value returned by multiprocessing.cpu_count(); this default can be overridden by specifying the parallel_count key equal to the total number of processes over all nodes. This entire automatic setup with dask-ssh can be overridden as described below. If overriding the automatic setup, it is recommended to keep nthreads equal to 1 for SBML models because the SBML simulator is not thread safe.

For manual configuration, you will need to run the series of commands described below. All of these commands must remain running during the entire PyBNF run. Utilites such as nohup or screen are helpful for keeping multiple commands running at once.

To begin, run the command dask-scheduler on the node you want to use as the scheduler. Pass the argument --scheduler-file to create a JSON-encoded text file containing connection information. For example:

dask-schduler --scheduler-file cluster.json

On each node you want to use as a worker, run the command dask-worker. Pass the scheduler file, and also specify the number of processes and threads per process to use on that worker. For example:

dask-worker --scheduler-file cluster.json --nprocs 32 --nthreads 1

Finally, run PyBNF, and pass PyBNF the scheduler file using the -s command line argument or the scheduler_file configuration key:

pybnf -c fit.conf -s cluster.json

For additional dask-scheduler and dask-worker options, refer to the Dask.distributed documentation.

(Optional) Logging configuration for remote machines

By default, PyBNF logs to the file bnf_timestamp.log to maintain a record of important events in the application. When running PyBNF on a cluster, some of the logs may be written while on a node distinct from the main thread. If these logs are desired, the user must configure the scheduler to retrieve these logs.

Upon installation of PyBNF, the dependencies dask and distributed should be installed. Installing them will create a .dask/ folder in the home directory with a single file: config.yaml. Open this file to find a logging: block containing information for how distributed outputs logs. Add the following line to the file, appropriately indented:

pybnf.algorithms.job: info

where info can be any string corresponding to a Python logging level (e.g. info, debug, warning)