Skip to content

Submit a TensorFlow job on the cluster

The easiest way to run tensorflow on the cluster is to follow the Jupyter OpenOnDemand instructions and select "TensorFlow" from the "Version of the Jupyter stack container" option. This options is easiest but has limited options for configuring the Python environment.

If you want to run TensorFlow without Jupyter (e.g. submit jobs non-interactively), you can follow these steps to create a TensorFlow conda environment for your code.

Create TensorFlow Conda Environment

As of 2/18/2025, the L40S GPUs on the cluster support CUDA version 12.4. The most recent TensorFlow version which supports CUDA <= 12.4 is TF 2.17, so we will install TensorFlow 2.17 for Python 3.12, and CUDA 12.3.2. The A100 GPUs will also work iwth TensorFlow 2.17, but may support more recent versions too.

Steps

  1. Initialize conda (see the using conda tutorial).

    source /shared/software/software-environments/miniconda3/bin/activate
    
  2. Create a conda environment. You will have to type "y" to confirm. THis may take a few minutes to process.

    conda create --name tf_2.17 python=3.12
    
  3. Activate conda

    conda activate tf_2.17
    
  4. Install required packages for TensorFlow. Instructions from the TensorFlow website.

    pip install tensorflow[-and-cuda]==2.17.1
    

    [!NOTE] CUDA version 12.3 should work with most of the cluster GPU's (A100; L40S). You can view this table for TensorFlow version compatibility.

  5. [Optional] Register Python kernel so you can load the environment in Jupyter

    python -m ipykernel install --user --name tf_2.17
    
  6. [Optional] Install additional conda or pip modules as needed

    pip install numpy matplotlib [...]
    

    alternatively you can install from a requirements.txt file:

    pip install -r requirements.txt
    
  7. Write a Python file to run. We're going to write this simple Python script as an example: tf-test.py.

    # tf-test.py
    
    import tensorflow as tf
    
    print(tf.config.list_physical_devices('GPU'))
    print("Tensorflow test script done!")
    
  8. Write a sbatch script. This code will be run when your job is executed from the queue. Take note of the partition and number resource (GPUs, CPU Cores, RAM) and time requested. See Getting Started for more information on these parameters. E.g. name the script: submit-myjob-tf

    #!/bin/bash
    #SBATCH --nodes=1
    #SBATCH --tasks=1
    #SBATCH --cores-per-task=8
    #SBATCH --time=04:00:00
    #SBATCH --partition=l40s-xl
    #SBATCH --mem=16GB
    #SBATCH --job-name=run-tf
    #SBATCH --output=myjob_%j.out
    #SBATCH --error=myjob_%j.err
    #SBATCH --gres=gpu:1
    
    # activate conda and activate the specific environment
    source /shared/software/software-environments/miniconda3/bin/activate
    conda activate tf_2.17
    
    # run your python code
    python tf-test.py
    
  9. Submit the script using sbatch, this return a JOB_ID

    sbatch submit-myjob-tf
    
  10. The job should be submitted. This will take at least 5-10 minutes if the job queue is empty, but it could take hours-days if there are long-running jobs in the queue. You can check the status of the job with:

    squeue
    

    The status will likely be CF (configuring) or PD (pending) at this point. This means the job is waiting queue or still setting up. When the status changes to R (running), this means the job is actively running.

  11. When the job finishes, the output will be saved to the file: myjob_%JOBID%.out e.g. myjob_5123.out