Skip to content

Submit a pytorch job on the cluster

Steps:

  1. Initialize conda (see the using conda tutorial)
  2. Create a conda environemnt

    conda create --name pytorch_env python=3.12
    
  3. Activate conda

    conda activate pytorch_env
    
  4. Install required packages for pytorch. Instructions from the pytorch website.

    pip install torch torchvision torchaudio ipykernel --index-url https://download.pytorch.org/whl/cu124
    

    [!NOTE] CUDA vesrion 12.4 should work with most of the cluster GPU's (A100; L40S) Also: Pytorch no longer distribtutes their packages with Conda - you can use pip (see command above).

  5. [Optional] Register Python kernel so you can load the environment in Jupyter

    python -m ipykernel install --user --name pytorch_env
    
  6. [Optional] Install additional conda or pip modules as needed

    pip install numpy matplotlib [...]
    

    alternatively you can install from a requirements.txt file:

    pip install -r requirements.txt
    
  7. Write a Python file to run. We're going to write this simple Python script as an example: pytorch-test.py.

    # pytorch-test.py
    
    import torch
    
    print("IS GPU AVAILABLE?", torch.cuda.is_available())
    print("ALL DONE!)
    
  8. Write a sbatch script. This code will be run when your job is executed from the queue. Take note of the partition and number resource (GPUs, CPU Cores, RAM) and time requested. See Getting Started for more information on these parameters. E.g. name the script: submit-myjob

    #!/bin/bash
    #SBATCH --nodes=1
    #SBATCH --tasks=1
    #SBATCH --cores-per-task=8
    #SBATCH --time=04:00:00
    #SBATCH --partition=l40s-xl
    #SBATCH --mem=16GB
    #SBATCH --job-name=run-torch
    #SBATCH --output=myjob_%j.out
    #SBATCH --error=myjob_%j.err
    #SBATCH --gres=gpu:1
    
    # activate conda and activate the specific environment
    source /shared/software/software-environments/miniconda3/bin/activate
    conda activate pytorch_env
    
    # run your python code
    python pytorch-test.py
    
  9. Submit the script using sbatch, this return a JOB_ID

    sbatch submit-myjob
    
  10. The job should be submitted. This will take at least 5-10 minutes if the job queue is empty, but it could take hours-days if there are long-running jobs in the queue. You can check the status of the job with:

    squeue
    

    The status will likely be CF (configuring) or PD (pending) at this point. This means the job is waiting queue or still setting up. When the status changes to R (running), this means the job is actively running. 11. When the job finishes, the output will be saved to the file: myjob_%JOBID%.out e.g. myjob_5123.out