Run nvidia-smi
within a Slurm script¶
Add the following block to your Slurm script.
STATS_INTERVAL=5
: This will gather profile data every 5 secondsSTATS_FILE
: Make to include a valid path to write the .csv file which contains the profile datanvidia-smi --query-gpu=
: Most essential variables areutilization.gpu,utilization.memory,memory.used,memory.total
. Feel free to remove the remainder
terminal
# GPU usage monitoring
STATS_INTERVAL=5
STATS_FILE="/path/to/save/the/output/file/${SLURM_JOB_ID}-gpu_stats.csv"
nvidia-smi --query-gpu=timestamp,uuid,clocks_throttle_reasons.sw_thermal_slowdown,utilization.gpu,utilization.memory,memory.used,memory.total,temperature.gpu,power.draw,clocks.current.sm \
--format=csv,nounits -l "$STATS_INTERVAL" -f "$STATS_FILE" &
sleep 20
For an example,
terminal
#!/bin/bash -e
#SBATCH --account nesi12345
#SBATCH --job-name test-withsmi
#SBATCH --time 06:30:00
#SBATCH --cpus-per-task 3
#SBATCH --mem 12G
#SBATCH --gpus-per-node A100:1
#SBATCH --output %j-%x.out
module purge
module load some-module
STATS_INTERVAL=5
STATS_FILE="/path/to/save/the/output/file/${SLURM_JOB_ID}-gpu_stats.csv"
nvidia-smi --query-gpu=timestamp,uuid,clocks_throttle_reasons.sw_thermal_slowdown,utilization.gpu,utilization.memory,memory.used,memory.total,temperature.gpu,power.draw,clocks.current.sm \
--format=csv,nounits -l "$STATS_INTERVAL" -f "$STATS_FILE" &
sleep 20#SBATCH --gpus-per-node A100:1
some commands
Visualising Profile data¶
Once the job was completed, following python script can be used to visualise the profile data acquired via nvidia-smi
- For demonstration purposes, we will call the file with profile data 18957616-gpu_stats.csv and the figure to be generated gpu_profile_figure.png
Visualise nvidia-smi
profile data
#replace 18957616 with the corresponding JobID
import pandas as pd
dset = pd.read_csv("18957616-gpu_stats.csv", parse_dates=["timestamp"], index_col="timestamp", sep=", ", engine="python")
dset = dset.drop(columns="memory.total [MiB]").dropna()
dset["sw_thermal_slowdown"] = dset["clocks_throttle_reasons.sw_thermal_slowdown"].apply(lambda x: 0 if "Not" in x else 1)
axes = dset.plot(subplots=True, figsize=(15, 10), grid=True)
axes[0].figure.savefig("gpu_profile_figure.png")