Slurm array: Run multiple BLAST queries in parallel with a single submission script¶
BLAST finds regions of similarity between biological sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance
For the demonstrator
- Working directory /nesi/project/nesi99999/Dinindu/20230503-pfr-demo/blast
- Sequences were delivered in a single file /nesi/project/nesi99999/Dinindu/20230503-pfr-demo/blast/parent_file/demo-hic.fa
- Sequence file was split to 150 separate queries by sequence with
faSplit
and stored in input-queries$ pwd /nesi/project/nesi99999/Dinindu/20230503-pfr-demo/blast $ faSplit sequence parent_file/demo-hic.fa 150 split-queries/demo-hic $ ls split-queries/ demo-hic000.fa demo-hic019.fa demo-hic038.fa demo-hic057.fa demo-hic076.fa demo-hic095.fa demo-hic114.fa demo-hic133.fa demo-hic001.fa demo..............................
- Slurm Submission script is /nesi/project/nesi99999/Dinindu/20230503-pfr-demo/blast/scripts/demo-array.slurm
- Query outputs will be saved to /blast-out and the Slurm StdOut to /slurm-logs
Slurm array script
#!/bin/bash -e
#SBATCH --account nesi99999
#SBATCH --job-name blast_fastaSplit
#SBATCH --cpus-per-task 1
#SBATCH --mem 2G
#SBATCH --time 24:00:00
#SBATCH --array 0-149
#SBATCH --output /nesi/project/nesi99999/Dinindu/20230503-pfr-demo/blast/slurm-logs/%A_%a.out
date;hostname;pwd
module load BLASTDB/2024-01
module load BLAST/2.13.0-GCC-11.3.0
export INPUT_DIR=/nesi/project/nesi99999/Dinindu/20230503-pfr-demo/blast/input-queries
export OUTPUT_DIR=/nesi/project/nesi99999/Dinindu/20230503-pfr-demo/blast/blast-out
RUN_ID=$(( $SLURM_ARRAY_TASK_ID + 1 ))
QUERY_FILE=$( ls ${INPUT_DIR} | sed -n ${RUN_ID}p )
QUERY_NAME="${QUERY_FILE%.*}"
QUERY="${INPUT_DIR}/${QUERY_FILE}"
OUTPUT="${OUTPUT_DIR}/${QUERY_NAME}.out"
echo -e "Command:\nblastn –query ${QUERY} –db nt –out ${OUTPUT} -outfmt 6 -max_target_seqs 1 -num_threads $SLURM_CPUS_PER_TASK"
blastn -query ${QUERY} -db nt -out ${OUTPUT} -outfmt 6 -max_target_seqs 1 -num_threads $SLURM_CPUS_PER_TASK
date
submit
- submit the script with
sbatch scripts/demo-array.slurm
- If needed, use array throttling (eeping only a certain number of tasks RUNNING at a time). Let's say we want to run only 20 queries at a time (out of 149), then adding
#SBATCH --array 0-149%20
to the submission script or call during submission tosbatch
command withsbatch --array 0-149%20 scripts/demo-array.slurm
- If needed, use array throttling (eeping only a certain number of tasks RUNNING at a time). Let's say we want to run only 20 queries at a time (out of 149), then adding
- Review the status of submission with
squeue -j jobid