I have a workstation that I am currently using to run the following code structure:
A matlab script that manages everything and iteratively calls a second wrapper function. Within this wrapper, I submit multiple jobs (each one is a model simulation requiring one core) using the batch command, wait for them to all complete, then return some output to the main script. This works fine on my computer running 12 jobs in parallel but each model simulation takes 2-3 hours and I am limited to the number of cores on my machine, ideally I would need to run ~50+ jobs in parallel to get reasonable run times.
I would like to get this working on the university cluster which uses the SLURM workload manager. My problem is that each node on this cluster does not have sufficient cores to get much of a speedup and so I need to submit the job to run on multiple nodes to take full advantage of the resources available. Of course I run into a problem because the main script only needs 1 core and so trying to split this over several nodes makes no sense to slurm and throws an error.
I am very much a beginner with how to use slurm so presumably this is a mistake in how I configure the job submission, the script I am using is as follows:
#!/bin/bash
#SBATCH -J my_script
#SBATCH –output=/scratch/%u/%x-%N-%j.out
#SBATCH –error=/scratch/%u/%x-%N-%j.err
#SBATCH -p 24hour
#SBATCH –cpus-per-task=40
#SBATCH –nodes=2
#SBATCH –tasks=1
#SBATCH –mail-type=BEGIN,END,FAIL
#SBATCH –mail-user sebastian.rosier@northumbria.ac.uk
#SBATCH –exclusive
module load MATLAB/R2018a
srun -N 2 -n 1 -c 40 matlab -nosplash -nodesktop -r "my_script; quit;"
The model wrapper that submits multiple batch jobs is something like this:
c = parcluster;for ii = 1:N workerTable{ii} = batch(c,'my_model',1,{my_model_opts});end
with additional lines to check job status and get results etc.
Perhaps what I am trying to do makes no sense and I need to come up with a completely different structure to my MATLAB script. Either way, any help would be much appreciated!
Sebastian
Best Answer