detect reserved nodes whose SLURM jobs have crashed
We have the following situation: a firework is launched in reservation mode. The SLURM job is started sometime later but crashes before the rlaunch command in the SLURM script. This causes that the job is neither running, nor pending in the SLURM queue, nor "fizzled" in the database. There should be a tool, similar to check_lostjobs()
, to detect such jobs and to show their error output. One hint is to use detect_unresearved()
[1] but it is based on the time elapsed since the reservation. This may catch jobs that are still pending in the SLURM queue and thus not lost. This has to be taken into account.
Edited by Ivan Kondov