Skip to content

Detect unreserved and lost launches and show more metadata in overview

Related to issue #420 (closed).

Maintaining state consistency

Variables that are in RUNNING or RESERVED state because the SLURM jobs have been cancelled, exceeded the job wall time or have crashed, should be updated after some period.

For RUNNING jobs say the last heart beat is more than the default of 4 hours (the default heartbeat period is 1 hour). Use this API function. The problem with the function is that it sets the state to FIZZLED which has two disadvantages: 1) we cannot distinguish jobs that are marked FIZZLED by the rocket and those by the launchpad. 2) we may want to directly rerun without going to FIZZLED.

With the underlying API function inconsistent Fireworks can be detected and the relevant workflows can be optionally refreshed.

For RESERVED jobs a check with Slurm is performed. Use this API function. The default time since the last update is 2 weeks which is quite fair (it is unlikely that a job is pending in the queue more than two weeks).

For first we should integrate some notification (User Warning?) and let us monitor what actions would be done on lost launches in reserved and running state, and inconsistent fireworks.

Explicitly changing states

Also, what happens to the submitted Slurm jobs when we %rerun a RESERVED or RUNNING variable? Again, there are functions in the VRE Middleware API to implement this safely.

Here an overview of functions that can be used:

method processed states target state(s)
cancel_job RESERVED, RUNNING WAITING, DEFUSED
update_node READY, WAITING, FIZZLED, DEFUSED, PAUSED no change
rerun_node COMPLETED, FIZZLED WAITING
update_rerun_node COMPLETED, WAITING, READY, FIZZLED WAITING
pause_fw WAITING, READY, RESERVED PAUSED
resume_fw PAUSED WAITING
defuse_fw DEFUSED, WAITING, READY, FIZZLED, PAUSED DEFUSED
reignite_fw DEFUSED WAITING
original state target state applicable methods magic / operator
ARCHIVED - - -
FIZZLED WAITING rerun_fw, rerun_node, update_rerun %rerun
FIZZLED DEFUSED defuse_fw %cancel
FIZZLED FIZZLED update_node, update_spec :=
DEFUSED WAITING reignite_fw %rerun
PAUSED WAITING resume_fw %rerun
PAUSED DEFUSED defuse_fw %cancel
WAITING WAITING update_spec, update_node, update_rerun :=
WAITING DEFUSED defuse_fw -
WAITING PAUSED pause_fw %cancel
READY PAUSED pause_fw %cancel
READY DEFUSED defuse_fw -
READY READY update_spec, update_node, update_rerun :=
RESERVED WAITING cancel_job, rerun_node, rerun_fw %rerun
RESERVED PAUSED pause_fw %cancel
RUNNING WAITING cancel_job, detect_lostruns, rerun_fw %rerun
RUNNING FIZZLED cancel_job, detect_lostruns -
RUNNING DEFUSED cancel_job %cancel
COMPLETED WAITING rerun_node, rerun_fw, update_rerun %rerun

Overviews (%find, %history)

The time stamps should be shorter, omit the time zone (+01:00) and document that all stamps in the interactive session are local times.

The overview of searches shows states, time stamp of last change and UUID. It will be very helpful to add model tags.Also it will be helpful to add an argument to %hist, to show the states of the variables in another model. The find results are not sorted at all. Maybe sorting by timestamp or state would be helpful.

Also the state of the workflow (model) is not very informative in the listing from %find. The overview wold have better such sections

UUID update time C REA W RUN ...
ccb6ce510946423cad5e8aa8ac2bfcb2 2025-02-22T15:42:11 1 4 10 5

The columns C, RUN, REA, RES, W, ... summarize statements' states.

Edited by Ivan Kondov