Detect unreserved and lost launches and show more metadata in overview

Maintaining state consistency

Variables that are in RUNNING or RESERVED state because the SLURM jobs have been cancelled, exceeded the job wall time or have crashed, should be updated after some period.

For RUNNING jobs say the last heart beat is more than the default of 4 hours (the default heartbeat period is 1 hour). Use this API function. The problem with the function is that it sets the state to FIZZLED which has two disadvantages: 1) we cannot distinguish jobs that are marked FIZZLED by the rocket and those by the launchpad. 2) we may want to directly rerun without going to FIZZLED.

With the underlying API function inconsistent Fireworks can be detected and the relevant workflows can be optionally refreshed.

For RESERVED jobs a check with Slurm is performed. Use this API function. The default time since the last update is 2 weeks which is quite fair (it is unlikely that a job is pending in the queue more than two weeks).

For first we should integrate some notification (User Warning?) and let us monitor what actions would be done on lost launches in reserved and running state, and inconsistent fireworks.

Explicitly changing states

Also, what happens to the submitted Slurm jobs when we %rerun a RESERVED or RUNNING variable? Again, there are functions in the VRE Middleware API to implement this safely.

Here an overview of functions that can be used:

method	processed states	target state(s)
cancel_job	RESERVED, RUNNING	WAITING, DEFUSED
update_node	READY, WAITING, FIZZLED, DEFUSED, PAUSED	no change
rerun_node	COMPLETED, FIZZLED	WAITING
update_rerun_node	COMPLETED, WAITING, READY, FIZZLED	WAITING
pause_fw	WAITING, READY, RESERVED	PAUSED
resume_fw	PAUSED	WAITING
defuse_fw	DEFUSED, WAITING, READY, FIZZLED, PAUSED	DEFUSED
reignite_fw	DEFUSED	WAITING

original state	target state	applicable methods	magic / operator
ARCHIVED	-	-	-
FIZZLED	WAITING	rerun_fw, rerun_node, update_rerun	%rerun
FIZZLED	DEFUSED	defuse_fw	%cancel
FIZZLED	FIZZLED	update_node, update_spec	:=
DEFUSED	WAITING	reignite_fw	%rerun
PAUSED	WAITING	resume_fw	%rerun
PAUSED	DEFUSED	defuse_fw	%cancel
WAITING	WAITING	update_spec, update_node, update_rerun	:=
WAITING	DEFUSED	defuse_fw	-
WAITING	PAUSED	pause_fw	%cancel
READY	PAUSED	pause_fw	%cancel
READY	DEFUSED	defuse_fw	-
READY	READY	update_spec, update_node, update_rerun	:=
RESERVED	WAITING	cancel_job, rerun_node, rerun_fw	%rerun
RESERVED	PAUSED	pause_fw	%cancel
RUNNING	WAITING	cancel_job, detect_lostruns, rerun_fw	%rerun
RUNNING	FIZZLED	cancel_job, detect_lostruns	-
RUNNING	DEFUSED	cancel_job	%cancel
COMPLETED	WAITING	rerun_node, rerun_fw, update_rerun	%rerun

Overviews (%find, %history)

The time stamps should be shorter, omit the time zone (+01:00) and document that all stamps in the interactive session are local times.

The overview of searches shows states, time stamp of last change and UUID. It will be very helpful to add model tags.Also it will be helpful to add an argument to %hist, to show the states of the variables in another model. The find results are not sorted at all. Maybe sorting by timestamp or state would be helpful.

Also the state of the workflow (model) is not very informative in the listing from %find. The overview wold have better such sections

UUID	update time	C	REA	W	RUN	...
ccb6ce510946423cad5e8aa8ac2bfcb2	2025-02-22T15:42:11	1	4	10	5

The columns C, RUN, REA, RES, W, ... summarize statements' states.

Edited Feb 26, 2025 by Ivan Kondov