Detect unreserved and lost launches and show more metadata in overview
Related to issue #420 (closed).
Maintaining state consistency
Variables that are in RUNNING or RESERVED state because the SLURM jobs have been cancelled, exceeded the job wall time or have crashed, should be updated after some period.
For RUNNING jobs say the last heart beat is more than the default of 4 hours (the default heartbeat period is 1 hour). Use this API function. The problem with the function is that it sets the state to FIZZLED which has two disadvantages: 1) we cannot distinguish jobs that are marked FIZZLED by the rocket and those by the launchpad. 2) we may want to directly rerun without going to FIZZLED.
With the underlying API function inconsistent Fireworks can be detected and the relevant workflows can be optionally refreshed.
For RESERVED jobs a check with Slurm is performed. Use this API function. The default time since the last update is 2 weeks which is quite fair (it is unlikely that a job is pending in the queue more than two weeks).
For first we should integrate some notification (User Warning?) and let us monitor what actions would be done on lost launches in reserved and running state, and inconsistent fireworks.
Explicitly changing states
Also, what happens to the submitted Slurm jobs when we %rerun a RESERVED or RUNNING variable? Again, there are functions in the VRE Middleware API to implement this safely.
Here an overview of functions that can be used:
method | processed states | target state(s) |
---|---|---|
cancel_job | RESERVED, RUNNING | WAITING, DEFUSED |
update_node | READY, WAITING, FIZZLED, DEFUSED, PAUSED | no change |
rerun_node | COMPLETED, FIZZLED | WAITING |
update_rerun_node | COMPLETED, WAITING, READY, FIZZLED | WAITING |
pause_fw | WAITING, READY, RESERVED | PAUSED |
resume_fw | PAUSED | WAITING |
defuse_fw | DEFUSED, WAITING, READY, FIZZLED, PAUSED | DEFUSED |
reignite_fw | DEFUSED | WAITING |
original state | target state | applicable methods | magic / operator |
---|---|---|---|
ARCHIVED | - | - | - |
FIZZLED | WAITING | rerun_fw, rerun_node, update_rerun | %rerun |
FIZZLED | DEFUSED | defuse_fw | %cancel |
FIZZLED | FIZZLED | update_node, update_spec | := |
DEFUSED | WAITING | reignite_fw | %rerun |
PAUSED | WAITING | resume_fw | %rerun |
PAUSED | DEFUSED | defuse_fw | %cancel |
WAITING | WAITING | update_spec, update_node, update_rerun | := |
WAITING | DEFUSED | defuse_fw | - |
WAITING | PAUSED | pause_fw | %cancel |
READY | PAUSED | pause_fw | %cancel |
READY | DEFUSED | defuse_fw | - |
READY | READY | update_spec, update_node, update_rerun | := |
RESERVED | WAITING | cancel_job, rerun_node, rerun_fw | %rerun |
RESERVED | PAUSED | pause_fw | %cancel |
RUNNING | WAITING | cancel_job, detect_lostruns, rerun_fw | %rerun |
RUNNING | FIZZLED | cancel_job, detect_lostruns | - |
RUNNING | DEFUSED | cancel_job | %cancel |
COMPLETED | WAITING | rerun_node, rerun_fw, update_rerun | %rerun |
Overviews (%find, %history)
The time stamps should be shorter, omit the time zone (+01:00) and document that all stamps in the interactive session are local times.
The overview of searches shows states, time stamp of last change and UUID. It will be very helpful to add model tags.Also it will be helpful to add an argument to %hist, to show the states of the variables in another model. The find results are not sorted at all. Maybe sorting by timestamp or state would be helpful.
Also the state of the workflow (model) is not very informative in the listing from %find. The overview wold have better such sections
UUID | update time | C | REA | W | RUN | ... |
---|---|---|---|---|---|---|
ccb6ce510946423cad5e8aa8ac2bfcb2 | 2025-02-22T15:42:11 | 1 | 4 | 10 | 5 |
The columns C, RUN, REA, RES, W, ... summarize statements' states.