- If the error is external (i.e. not in the ``spec``) and fixed then just ``lpad rerun_fws``.
- In-place fix with ``lpad update_fws`` and ``lpad rerun_fws``. Advantage: preservation of all independent COMPLETED fireworks is guaranteed
- Fix the error in and resubmit the whole workflow.
For some reason the execution of a firework may fail and the firework gets
*FIZZLED* state. Depending on the reason for the error there are different
approaches to handle the error:
* If the error is external (i.e. not in the ``spec``) and fixed then the firework
can be rerun using the command ``lpad rerun_fws``.
* If the the error is in the *spec* of the firework then this can be in-place
fixed with the command ``lpad update_fws`` and then rerun with ``lpad rerun_fws``.
Advantage: preservation of all independent *COMPLETED* fireworks is guaranteed.
* If the error is in the *spec* then it can be fixed in the workflow template
and the whole workflow is added again to launchpad. This approach is not
practical with increasing number of errors and updates in the same workflow.
Detect lost runs
----------------
If a job is killed by the batch system its status *RUNNING* gets never changed.
In order to detect such running fireworks we use the command ``lpad detect_lostruns``
which will return the IDs of fireworks with lost runs. Optionally, these can be
rerun set to *FIZZLED*.
* RUNNING forever
- Use the command ``lpad detect_lostruns`` to set to FIZZLED or to rerun
Detect duplicates
-----------------
Fireworks can reuse the data from the launches of identical Fireworks
(duplicates). To enable detection of duplicates the following key is added to the *spec*::
Fireworks can reuse the data from the launches of identical Fireworks (duplicates).
To enable detection of duplicates the following key is added to the *spec*::
_dupefinder:
_fw_name: DupeFinderExact
...
...
@@ -77,7 +98,7 @@ identical section including four fireworks (1-4)::
The second run of ``rlaunch`` detects four duplicate pairs whereas only the last
firework of the second added workflow is executed. After this both workflows are
in COMPLETED state which can be checked with::
in COMPLETED state which can be checked with::
lpad get_wflows -t -m 2 --rsort created_on
...
...
@@ -86,7 +107,8 @@ Let us now delete the first workflow for that all fireworks have been executed::
lpad delete_wflows -i 1
We see *Remove launches []* in the output, i.e. its launches have not been
removed. With deleting a workflow including duplicated fireworks the shared launcher is removed from launchpad only if all duplicated fireworks are deleted.
removed. With deleting a workflow including duplicated fireworks the shared
launcher is removed from launchpad only if all duplicated fireworks are deleted.
The launches are related now only to the relevant fireworks of the second
workflow. The launches will be removed if we remove the second workflow::
...
...
@@ -113,24 +135,15 @@ us test this with re-running a firework that is identical to another firework::
# the two fireworks are COMPLETED
Security best practices
-----------------------
Configure security (MongoDB authentication and authorization)
Query and analyse data from fireworks and workflows