Sub-models reuse
Motivation
Sometimes it is necessary to combine data from more than one model (workflow) in a new workflow. The data can be exported and imported into the new workflow but the data provenance is lost, apart from the complexity of operations and effort. Much more natural is to copy the relevant sub-workflows (i.e. the nodes providing the inputs and all their ancestors) from the source workflows to the target workflow. The copied sub-workflows are duplicates of their originals and do not need to be run but can share the launches of their completed originals. This feature is implemented as duplicate detection in FireWorks.
User interaction
It is possible to implement this behavior with a language extension with a simple query language or graphical user interface. The query should be mostly based on the source code of the models from which the sub-models will be reused. The query output is represented in a tabular or graphical form.
find_one
and find_all
statements
The These statements are no language extensions but can be viewed as a separate language.
They can be implemented graphically similar to the execution engine WFEngine. The result of the search includes the workflow UUIDs of the matches. A selection can be loaded and explored in more detail by using the print()
and view()
statements.
Inter-model referencing method
Let a
is a variable in a model instance with UUID f4dcv1
. The UUID becomes a namespace for the variable a
. This creates a scope provider for the variable in the target model (see in textx scoping across multiple models). A possible syntax can be for example a@f4dcv1
. Upon interpretation this reference is resolved and the corresponding data is copied. The provenance is preserved via the namespace with the UUID of the source model.
Advantages: 1) less redundancy of copied data; 2) no need for duplicate detection.
Disadvantages: 1) if the source model is deleted, the provenance of data is lost; 2) changes in the source model do not cause recomputing leading to inconsistencies; 3) the variable in the source model must be in completed state so that it has a value that can be copied only then.
In-lining method
The syntax can be, e.g., reuse a@f4dcv1
. In contrast to the referencing method this is no part of the language. This command copies the relevant sub-model to the target model. The command can be replaced by a graphical user interface. The variable a
obtains local scope and is referenced as usual.
Advantages: 1) the target model is self-contained - if the source is deleted the provenance is preserved even if the launches are shared; 2) no language extension necessary - correspondingly no interpreter extension needed. 3) changes in the sub-model do not cause inconsistencies.
Disadvantages: 1) due to excessive duplicates the duplicate detection must be used to avoid repeated computation. 2) lack of modularity because the copied sub-workflow / sub-model cannot be identified in the target model; 3) possible name conflicts upon copying to the target model if part of it is already written. This latter issue can be solved by using a namespace. But this name change has to be considered in the special DupeFinder
object so that the copy is detected as duplicate of the original.
Implementation
reuse
statements (in-lining method)
Processing the We do not have to replicate the sub-workflow but the relevant source code and then regenerate the workflow section. The steps include:
- Parse the
reuse
statement. This can be done by regex. - Check grammar compatibility of the source model
- Find all parent nodes of the requested node
- Collect the source code from these nodes
- Extend the target model. In case of name conflicts, references obtain a namespace as from the reuse statement. In case no name conflict has been found no namespace extension has to be performed.
- Parse, check and interpret the target model
The steps 0-4 must be implemented in the Session
class in a method switched before the get_model()
. Step 5 is performed as usual by the model processors.
reuse
statements (reference method)
Processing the
Session
class
In the The reuse statement is caught in the Session class before calling the get_model()
method. All occurrences of varname@uuid
in the target model are replaced by the corresponding value of the variable in the source model (NOTE: the variable must have been evaluated!). Then the target model input is passed to the model processor. The references to the source in the target model are lost.
In the model processor
The advantage of this method is that the references are kept in the source code and therefore the provenance is saved. The references varname@uuid
are resolved using the UUID as namespace to define the scope provider. This triggers creating the source model. From the source model, the values of varname@uuid
are pulled and set. NOTE: the variable in the source model must have been evaluated!
Alternative referencing method
One way is to create a new variable in the local scope referencing varname@uuid and sharing a launch of varname@uuid in the source model (i.e. both must have duplicate nodes) and then use the new local variable. In this case the state of the node of the referenced variable is not important. The problem is to make the two nodes duplicates (parents updating the spec must be the same, their parents etc.).
Alternatives in syntax and implementation
reuse
keyword
We can completely omit the keyword reuse
because if no namespace is specified using the @
symbol then the variable must be locally available in the model. If it is specified and if the UUID matches the model it is interpreted as local reference, otherwise it is interpreted as reference to a variable from another model.
Interpreter location
Steps 0-4 may be implemented in the processor of the target model. The problem is that the model cannot be parsed due to unresolved references to a variable not contained in the model. In this case, we will have to use this or this recipes.
Storing the reuse statements
The reuse
statements do not have to be stored in the meta-node because this information is not relevant at later point in time.
Duplicates detection
After replication (re-generation), the source and the target models will contain duplicates. These have to share launches by using smart duplicates detection.