Sub-models reuse

Motivation

Sometimes it is necessary to combine data from more than one model (workflow) in a new workflow. The data can be exported and imported into the new workflow but the data provenance is lost, apart from the complexity of operations and effort. Much more natural is to copy the relevant sub-workflows (i.e. the nodes providing the inputs and all their ancestors) from the source workflows to the target workflow. The copied sub-workflows are duplicates of their originals and do not need to be run but can share the launches of their completed originals. This feature is implemented as duplicate detection in FireWorks.

User interaction

It is possible to implement this behavior with a language extension with a simple query language or graphical user interface. The query should be mostly based on the source code of the models from which the sub-models will be reused. The query output is represented in a tabular or graphical form.

The `find_one` and `find_all` statements

These statements are no language extensions but can be viewed as a separate language. They can be implemented graphically similar to the execution engine WFEngine. The result of the search includes the workflow UUIDs of the matches. A selection can be loaded and explored in more detail by using the print() and view() statements.

Inter-model referencing method

Let a is a variable in a model instance with UUID f4dcv1. The UUID becomes a namespace for the variable a. This creates a scope provider for the variable in the target model (see in textx scoping across multiple models). A possible syntax can be for example a@f4dcv1. Upon interpretation this reference is resolved and the corresponding data is copied. The provenance is preserved via the namespace with the UUID of the source model.

Advantages: 1) less redundancy of copied data; 2) no need for duplicate detection.

Disadvantages: 1) if the source model is deleted, the provenance of data is lost; 2) changes in the source model do not cause recomputing leading to inconsistencies; 3) the variable in the source model must be in completed state so that it has a value that can be copied only then.

In-lining method

The syntax can be, e.g., reuse a@f4dcv1. In contrast to the referencing method this is no part of the language. This command copies the relevant sub-model to the target model. The command can be replaced by a graphical user interface. The variable a obtains local scope and is referenced as usual.

Advantages: 1) the target model is self-contained - if the source is deleted the provenance is preserved even if the launches are shared; 2) no language extension necessary - correspondingly no interpreter extension needed. 3) changes in the sub-model do not cause inconsistencies.

Disadvantages: 1) due to excessive duplicates the duplicate detection must be used to avoid repeated computation. 2) lack of modularity because the copied sub-workflow / sub-model cannot be identified in the target model; 3) possible name conflicts upon copying to the target model if part of it is already written. This latter issue can be solved by using a namespace. But this name change has to be considered in the special DupeFinder object so that the copy is detected as duplicate of the original.

Implementation

Processing the `reuse` statements (in-lining method)

We do not have to replicate the sub-workflow but the relevant source code and then regenerate the workflow section. The steps include:

Parse the reuse statement. This can be done by regex.
Check grammar compatibility of the source model
Find all parent nodes of the requested node
Collect the source code from these nodes
Extend the target model. In case of name conflicts, references obtain a namespace as from the reuse statement. In case no name conflict has been found no namespace extension has to be performed.
Parse, check and interpret the target model

The steps 0-4 must be implemented in the Session class in a method switched before the get_model(). Step 5 is performed as usual by the model processors.

Processing the `reuse` statements (reference method)

In the `Session` class

The reuse statement is caught in the Session class before calling the get_model() method. All occurrences of varname@uuid in the target model are replaced by the corresponding value of the variable in the source model (NOTE: the variable must have been evaluated!). Then the target model input is passed to the model processor. The references to the source in the target model are lost.

In the model processor

The advantage of this method is that the references are kept in the source code and therefore the provenance is saved. The references varname@uuid are resolved using the UUID as namespace to define the scope provider. This triggers creating the source model. From the source model, the values of varname@uuid are pulled and set. NOTE: the variable in the source model must have been evaluated!

Alternative referencing method

One way is to create a new variable in the local scope referencing varname@uuid and sharing a launch of varname@uuid in the source model (i.e. both must have duplicate nodes) and then use the new local variable. In this case the state of the node of the referenced variable is not important. The problem is to make the two nodes duplicates (parents updating the spec must be the same, their parents etc.).

Alternatives in syntax and implementation

`reuse` keyword

We can completely omit the keyword reuse because if no namespace is specified using the @ symbol then the variable must be locally available in the model. If it is specified and if the UUID matches the model it is interpreted as local reference, otherwise it is interpreted as reference to a variable from another model.

Interpreter location

Steps 0-4 may be implemented in the processor of the target model. The problem is that the model cannot be parsed due to unresolved references to a variable not contained in the model. In this case, we will have to use this or this recipes.

Storing the reuse statements

The reuse statements do not have to be stored in the meta-node because this information is not relevant at later point in time.

Duplicates detection

After replication (re-generation), the source and the target models will contain duplicates. These have to share launches by using smart duplicates detection.

Edited Dec 10, 2023 by Ivan Kondov