High throughput processing

High-throughput processing (or better group processing or bulk processing) can be realized in workflow evaluation mode. Using this feature, a program can be run for a series of inputs.

Possible usage scenarios

Run a program from scratch
Run an existing program with further inputs
Extend an existing program

From the user's point of view the three scenarios do not make big difference. The mode is activated as soon as the keyword vary occurs in the code.

Run a program from scratch

The keyword to activate this mode is vary. It indicates which variables what values will be initialized:

vary (a: 1, 2, 3), (b: false, true)

In this case a Cartesian product is constructed from the two series, i.e. totally 6 workflows (in a group) will be generated for each combination of a and b values. If the series are placed in a table then the program is executed for each row (parameter tuple):

vary ((a: 1, 2, 3), (b: false, true, true))

In this case the program will be executed 3 times (3 workflows will be generated in a group), for (1, false), (2, true) and (3, true).

Only variables can be initialized using the vary keyword. These values in programs where only literals are allowed cannot be varied and will stay constants.

The behavior of print statement is different in this mode. It prints a table with all varied initial values and the corresponding values to print:

print(e, f)
>>> ((e: 1.3, 1.4, 1.5), (f: 0.2, -0.1, 0.0), (a: 1, 2, 3), (b: false, true, true))

Run an existing program with further inputs

The program is loaded from the persistent storage by using a UUID. A new vary statement is issued. The needed workflows are created accordingly.

Extend an existing program

The program is loaded from the persistent storage by using a UUID. The newly added statements are interpreted to extend the workflows correspondingly.

Nice-to-have features

Optionally use a single workflow from a group ignoring the `vary` statements

This will end up with inconsistent workflows in the group. After running the same program in group mode then the possibly missing nodes in the other workflows in the group will be added. This option is helpful for developing an extension.

Collect all workflows from the persistent storage matching a program.

These do not belong to a group and vary parameters only implicitly. The print and vary statements behave like in the usual high throughput mode.

This option is helpful to add a common extension to the selected workflows, for example data analysis or add parameter variation.

Implementation

There are three approaches.

Use a group UUID in the metadata section

An advantage is the fast detection by a single query. Group extensions are possible by setting the group UUID for the new group members. The group UUID is created by default for all new program instances to make them extendable later on.

Use a linked tree based on the workflow UUID

Upstream workflows can be found by an attribute in the metadata section but downstream workflows cannot be found. Furthermore the effort is much higher than in the first method because we need one query per workflow in a group that limits the scaling for large groups.

Code based collector

Given a workflow UUID extract the code and find all workflows with certain traits. This method can also be used to find a program without UUID but by providing the source code.

Traits: full program match statement-wise, variable names match, variable types match, function definitions match, imports match - all match as sets, dependency graph match with the variables as vertices.The database is screened with these criteria by using database queries applying the cheapest criteria first.

Advantages: possibility to identify matching workflows created without the vary keyword and integrate them into virtual groups and process them together. Also the vary keyword should work.

Disadvantages: high implementation effort, possibly (worst case) high cost for collecting the workflows especially for large databases, group association not persistent.

This feature is not critical for bulk/high-throughput processing. The result of such a search will be the group UUID. To implement the group UUID and processing lists of models (groups) seems to be feasible for a start. Probably in a separate issue?

Implementation details

Group UUID

All instances receive a group UUID to enable high throughput processing mode. The group UUID can be used also in place of the workflow UUID to select the model instance. The workflows with the same group UUID form a group.

Virtual group

This group is constructed by other criteria than the group UUID. Makes sense only if some workflows belonging to the same model (see criteria) do not have UUIDs or have different UUIDs.

Activation of high-throughput processing mode

The mode is activated as soon as there is more than one workflow in the group. In this mode all workflows are treated as one instance of the model.

The mode is also activated if the vary statement is used.

At activation time a consistency check is performed and missing workflows or workflow nodes are added as needed.

Grammar extension with the `vary` statement

VarySeriesParameter:
  Number | Boolean | String | Table | Series | ... | Null
;

VarySeries:
  '('
    ref = [Variable:TrueID] ':'
    elements += VarySeriesParameter[',']
  ')'
;

Vary:
  'vary' '('? values += VarySeries[','] ')'?
;

Alternatively (simple but a problem might be that variable references, function calls, etc. would be allowed):

Vary:
  'vary' '( series += Series ) | ( table = Table )
;

Interpreter of `vary` statements

The vary statement should be processed by the Session class (or similar). Processing it within a model is not recommended because the scope of the model properties (methods) is limited to the same workflow. These can be, in principle, extended to reference objects in other models but 1) the name space has to be introduced and processors must be re-implemented to make them aware of the other models in the group; 2) all textX models must be created in memory; 3) the references between the model cannot be mapped in FireWorks as references between workflows.

The vary statement does not have to be passed to the model processors. It can be stored in the meta node directly by the Session class. The model processors cannot interpret the vary statement because they are designed to work with one model at a time. The invariant part of the workflow has to be copied into every model in the group for each parameter tuple. The varied parts are created with the input from the statement. Thus the Session class instantiates a list of textX models.

If a vary statement is updated by another vary statement the Session class will create a Cartesian product of the values specified in the new vary and create only models for the new parameter combinations (tuples).

Storage of the group UUID

Storing the group UUID is essential for the method. The group UUID can be written in the metadata section of the workflow. For his it has to be passed from the Session class to the model processors. In this case, the group membership of a model is fixed at instantiation time (static). Alternatively, the group UUID can be stored in the meta node by the Session class. In this case, a model can be member in more than one group and the group membership can be changed or removed.

Storage of the `vary` statement

Storing information from the vary statement is not strictly necessary and is redundant. This is because the varied parameters and their values can be derived from the models in the group defined by a group UUID. Nevertheless, this information can be provided as metadata and used for 1) faster processing (only one query for the meta node, metadata section or some attribute rather than complex queries to traverse all variables and extracting variations etc.); 2) possibility for consistency check. One consistency check is to check that the sets of variables from the vary info saved in the meta nodes are identical for all workflows in the group. Further checks can be to find matches between variable names and values.

There are several different options for storing this information.

The parameter tuple can saved, such as ('a', 'b', 'c') under a key something like _varied. A good place to save it is in the metadata section but it cannot be updated in the case that the variation is extended by further variables therefore a better place is the meta node.
More information can be provided with specifying the values, e.g. {'a': 1.0, 'b': true, 'c': 0} and use the same place for storage as in 1.
The original vary statement can be stored as string in the meta node (similar to import and function definition statements) but is has to be re-parsed and it contains more information not relevant to the current workflow.
Finally, workflow nodes of varied variables can be tagged by a boolean attribute to denote they belong to a variation.

Processing the `vary` statement

A prerequisite is that the get_model() method in Session class works with lists of models and not one model. This is a trivial extension that has to be carried out as first step. A method of the Session class catches the vary statement from the textual model before calling get_model(). The statement must be parsed and removed from the input string of the textual model. It can be parsed but a regular expression - faster and less code to implement but less grammar flexibility. Alternatively, we can define a textX grammar rule and define a processor - more flexibility, more structure, less speed. The processor (interpreter) of the vary statement then replicates the textual model as many times as necessary and then adds variable statements corresponding to the parameter tuples, e.g. {'a': 1.0, 'b': true, 'c': 0} for the first model, {'a': -1.0, 'b': false, 'c': 1} for the second model etc. The list of model inputs is then passed to the get_model() method and the information (metadata) in the vary statement is stored (see above).

The method should call one function to parse, one function to interpret, and a third function to store metadata.

Scenarios

Run a program from scratch

Example 1. Cartesian product:

vary ((a: 1, 2, 3))
vary ((b: false, true))

group ID 1
model ID 1 model ID 2  model ID 3 model ID 4 model ID 5 model ID 6
a = 1      a = 2       a = 3      a = 1      a = 2      a = 3
b = false  b = true    b = false  b = true   b = false  b = true

Example 2. Direct product:

vary ((a: 1, 2, 3), (b: false, true, true))

group ID 2
model ID 1 model ID 2  model ID 3
a = 1      a = 2       a = 3
b = false  b = true    b = true

Add `vary` statement to an existing program

With no vary statement

load the model as usual
process the vary statement
update the metadata in meta node
check if the parameter variables have been initialized and only create the missing parameters tuples
replicate the source code for each new parameter tuple and instantiate the models

This scenario is basically the same as to runs a program from scratch

With vary statement

load the model as usual
process the vary statement
update the metadata in meta node
generate newly added parameter tuples
replicate the code for each new parameter tuple and instantiate the additional models

Finally, 1. and 2. should be implemented in a single case. Rather more important is what variables are in the vary statement: new or old variables.

For new only variables, the models' source codes have to be downloaded and the models are multiplied by the number of new models. Example: 2 persistent models in the group, 2 new variables 2 values each = 2 * 2 * 2 = 8 models, 6 new models. The two persistent models 1, 2 are rolled out for use by models 3, 4 respectively and 5, 6 respectively. Then the vary update (consisting of 8 lines) is added to the respective models. The ordering of the tuples is important!

Maybe the optimal way is to create ad dataframe with the old vary variables with the corresponding source uuids of the persistent models. Then, after the join, the updated vary dataframe will look like this

    uuid  a      b  c
0      0  1  False  1
1      0  1  False  2
2      0  1   True  1
3      0  1   True  2
4      1  2  False  1
5      1  2  False  2
6      1  2   True  1
7      1  2   True  2
8      2  3  False  1
9      2  3  False  2
10     2  3   True  1
11     2  3   True  2

Here, a is the only old vary variable, b and c are the new vary variables. The self.models should just be resized from length 3 to length 12. Now, we iterate over the above dataframe's rows and update it like this for self.uuids and strns:

    uuid  a      b  c   self.uuids   strns                         varies
0      0  1  False  1   0                      b = False; c = 1    df(a=1, b=False, c=1)
1      0  1  False  2   None         source 0; b = False; c = 2    df(a=1, b=False, c=2)
2      0  1   True  1   None         source 0; b = True; c = 1     df(a=1, b=True, c=1)
3      0  1   True  2   None         source 0; b = True; c = 2     df(a=1, b=True, c=2)
4      1  2  False  1   1                      b = False; c = 1    df(a=2, b=False, c=1)
5      1  2  False  2   None         source 1; b = False; c = 2    df(a=2, b=False, c=2)
6      1  2   True  1   None         source 1; b = True; c = 1     df(a=2, b=True, c=1)
7      1  2   True  2   None         source 1; b = True; c = 2     df(a=2, b=True, c=2)
8      2  3  False  1   2                      b = False; c = 1    df(a=3, b=False, c=1)
9      2  3  False  2   None         source 2; b = False; c = 2    df(a=3, b=False, c=2)
10     2  3   True  1   None         source 2; b = True; c = 1     df(a=3, b=True, c=1)
11     2  3   True  2   None         source 2; b = True; c = 2     df(a=3, b=True, c=2)

For the varies update, just select each row for the columns a, b, c and append it as dataframe to varies.

For only old variables (just new values): one persistent model is downloaded as source and all vary variables are removed. Then the new models are created by adding the updated vary tuples.

Original vary:

source x = source(uuid=0)\{a = 1}

Update:

   a self.uuids   strns                         varies
0  4 None         source x; a = 4               df(a=4)
1  5 None         source x; a = 5               df(a=5)

Extend an existing program

Any processing of a program with a vary statement means that the source code is added to all instances with the same group UUID. This means that the session processes all instances as a list for a single input. In interactive session mode, the %uuid magic returns the list of uuids of all instances with the same group uuid. After selecting the UUID of an instance containing a vary statement all instances with the same group UUID are processed as list.

Behavior of the `print` statement

Suppose, f is a variable depending on a and b.

In: print(f)

Out: ((a: 1, 2, 3), (b: false, true, true), (f: val1, val2, val3))

Empty vary statement

If the vary statement has no table, i.e. only includes the vary keyword then the current vary information will be printed. It is similar to the print statement but without a target variable. For the example above, this would be like:

In: vary

Out: ((a: 1, 2, 3), (b: false, true, true))

Bulk-mode (group-mode) operations

The print statement is only one example for group operations. Such operations combine data from several models of the group. For example, find to sum of all a's in a group of models. Such processing can be realized within one single model by using the map function by defining a as series and not as a scalar parameter. But the size of the variation is fixed and cannot be extended later. By using a group of models, the set of parameters can be extended many times at any time later.

Group operations, similar to vary and print have to be implemented in the Session class, outside of the scope of the model processors of the individual models. Because print is no persistent statement and vary persists partitioned in the individual models there is no issue with the persistence and referencing. Other operations like the sum, as in the example above, need persistence and references to their inputs.

One way for solution is to use sub-models (see issue #144 (closed)). This method involves replicating all referenced models from the group into one new workflow and adding the operation as a child node.

a_var = (a: a@0, a@1, a@2)
s = sum(a_var)

Here, the UUIDs of the individual models used are specified after the @ symbol. This can be wrapped in a shortcut, such like s = sum_var(a) which will be interpreted by the Session class. Another way is to provide a generic function, such as collect, e.g. a_var = collect(a); s = sum(a_var).

Duplicates detection

To avoid unnecessary computation and save storage space, a special DupeFinder class should be written. Duplicates occur normally in variations but can occasionally occur in other cases.

Edited Dec 23, 2023 by Ivan Kondov

High throughput processing

Possible usage scenarios

Run a program from scratch

Run an existing program with further inputs

Extend an existing program

Nice-to-have features

Optionally use a single workflow from a group ignoring the vary statements

Collect all workflows from the persistent storage matching a program.

Implementation

Use a group UUID in the metadata section

Use a linked tree based on the workflow UUID

Code based collector

Implementation details

Group UUID

Virtual group

Activation of high-throughput processing mode

Grammar extension with the vary statement

Interpreter of vary statements

Storage of the group UUID

Storage of the vary statement

Processing the vary statement

Scenarios

Run a program from scratch

Add vary statement to an existing program

Extend an existing program

Behavior of the print statement

Empty vary statement

Bulk-mode (group-mode) operations

Duplicates detection

Optionally use a single workflow from a group ignoring the `vary` statements

Grammar extension with the `vary` statement

Interpreter of `vary` statements

Storage of the `vary` statement

Processing the `vary` statement

Add `vary` statement to an existing program

Behavior of the `print` statement