High throughput processing
High-throughput processing (or better group processing or bulk processing) can be realized in workflow evaluation mode. Using this feature, a program can be run for a series of inputs.
Possible usage scenarios
- Run a program from scratch
- Run an existing program with further inputs
- Extend an existing program
From the user's point of view the three scenarios do not make big difference. The mode is activated as soon as the keyword vary
occurs in the code.
Run a program from scratch
The keyword to activate this mode is vary
. It indicates which variables what values will be initialized:
vary (a: 1, 2, 3), (b: false, true)
In this case a Cartesian product is constructed from the two series, i.e. totally 6 workflows (in a group) will be generated for each combination of a and b values. If the series are placed in a table then the program is executed for each row (parameter tuple):
vary ((a: 1, 2, 3), (b: false, true, true))
In this case the program will be executed 3 times (3 workflows will be generated in a group), for (1, false)
, (2, true)
and (3, true)
.
Only variables can be initialized using the vary
keyword. These values in programs where only literals are allowed cannot be varied and will stay constants.
The behavior of print
statement is different in this mode. It prints a table with all varied initial values and the corresponding values to print:
print(e, f)
>>> ((e: 1.3, 1.4, 1.5), (f: 0.2, -0.1, 0.0), (a: 1, 2, 3), (b: false, true, true))
Run an existing program with further inputs
The program is loaded from the persistent storage by using a UUID. A new vary
statement is issued. The needed workflows are created accordingly.
Extend an existing program
The program is loaded from the persistent storage by using a UUID. The newly added statements are interpreted to extend the workflows correspondingly.
Nice-to-have features
vary
statements
Optionally use a single workflow from a group ignoring the This will end up with inconsistent workflows in the group. After running the same program in group mode then the possibly missing nodes in the other workflows in the group will be added. This option is helpful for developing an extension.
Collect all workflows from the persistent storage matching a program.
These do not belong to a group and vary parameters only implicitly. The print and vary statements behave like in the usual high throughput mode.
This option is helpful to add a common extension to the selected workflows, for example data analysis or add parameter variation.
Implementation
There are three approaches.
Use a group UUID in the metadata section
An advantage is the fast detection by a single query. Group extensions are possible by setting the group UUID for the new group members. The group UUID is created by default for all new program instances to make them extendable later on.
Use a linked tree based on the workflow UUID
Upstream workflows can be found by an attribute in the metadata section but downstream workflows cannot be found. Furthermore the effort is much higher than in the first method because we need one query per workflow in a group that limits the scaling for large groups.
Code based collector
Given a workflow UUID extract the code and find all workflows with certain traits. This method can also be used to find a program without UUID but by providing the source code.
Traits: full program match statement-wise, variable names match, variable types match, function definitions match, imports match - all match as sets, dependency graph match with the variables as vertices.The database is screened with these criteria by using database queries applying the cheapest criteria first.
Advantages: possibility to identify matching workflows created without the vary keyword and integrate them into virtual groups and process them together. Also the vary keyword should work.
Disadvantages: high implementation effort, possibly (worst case) high cost for collecting the workflows especially for large databases, group association not persistent.
This feature is not critical for bulk/high-throughput processing. The result of such a search will be the group UUID. To implement the group UUID and processing lists of models (groups) seems to be feasible for a start. Probably in a separate issue?
Implementation details
Group UUID
All instances receive a group UUID to enable high throughput processing mode. The group UUID can be used also in place of the workflow UUID to select the model instance. The workflows with the same group UUID form a group.
Virtual group
This group is constructed by other criteria than the group UUID. Makes sense only if some workflows belonging to the same model (see criteria) do not have UUIDs or have different UUIDs.
Activation of high-throughput processing mode
The mode is activated as soon as there is more than one workflow in the group. In this mode all workflows are treated as one instance of the model.
The mode is also activated if the vary
statement is used.
At activation time a consistency check is performed and missing workflows or workflow nodes are added as needed.
vary
statement
Grammar extension with the VarySeriesParameter:
Number | Boolean | String | Table | Series | ... | Null
;
VarySeries:
'('
ref = [Variable:TrueID] ':'
elements += VarySeriesParameter[',']
')'
;
Vary:
'vary' '('? values += VarySeries[','] ')'?
;
Alternatively (simple but a problem might be that variable references, function calls, etc. would be allowed):
Vary:
'vary' '( series += Series ) | ( table = Table )
;
vary
statements
Interpreter of The vary
statement should be processed by the Session
class (or similar). Processing it within a model is not recommended because the scope of the model properties (methods) is limited to the same workflow. These can be, in principle, extended to reference objects in other models but 1) the name space has to be introduced and processors must be re-implemented to make them aware of the other models in the group; 2) all textX models must be created in memory; 3) the references between the model cannot be mapped in FireWorks as references between workflows.
The vary
statement does not have to be passed to the model processors. It can be stored in the meta node directly by the Session
class. The model processors cannot interpret the vary statement because they are designed to work with one model at a time. The invariant part of the workflow has to be copied into every model in the group for each parameter tuple. The varied parts are created with the input from the statement. Thus the Session
class instantiates a list of textX models.
If a vary
statement is updated by another vary
statement the Session
class will create a Cartesian product of the values specified in the new vary
and create only models for the new parameter combinations (tuples).
Storage of the group UUID
Storing the group UUID is essential for the method. The group UUID can be written in the metadata section of the workflow. For his it has to be passed from the Session
class to the model processors. In this case, the group membership of a model is fixed at instantiation time (static). Alternatively, the group UUID can be stored in the meta node by the Session
class. In this case, a model can be member in more than one group and the group membership can be changed or removed.
vary
statement
Storage of the Storing information from the vary statement is not strictly necessary and is redundant. This is because the varied parameters and their values can be derived from the models in the group defined by a group UUID. Nevertheless, this information can be provided as metadata and used for 1) faster processing (only one query for the meta node, metadata section or some attribute rather than complex queries to traverse all variables and extracting variations etc.); 2) possibility for consistency check. One consistency check is to check that the sets of variables from the vary info saved in the meta nodes are identical for all workflows in the group. Further checks can be to find matches between variable names and values.
There are several different options for storing this information.
- The parameter tuple can saved, such as
('a', 'b', 'c')
under a key something like_varied
. A good place to save it is in the metadata section but it cannot be updated in the case that the variation is extended by further variables therefore a better place is the meta node. - More information can be provided with specifying the values, e.g.
{'a': 1.0, 'b': true, 'c': 0}
and use the same place for storage as in 1. - The original
vary
statement can be stored as string in the meta node (similar to import and function definition statements) but is has to be re-parsed and it contains more information not relevant to the current workflow. - Finally, workflow nodes of varied variables can be tagged by a boolean attribute to denote they belong to a variation.
vary
statement
Processing the A prerequisite is that the get_model()
method in Session
class works with lists of models and not one model. This is a trivial extension that has to be carried out as first step. A method of the Session
class catches the vary
statement from the textual model before calling get_model()
. The statement must be parsed and removed from the input string of the textual model. It can be parsed but a regular expression - faster and less code to implement but less grammar flexibility. Alternatively, we can define a textX grammar rule and define a processor - more flexibility, more structure, less speed. The processor (interpreter) of the vary
statement then replicates the textual model as many times as necessary and then adds variable statements corresponding to the parameter tuples, e.g. {'a': 1.0, 'b': true, 'c': 0}
for the first model, {'a': -1.0, 'b': false, 'c': 1}
for the second model etc. The list of model inputs is then passed to the get_model()
method and the information (metadata) in the vary
statement is stored (see above).
The method should call one function to parse, one function to interpret, and a third function to store metadata.
Scenarios
Run a program from scratch
Example 1. Cartesian product:
vary ((a: 1, 2, 3))
vary ((b: false, true))
group ID 1
model ID 1 model ID 2 model ID 3 model ID 4 model ID 5 model ID 6
a = 1 a = 2 a = 3 a = 1 a = 2 a = 3
b = false b = true b = false b = true b = false b = true
Example 2. Direct product:
vary ((a: 1, 2, 3), (b: false, true, true))
group ID 2
model ID 1 model ID 2 model ID 3
a = 1 a = 2 a = 3
b = false b = true b = true
vary
statement to an existing program
Add - With no vary statement
- load the model as usual
- process the vary statement
- update the metadata in meta node
- check if the parameter variables have been initialized and only create the missing parameters tuples
- replicate the source code for each new parameter tuple and instantiate the models
This scenario is basically the same as to runs a program from scratch
- With vary statement
- load the model as usual
- process the vary statement
- update the metadata in meta node
- generate newly added parameter tuples
- replicate the code for each new parameter tuple and instantiate the additional models
Finally, 1. and 2. should be implemented in a single case. Rather more important is what variables are in the vary statement: new or old variables.
- For new only variables, the models' source codes have to be downloaded and the models are multiplied by the number of new models. Example: 2 persistent models in the group, 2 new variables 2 values each = 2 * 2 * 2 = 8 models, 6 new models. The two persistent models 1, 2 are rolled out for use by models 3, 4 respectively and 5, 6 respectively. Then the vary update (consisting of 8 lines) is added to the respective models. The ordering of the tuples is important!
Maybe the optimal way is to create ad dataframe with the old vary variables with the corresponding source uuids of the persistent models. Then, after the join, the updated vary dataframe will look like this
uuid a b c
0 0 1 False 1
1 0 1 False 2
2 0 1 True 1
3 0 1 True 2
4 1 2 False 1
5 1 2 False 2
6 1 2 True 1
7 1 2 True 2
8 2 3 False 1
9 2 3 False 2
10 2 3 True 1
11 2 3 True 2
Here, a
is the only old vary variable, b
and c
are the new vary variables. The self.models
should just be resized from length 3 to length 12. Now, we iterate over the above dataframe's rows and update it like this for self.uuids
and strns
:
uuid a b c self.uuids strns varies
0 0 1 False 1 0 b = False; c = 1 df(a=1, b=False, c=1)
1 0 1 False 2 None source 0; b = False; c = 2 df(a=1, b=False, c=2)
2 0 1 True 1 None source 0; b = True; c = 1 df(a=1, b=True, c=1)
3 0 1 True 2 None source 0; b = True; c = 2 df(a=1, b=True, c=2)
4 1 2 False 1 1 b = False; c = 1 df(a=2, b=False, c=1)
5 1 2 False 2 None source 1; b = False; c = 2 df(a=2, b=False, c=2)
6 1 2 True 1 None source 1; b = True; c = 1 df(a=2, b=True, c=1)
7 1 2 True 2 None source 1; b = True; c = 2 df(a=2, b=True, c=2)
8 2 3 False 1 2 b = False; c = 1 df(a=3, b=False, c=1)
9 2 3 False 2 None source 2; b = False; c = 2 df(a=3, b=False, c=2)
10 2 3 True 1 None source 2; b = True; c = 1 df(a=3, b=True, c=1)
11 2 3 True 2 None source 2; b = True; c = 2 df(a=3, b=True, c=2)
For the varies
update, just select each row for the columns a, b, c and append it as dataframe to varies
.
- For only old variables (just new values): one persistent model is downloaded as source and all vary variables are removed. Then the new models are created by adding the updated vary tuples.
Original vary:
uuid a
0 0 1
1 1 2
2 2 3
source x = source(uuid=0)\{a = 1}
Update:
a self.uuids strns varies
0 4 None source x; a = 4 df(a=4)
1 5 None source x; a = 5 df(a=5)
Extend an existing program
Any processing of a program with a vary
statement means that the source code is added to all instances with the same group UUID. This means that the session processes all instances as a list for a single input. In interactive session mode, the %uuid magic returns the list of uuids of all instances with the same group uuid. After selecting the UUID of an instance containing a vary
statement all instances with the same group UUID are processed as list.
print
statement
Behavior of the Suppose, f
is a variable depending on a
and b
.
In: print(f)
Out: ((a: 1, 2, 3), (b: false, true, true), (f: val1, val2, val3))
Empty vary statement
If the vary
statement has no table, i.e. only includes the vary
keyword then the current vary information will be printed. It is similar to the print
statement but without a target variable. For the example above, this would be like:
In: vary
Out: ((a: 1, 2, 3), (b: false, true, true))
Bulk-mode (group-mode) operations
The print
statement is only one example for group operations. Such operations combine data from several models of the group. For example, find to sum of all a
's in a group of models. Such processing can be realized within one single model by using the map
function by defining a
as series and not as a scalar parameter. But the size of the variation is fixed and cannot be extended later. By using a group of models, the set of parameters can be extended many times at any time later.
Group operations, similar to vary
and print
have to be implemented in the Session
class, outside of the scope of the model processors of the individual models. Because print
is no persistent statement and vary
persists partitioned in the individual models there is no issue with the persistence and referencing. Other operations like the sum, as in the example above, need persistence and references to their inputs.
One way for solution is to use sub-models (see issue #144 (closed)). This method involves replicating all referenced models from the group into one new workflow and adding the operation as a child node.
a_var = (a: a@0, a@1, a@2)
s = sum(a_var)
Here, the UUIDs of the individual models used are specified after the @
symbol. This can be wrapped in a shortcut, such like s = sum_var(a)
which will be interpreted by the Session
class. Another way is to provide a generic function, such as collect
, e.g. a_var = collect(a); s = sum(a_var)
.
Duplicates detection
To avoid unnecessary computation and save storage space, a special DupeFinder
class should be written. Duplicates occur normally in variations but can occasionally occur in other cases.