Add computing resources and granularity

Concepts

Defining computing resources and granularity has no impact on results and is optional in the grammar.

Computing resources:

computing time
number of cores
main memory
disk storage
...

Granularity:

Data granularity
Node-task granularity

Task-node and data granularity

Task-node granularity

This part has been moved to issue https://git.scc.kit.edu/virtmat-tools/vre-language/-/issues/95

Data granularity: chunk size

Syntax

<object> [in <int> chunks]

Example with a function

d = f2(c in 4 chunks, b) on 2 cores for 2 hours

is equivalent to

(c1, c2, c3, c4) = split(c, 4)
d1 = f2(c1) on 2 cores for 0.5 hours
d2 = f2(c2) on 2 cores for 0.5 hours
d3 = f2(c3) on 2 cores for 0.5 hours
d4 = f2(c4) on 2 cores for 0.5 hours
d = concat(d1, d2, d3, d4)

without using the chunks keyword.

Example with an expression

Let us assume that energy is an iterable like pd.Series, np.array, ...

use exp from stdlib.functions
use kB from stdlib.constants

temperature = 300.0 K
energy_0 = 2.3 eV
energy = 0 eV to 4 eV step 0.2 eV # alternative to range(0 eV, 4 eV, 0.2 eV)
rate = exp(-(energy-energy_0)**2/(kB*temperature)) in 2 chunks

Implementation

For the first implementation we need basic interpreter for delayed execution.

FireWorks

use Firework objects to implement nodes. Use Firetask objects to implement tasks
~~use PyTask and LambdaTask (both supporting chunk number) to implement data granularity~~

NOTE: ForeachTask is not necessary because the number of chunks is a known constant input.

NOTE: It can happend that we need specific subclasses of Firetask that better match the needs of the interpreter.

Example implementation of the expression example using FireWorks (pseudocode)

# Task 1: LambdaTask with ForeachTask
func: -(energy-energy_0)**2/(units.kB*temperature)
inputs: energy, energy_0, units.kB, temperature
split: energy
number of chunks: 2
outputs: exp_arg

# Task 2: PyTask with ForeachTask
func: numpy.exp(exp_arg)
inputs: exp_arg
outputs: rate
split: exp_arg
number of chunks: 2

Deferred execution of lambda functions and expressions

For named python functions available via an API we can readily use the PyTask. For expressions and expressions with dummy identifiers (lambda functions), we need a way to transfer the code to the workflow system. There are two methods.

Via serialization

import dill
import base64
import json

def func(x):
    return 2*sqrt(x)

# a named function:
string = json.dumps(base64.b64encode(dill.dumps(func)).decode('utf-8'))
# lambda function:
string = json.dumps(base64.b64encode(dill.dumps(lambda x: x**2)).decode('utf-8'))

new_func = dill.loads(base64.b64decode(json.loads(string).encode())) # reconstructed function in Python
print(new_func(4)) # 4

Expressions can be serialized but it is not clear where they are evaluated, for example:

a = 4
string = json.dumps(base64.b64encode(dill.dumps(x**2)).decode('utf-8'))
expression = dill.loads(base64.b64decode(json.loads(string).encode()))
print(expression) # 16

A drawback of this kind of serialization is that the serialized function may be non-portable (other package than dill, other Python, 64/32 bit, etc.).

Via Python source code

import ast
string = 'lambda x: x**2'
node = ast.parse(string, mode='eval')
assert isinstance(node.body, ast.Lambda)
func = eval(compile(node, '', 'eval'))
print(func(2)) # 4

Optimal wrapping Python functions into `Firetasks`

It may happend that the exisisting "standard" Firetasks are not optimal for implementing the deferred executor. There are some issues:

Lambda functions must be passed in a serialized form and not via a module.name string
The functions provided to PyTask must be aware of the serialized objects passed as arguments and returned.
A mixture of args and inputs is generally not supported. We need a way to pass constant parameters via args intermitted by inputs, i.e. data from upstream nodes and previous Firetasks.

A proposal: the construction of the FireTask (than must be sub-classed) can be carried out using a decorator function that is specific to every metamodel class that has value property. This will wrap and serialize every Python function correspondingly. A model processor can be used to construct the Firetasks per model object and add the to Fireworks, and then add the Fireworks to the database. The value property that will basically fetch the Firetask output will be available only if the state property is COMPLETED.

Edited Jan 31, 2023 by Ivan Kondov