Transparent file storage for large objects / datasets

If an object becomes so large that it cannot fit into the database then there should be a mechanism to store the object in a file and to manage its reuse in the model through the file. The file must be fully transparent.

Typical use cases: trajectories from molecular dynamics and Monte Carlo, charge densities.

See a use case here.

Implementation

Set a threshold for file storage (configuration parameters)
- a maximum object size (use for example the Pympler package https://pympler.readthedocs.io)
- a maximum document size has to be set; after JSON serialization the number of bytes of the JSON string: len(json_str.encode("utf-8")) or len(bson.BSON.encode(serializable_obj))
Catch the pymongo.errors.DocumentTooLarge exception and attach the same handler as in the case of exceeding the thresholds.
Write a handler. The only option (without modfying FireWorks) is to modify the FWAction object returned by the Firetask.
Define file storage / workspace. Something similar to launch directories in fireworks.
A metadata layer has to be implemented to enable passing objects by using path instead of directly via the database. For example

value: null
file path: /path/to/storage/<uuid>.json
url: file:///path/to/storage/<uuid>.json

The code switching between inline and file storage must be in the Firetask that processes a variable. Inputs with file storage must be de-referenced and read from files and outputs for file storage must be referenced and written to files.

Because JSON serialization can be time and memory consuming it will be beneficial to create some data-driven mapping between the storage size in memory and in JSON string.

Edited Jul 11, 2024 by Ivan Kondov