Problems with background I/O

The following script

calc = Calculator lj ((sigma: 3.405) [angstrom], (epsilon: 119.8) [boltzmann_constant * kelvin])
struct = Structure from file '/path/to/Ar108-fcc.xyz'
algo_verlet = Algorithm VelocityVerlet ((timestep: 1) [fs], (steps: 2500), (trajectory: true))
algo_350K = Algorithm VelocityDistribution ((temperature_K: 350) [K])
prop_350K = Property ((algorithm: algo_350K), (structure: struct), (calculator: calc))
algo_verlet_longer = Algorithm VelocityVerlet ((timestep: 1) [fs], (steps: 10000), (trajectory: true))
prop_350K_verlet = Property trajectory ((algorithm: algo_verlet), (structure: prop_350K.output_structure), (calculator: calc))

runs through. All statements become completed. When we want to use the prop_350K_verlet variable e.g. like this:

kins = prop_350K_verletl.trajectory[0].properties.kinetic_energy

we get this error (does not matter with or without auto-run activated):

pymongo.errors.DocumentTooLarge: 'findAndModify' command document too large

A more detailed inspection of the launch shows that the launch action has been stored in gridfs:

{'_id': ObjectId('67dc12a73cbd02deeb0eb766'),
 'fworker': {'name': '5b70450d5e084edc8096ab1f55fc4a67',
  'category': 'interactive',
  'query': '{}',
  'env': {}},
 'fw_id': 4432,
 'launch_dir': '/path/to/fireworks-launches/launcher_2025-03-20-13-05-43-296933',
 'host': 'uc2n995.localdomain',
 'ip': '10.0.3.227',
 'trackers': [],
 'action': {'gridfs_id': '67dc1a3647cbf4d00b096ac4'},
 'state': 'COMPLETED',
 'state_history': [{'state': 'RUNNING',
   'created_on': '2025-03-20T13:05:43.591437',
   'updated_on': '2025-03-20T13:07:58.812583'},
  {'state': 'COMPLETED', 'created_on': '2025-03-20T13:07:58.816497'}],
 'launch_id': 4070,
 'time_start': '2025-03-20T13:05:43.591437',
 'time_end': '2025-03-20T13:07:58.816497',
 'runtime_secs': 135.22506}

The launch action has been stored in gridfs by FireWorks, not by VRE Language. This mechanism is activated in FireWorks (by an exception handler) when the action exceeds the limit of 16 MB for a document in MongoDB. The mechanism is applied only to action but not to large data in firework's spec, as the new fireworks kins is appended to the workflow, the action is applied to the new child firework leading to too large spec (update_spec action). The mechanism can be deactivated by setting GRIDFS_FALLBACK_COLLECTION: null in ~/.fireworks/FW_config.yaml.

Then we get the (unhandled) fireworks exception:

Traceback (most recent call last):
  File "/virtmat-tools/vre-language/src/virtmat/language/utilities/errors.py", line 306, in wrapper
    return func(*args, **kwargs)
  File "/virtmat-tools/vre-language/src/virtmat/language/interpreter/session_manager.py", line 169, in get_model_value
    return getattr(self.session.get_model(*args, uuid=self.uuid, **kwargs), 'value', '')
  File "/virtmat-tools/vre-language/src/virtmat/language/interpreter/session.py", line 150, in get_model
    self.process_models(model_str, model_path, active_uuid=uuid)
  File "/virtmat-tools/vre-language/src/virtmat/language/interpreter/session.py", line 244, in process_models
    model, uuid = self._process_model(uuid, strn, path, active_uuid=active_uuid)
  File "/virtmat-tools/vre-language/src/virtmat/language/interpreter/session.py", line 587, in _process_model
    model = tx_get_model(model_src, deferred_mode=True,
  File "/jupyter-tensorflow-2023-10-10/lib64/python3.9/site-packages/textx/metamodel.py", line 699, in model_from_str
    p(model, self)
  File "/virtmat-tools/vre-language/src/virtmat/language/interpreter/workflow_executor.py", line 545, in workflow_model_processor
    append_var_nodes(model)
  File "/virtmat-tools/vre-language/src/virtmat/language/interpreter/workflow_executor.py", line 475, in append_var_nodes
    model.lpad.append_wf(Workflow([nodes.pop(ind)]), fw_ids=parents)
  File "/jupyter-tensorflow-2023-10-10/lib64/python3.9/site-packages/fireworks/core/launchpad.py", line 456, in append_wf
    wf = self.get_wf_by_fw_id(fw_ids[0])
  File "/jupyter-tensorflow-2023-10-10/lib64/python3.9/site-packages/fireworks/core/launchpad.py", line 530, in get_wf_by_fw_id
    return Workflow(
  File "/jupyter-tensorflow-2023-10-10/lib64/python3.9/site-packages/fireworks/core/firework.py", line 791, in __init__
    for fw in fireworks:
  File "/jupyter-tensorflow-2023-10-10/lib64/python3.9/site-packages/fireworks/core/launchpad.py", line 515, in get_fw_by_id
    return Firework.from_dict(self.get_fw_dict_by_id(fw_id))
  File "/jupyter-tensorflow-2023-10-10/lib64/python3.9/site-packages/fireworks/core/launchpad.py", line 495, in get_fw_dict_by_id
    launch["action"] = get_action_from_gridfs(launch.get("action"), self.gridfs_fallback)
  File "/jupyter-tensorflow-2023-10-10/lib64/python3.9/site-packages/fireworks/core/launchpad.py", line 2202, in get_action_from_gridfs
    action_data = fallback_fs.get(ObjectId(action_gridfs_id))
AttributeError: 'NoneType' object has no attribute 'get'

Starting a new variable with the GRIDFS_FALLBACK_COLLECTION: null setting prevents completion of prop_350K_verlet with the error message:

Traceback (most recent call last):
  File "/jupyter-tensorflow-2023-10-10/lib64/python3.9/site-packages/fireworks/core/rocket.py", line 359, in run
    lp.complete_launch(launch_id, m_action, final_state)
  File "/jupyter-tensorflow-2023-10-10/lib64/python3.9/site-packages/fireworks/core/launchpad.py", line 1528, in complete_launch
    raise err
  File "/jupyter-tensorflow-2023-10-10/lib64/python3.9/site-packages/fireworks/core/launchpad.py", line 1515, in complete_launch
    self.launches.find_one_and_replace({"launch_id": m_launch.launch_id}, m_launch.to_db_dict(), upsert=True)
  File "/jupyter-tensorflow-2023-10-10/lib64/python3.9/site-packages/pymongo/collection.py", line 3328, in find_one_and_replace
    return self.__find_and_modify(
  File "/jupyter-tensorflow-2023-10-10/lib64/python3.9/site-packages/pymongo/collection.py", line 3138, in __find_and_modify
    return self.__database.client._retryable_write(
  File "/jupyter-tensorflow-2023-10-10/lib64/python3.9/site-packages/pymongo/mongo_client.py", line 1523, in _retryable_write
    return self._retry_with_session(retryable, func, s, bulk)
  File "/jupyter-tensorflow-2023-10-10/lib64/python3.9/site-packages/pymongo/mongo_client.py", line 1421, in _retry_with_session
    return self._retry_internal(
  File "/jupyter-tensorflow-2023-10-10/lib64/python3.9/site-packages/pymongo/_csot.py", line 107, in csot_wrapper
    return func(self, *args, **kwargs)
  File "/jupyter-tensorflow-2023-10-10/lib64/python3.9/site-packages/pymongo/mongo_client.py", line 1453, in _retry_internal
    return _ClientConnectionRetryable(
  File "/jupyter-tensorflow-2023-10-10/lib64/python3.9/site-packages/pymongo/mongo_client.py", line 2315, in run
    return self._read() if self._is_read else self._write()
  File "/jupyter-tensorflow-2023-10-10/lib64/python3.9/site-packages/pymongo/mongo_client.py", line 2423, in _write
    return self._func(self._session, conn, self._retryable)  # type: ignore
  File "/jupyter-tensorflow-2023-10-10/lib64/python3.9/site-packages/pymongo/collection.py", line 3124, in _find_and_modify
    out = self._command(
  File "/jupyter-tensorflow-2023-10-10/lib64/python3.9/site-packages/pymongo/collection.py", line 308, in _command
    return conn.command(
  File "/jupyter-tensorflow-2023-10-10/lib64/python3.9/site-packages/pymongo/helpers.py", line 322, in inner
    return func(*args, **kwargs)
  File "/jupyter-tensorflow-2023-10-10/lib64/python3.9/site-packages/pymongo/pool.py", line 996, in command
    self._raise_connection_failure(error)
  File "/jupyter-tensorflow-2023-10-10/lib64/python3.9/site-packages/pymongo/pool.py", line 968, in command
    return command(
  File "/jupyter-tensorflow-2023-10-10/lib64/python3.9/site-packages/pymongo/network.py", line 164, in command
    message._raise_document_too_large(name, size, max_bson_size + message._COMMAND_OVERHEAD)
  File "/jupyter-tensorflow-2023-10-10/lib64/python3.9/site-packages/pymongo/message.py", line 1182, in _raise_document_too_large
    raise DocumentTooLarge(f"{operation!r} command document too large")
pymongo.errors.DocumentTooLarge: 'findAndModify' command document too large. Set GRIDFS_FALLBACK_COLLECTION in FW_config.yaml to a value different from None

The debug log shows just before this exception occurs:

2025-03-23 13:21:29,558 DEBUG Querying for duplicates, fw_id: 4444
2025-03-23 13:21:29,572 DEBUG FW with id: 4444 is unique!
2025-03-23 13:21:29,576 DEBUG Created/updated Launch with launch_id=4077
/lib/python3.9/site-packages/pandas/core/dtypes/cast.py:1565: UnitStrippedWarning: The unit of the quantity is stripped when downcasting to ndarray.
  result[:] = values
/lib/python3.9/site-packages/pandas/core/dtypes/cast.py:1565: UnitStrippedWarning: The unit of the quantity is stripped when downcasting to ndarray.
  result[:] = values
/lib/python3.9/site-packages/pandas/core/dtypes/cast.py:1565: UnitStrippedWarning: The unit of the quantity is stripped when downcasting to ndarray.
  result[:] = values
2025-03-23 13:21:30,085 DEBUG RUNNING FW with id: 4444
2025-03-23 13:21:30,085 INFO RUNNING fw_id: 4444 in directory: /pfs/data5/home/kit/scc/jk7683/fireworks-launches/launcher_2025-03-23-12-21-29-278644
2025-03-23 13:21:30,305 INFO Task started: {{virtmat.language.utilities.firetasks.FunctionTask}}.
/lib/python3.9/site-packages/pandas/core/dtypes/cast.py:1565: UnitStrippedWarning: The unit of the quantity is stripped when downcasting to ndarray.
  result[:] = values
/lib/python3.9/site-packages/pandas/core/dtypes/cast.py:1565: UnitStrippedWarning: The unit of the quantity is stripped when downcasting to ndarray.
  result[:] = values
/lib/python3.9/site-packages/pandas/core/dtypes/cast.py:1565: UnitStrippedWarning: The unit of the quantity is stripped when downcasting to ndarray.
  result[:] = values
/lib/python3.9/site-packages/pandas/core/dtypes/cast.py:1565: UnitStrippedWarning: The unit of the quantity is stripped when downcasting to ndarray.
  result[:] = values
2025-03-23 13:22:01,689 INFO Task completed: {{virtmat.language.utilities.firetasks.FunctionTask}}
2025-03-23 13:22:01,703 DEBUG virtmat.language.utilities.serializable: size in memory: 70632

The last line is the size of the object returned by the Firetask in this case a Property object. The Trajectory object is in the results attribute of Property: results is a pandas dataframe containing the Trajectory object and other computed properties. The size does no change if we make the trajectory shorter and longer - it is always 70 kB and obviously the size of Property and its attributes but the trajectory is obviously not included in the memory used (the binary ASE trajectory file is about 30 MB).

Currently we use pympler for compute the memory usage of the object before serialization and estimate their storage in file/database. This is obviously not reliable.

Alternatives:

recursively apply pympler following the references used in serialization;
serialize (for iterable types partially serialize) to see whether the JSON string exceeds the inline storage threshold (no change in data schema);
certain types can be "serialized" in a different way, for example a Property object that contains trajectories, charge densities etc. This can be to write them to a binary file and reference it (-> implies changing the data schema);
independent on the size, chose file/gridfs storage if certain types are included in the object.

Edited 3 weeks ago

Problems with background I/O

Designs

Child items ...

Activity