Problems with background I/O
The following script
calc = Calculator lj ((sigma: 3.405) [angstrom], (epsilon: 119.8) [boltzmann_constant * kelvin])
struct = Structure from file '/path/to/Ar108-fcc.xyz'
algo_verlet = Algorithm VelocityVerlet ((timestep: 1) [fs], (steps: 2500), (trajectory: true))
algo_350K = Algorithm VelocityDistribution ((temperature_K: 350) [K])
prop_350K = Property ((algorithm: algo_350K), (structure: struct), (calculator: calc))
algo_verlet_longer = Algorithm VelocityVerlet ((timestep: 1) [fs], (steps: 10000), (trajectory: true))
prop_350K_verlet = Property trajectory ((algorithm: algo_verlet), (structure: prop_350K.output_structure), (calculator: calc))
runs through. All statements become completed. When we want to use the prop_350K_verlet
variable e.g. like this:
kins = prop_350K_verletl.trajectory[0].properties.kinetic_energy
we get this error (does not matter with or without auto-run activated):
pymongo.errors.DocumentTooLarge: 'findAndModify' command document too large
A more detailed inspection of the launch shows that the launch action has been stored in gridfs:
{'_id': ObjectId('67dc12a73cbd02deeb0eb766'),
'fworker': {'name': '5b70450d5e084edc8096ab1f55fc4a67',
'category': 'interactive',
'query': '{}',
'env': {}},
'fw_id': 4432,
'launch_dir': '/path/to/fireworks-launches/launcher_2025-03-20-13-05-43-296933',
'host': 'uc2n995.localdomain',
'ip': '10.0.3.227',
'trackers': [],
'action': {'gridfs_id': '67dc1a3647cbf4d00b096ac4'},
'state': 'COMPLETED',
'state_history': [{'state': 'RUNNING',
'created_on': '2025-03-20T13:05:43.591437',
'updated_on': '2025-03-20T13:07:58.812583'},
{'state': 'COMPLETED', 'created_on': '2025-03-20T13:07:58.816497'}],
'launch_id': 4070,
'time_start': '2025-03-20T13:05:43.591437',
'time_end': '2025-03-20T13:07:58.816497',
'runtime_secs': 135.22506}
The launch action has been stored in gridfs by FireWorks, not by VRE Language. This mechanism is activated in FireWorks (by an exception handler) when the action exceeds the limit of 16 MB for a document in MongoDB. The mechanism is applied only to action but not to large data in firework's spec, as the new fireworks kins
is appended to the workflow, the action is applied to the new child firework leading to too large spec (update_spec action). The mechanism can be deactivated by setting GRIDFS_FALLBACK_COLLECTION: null
in ~/.fireworks/FW_config.yaml
.
Then we get the (unhandled) fireworks exception:
Traceback (most recent call last):
File "/virtmat-tools/vre-language/src/virtmat/language/utilities/errors.py", line 306, in wrapper
return func(*args, **kwargs)
File "/virtmat-tools/vre-language/src/virtmat/language/interpreter/session_manager.py", line 169, in get_model_value
return getattr(self.session.get_model(*args, uuid=self.uuid, **kwargs), 'value', '')
File "/virtmat-tools/vre-language/src/virtmat/language/interpreter/session.py", line 150, in get_model
self.process_models(model_str, model_path, active_uuid=uuid)
File "/virtmat-tools/vre-language/src/virtmat/language/interpreter/session.py", line 244, in process_models
model, uuid = self._process_model(uuid, strn, path, active_uuid=active_uuid)
File "/virtmat-tools/vre-language/src/virtmat/language/interpreter/session.py", line 587, in _process_model
model = tx_get_model(model_src, deferred_mode=True,
File "/jupyter-tensorflow-2023-10-10/lib64/python3.9/site-packages/textx/metamodel.py", line 699, in model_from_str
p(model, self)
File "/virtmat-tools/vre-language/src/virtmat/language/interpreter/workflow_executor.py", line 545, in workflow_model_processor
append_var_nodes(model)
File "/virtmat-tools/vre-language/src/virtmat/language/interpreter/workflow_executor.py", line 475, in append_var_nodes
model.lpad.append_wf(Workflow([nodes.pop(ind)]), fw_ids=parents)
File "/jupyter-tensorflow-2023-10-10/lib64/python3.9/site-packages/fireworks/core/launchpad.py", line 456, in append_wf
wf = self.get_wf_by_fw_id(fw_ids[0])
File "/jupyter-tensorflow-2023-10-10/lib64/python3.9/site-packages/fireworks/core/launchpad.py", line 530, in get_wf_by_fw_id
return Workflow(
File "/jupyter-tensorflow-2023-10-10/lib64/python3.9/site-packages/fireworks/core/firework.py", line 791, in __init__
for fw in fireworks:
File "/jupyter-tensorflow-2023-10-10/lib64/python3.9/site-packages/fireworks/core/launchpad.py", line 515, in get_fw_by_id
return Firework.from_dict(self.get_fw_dict_by_id(fw_id))
File "/jupyter-tensorflow-2023-10-10/lib64/python3.9/site-packages/fireworks/core/launchpad.py", line 495, in get_fw_dict_by_id
launch["action"] = get_action_from_gridfs(launch.get("action"), self.gridfs_fallback)
File "/jupyter-tensorflow-2023-10-10/lib64/python3.9/site-packages/fireworks/core/launchpad.py", line 2202, in get_action_from_gridfs
action_data = fallback_fs.get(ObjectId(action_gridfs_id))
AttributeError: 'NoneType' object has no attribute 'get'
Starting a new variable with the GRIDFS_FALLBACK_COLLECTION: null
setting prevents completion of prop_350K_verlet
with the error message:
Traceback (most recent call last):
File "/jupyter-tensorflow-2023-10-10/lib64/python3.9/site-packages/fireworks/core/rocket.py", line 359, in run
lp.complete_launch(launch_id, m_action, final_state)
File "/jupyter-tensorflow-2023-10-10/lib64/python3.9/site-packages/fireworks/core/launchpad.py", line 1528, in complete_launch
raise err
File "/jupyter-tensorflow-2023-10-10/lib64/python3.9/site-packages/fireworks/core/launchpad.py", line 1515, in complete_launch
self.launches.find_one_and_replace({"launch_id": m_launch.launch_id}, m_launch.to_db_dict(), upsert=True)
File "/jupyter-tensorflow-2023-10-10/lib64/python3.9/site-packages/pymongo/collection.py", line 3328, in find_one_and_replace
return self.__find_and_modify(
File "/jupyter-tensorflow-2023-10-10/lib64/python3.9/site-packages/pymongo/collection.py", line 3138, in __find_and_modify
return self.__database.client._retryable_write(
File "/jupyter-tensorflow-2023-10-10/lib64/python3.9/site-packages/pymongo/mongo_client.py", line 1523, in _retryable_write
return self._retry_with_session(retryable, func, s, bulk)
File "/jupyter-tensorflow-2023-10-10/lib64/python3.9/site-packages/pymongo/mongo_client.py", line 1421, in _retry_with_session
return self._retry_internal(
File "/jupyter-tensorflow-2023-10-10/lib64/python3.9/site-packages/pymongo/_csot.py", line 107, in csot_wrapper
return func(self, *args, **kwargs)
File "/jupyter-tensorflow-2023-10-10/lib64/python3.9/site-packages/pymongo/mongo_client.py", line 1453, in _retry_internal
return _ClientConnectionRetryable(
File "/jupyter-tensorflow-2023-10-10/lib64/python3.9/site-packages/pymongo/mongo_client.py", line 2315, in run
return self._read() if self._is_read else self._write()
File "/jupyter-tensorflow-2023-10-10/lib64/python3.9/site-packages/pymongo/mongo_client.py", line 2423, in _write
return self._func(self._session, conn, self._retryable) # type: ignore
File "/jupyter-tensorflow-2023-10-10/lib64/python3.9/site-packages/pymongo/collection.py", line 3124, in _find_and_modify
out = self._command(
File "/jupyter-tensorflow-2023-10-10/lib64/python3.9/site-packages/pymongo/collection.py", line 308, in _command
return conn.command(
File "/jupyter-tensorflow-2023-10-10/lib64/python3.9/site-packages/pymongo/helpers.py", line 322, in inner
return func(*args, **kwargs)
File "/jupyter-tensorflow-2023-10-10/lib64/python3.9/site-packages/pymongo/pool.py", line 996, in command
self._raise_connection_failure(error)
File "/jupyter-tensorflow-2023-10-10/lib64/python3.9/site-packages/pymongo/pool.py", line 968, in command
return command(
File "/jupyter-tensorflow-2023-10-10/lib64/python3.9/site-packages/pymongo/network.py", line 164, in command
message._raise_document_too_large(name, size, max_bson_size + message._COMMAND_OVERHEAD)
File "/jupyter-tensorflow-2023-10-10/lib64/python3.9/site-packages/pymongo/message.py", line 1182, in _raise_document_too_large
raise DocumentTooLarge(f"{operation!r} command document too large")
pymongo.errors.DocumentTooLarge: 'findAndModify' command document too large. Set GRIDFS_FALLBACK_COLLECTION in FW_config.yaml to a value different from None
The debug log shows just before this exception occurs:
2025-03-23 13:21:29,558 DEBUG Querying for duplicates, fw_id: 4444
2025-03-23 13:21:29,572 DEBUG FW with id: 4444 is unique!
2025-03-23 13:21:29,576 DEBUG Created/updated Launch with launch_id=4077
/lib/python3.9/site-packages/pandas/core/dtypes/cast.py:1565: UnitStrippedWarning: The unit of the quantity is stripped when downcasting to ndarray.
result[:] = values
/lib/python3.9/site-packages/pandas/core/dtypes/cast.py:1565: UnitStrippedWarning: The unit of the quantity is stripped when downcasting to ndarray.
result[:] = values
/lib/python3.9/site-packages/pandas/core/dtypes/cast.py:1565: UnitStrippedWarning: The unit of the quantity is stripped when downcasting to ndarray.
result[:] = values
2025-03-23 13:21:30,085 DEBUG RUNNING FW with id: 4444
2025-03-23 13:21:30,085 INFO RUNNING fw_id: 4444 in directory: /pfs/data5/home/kit/scc/jk7683/fireworks-launches/launcher_2025-03-23-12-21-29-278644
2025-03-23 13:21:30,305 INFO Task started: {{virtmat.language.utilities.firetasks.FunctionTask}}.
/lib/python3.9/site-packages/pandas/core/dtypes/cast.py:1565: UnitStrippedWarning: The unit of the quantity is stripped when downcasting to ndarray.
result[:] = values
/lib/python3.9/site-packages/pandas/core/dtypes/cast.py:1565: UnitStrippedWarning: The unit of the quantity is stripped when downcasting to ndarray.
result[:] = values
/lib/python3.9/site-packages/pandas/core/dtypes/cast.py:1565: UnitStrippedWarning: The unit of the quantity is stripped when downcasting to ndarray.
result[:] = values
/lib/python3.9/site-packages/pandas/core/dtypes/cast.py:1565: UnitStrippedWarning: The unit of the quantity is stripped when downcasting to ndarray.
result[:] = values
2025-03-23 13:22:01,689 INFO Task completed: {{virtmat.language.utilities.firetasks.FunctionTask}}
2025-03-23 13:22:01,703 DEBUG virtmat.language.utilities.serializable: size in memory: 70632
The last line is the size of the object returned by the Firetask in this case a Property
object. The Trajectory
object is in the results
attribute of Property
: results
is a pandas dataframe containing the Trajectory object and other computed properties. The size does no change if we make the trajectory shorter and longer - it is always 70 kB and obviously the size of Property and its attributes but the trajectory is obviously not included in the memory used (the binary ASE trajectory file is about 30 MB).
Currently we use pympler for compute the memory usage of the object before serialization and estimate their storage in file/database. This is obviously not reliable.
Alternatives:
- recursively apply pympler following the references used in serialization;
- serialize (for iterable types partially serialize) to see whether the JSON string exceeds the inline storage threshold (no change in data schema);
- certain types can be "serialized" in a different way, for example a Property object that contains trajectories, charge densities etc. This can be to write them to a binary file and reference it (-> implies changing the data schema);
- independent on the size, chose file/gridfs storage if certain types are included in the object.