Optimize database queries
At several places, we originally use the FireWorks API to make database queries. This is often performed in several API calls and downloading full documents (fireworks or workflows). This can become a performance issue for large and/or many documents.
Here is an overview of the current usage of the FireWorks API in src
:
./virtmat/language/interpreter/session_manager.py: fw_ids = self.lpad.get_fw_ids_in_wfs(wf_query, fw_query)
./virtmat/language/interpreter/session_manager.py: return [self.lpad.get_fw_by_id(i) for i in fw_ids]
./virtmat/language/interpreter/workflow_executor.py: firework = model.lpad.get_fw_by_id(model.lpad.get_fw_ids(fw_q)[0])
./virtmat/language/interpreter/workflow_executor.py: fizzled = model.lpad.get_fw_ids(fw_query)
./virtmat/language/interpreter/workflow_executor.py: fw_name = model.lpad.get_fw_by_id(fw_id).name
./virtmat/language/interpreter/workflow_executor.py: return self.lpad.get_fw_ids_in_wfs({'metadata.uuid': self.uuid})
./virtmat/language/interpreter/workflow_executor.py: fw_ids.extend(self.lpad.get_fw_ids({'name': var.__fw_name}))
./virtmat/language/interpreter/workflow_executor.py: var.__fw_name = model.lpad.get_fw_by_id(fw_ids[0]).name
./virtmat/language/interpreter/workflow_executor.py: meta_id = next(iter(model.lpad.get_fw_ids_in_wfs(wf_query, fw_query)))
./virtmat/language/interpreter/workflow_executor.py: fwk = model.lpad.get_fw_by_id(meta_id)
./virtmat/language/interpreter/workflow_executor.py: return model.lpad.get_fw_ids(fw_query)
./virtmat/language/utilities/fw_tools.py: return lpad.get_fw_ids(fw_query)
./virtmat/language/utilities/fw_tools.py: return lpad.get_fw_ids({'fw_id': {'$in': fw_ids}, 'name': '_fw_root_node'})
./virtmat/language/utilities/fw_tools.py: wfl_pl = lpad.get_wf_by_fw_id(fw_id).links.parent_links
./virtmat/language/utilities/fw_tools.py: state = lpad.get_fw_by_id(fw_id).state
In the meantime, we have "smarter" interfaces get_one_node_info()
and get_all_nodes_info()
. These, used with MongoDB, are accomplished with one call using the PyMongo API. Moreover, the only needed data, and not full documents, is downloaded (using a projection). Using a local file storage as surrogate for MongoDB (mongomock) these are again several API calls but still with projection.
The calls above should be systematically be replaced by get_one_node_info()
and get_all_nodes_info()
.
It would be nice to have a benchmark - a use case for which the new implementation makes real difference in performance.
At the end, get_one_node_info()
and get_all_nodes_info()
seem to be quite similar. Obviously, get_all_nodes_info()
is more powerful and allows later uniqueness test by the caller. Maybe get_one_node_info()
should be eliminated and replaced with get_all_nodes_info()
that can be renamed to get_nodes_info()
.