Storage Module¶
Abstract Base Class¶
- class orchestrator.storage.storage_base.Storage(**kwargs)[source]¶
Bases:
Recorder,ABCAbstract base class for data storage
The Storage class deals with all functionalities associated to data storage inside Orchestrator. Its functions include the initialization of the database, and data additions, updates, and queries. The Orchestrator uses a list of ASE Atoms as the internal data representation. A given database (Storage instance) can include multiple datasets (collections of configurations and properties) and generally persists in time.
- Parameters:
storage_args (dict) – dictionary with initialization parameters, including database_name and database_path. See module documentation for greater detail
- __init__(**kwargs)[source]¶
Set variables and initialize the recorder
- Parameters:
storage_args (dict) – dictionary with initialization parameters, including database_name and database_path. See module documentation for greater detail
- generate_dataset_name(root, specifier, counter=None, check_uniqueness=True)[source]¶
generate a detailed (mostly) human-readable dataset name
The dataset name will be in the form: root_specifier[_counter] and if check_uniqueness is true, root_specifier[_counter]_unique_hash.
- Parameters:
root (str) – root of the dataset name, this should be consistent across similar runs (i.e. a campaign name)
specifier (str) – this argument gives more fine control of the dataset name, allowing differentiation within a given root
counter (int) – iteration number of the present root and specifier combination. This can be used for versioning of datasets.
Default:Nonecheck_uniqueness (boolean) – attaches a random hash to the dataset name if true, and ensures that the resulting dataset name is unique within the storage module.
Default:True- Returns:
the dataset name
- Return type:
str
- abstract check_if_dataset_name_unique(dataset_name)[source]¶
check if the provided dataset_name is unique in the database
- Parameters:
dataset_name (str) – name to check (human readable)
- Returns:
true if the database is not present in the database, false if it does exist
- Return type:
boolean
- abstract add_data(dataset_handle, data, dataset_metadata=None)[source]¶
Add new configurations (and associated properties) to the database
This method is used to add to an existing dataset with new configurations. The new configurations may or may not have other properties associated with them.
- Parameters:
dataset_handle (str or int) – name or ID of dataset
data (list) – list of ASE.Atoms objects containing the configurations and associated properties to add to the database. Note that configuration-specific metadata should be stored under the atoms.info[METADATA_KEY] field.
dataset_metadata (dict) – A dictionary of metadata specific to the dataset as a whole.
- Returns:
handle for the dataset which includes the new additions
- Return type:
str or int
- abstract new_dataset(dataset_handle, data, dataset_metadata=None)[source]¶
Create a new dataset with the provided data and metadata
The new dataset will have a human readable name specificed by dataset_handle and will ingest the data and metadata provided.
- Parameters:
dataset_handle (str) – name of the dataset to be created
data (list) – list of ASE.Atoms objects containing the configurations and associated properties to add to the database. Note that configuration-specific metadata should be stored under the atoms.info[METADATA_KEY] field.
dataset_metadata (dict) – A dictionary of metadata specific to the dataset as a whole.
- Returns:
unique handle for the dataset, i.e. its ID
- Return type:
str or int
- abstract update_data(dataset_handle, data, metadata=None)[source]¶
Update an existing dataset - overwriting or adding new properties
This method operates on existing configurations and/or properties. Data are provided as a KliFF dataset of properties that should be added to either the configuration as a new property or overwriting existing properties within the database.
- Parameters:
dataset_handle (str or int) – name or ID of dataset
data (list) – list of ASE.Atoms objects containing the configurations and associated properties to add to the database. Note that configuration-specific metadata should be stored under the atoms.info[METADATA_KEY] field.
dataset_metadata (dict) – A dictionary of metadata specific to the dataset as a whole.
- Returns:
unique handle for the dataset
- Return type:
str
- abstract get_data(dataset_handle, query_options=None)[source]¶
Extract data from Storage
Return the dataset specified by dataset_handle as a list of ASE Atoms. Further options for parameterizing the extraction can be provided by the query_options dictionary.
- Parameters:
dataset_handle (str or int) – name or ID of dataset
query_options (dict) – dict of options for data extraction and return
Default:None- Returns:
requested data as a list of ASE Atoms
- Return type:
list
- abstract delete_dataset(dataset_handle)[source]¶
Remove the dataset specified by dataset_handle from the database
- Parameters:
dataset_handle (str) – name or ID of dataset
- abstract list_data(dataset_handle=None)[source]¶
Utility function to query the database
Prints an overview of the database contents if no dataset_handle is provided, otherwise provides information about the specific dataset contents.
- Parameters:
dataset_handle (str or int) – name or ID of dataset
Default:NoneConcrete Implementations¶
Local (demo/testing)¶
- class orchestrator.storage.local.LocalStorage(database_path='./local_storage_database', database_name=None, **kwargs)[source]¶
Bases:
StorageClass to store data in the local disk
The Storage class deals with all functionalities associated to data storage inside Orchestrator. Its functions include the initialization of the database, and data additions, updates, and queries. The Orchestrator uses ASE Atoms as the internal data representation. A given database (Storage instance) can include multiple datasets (collections of configurations and properties) and generally persists in time.
- Parameters:
storage_args (dict) – dictionary with initialization parameters, including database_name and database_path. database_path defaults to ‘./local_storage_database’ while database_name defaults to the last string component from the database_path. LocalStorage does not require any additional arguments
- __init__(database_path='./local_storage_database', database_name=None, **kwargs)[source]¶
Set variables and initialize the recorder
- Parameters:
database_path (str) – Path to the local storage database
Default:'./local_storage_database'database_name (
Optional[str]) – Name of the local database- check_if_dataset_name_unique(dataset_name)[source]¶
check if the provided dataset_name is unique in the database
- Parameters:
dataset_name (str) – name to check (human readable)
- Returns:
true if the database is not present in the database, false if it does exist
- Return type:
boolean
- add_data(dataset_handle, data, dataset_metadata=None)[source]¶
Add new configurations (and associated properties) to the database
This method is used to add to an existing dataset with new configurations. The new configurations may or may not have other properties associated with them.
- Parameters:
dataset_handle (str) – name of dataset
data (list) – list of ASE.Atoms objects containing the configurations and associated properties to add to the database. Note that configuration-specific metadata should be stored under the atoms.info[METADATA_KEY] field.
dataset_metadata (
Optional[dict]) – A dictionary of metadata specific to the dataset as a whole.
- Returns:
handle for the dataset which includes the new additions
- Return type:
str
- new_dataset(dataset_handle, data, dataset_metadata=None)[source]¶
Create a new dataset with the provided data and metadata
The new dataset will have a human readable name specificed by dataset_handle and will ingest the data and metadata provided.
- Parameters:
dataset_handle (str) – name of the dataset to be created
data (list) – list of ASE.Atoms objects containing the configurations and associated properties to add to the database. Note that configuration-specific metadata should be stored under the atoms.info[METADATA_KEY] field.
dataset_metadata (
Optional[dict]) – A dictionary of metadata specific to the dataset as a whole.
- Returns:
name of the dataset
- Return type:
str
- update_data(dataset_handle, data, new_data_key)[source]¶
Update an existing dataset - overwriting or adding new properties
This method operates on existing configurations and/or properties. data are provided as a KliFF dataset of properties that should be added to either the configuration as a new property or overwriting existing properties within the database.
- Parameters:
dataset_handle (str or int) – name or ID of dataset
data (list) – list of ASE.Atoms objects containing the configurations and associated properties to add to the database. Note that configuration-specific metadata should be stored under the atoms.info[METADATA_KEY] field.
dataset_metadata – A dictionary of metadata specific to the dataset as a whole.
- Returns:
unique handle for the dataset
- Return type:
str
- get_data(dataset_handle, query_options=None)[source]¶
Extract data from storage
Return the dataset specified by dataset_handle as a list of ASE Atoms. Further options for parameterizing the extraction can be provided by the query_options dictionary.
- Parameters:
dataset_handle (str) – name of the dataset to extract
query_options (dict) – dict of options for data extraction and return
Default:None- Returns:
requested data as a list of ASE Atoms
- Return type:
list
- delete_dataset(dataset_handle)[source]¶
Remove the dataset specified by dataset_handle from the database
- Parameters:
dataset_handle (str) – name or ID of dataset
- list_data(dataset_handle=None)[source]¶
Utility function to query the database
Prints an overview of the database contents if no dataset_handle is provided, otherwise provides information about the specific dataset contents.
- Parameters:
dataset_handle (str) – name of dataset
Default:NoneColabfit (production)¶
- class orchestrator.storage.colabfit.ColabfitStorage(credential_file=None, database_path=None, database_name=None, database_port=None, database_user=None, database_password=None, external_file=None, **kwargs)[source]¶
Bases:
StorageManage data using Colabfit
Colabfit documentation can be found at: https://colabfit.github.io/colabfit-tools/html/index.html.
- Parameters:
storage_args (dict) – dictionary with initialization parameters, including database_name, database_path, external_file, and credential_file. database_path is the uri to the mongodb data server (required). database_name is the name of the mongodb database client (required). external_file is the explicit path to an lmdb file to handle configurations larger than 20,000 atoms. This file will be generated by Colabfit if it does not yet exist (optional). credential_file is a path to a json file which contains the database_path, database_name, and optionally external_file path. If a credential_file is provided, its contents override any other arguments. None of these parameters have default values.
- __init__(credential_file=None, database_path=None, database_name=None, database_port=None, database_user=None, database_password=None, external_file=None, **kwargs)[source]¶
- Parameters:
credential_file (str) – Path to a JSON file with the path, name, port, user, password, and external_file keys. This is the preferred method for initializing a storage module. No other keys are needed if credential_file is set
database_path (str) – URI to the PostgreSQL data server
database_name (str) – Name of the PostgreSQL database client
database_port (str) – Port for the PostgreSQL server
database_user (str) – Username for the PostgreSQL server
database_password (str) – Password for the PostgreSQL server
external_file (str) – Path to an LMDB file for large configurations
- check_if_dataset_name_unique(dataset_name)[source]¶
check if the provided dataset_name is unique in the database
- Parameters:
dataset_name (str) – name to check (human readable)
- Returns:
true if the database is not present in the database, false if it does exist
- Return type:
boolean
- add_data(dataset_handle, data, dataset_metadata=None, updated_description=None, updated_authors=None)[source]¶
Add new configurations (and associated properties) to the db.
This method is used to add to an existing dataset with new configurations. update_data’ can serve the same role (along with others) but requires all data (new and existing) to be passed in as an argument. Assumes property format (property_map)is the same as the original dataset.
- Parameters:
dataset_handle (
str) – name or ID of datasetdata (
list[Atoms]) – list of ASE.Atoms objects containing the configurations and associated properties to add to the database. Note that configuration-specific metadata should be stored under the atoms.info[METADATA_KEY] field.dataset_metadata (
Optional[dict]) – A dictionary of metadata specific to the dataset as a whole. This function needs to have ‘parameters’ provided which consists or ‘universal’ and ‘code’ nested dictionaries.updated_description (
Optional[str]) – If not None, will also update the dataset descriptionupdated_authors (
Optional[list[str]]) – If not None, will also update the dataset authors
- Return type:
str- Returns:
handle for the dataset which includes the new additions
- new_dataset(dataset_name, data, dataset_metadata=None, strict=True)[source]¶
Create a new dataset with the provided data and metadata
The new dataset will have a human readable name specificed by dataset_name and will ingest the data and metadata provided.
- Parameters:
dataset_name (str) – name of the dataset to be created
data (list) – list of ASE.Atoms objects containing the configurations and associated properties to add to the database. Note that configuration-specific metadata should be stored under the atoms.info[METADATA_KEY] field.
dataset_metadata (dict) – A dictionary of metadata specific to the dataset as a whole. Current options are authors (str), description (str), and parameters (dict) which consists of two nested dictionaries named ‘universal’ and ‘code’ for the universal input parameter names and the code specific dictionaries.
strict (bool) – If strict, ingested data must all contain the properties specified in the property map.
Default:True- Returns:
unique handle for the dataset
- Return type:
str
- get_data(dataset_handle, query_options=None, inspect=False, rename_properties=False, return_dataset_info=False)[source]¶
Extract data from storage
Return the dataset specified by dataset_handle as a list of ASE Atoms. Further options for parameterizing the extraction can be provided by the query_options dictionary.
- Parameters:
dataset_handle (str) – ID of dataset
query_options (dict) – dict of options for data extraction and return
Default:Noneinspect (bool) – whether to inspect data and print summary
rename_properties (
Optional[bool]) – whether to rename properties based upon previous dataset’s property map. Useful to keep consistent naming when adding data to datasetreturn_dataset_info (
Optional[bool]) – whether to return dataset info such as name, authors, etc in addition to data- Returns:
requested data as a list of ASE.Atoms objects and dataset info if return_dataset_info is True
- Return type:
list or list and dict
- update_data(dataset_handle, data, parameters=None, property_map=None, use_orig_property_map=True, new_properties=None, strict=True, updated_description=None, updated_authors=None)[source]¶
Update an existing dataset - adding new properties to configurations
This method operates on existing configurations and/or properties. Data is a list of ASE Atoms objects. NOTE: This should include all data that is to be associated with datasets. Call get_data if you want old data and potentially new data to be in dataset. The property map is automatically pulled from the original dataset. If this isn’t wanted set use_orig_property_map=False and specify property_map which should include mappings for all data to add. dataset_handle specifies the dataset where these data should be updated and should be the dataset ID, (DS_XXXXXX).
- Parameters:
dataset_handle (
str) – ID of datasetdata (
list[Atoms]) – list of ase.Atoms which include the new data to addparameters (
Optional[dict]) – The ‘universal’ and ‘code’ specific parameters from the simulations. These should be the same as the parameters in the database.use_orig_property_map (
bool) – whether or not to use the dataset’s original property map. Useful when get_data(rename_properties=True) has been used. If False self.property_map is used instead.new_properties (
Optional[dict]) – These properties will be added to the property_map via add_property_mappingstrict (
bool) – If strict, ingested data must all contain the properties specified in the property map.Default:Trueupdated_description (
Optional[str]) – If not None, will also update the dataset descriptionupdated_authors (
Optional[list[str]]) – If not None, will also update the dataset authors- Return type:
str- Returns:
updated handle for the dataset
- list_data(dataset_handle=None, text=None, properties=None, elements=None, elements_exact=False)[source]¶
Utility function to query the database
Prints an overview of the database contents if no dataset_handle is provided, otherwise provides information about the specific dataset contents. Currently only dataset_handles which reference the dataset name (not the colabfit ID) will work for showing the selective query result.
- Parameters:
dataset_handle (str) – name of the dataset
Default:Nonetext (
Optional[str]) – text to search for within the dataset. This can be authors, descriptions, uploader.Default:Noneproperties (str) – name of properties to search for. Multiple should be included as “energy atomic-forces”
Default:Noneelements (str) – elements to search for. Multiple should be included as “C H”. Will return datsets containing these plus other elements. See elements_exact
Default:Noneelements_exact (bool) – whether to restrict element search to return datasets containing only specified elements
Default:False- delete_dataset(dataset_handle, delete_children=True)[source]¶
Remove the dataset specified by dataset_handle from the database
- Parameters:
dataset_handle (bool) – ID of dataset
delete_cildren – if true will also delete all POs and COs (not associated with another DS)
- delete_items(item_ids_list)[source]¶
Remove the COs and/or POs specified by item_ids_list from the database
- dataset_intersection_and_differences(dataset1, dataset2, mode)[source]¶
returns the intersection or differences between two datasets
behavior is controlled by the mode variable, which can be set to ‘intersection’ or ‘difference’. The corresponding results will be returned. If ‘difference’ is chosen, the returned ASE Atoms list contains all configurations IN dataset1 but NOT IN dataset2.
- Parameters:
dataset1 (str) – name of the first dataset to compare
dataset2 (str) – name of the second dataset to compare
mode (str) – switch for if the intersection or difference is returned
- Returns:
a list of ASE Atoms of the shared configurations
- Return type:
list
- define_new_properties(property_list)[source]¶
Define new properties to add to the database
New properties only need to be defined once for the database.
- Parameters:
property_list (
list[dict]) – List of dictionaries containing properties to be stored in a client
- set_property_map(keys=None, file_example=None)[source]¶
Set the mapping between input properties and colabfit representation
Definition of a set of basic properties to be stored in a Colabfit database. This will be used to map input data to the articulated properties which are stored in the Colabfit database. The property_map is used when inserting data into the database. A default property map is defined, but can be overwritten by setting
self.property_mapto the output of this function with specified keys/examples.- Parameters:
keys (dict) – dictionary defining the mapping between ingested properties and their internal database representation. Keys can include ‘energy_field’, ‘force_field’, and ‘stress_field’, with the values corresponding to how that property is demarcated in the input. Additional keys can be included but must include their full mapping.
Default:Nonefile_example (str) – path to a file with a header representing the property tags, from which possible energy, force, and stress mappings (defined by the options in this method) are extracted
Default:None- Return type:
dict- Returns:
dictionary with all properties used in a dataset
- set_default_property_map()[source]¶
Set the default mapping between input properties and colabfit representation. Includes energy, atomic-forces, and cauchy-stress.
- Return type:
dict
- add_property_mapping(new_property_name, new_map, overwrite=False)[source]¶
add a new property to the property entry into the internal property map
Example usage:
storage.add_property_map( 'new_property_name', { 'key_1': {'field': 'key_1_for_ASE', 'units': None}, 'key_2': {'field': 'key_2_for_ASE', 'units': None}, } )
- Parameters:
new_property_name (str) – name of property mapping being added
new_map (dict or list) – the colabfit-style property mapping. A dictionary specifying the
'field'which will be used to load the data off of an ASE atoms object (from the.infoor.arraysdictionaries), and the units. Note that colabfit expects new_map to actually be a list; this function will wrapnew_mapin a list if it is not already one.overwrite (bool) – True allows existing maps with the same name to be overwritten. Default is False.
- Returns:
updated property_map
- Return type:
dict
- get_dataset_property_map(dataset_id)[source]¶
Given a dataset_id will return the property_map that was used to ingest that dataset.
- Parameters:
dataset_id (str) – ID of dataset
- Return type:
dict- Returns:
dictionary with all properties used in a dataset
- Rtype property_map:
dict
- get_dataset_name_from_id(dataset_id)[source]¶
Given a dataset_id will return the dataset’s name
- Parameters:
dataset_id (str) – ID of dataset
- Return type:
str- Returns:
name of the dataset
- Rtype dataset_name:
str
- update_property_definition(prop_def, new_keys)[source]¶
Updates an existing property definition with new keys
Only keys that are not currently a part of the definition should be add in new_keys. Populates existing entries with provided default value Form of new_keys should be similar to:
{'energy': { 'type': 'float', 'has-unit': True, 'extent': [], 'required': True, 'description': 'The potential energy of the system.', 'default-value': None }}
The default default-value is NULL.
- Parameters:
prop_def (dict) – name of definition to update
new_keys (
dict) – dict containing new keys to add with default values to populate existing entries
- setup_tables()[source]¶
Builds all necessary PostgreSQL tables. For use with newly created databases. Won’t affect existing databases if called. Also add energy, forces, and stress props
- Return type:
None
- static sort_configurations(configs)[source]¶
Given a list of Atoms will return a sorted version based upon what the CO-id would be. Useful for sorting configs to be in the same order as returned configurations from get_data.
- Parameters:
configs (
list[Atoms]) – list of configurations- Returns:
sorted configs
- Return type:
list(Atoms)
- get_dataset_input_parameters(dataset_id)[source]¶
Collect the input parameters associated with a dataset id. Will need to parse two different tables and join the results as a single output. If there are no input parameters found, return an error. To enforce conformity among the dataset, there will only be one input parameter values generated and allowed per dataset family.
- Parameters:
dataset_id (
str) – The dataset identification within the database.- Returns universal:
Dictionary containing the universal input parameters.
- Returns code:
Dictionary containing code specific input parameters.
- Return type:
tuple[dict,dict]
Storage Builder¶
- orchestrator.storage.factory.storage_factory = <orchestrator.utils.module_factory.ModuleFactory object>¶
default factory for oracles, includes QE, LAMMPS, and KIM
- class orchestrator.storage.factory.StorageBuilder(factory=<orchestrator.utils.module_factory.ModuleFactory object>)[source]¶
Bases:
ModuleBuilderConstructor for storage modules added in the factory
set the factory to be used for the builder. The default is to use the storage_factory generated at the end of this module. A user defined StorageFactory can optionally be supplied instead.
- Parameters:
factory (ModuleFactory) – a storage factory
Default:storage_factory- __init__(factory=<orchestrator.utils.module_factory.ModuleFactory object>)[source]¶
constructor for the StorageBuilder, sets the factory to build from
- Parameters:
factory (ModuleFactory) – a storage factory
Default:storage_factory- build(storage_type, storage_args=None)[source]¶
Return an instance of the specified data storage
The build method takes the specifier and input arguments to construct a concrete storage instance.
- Parameters:
storage_type (str) – token of a storage which has been added to the factory
storage_args (dict) – dictionary with initialization parameters, including database_name and database_path. See module documentation for greater detail
- Returns:
instantiated concrete Storage
- Return type:
- orchestrator.storage.factory.storage_builder = <orchestrator.storage.factory.StorageBuilder object>¶
storage builder object which can be imported for use in other modules