Storage Module

Abstract Base Class

class orchestrator.storage.storage_base.Storage(**kwargs)[source]

Bases: Recorder, ABC

Abstract base class for data storage

The Storage class deals with all functionalities associated to data storage inside Orchestrator. Its functions include the initialization of the database, and data additions, updates, and queries. The Orchestrator uses a list of ASE Atoms as the internal data representation. A given database (Storage instance) can include multiple datasets (collections of configurations and properties) and generally persists in time.

Parameters:

storage_args (dict) – dictionary with initialization parameters, including database_name and database_path. See module documentation for greater detail

__init__(**kwargs)[source]

Set variables and initialize the recorder

Parameters:

storage_args (dict) – dictionary with initialization parameters, including database_name and database_path. See module documentation for greater detail

generate_dataset_name(root, specifier, counter=None, check_uniqueness=True)[source]

generate a detailed (mostly) human-readable dataset name

The dataset name will be in the form: root_specifier[_counter] and if check_uniqueness is true, root_specifier[_counter]_unique_hash.

Parameters:
  • root (str) – root of the dataset name, this should be consistent across similar runs (i.e. a campaign name)

  • specifier (str) – this argument gives more fine control of the dataset name, allowing differentiation within a given root

  • counter (int) – iteration number of the present root and specifier combination. This can be used for versioning of datasets.

    Default: None

  • check_uniqueness (boolean) – attaches a random hash to the dataset name if true, and ensures that the resulting dataset name is unique within the storage module.

    Default: True

Returns:

the dataset name

Return type:

str

abstract check_if_dataset_name_unique(dataset_name)[source]

check if the provided dataset_name is unique in the database

Parameters:

dataset_name (str) – name to check (human readable)

Returns:

true if the database is not present in the database, false if it does exist

Return type:

boolean

abstract add_data(dataset_handle, data, dataset_metadata=None)[source]

Add new configurations (and associated properties) to the database

This method is used to add to an existing dataset with new configurations. The new configurations may or may not have other properties associated with them.

Parameters:
  • dataset_handle (str or int) – name or ID of dataset

  • data (list) – list of ASE.Atoms objects containing the configurations and associated properties to add to the database. Note that configuration-specific metadata should be stored under the atoms.info[METADATA_KEY] field.

  • dataset_metadata (dict) – A dictionary of metadata specific to the dataset as a whole.

Returns:

handle for the dataset which includes the new additions

Return type:

str or int

abstract new_dataset(dataset_handle, data, dataset_metadata=None)[source]

Create a new dataset with the provided data and metadata

The new dataset will have a human readable name specificed by dataset_handle and will ingest the data and metadata provided.

Parameters:
  • dataset_handle (str) – name of the dataset to be created

  • data (list) – list of ASE.Atoms objects containing the configurations and associated properties to add to the database. Note that configuration-specific metadata should be stored under the atoms.info[METADATA_KEY] field.

  • dataset_metadata (dict) – A dictionary of metadata specific to the dataset as a whole.

Returns:

unique handle for the dataset, i.e. its ID

Return type:

str or int

abstract update_data(dataset_handle, data, metadata=None)[source]

Update an existing dataset - overwriting or adding new properties

This method operates on existing configurations and/or properties. Data are provided as a KliFF dataset of properties that should be added to either the configuration as a new property or overwriting existing properties within the database.

Parameters:
  • dataset_handle (str or int) – name or ID of dataset

  • data (list) – list of ASE.Atoms objects containing the configurations and associated properties to add to the database. Note that configuration-specific metadata should be stored under the atoms.info[METADATA_KEY] field.

  • dataset_metadata (dict) – A dictionary of metadata specific to the dataset as a whole.

Returns:

unique handle for the dataset

Return type:

str

abstract get_data(dataset_handle, query_options=None)[source]

Extract data from Storage

Return the dataset specified by dataset_handle as a list of ASE Atoms. Further options for parameterizing the extraction can be provided by the query_options dictionary.

Parameters:
  • dataset_handle (str or int) – name or ID of dataset

  • query_options (dict) – dict of options for data extraction and return

    Default: None

Returns:

requested data as a list of ASE Atoms

Return type:

list

abstract delete_dataset(dataset_handle)[source]

Remove the dataset specified by dataset_handle from the database

Parameters:

dataset_handle (str) – name or ID of dataset

abstract list_data(dataset_handle=None)[source]

Utility function to query the database

Prints an overview of the database contents if no dataset_handle is provided, otherwise provides information about the specific dataset contents.

Parameters:

dataset_handle (str or int) – name or ID of dataset

Default: None

Concrete Implementations

Local (demo/testing)

class orchestrator.storage.local.LocalStorage(database_path='./local_storage_database', database_name=None, **kwargs)[source]

Bases: Storage

Class to store data in the local disk

The Storage class deals with all functionalities associated to data storage inside Orchestrator. Its functions include the initialization of the database, and data additions, updates, and queries. The Orchestrator uses ASE Atoms as the internal data representation. A given database (Storage instance) can include multiple datasets (collections of configurations and properties) and generally persists in time.

Parameters:

storage_args (dict) – dictionary with initialization parameters, including database_name and database_path. database_path defaults to ‘./local_storage_database’ while database_name defaults to the last string component from the database_path. LocalStorage does not require any additional arguments

__init__(database_path='./local_storage_database', database_name=None, **kwargs)[source]

Set variables and initialize the recorder

Parameters:
  • database_path (str) – Path to the local storage database

    Default: './local_storage_database'

  • database_name (Optional[str]) – Name of the local database

check_if_dataset_name_unique(dataset_name)[source]

check if the provided dataset_name is unique in the database

Parameters:

dataset_name (str) – name to check (human readable)

Returns:

true if the database is not present in the database, false if it does exist

Return type:

boolean

add_data(dataset_handle, data, dataset_metadata=None)[source]

Add new configurations (and associated properties) to the database

This method is used to add to an existing dataset with new configurations. The new configurations may or may not have other properties associated with them.

Parameters:
  • dataset_handle (str) – name of dataset

  • data (list) – list of ASE.Atoms objects containing the configurations and associated properties to add to the database. Note that configuration-specific metadata should be stored under the atoms.info[METADATA_KEY] field.

  • dataset_metadata (Optional[dict]) – A dictionary of metadata specific to the dataset as a whole.

Returns:

handle for the dataset which includes the new additions

Return type:

str

new_dataset(dataset_handle, data, dataset_metadata=None)[source]

Create a new dataset with the provided data and metadata

The new dataset will have a human readable name specificed by dataset_handle and will ingest the data and metadata provided.

Parameters:
  • dataset_handle (str) – name of the dataset to be created

  • data (list) – list of ASE.Atoms objects containing the configurations and associated properties to add to the database. Note that configuration-specific metadata should be stored under the atoms.info[METADATA_KEY] field.

  • dataset_metadata (Optional[dict]) – A dictionary of metadata specific to the dataset as a whole.

Returns:

name of the dataset

Return type:

str

update_data(dataset_handle, data, new_data_key)[source]

Update an existing dataset - overwriting or adding new properties

This method operates on existing configurations and/or properties. data are provided as a KliFF dataset of properties that should be added to either the configuration as a new property or overwriting existing properties within the database.

Parameters:
  • dataset_handle (str or int) – name or ID of dataset

  • data (list) – list of ASE.Atoms objects containing the configurations and associated properties to add to the database. Note that configuration-specific metadata should be stored under the atoms.info[METADATA_KEY] field.

  • dataset_metadata – A dictionary of metadata specific to the dataset as a whole.

Returns:

unique handle for the dataset

Return type:

str

get_data(dataset_handle, query_options=None)[source]

Extract data from storage

Return the dataset specified by dataset_handle as a list of ASE Atoms. Further options for parameterizing the extraction can be provided by the query_options dictionary.

Parameters:
  • dataset_handle (str) – name of the dataset to extract

  • query_options (dict) – dict of options for data extraction and return

    Default: None

Returns:

requested data as a list of ASE Atoms

Return type:

list

delete_dataset(dataset_handle)[source]

Remove the dataset specified by dataset_handle from the database

Parameters:

dataset_handle (str) – name or ID of dataset

list_data(dataset_handle=None)[source]

Utility function to query the database

Prints an overview of the database contents if no dataset_handle is provided, otherwise provides information about the specific dataset contents.

Parameters:

dataset_handle (str) – name of dataset

Default: None

Colabfit (production)

class orchestrator.storage.colabfit.ColabfitStorage(credential_file=None, database_path=None, database_name=None, database_port=None, database_user=None, database_password=None, external_file=None, **kwargs)[source]

Bases: Storage

Manage data using Colabfit

Colabfit documentation can be found at: https://colabfit.github.io/colabfit-tools/html/index.html.

Parameters:

storage_args (dict) – dictionary with initialization parameters, including database_name, database_path, external_file, and credential_file. database_path is the uri to the mongodb data server (required). database_name is the name of the mongodb database client (required). external_file is the explicit path to an lmdb file to handle configurations larger than 20,000 atoms. This file will be generated by Colabfit if it does not yet exist (optional). credential_file is a path to a json file which contains the database_path, database_name, and optionally external_file path. If a credential_file is provided, its contents override any other arguments. None of these parameters have default values.

__init__(credential_file=None, database_path=None, database_name=None, database_port=None, database_user=None, database_password=None, external_file=None, **kwargs)[source]
Parameters:
  • credential_file (str) – Path to a JSON file with the path, name, port, user, password, and external_file keys. This is the preferred method for initializing a storage module. No other keys are needed if credential_file is set

  • database_path (str) – URI to the PostgreSQL data server

  • database_name (str) – Name of the PostgreSQL database client

  • database_port (str) – Port for the PostgreSQL server

  • database_user (str) – Username for the PostgreSQL server

  • database_password (str) – Password for the PostgreSQL server

  • external_file (str) – Path to an LMDB file for large configurations

check_if_dataset_name_unique(dataset_name)[source]

check if the provided dataset_name is unique in the database

Parameters:

dataset_name (str) – name to check (human readable)

Returns:

true if the database is not present in the database, false if it does exist

Return type:

boolean

add_data(dataset_handle, data, dataset_metadata=None, updated_description=None, updated_authors=None)[source]

Add new configurations (and associated properties) to the db.

This method is used to add to an existing dataset with new configurations. update_data’ can serve the same role (along with others) but requires all data (new and existing) to be passed in as an argument. Assumes property format (property_map)is the same as the original dataset.

Parameters:
  • dataset_handle (str) – name or ID of dataset

  • data (list[Atoms]) – list of ASE.Atoms objects containing the configurations and associated properties to add to the database. Note that configuration-specific metadata should be stored under the atoms.info[METADATA_KEY] field.

  • dataset_metadata (Optional[dict]) – A dictionary of metadata specific to the dataset as a whole. This function needs to have ‘parameters’ provided which consists or ‘universal’ and ‘code’ nested dictionaries.

  • updated_description (Optional[str]) – If not None, will also update the dataset description

  • updated_authors (Optional[list[str]]) – If not None, will also update the dataset authors

Return type:

str

Returns:

handle for the dataset which includes the new additions

new_dataset(dataset_name, data, dataset_metadata=None, strict=True)[source]

Create a new dataset with the provided data and metadata

The new dataset will have a human readable name specificed by dataset_name and will ingest the data and metadata provided.

Parameters:
  • dataset_name (str) – name of the dataset to be created

  • data (list) – list of ASE.Atoms objects containing the configurations and associated properties to add to the database. Note that configuration-specific metadata should be stored under the atoms.info[METADATA_KEY] field.

  • dataset_metadata (dict) – A dictionary of metadata specific to the dataset as a whole. Current options are authors (str), description (str), and parameters (dict) which consists of two nested dictionaries named ‘universal’ and ‘code’ for the universal input parameter names and the code specific dictionaries.

  • strict (bool) – If strict, ingested data must all contain the properties specified in the property map.

    Default: True

Returns:

unique handle for the dataset

Return type:

str

get_data(dataset_handle, query_options=None, inspect=False, rename_properties=False, return_dataset_info=False)[source]

Extract data from storage

Return the dataset specified by dataset_handle as a list of ASE Atoms. Further options for parameterizing the extraction can be provided by the query_options dictionary.

Parameters:
  • dataset_handle (str) – ID of dataset

  • query_options (dict) – dict of options for data extraction and return

    Default: None

  • inspect (bool) – whether to inspect data and print summary

  • rename_properties (Optional[bool]) – whether to rename properties based upon previous dataset’s property map. Useful to keep consistent naming when adding data to dataset

  • return_dataset_info (Optional[bool]) – whether to return dataset info such as name, authors, etc in addition to data

Returns:

requested data as a list of ASE.Atoms objects and dataset info if return_dataset_info is True

Return type:

list or list and dict

update_data(dataset_handle, data, parameters=None, property_map=None, use_orig_property_map=True, new_properties=None, strict=True, updated_description=None, updated_authors=None)[source]

Update an existing dataset - adding new properties to configurations

This method operates on existing configurations and/or properties. Data is a list of ASE Atoms objects. NOTE: This should include all data that is to be associated with datasets. Call get_data if you want old data and potentially new data to be in dataset. The property map is automatically pulled from the original dataset. If this isn’t wanted set use_orig_property_map=False and specify property_map which should include mappings for all data to add. dataset_handle specifies the dataset where these data should be updated and should be the dataset ID, (DS_XXXXXX).

Parameters:
  • dataset_handle (str) – ID of dataset

  • data (list[Atoms]) – list of ase.Atoms which include the new data to add

  • parameters (Optional[dict]) – The ‘universal’ and ‘code’ specific parameters from the simulations. These should be the same as the parameters in the database.

  • use_orig_property_map (bool) – whether or not to use the dataset’s original property map. Useful when get_data(rename_properties=True) has been used. If False self.property_map is used instead.

  • new_properties (Optional[dict]) – These properties will be added to the property_map via add_property_mapping

  • strict (bool) – If strict, ingested data must all contain the properties specified in the property map.

    Default: True

  • updated_description (Optional[str]) – If not None, will also update the dataset description

  • updated_authors (Optional[list[str]]) – If not None, will also update the dataset authors

Return type:

str

Returns:

updated handle for the dataset

list_data(dataset_handle=None, text=None, properties=None, elements=None, elements_exact=False)[source]

Utility function to query the database

Prints an overview of the database contents if no dataset_handle is provided, otherwise provides information about the specific dataset contents. Currently only dataset_handles which reference the dataset name (not the colabfit ID) will work for showing the selective query result.

Parameters:
  • dataset_handle (str) – name of the dataset

    Default: None

  • text (Optional[str]) – text to search for within the dataset. This can be authors, descriptions, uploader.

    Default: None

  • properties (str) – name of properties to search for. Multiple should be included as “energy atomic-forces”

    Default: None

  • elements (str) – elements to search for. Multiple should be included as “C H”. Will return datsets containing these plus other elements. See elements_exact

    Default: None

  • elements_exact (bool) – whether to restrict element search to return datasets containing only specified elements

    Default: False

delete_dataset(dataset_handle, delete_children=True)[source]

Remove the dataset specified by dataset_handle from the database

Parameters:
  • dataset_handle (bool) – ID of dataset

  • delete_cildren – if true will also delete all POs and COs (not associated with another DS)

delete_items(item_ids_list)[source]

Remove the COs and/or POs specified by item_ids_list from the database

dataset_intersection_and_differences(dataset1, dataset2, mode)[source]

returns the intersection or differences between two datasets

behavior is controlled by the mode variable, which can be set to ‘intersection’ or ‘difference’. The corresponding results will be returned. If ‘difference’ is chosen, the returned ASE Atoms list contains all configurations IN dataset1 but NOT IN dataset2.

Parameters:
  • dataset1 (str) – name of the first dataset to compare

  • dataset2 (str) – name of the second dataset to compare

  • mode (str) – switch for if the intersection or difference is returned

Returns:

a list of ASE Atoms of the shared configurations

Return type:

list

define_new_properties(property_list)[source]

Define new properties to add to the database

New properties only need to be defined once for the database.

Parameters:

property_list (list[dict]) – List of dictionaries containing properties to be stored in a client

set_property_map(keys=None, file_example=None)[source]

Set the mapping between input properties and colabfit representation

Definition of a set of basic properties to be stored in a Colabfit database. This will be used to map input data to the articulated properties which are stored in the Colabfit database. The property_map is used when inserting data into the database. A default property map is defined, but can be overwritten by setting self.property_map to the output of this function with specified keys/examples.

Parameters:
  • keys (dict) – dictionary defining the mapping between ingested properties and their internal database representation. Keys can include ‘energy_field’, ‘force_field’, and ‘stress_field’, with the values corresponding to how that property is demarcated in the input. Additional keys can be included but must include their full mapping.

    Default: None

  • file_example (str) – path to a file with a header representing the property tags, from which possible energy, force, and stress mappings (defined by the options in this method) are extracted

    Default: None

Return type:

dict

Returns:

dictionary with all properties used in a dataset

set_default_property_map()[source]

Set the default mapping between input properties and colabfit representation. Includes energy, atomic-forces, and cauchy-stress.

Return type:

dict

check_example_config(example_config)[source]
add_property_mapping(new_property_name, new_map, overwrite=False)[source]

add a new property to the property entry into the internal property map

Example usage:

storage.add_property_map(
    'new_property_name',
    {
        'key_1': {'field': 'key_1_for_ASE', 'units': None},
        'key_2': {'field': 'key_2_for_ASE', 'units': None},
    }
)
Parameters:
  • new_property_name (str) – name of property mapping being added

  • new_map (dict or list) – the colabfit-style property mapping. A dictionary specifying the 'field' which will be used to load the data off of an ASE atoms object (from the .info or .arrays dictionaries), and the units. Note that colabfit expects new_map to actually be a list; this function will wrap new_map in a list if it is not already one.

  • overwrite (bool) – True allows existing maps with the same name to be overwritten. Default is False.

Returns:

updated property_map

Return type:

dict

get_dataset_property_map(dataset_id)[source]

Given a dataset_id will return the property_map that was used to ingest that dataset.

Parameters:

dataset_id (str) – ID of dataset

Return type:

dict

Returns:

dictionary with all properties used in a dataset

Rtype property_map:

dict

get_dataset_name_from_id(dataset_id)[source]

Given a dataset_id will return the dataset’s name

Parameters:

dataset_id (str) – ID of dataset

Return type:

str

Returns:

name of the dataset

Rtype dataset_name:

str

get_property_definitions()[source]
Returns:

all properties currently in database

Return type:

list

update_property_definition(prop_def, new_keys)[source]

Updates an existing property definition with new keys

Only keys that are not currently a part of the definition should be add in new_keys. Populates existing entries with provided default value Form of new_keys should be similar to:

{'energy': {
    'type': 'float',
    'has-unit': True,
    'extent': [],
    'required': True,
    'description': 'The potential energy of the system.',
    'default-value': None
}}

The default default-value is NULL.

Parameters:
  • prop_def (dict) – name of definition to update

  • new_keys (dict) – dict containing new keys to add with default values to populate existing entries

setup_tables()[source]

Builds all necessary PostgreSQL tables. For use with newly created databases. Won’t affect existing databases if called. Also add energy, forces, and stress props

Return type:

None

static sort_configurations(configs)[source]

Given a list of Atoms will return a sorted version based upon what the CO-id would be. Useful for sorting configs to be in the same order as returned configurations from get_data.

Parameters:

configs (list[Atoms]) – list of configurations

Returns:

sorted configs

Return type:

list(Atoms)

get_dataset_input_parameters(dataset_id)[source]

Collect the input parameters associated with a dataset id. Will need to parse two different tables and join the results as a single output. If there are no input parameters found, return an error. To enforce conformity among the dataset, there will only be one input parameter values generated and allowed per dataset family.

Parameters:

dataset_id (str) – The dataset identification within the database.

Returns universal:

Dictionary containing the universal input parameters.

Returns code:

Dictionary containing code specific input parameters.

Return type:

tuple[dict, dict]

Storage Builder

orchestrator.storage.factory.storage_factory = <orchestrator.utils.module_factory.ModuleFactory object>

default factory for oracles, includes QE, LAMMPS, and KIM

class orchestrator.storage.factory.StorageBuilder(factory=<orchestrator.utils.module_factory.ModuleFactory object>)[source]

Bases: ModuleBuilder

Constructor for storage modules added in the factory

set the factory to be used for the builder. The default is to use the storage_factory generated at the end of this module. A user defined StorageFactory can optionally be supplied instead.

Parameters:

factory (ModuleFactory) – a storage factory

Default: storage_factory

__init__(factory=<orchestrator.utils.module_factory.ModuleFactory object>)[source]

constructor for the StorageBuilder, sets the factory to build from

Parameters:

factory (ModuleFactory) – a storage factory

Default: storage_factory

build(storage_type, storage_args=None)[source]

Return an instance of the specified data storage

The build method takes the specifier and input arguments to construct a concrete storage instance.

Parameters:
  • storage_type (str) – token of a storage which has been added to the factory

  • storage_args (dict) – dictionary with initialization parameters, including database_name and database_path. See module documentation for greater detail

Returns:

instantiated concrete Storage

Return type:

Storage

orchestrator.storage.factory.storage_builder = <orchestrator.storage.factory.StorageBuilder object>

storage builder object which can be imported for use in other modules