Score Module

Abstract Base Class

class orchestrator.computer.score.score_base.ScoreQuantity(value)[source]

Bases: Enum

An enumeration.

IMPORTANCE = 0
UNCERTAINTY = 1
EFFICIENCY = 2
DIVERSITY = 3
DELTA_ENTROPY = 4
SENSITIVITY = 5
class orchestrator.computer.score.score_base.ScoreBase(**kwargs)[source]

Bases: Computer

Abstract base class for an object which returns a “score” of its inputs. For example, an “importance” score or an “uncertainty” score. Details of the inputs or outputs will vary across implementations.

OUTPUT_KEY = 'score'
suppported_score_quantities = []
data_file_name = 'score_results.xyz'
output_file_name = 'score_results.xyz'
init_args_file_name = 'score_init_args.json'
init_args_subdir = 'score_init_args_temp_files'
compute_args_file_name = 'score_compute_args.json'
compute_args_subdir = 'score_compute_args_temp_files'
script_file_name = 'score_compute_script.py'
abstract compute(atoms, score_quantity, **kwargs)[source]

Runs the calculation for a single atomic configuration. This is intended to be able to be used in a serial (non-distributed) manner, outside of a proper orchestrator workflow.

Parameters:
  • atoms (Atoms) – the ASE Atoms object

  • score_quantity (int) – the type of score value to compute

Returns:

the score, where the first dimension should be the number of atoms.

Return type:

np.ndarray

abstract compute_batch(list_of_atoms, score_quantity, **kwargs)[source]

Runs the calculation for a batch of atomic configurations. This is intended to be able to be used in a serial (non-distributed) manner, outside of a proper orchestrator workflow.

Parameters:
  • list_of_atoms (list[Atoms]) – a list of ASE Atoms objects

  • score_quantity (int) – the type of score value to compute

Returns:

a list of scores for each atomic configuration

Return type:

list

get_run_command(**kwargs)[source]

Return the command to run calculations within a workflow. This allows for distributed execution of compute().

Returns:

string for execution via command line

Return type:

str

get_batched_run_command(**kwargs)[source]

Similar to get_run_command(), this function is meant to support executing compute_batch() within a workflow.

Returns:

string for execution via command line

Return type:

str

run(path_type, configs, compute_args, workflow=None, job_details=None, batch_size=1, verbose=False)[source]

Main function to compute the score for a collection of atomic configurations.

The run method includes half of the main functionality of the score module, taking atomic configurations as input and handling the submission of calculations to obtain the computed results. configs is a dataset of 1 or more structures. run() will create independent jobs for each structure using the supplied workflow, with job_details parameterizing the job submission.

Parameters:
  • path_type (str) – specifier for the workflow path, to differentiate calculation types

  • compute_args (dict) – input arguments to fill out the input file

  • configs (list) – list of configurations or data samples to be used for score calculation. Each item can be an ASE Atoms object or any other format supported by the score module.

  • workflow (Workflow) – the workflow for managing job submission, if none are supplied, will use the default workflow defined in this class

    Default: None

  • job_details (dict) – dict that includes any additional parameters for running the job (passed to submit_job())

    Default: {}

  • batch_size (int) – number of configurations to pass to compute() at once. Default of 1 does not do any batching.

  • verbose (bool) – if True, show progress

Returns:

a list of calculation IDs from the workflow.

Return type:

list

write_input(run_path, compute_args, configs)[source]

Generate input files for running the calculation.

This method will write the requisite input files in the run_path. Specific implementations may leverage additional helper functions to construct the input. Notably, and arguments that are passed as in-memory arrays will be written out to temporary files, which will be removed later by .cleanup().

Parameters:
  • run_path (str) – directory path where the file is written

  • compute_args (dict) – arguments for the computer

  • configs (list) – list of configurations or data samples to be used for score calculation. Each item can be an ASE Atoms object or any other format supported by the score module.

Returns:

name of written input file

Return type:

str

cleanup(run_path=None)[source]

Removes any temporary files that were created for job execution.

Parameters:

run_path (str) – the parent directory containing the temp file subdir. If None, it is not being called by a batch job, so it should delete the init_args

read_data(read_path, **kwargs)[source]

Read the configurations or data from a file.

Return type:

list[Atoms]

write_data(save_path, configs, **kwargs)[source]

Write the configurations or data to a file.

class orchestrator.computer.score.score_base.AtomCenteredScore(**kwargs)[source]

Bases: ScoreBase

save_results(compute_results, save_dir='.', list_of_configs=None, **kwargs)[source]

Save calculation output to a file.

Since these results are per-atom scores, they will be saved in the .arrays dictionary of an Atoms object. Note that this code assumes that the ASE file used to compute the results already exists in save_dir.

Parameters:
  • compute_results (np.ndarray or list[np.ndarray]) – the output of .compute() or .compute_batch()

  • save_dir (str) – folder in which to save the results

  • list_of_configs (list or Atoms) – the atomic configurations for which the results were computed. Must be provided so that results can be attached and saved on the correct Atoms objects.

parse_for_storage(run_path, cleanup=True)[source]

Process calculation output to extract data in a consistent format, then run cleanup() to remove any unnecessary temporary files.

Parameters:
  • run_path (str) – directory where the output file resides

  • cleanup (bool) – a flag indicating whether to delete the temporary files.

    Default: True

Returns:

Atoms of the configurations with attached properties and metadata

Return type:

list of Atoms

class orchestrator.computer.score.score_base.ConfigurationScore(**kwargs)[source]

Bases: ScoreBase

save_results(compute_results, save_dir='.', list_of_configs=None, **kwargs)[source]

Save calculation output to a file.

Since these results are per-atom scores, they will be saved in the .info dictionary of an Atoms object. Note that this code assumes that the ASE file used to compute the results already exists in save_dir.

Parameters:
  • compute_results (np.ndarray or list[np.ndarray]) – the output of .compute() or .compute_batch()

  • save_dir (str) – folder in which to save the results

  • list_of_configs (list or Atoms) – the atomic configurations for which the results were computed. Must be provided so that results can be attached and saved on the correct Atoms objects.

parse_for_storage(run_path, cleanup=True)[source]

Process calculation output to extract data in a consistent format, then run cleanup() to remove any unnecessary temporary files.

Parameters:
  • run_path (str) – directory where the output file resides

  • cleanup (bool) – a flag indicating whether to delete the temporary files.

    Default: True

Returns:

Atoms of the configurations with attached properties and metadata

Return type:

list of Atoms

class orchestrator.computer.score.score_base.DatasetScore(**kwargs)[source]

Bases: ScoreBase

output_file_name = 'score_results.json'
compute_batch(list_of_atoms, score_quantity, **kwargs)[source]

Runs the calculation for a batch of atomic configurations. This is intended to be able to be used in a serial (non-distributed) manner, outside of a proper orchestrator workflow.

Parameters:
  • list_of_atoms (list[Atoms]) – a list of ASE Atoms objects

  • score_quantity (int) – the type of score value to compute

Returns:

a list of scores for each atomic configuration

Return type:

list

save_results(compute_results, save_dir='.', **kwargs)[source]

Save calculation output to a file. Implementation dependent.

Note that this function should also store any metadata associated with the calculation.

Parameters:
  • compute_results (np.ndarray or list[np.ndarray]) – the output of .compute() or .compute_batch()

  • save_path (str) – folder in which to save the results

parse_for_storage(run_path, cleanup=True)[source]

Process calculation output to extract data in a consistent format, then run cleanup() to remove any unnecessary temporary files.

Parameters:
  • run_path (str) – directory where the output file resides

  • cleanup (bool) – a flag indicating whether to delete the temporary files.

    Default: True

Returns:

Atoms of the configurations with attached properties and metadata

Return type:

list of np.ndarray

run(path_type, configs, compute_args, workflow=None, job_details=None, batch_size=1, verbose=False)[source]

Custom .run() for handling dataset inputs instead of configurations.

Sets the batch size to the entire dataset size, and raises an error if batch_size is not 1 or the full dataset size.

Return type:

list[int]

class orchestrator.computer.score.score_base.ModelScore(**kwargs)[source]

Bases: ScoreBase

A class for storing score results related to a single model, rather than for a single atom or atomic configuration.

output_file_name = 'score_results.json'
compute(data, score_quantity, **kwargs)[source]

Unlike the other score modules, which always act on the atomic configurations, model score doesn’t have this restriction. For example, it can act on some target property, etc., although it can still act on the atomic configurations.

Parameters:
  • data (Any) – general data for the score calculation, a generalization of the atoms argument in the other score modules

  • score_quantity (int) – the type of score value to compute

Returns:

the score, where the first dimension should be the number of atoms.

Return type:

np.ndarray

compute_batch(list_of_data, score_quantity, **kwargs)[source]

Unlike the other score modules, which always act on the atomic configurations, model score doesn’t have this restriction and can act on a more general object or quantity. For example, it can act on some target property, etc., although it can still act on the atomic configurations.

Parameters:
  • list_of_data (list[Any]) – list of general data for the score calculation, a generalization of the list_of_atoms argument in the other score modules

  • score_quantity (int) – the type of score value to compute

Returns:

a list of scores for each atomic configuration

Return type:

list

save_results(compute_results, save_dir='.', **kwargs)[source]

Save calculation output to a file.

Save the results as a JSON file, so that we can also store the metadata. The order of list elements stored under the self.OUTPUT_KEY_score key matches the order of elements in the compute_results input argument, which is typically the same as the output order of the compute_batch() method.

Parameters:
  • compute_results (np.ndarray or list[np.ndarray]) – the output of .compute() or .compute_batch()

  • save_dir (str) – folder in which to save the results

parse_for_storage(run_path, cleanup=True)[source]

Process calculation output to extract data in a consistent format, then run cleanup() to remove any unnecessary temporary files.

Parameters:
  • run_path (str) – directory where the output file resides

  • cleanup (bool) – a flag indicating whether to delete the temporary files.

    Default: True

Returns:

A dictionary with the score value(s) and metadata

Return type:

dict

read_data(read_path, **kwargs)[source]

Read the data from a file.

This method should handle whatever data format used by the score module.

Return type:

list[Any]

write_data(save_path, data, **kwargs)[source]

Write the data to a file.

This method should handle whatever data format used by the score module.

Concrete Implementations

LTAU

class orchestrator.computer.score.ltau.LTAUForcesUQScore(train_descriptors, error_pdfs, bins=None, from_error_logs=False, error_logs_norm=False, nbins=None, range_limits=None, bin_spacing='log', index_type='IndexFlatL2', index_args=None, **kwargs)[source]

Bases: AtomCenteredScore

An ensemble-based UQ method for force predictions.

This module uses the distributions of force error magnitudes for each atom sampled over the course of training to estimate that atom’s force uncertainty. For test atoms, a nearest-neighbor search is performed (in the descriptor space) using the FAISS library to estimate the test atom’s uncertainty using its nearest neighbors from the training set.

OUTPUT_KEY = 'ltau_forces_uq'
supported_score_quantities = [ScoreQuantity.UNCERTAINTY]
__init__(train_descriptors, error_pdfs, bins=None, from_error_logs=False, error_logs_norm=False, nbins=None, range_limits=None, bin_spacing='log', index_type='IndexFlatL2', index_args=None, **kwargs)[source]
Parameters:
  • train_descriptors (np.ndarray or str) – (N, D) array of atomic descriptors for the entire training set. Can also be string to numpy-readable file.

  • error_pdfs (np.ndarray or list or str) – an array of size (N, N_b) containing the PDFs for all N training points defined over N_b bins. If from_error_logs is True, then error_pdfs should be a list of M arrays, where each array is a size (T_m, N) array of errors logged for each training point for an ensemble of M models. T_m is the number of training epochs for the m-th model in the ensemble. Can also be a string to a numpy-readable file.

  • from_error_logs (bool) – if True, builds the training PDFs from the logged training errors.

  • error_logs_norm (bool) – if True, takes the norm of the error logs along axis=-1

  • bins (np.ndarray or None) – array of bin edges. If from_error_logs is True and bins is not provided, must provide nbins, range_limits, and bin_spacing instead.

  • nbins (int) – number of bins for PDFs. Required if bins is None

  • range_limits (tuple) – the upper/lower limits of the bins. If not provided, uses the min/max error from error_pdfs. Only required if from_error_logs is True and bins is None.

  • bin_spacing (str) – one of “log” or “linear”. Default is “log”. Only required if from_error_logs is True and bins is None.

  • index_type (str) – one of [‘IndexFlatL2’, ‘IndexHNSWFlat’, ‘HNSW+IVFPQ’] specifying the index type for FAISS. Default is ‘IndexFlatL2’. Required if load_index is False.

  • index_args (dict) – additional arguments to be passed directly to the FAISS index constructor. Required if load_index is False.

compute(atoms, score_quantity=ScoreQuantity.UNCERTAINTY, descriptors_key='descriptors', num_nearest_neighbors=1, **kwargs)[source]

Calls compute_batch with a single-configuration list.

Parameters:
  • atoms (Union[list[Atoms], Atoms]) – the ASE Atoms objects.

  • descriptors_key (str) – the key to use for extracting the descriptors from an ASE.Atoms object

  • score_quantity (int) – the type of score value to compute

  • num_nearest_neighbors (int) – the number of neighbors to search for when performing UQ on test point

Return type:

ndarray

Returns:

(N, b) array of PDFs of errors for each atom.

compute_batch(list_of_atoms, score_quantity=ScoreQuantity.UNCERTAINTY, descriptors_key='descriptors', num_nearest_neighbors=1, **kwargs)[source]
Parameters:
  • list_of_atoms (list[Atoms]) – a list of ASE Atoms objects. If a list of ASE Atoms is provided, then ‘descriptors_key’ should be provided in args to allow extraction of descriptors.

  • score_quantity (int) – the type of score value to compute

  • descriptors_key (str) – the key to use for extracting the descriptors from an ASE.Atoms object

  • num_nearest_neighbors (int) – the number of neighbors to search for when performing UQ on test points

Return type:

list[ndarray]

Returns:

(N, b) array of PDFs of errors for each atom.

get_colabfit_property_definition(score_quantity=None)[source]

A ‘property definition’ is a dictionary used by the ColabFit storage module for exactly specifying the details (data type, shape, description, etc.) of each field required for uniquely defining a given property. This function must be implemented in order to support storage of the computed results in the ColabFit module.

Parameters:

name (str) – the name of the property. Only needs to be provided if the Computer can return multiple properties.

Returns:

the property definition

Return type:

dict

get_colabfit_property_map(score_quantity=None)[source]

Returns a default property map that can be used to extract a ColabFit property from an ASE.Atoms object. This assumes that the values being extracted are stored in their default locations based on the specific Computer module (usually within the compute() or compute_batch() functions).

A ‘property map’ is similar to a ‘property definition’, but instead tells ColabFit how to extract the keys specified in the property definition from an ASE.Atoms object. This function must be implemented in order to support storage of the computed results in the ColabFit module.

Parameters:

name (str) – the name of the property. Only needs to be provided if the Computer can return multiple properties.

Returns:

the property map

Return type:

dict

QUESTS

class orchestrator.computer.score.quests.QUESTSEfficiencyScore(**kwargs)[source]

Bases: DatasetScore

An information-based method of quantifying dataset diversity.

This module wraps the quests package, which performs a kernel density estimate of the distribution of points in a descriptor space to obtain a non-parametric estimate of the information entropy of a dataset. This estimate can then be used to identify the “efficiency” of the dataset (the lack of redundancy).

OUTPUT_KEY = 'quests_efficiency'
supported_score_quantities = [ScoreQuantity.EFFICIENCY]
__init__(**kwargs)[source]

Initialize the Recorder mixin class.

Sets up logging configuration and creates a logger instance named after the class of the object using it.

Parameters:
  • args – Positional arguments passed to other supers in the MRO.

  • kwargs – Keyword arguments passed to other supers in the MRO.

compute(dataset, score_quantity, apply_mask=False, descriptors_key='descriptors', bandwidth=quests.entropy.DEFAULT_BANDWIDTH, batch_size=quests.entropy.DEFAULT_BATCH, **kwargs)[source]

Computes the efficiency of the dataset.

The efficiency is a measure of how little oversampling the dataset has. If the efficiency is near 1, the dataset has very little redundancy.

Parameters:
  • dataset (list) – a list of ASE Atoms objects.

  • score_quantity (int) – the type of score value to compute

  • apply_mask (bool) – if True, apply the environment selection mask; can only be used if mask already exists for all configurations.

  • descriptors_key (str) – the key to use for extracting the descriptors from an ASE.Atoms object

  • bandwidth (float) – the bandwidth used by the Gaussian kernel for KDE.

  • batch_size (int) – the maximum batch size to consider when performing a distance calculation.

Returns:

the efficiency of the dataset

Return type:

float

get_colabfit_property_definition(score_quantity)[source]

A ‘property definition’ is a dictionary used by the ColabFit storage module for exactly specifying the details (data type, shape, description, etc.) of each field required for uniquely defining a given property. This function must be implemented in order to support storage of the computed results in the ColabFit module.

Parameters:

name (str) – the name of the property. Only needs to be provided if the Computer can return multiple properties.

Returns:

the property definition

Return type:

dict

class orchestrator.computer.score.quests.QUESTSDiversityScore(**kwargs)[source]

Bases: DatasetScore

An information-based method of quantifying dataset diversity.

This module wraps the quests package, which performs a kernel density estimate of the distribution of points in a descriptor space to obtain a non-parametric estimate of the information entropy of a dataset. This estimate can then be used to identify the “diversity” of a dataset.

OUTPUT_KEY = 'quests_diversity'
supported_score_quantities = [ScoreQuantity.DIVERSITY]
__init__(**kwargs)[source]

Initialize the Recorder mixin class.

Sets up logging configuration and creates a logger instance named after the class of the object using it.

Parameters:
  • args – Positional arguments passed to other supers in the MRO.

  • kwargs – Keyword arguments passed to other supers in the MRO.

compute(dataset, score_quantity, apply_mask=False, descriptors_key='descriptors', bandwidth=quests.entropy.DEFAULT_BANDWIDTH, batch_size=quests.entropy.DEFAULT_BATCH, **kwargs)[source]

Computes the diversity of the dataset.

The diversity is a measure of how well the dataset covers all regions of the configuration space that it spans.

Parameters:
  • dataset (list) – a list of ASE Atoms objects.

  • score_quantity (int) – the type of score value to compute

  • apply_mask (bool) – if True, apply the environment selection mask; can only be used if mask already exists for all configurations.

  • descriptors_key (str) – the key to use for extracting the descriptors from an ASE.Atoms object

  • bandwidth (float) – the bandwidth used by the Gaussian kernel for KDE.

  • batch_size (int) – the maximum batch size to consider when performing a distance calculation.

Returns:

returns the diversity of the dataset

Return type:

float

get_colabfit_property_definition(score_quantity)[source]

A ‘property definition’ is a dictionary used by the ColabFit storage module for exactly specifying the details (data type, shape, description, etc.) of each field required for uniquely defining a given property. This function must be implemented in order to support storage of the computed results in the ColabFit module.

Parameters:

name (str) – the name of the property. Only needs to be provided if the Computer can return multiple properties.

Returns:

the property definition

Return type:

dict

class orchestrator.computer.score.quests.QUESTSDeltaEntropyScore(**kwargs)[source]

Bases: AtomCenteredScore

OUTPUT_KEY = 'quests_delta_entropy'
supported_score_quantities = [ScoreQuantity.DELTA_ENTROPY]
compute(atoms, score_quantity, reference_set, descriptors_key='descriptors', approx=False, bandwidth=quests.entropy.DEFAULT_BANDWIDTH, num_nearest_neighbors=quests.entropy.DEFAULT_UQ_NBRS, graph_neighbors=quests.entropy.DEFAULT_GRAPH_NBRS, **kwargs)[source]

Calls compute_batch with a single-configuration list.

Parameters:
  • atoms (Atoms) – a single ASE Atoms objects.

  • score_quantity (int) – the type of score value to compute

  • reference_set (np.ndarray) – an (N, D) matrix with the descriptors of the reference.

  • descriptors_key (str) – the key to use for extracting the descriptors from an ASE.Atoms object

  • approx (bool) – if True, uses an approximate nearest neighbor search to compute the delta entropy values. Recommended for large data sizes.

  • bandwidth (float) – the bandwidth used by the Gaussian kernel for KDE.

  • num_nearest_neighbors (int) – number of nearest-neighbors to take into account when computing the approximate dH.

  • graph_neighbors (int) – a parameter used by pynndescent for performing the approximate nearest neighbor search.

Returns:

returns the delta_entropy

Return type:

float or np.ndarray

compute_batch(list_of_atoms, score_quantity, reference_set, descriptors_key='descriptors', approx=False, bandwidth=quests.entropy.DEFAULT_BANDWIDTH, batch_size=quests.entropy.DEFAULT_BATCH, num_nearest_neighbors=quests.entropy.DEFAULT_UQ_NBRS, graph_neighbors=quests.entropy.DEFAULT_GRAPH_NBRS, **kwargs)[source]

Calls compute_batch with a single-configuration list.

Parameters:
  • list_of_atoms (list) – a list of ASE Atoms objects. If a list of ASE Atoms is provided, then ‘descriptors_key’ should be provided in args to allow extraction of descriptors.

  • score_quantity (int) – the type of score value to compute

  • descriptors_key (str) – the key to use for extracting the descriptors from an ASE.Atoms object

  • approx (bool) – if True, uses an approximate nearest neighbor search to compute the delta entropy values. Recommended for large data sizes.

  • bandwidth (float) – the bandwidth used by the Gaussian kernel for KDE.

  • batch_size (int) – the maximum batch size to consider when performing a distance calculation.

  • num_nearest_neighbors (int) – number of nearest-neighbors to take into account when computing the approximate dH.

  • graph_neighbors (int) – a parameter used by pynndescent for performing the approximate nearest neighbor search.

  • reference_set (np.ndarray) – an (N, D) matrix with the descriptors of the reference.

Returns:

the delta_entropy scores for each atom

Return type:

float or np.ndarray

get_colabfit_property_definition(score_quantity)[source]

A ‘property definition’ is a dictionary used by the ColabFit storage module for exactly specifying the details (data type, shape, description, etc.) of each field required for uniquely defining a given property. This function must be implemented in order to support storage of the computed results in the ColabFit module.

Parameters:

name (str) – the name of the property. Only needs to be provided if the Computer can return multiple properties.

Returns:

the property definition

Return type:

dict

get_colabfit_property_map(score_quantity)[source]

Returns a default property map that can be used to extract a ColabFit property from an ASE.Atoms object. This assumes that the values being extracted are stored in their default locations based on the specific Computer module (usually within the compute() or compute_batch() functions).

A ‘property map’ is similar to a ‘property definition’, but instead tells ColabFit how to extract the keys specified in the property definition from an ASE.Atoms object. This function must be implemented in order to support storage of the computed results in the ColabFit module.

Parameters:

name (str) – the name of the property. Only needs to be provided if the Computer can return multiple properties.

Returns:

the property map

Return type:

dict

FIMTrain

class orchestrator.computer.score.fim.fim_training_set.FIMTrainingSetScore(**kwargs)[source]

Bases: ConfigurationScore

A module to compute the FIM of the training dataset with configuration energy, atomic forces, or stress quantity.

The FIM is calculated by first computing the Jacobian, i.e., the derivative with respect to potential parameters, then take the dot product between the Jacobian with itself. The derivative is calculated numerically using numdifftools package.

The element of the FIM matrix at row \(i\), column \(j\), approximates the second derivative of the potential predictions with respect to the potential parameters at indices \(i\) and \(j\). The mapping between parameter indices and their corresponding potential parameters is stored in the attribute self.fim_index_to_parameter.

Note

When passing the evaluate_kwargs argument, each configuration should evaluate only one quantity at a time — either energy, forces, or stress. Since these quantities have different physical units, the FIM should be computed separately for each.

If multiple quantities need to be evaluated for a configuration, consider duplicating the configuration and assigning only one quantity to each duplicate.

An exception will be raised if more than one quantity is requested for evaluation.

Note

Currently, this module only works with KIM portable models.

OUTPUT_KEY = 'fim_training_set'
supported_score_quantities = [ScoreQuantity.SENSITIVITY]
supported_potential_type = ['KIM']
default_evaluate_kwargs = {'compute_energy': False, 'compute_forces': True, 'compute_stress': False}
default_derivative_kwargs = {'method': 'central'}
__init__(**kwargs)[source]

Initialize the Recorder mixin class.

Sets up logging configuration and creates a logger instance named after the class of the object using it.

Parameters:
  • args – Positional arguments passed to other supers in the MRO.

  • kwargs – Keyword arguments passed to other supers in the MRO.

compute(atoms, score_quantity, potential, parameters_optimize=None, transform=None, evaluate_kwargs=None, derivative_kwargs=None, mask=None, **kwargs)[source]

Runs the FIM calculation for a single atomic configuration. This is intended to be able to be used in a serial (non-distributed) manner, outside of a proper orchestrator workflow.

Note

When passing the evaluate_kwargs argument, each configuration should evaluate only one quantity at a time — either energy, forces, or stress. Since these quantities have different physical units, the FIM should be computed separately for each.

If multiple quantities need to be evaluated for a configuration, consider duplicating the configuration and assigning only one quantity to each duplicate.

A UserWarning will be issued if more than one quantity is requested for evaluation.

Parameters:
  • atoms (ase.Atoms) – the ASE Atoms object

  • score_quantity (str ("SENSITIVITY")) – The type of score value to compute. For this module, the accepted argument in “SENSITIVITY”.

  • potential (dict (preferred) or orchestrator.potential.Potential) – input dictionary to instantiate the potential (using init_potential) or the potential instance itself.

  • parameters_optimize (dict) – Potential parameters to differentiate and their values.

  • transform (Union[dict, TransformBase, None]) – A dictionary containing information to instantiate parameter transformation class. Required keys are “transform_type” and “transform_args”.

  • evaluate_kwargs (dict) – specify to compute energy, forces, and/or stress, with key compute_<quantity> set to boolean value. The default is to compute forces only.

  • derivative_kwargs (dict) – keyword arguments for the Jacobian calculation via numdifftools Python package. See see numdifftools documentation for the list of available keywords.

  • mask (str or np.ndarray) – a binary masking array that can be used to exclude rows of the Jacobian matrix. For example, we can use this array if we want to compute the FIM of atomic forces of certain atoms in the configuration. However, note that since the masking array is applied to the Jacobian, then if we want to include all force components on atom i (zero-base index) in a configuration with N atoms, then the masking array will look like an array of zeros with length 3N, but with element 3*i : 3*(i+1) set to 1.

Returns:

(p, p) array of the FIM, where p is the number of potential parameters

Return type:

np.ndarray

compute_batch(list_of_atoms, score_quantity, potential, parameters_optimize, transform=None, evaluate_kwargs=None, derivative_kwargs=None, list_of_mask=None, nprocs=1, **kwargs)[source]

Runs the FIM calculation for a batch of atomic configurations. This is intended to be able to be used in a serial (non-distributed) manner, outside of a proper orchestrator workflow.

Note

When passing the evaluate_kwargs argument, each configuration should evaluate only one quantity at a time — either energy, forces, or stress. Since these quantities have different physical units, the FIM should be computed separately for each.

If multiple quantities need to be evaluated for a configuration, consider duplicating the configuration and assigning only one quantity to each duplicate.

A UserWarning will be issued if more than one quantity is requested for evaluation.

Parameters:
  • list_of_atoms (list of ase.Atoms) – a list of ASE Atoms objects

  • score_quantity (str ("SENSITIVITY")) – The type of score value to compute. For this module, the accepted argument in “SENSITIVITY”.

  • potential (dict (preferred) or orchestrator.potential.Potential) – input dictionary to instantiate the potential (using init_potential) or the potential instance itself.

  • parameters_optimize (dict) – Potential parameters to differentiate and their values.

  • transform (Union[dict, TransformBase, None]) – A dictionary containing information to instantiate parameter transformation class. Required keys are “transform_type” and “transform_args”.

  • evaluate_kwargs (dict) – specify to compute energy, forces, and/or stress, with key compute_<quantity> set to boolean value. The default is to compute forces only. For each key, a boolean value or a list of boolean with length 1 or len(list_of_atoms) should be given.

  • derivative_kwargs (dict) –

    keyword arguments for the Jacobian calculation via numdifftools Python package. See see numdifftools documentation for the list of available keywords.

  • list_of_mask (list of str or list of np.ndarray) – a list of binary masking array that can be used to exclude rows of the Jacobian matrix. For example, we can use this array if we want to compute the FIM of atomic forces of certain atoms in the configuration. However, note that since the masking array is applied to the Jacobian, then if we want to include all force components on atom i (zero-base index) in a configuration with N atoms, then the masking array will look like an array of zeros with length 3N, but with element 3*i : 3*(i+1) set to 1.

  • nprocs (int) – number of parallel processes to use

Returns:

list of M (P, P) arrays of (P, P) FIM for each of the M atomic configurations.

Return type:

list

get_colabfit_property_definition(score_quantity=None)[source]

A ‘property definition’ is a dictionary used by the ColabFit storage module for exactly specifying the details (data type, shape, description, etc.) of each field required for uniquely defining a given property. This function must be implemented in order to support storage of the computed results in the ColabFit module.

Parameters:

name (str) – the name of the property. Only needs to be provided if the Computer can return multiple properties.

Returns:

the property definition

Return type:

dict

FIMProperty

class orchestrator.computer.score.fim.fim_property.FIMPropertyScore(**kwargs)[source]

Bases: ModelScore

A module to compute the FIM of the target property with respect to the potential parameters.

The FIM is calculated by first computing the Jacobian, i.e., the derivative with respect to potential parameters, then take the dot product between the Jacobian with itself. The derivative is calculated numerically using finite difference approach, implemented in the information_matching package.

The element of the FIM matrix at row \(i\), column \(j\), approximates the second derivative of the potential predictions with respect to the potential parameters at indices \(i\) and \(j\). The mapping between parameter indices and their corresponding potential parameters is stored in the attribute self.fim_index_to_parameter.

User is also required to provide the target covariance matrix of the target property. If multiple target properties is given, e.g., when calling compute_batch method, one covariance matrix needs to be given for each target property. The method will then returns the FIM for each target property.

Note

Currently, this module only works with KIM portable models that support writing parameters.

OUTPUT_KEY = 'fim_property'
data_file_name = 'score_results.json'
supported_score_quantities = [ScoreQuantity.SENSITIVITY]
supported_potential_type = ['KIM']
property_output_dir = 'fim_property_output_files'
__init__(**kwargs)[source]

Initialize the Recorder mixin class.

Sets up logging configuration and creates a logger instance named after the class of the object using it.

Parameters:
  • args – Positional arguments passed to other supers in the MRO.

  • kwargs – Keyword arguments passed to other supers in the MRO.

compute(target_property, score_quantity, cov, potential, parameters_optimize, transform=None, derivative_kwargs=None, return_jacobian=False, nprocs=1, **kwargs)[source]

Run the FIM calculation for a single target property. This is intended to be able to be used in a serial (non-distributed) manner, outside of a proper orchestrator workflow.

Note

If KIMRun is used, then the argument flatten=True in calculate_property method is enforced.

Parameters:
  • target_property (dict) –

    a dictionary about the target property, which should contain the following keys:

    • `init_args`: Used to instantiate the target property class

    • `calculate_property_args`: Contains arguments for TargetProperty.calculate_property method, excluding the potential and iter_num, which are automatically inserted during this calculation.

  • score_quantity (str ("SENSITIVITY")) – The type of score value to compute. For this module, the accepted argument in “SENSITIVITY”.

  • cov (Path-like (preferred) or np.ndarray) – target covariance matrix of the target property to achieve.

  • potential (dict (preferred) or orchestrator.potential.Potential) – input dictionary to instantiate the potential (using init_potential) or the potential instance itself.

  • parameters_optimize (dict) – Potential parameters to differentiate and their values.

  • transform (dict (preferred) or TransformBase) – A dictionary containing information to instantiate parameter transformation class. Required keys are “transform_type” and “transform_args”.

  • derivative_kwargs (dict) –

    Additional arguments for instantiating information_matching.fim.finitediff.FiniteDifference, which are interpreted as the finite difference settings. Available keywords include:

    • `h` (float or np.ndarray): Specifies the finite difference step size. If an array is given, each element gives the step size for each parameter.

    • `method` (str): Specifies the finite difference method to use. Available methods are “FD”, “FD2”, “FD3”, “FD4”, “CD”, and “CD4”.

  • return_jacobian (bool) – If it is True, then the method returns both the FIM and Jacobian, respectively

  • nprocs (int) – number of parallel processes to use when computing the columns of Jacobian.

Returns:

(P, P) array of the FIM, where P is the number of potential parameters

Return type:

np.ndarray

compute_batch(list_of_target_property, score_quantity, cov, potential, parameters_optimize, transform=None, derivative_kwargs=None, return_jacobian=False, nprocs=1, **kwargs)[source]

Runs the FIM calculation for a batch of atomic configurations. This is intended to be able to be used in a serial (non-distributed) manner, outside of a proper orchestrator workflow.

This method returns a list of FIMs, where the order of the list matches the order of the list_of_target_property input argument. Specifically, the first element corresponds to the FIM of the first target property, the second element to the second target property, and so on.

Note

If KIMRun is used, then the argument flatten=True in calculate_property method is enforced.

Parameters:
  • list_of_target_property (List[dict]) –

    a list of dictionaries, where each dictionary contains information about the target property, and should have the following keys:

    • `init_args`: Used to instantiate the target property class

    • `calculate_property_args`: Contains arguments for TargetProperty.calculate_property method, excluding the potential and iter_num , which are automatically inserted during this calculation.

  • score_quantity (str ("SENSITIVITY")) – The type of score value to compute. For this module, the accepted argument in “SENSITIVITY”.

  • cov (Path-like str (preferred) or np.ndarray, or a list of Path-like or np.ndarray) – target covariance matrix of the target property to achieve. If a single matrix is given, then it is treated as the target covariance for combined target properties. If a list is given, each element gives the target covariance for each target property. Note that the later option should not be used if the target properties are correlated, as it cannot encode covariance across target properties

  • potential (dict (preferred) or orchestrator.potential.Potential) – input dictionary to instantiate the potential (using init_potential) or the potential instance itself.

  • parameters_optimize (dict) – Potential parameters to differentiate and their values.

  • transform (dict (preferred) or TransformBase) – A dictionary containing information to instantiate parameter transformation class. Required keys are “transform_type” and “transform_args”.

  • derivative_kwargs (dict) –

    Additional arguments for instantiating information_matching.fim.finitediff.FiniteDifference, which are interpreted as the finite difference settings. Available keywords include:

    • `h` (float or np.ndarray): Specifies the finite difference step size. If an array is given, each element gives the step size for each parameter.

    • `method` (str): Specifies the finite difference method to use. Available methods are “FD”, “FD2”, “FD3”, “FD4”, “CD”, and “CD4”.

  • return_jacobian (bool) – If it is True, then the method returns both the FIM and Jacobian, respectively

  • nprocs (1) – number of parallel processes to use when computing the columns of Jacobian.

Returns:

A (P, P) array of the FIM, where P is the number of potential parameters. The FIMs from different target properties summed up to be the total FIM.

Return type:

list

compute_fim(list_of_jac, cov)[source]

Compute the combined FIM of the target property given the lists of Jacobian and covariance matrices.

Parameters:
  • list_of_jac (np.ndarray) – Jacobian matrix of the target property.

  • cov (str or np.ndarray or list) – Target covariance matrix or matrices.

Returns:

total FIM for the target properties

Return type:

np.ndarray

compute_jacobian(target_property_args, calculate_property_args={}, nprocs=1)[source]

Compute the Jacobian matrix for 1 target property.

Parameters:
  • target_property (dict) – a dictionary for building the target property istance

  • calculate_property_args (dict) – any additional arguments to be passed into target_property.calculate_property() method.

  • nprocs (1) – number of parallel processes to use when computing the Jacobian.

Returns:

(q, p) np.ndarray of the Jacobian, where q and p are numbers of predictions and parameters, respectively.

Return type:

np.ndarray

read_data(read_path, **kwargs)[source]

Read the data from a file.

Return type:

list[dict]

write_data(save_path, data, **kwargs)[source]

Write the data to a file.

get_colabfit_property_definition(score_quantity=None)[source]

A ‘property definition’ is a dictionary used by the ColabFit storage module for exactly specifying the details (data type, shape, description, etc.) of each field required for uniquely defining a given property. This function must be implemented in order to support storage of the computed results in the ColabFit module.

Parameters:

name (str) – the name of the property. Only needs to be provided if the Computer can return multiple properties.

Returns:

the property definition

Return type:

dict

FIMMatching

class orchestrator.computer.score.fim.fim_matching.FIMMatchingScore(**kwargs)[source]

Bases: ConfigurationScore

A method to quantify the importance of dataset by computing the optimal configuration weights via information-matching method.

This module wraps over information_matching Python package, which uses the FIMs of the candidate configurations and target property, and perform matrix matching to obtain the optimal weight to assign to each candidate configuration. This calculation ensure that the uncertainty of the target property is smaller than the predefined target uncertainty. The weights also measures the “importance” of the configurations, and their reciprocal square-root give the target precision to achieve when generating the ground truth data.

Note

Please use compute_batch() method. The optimization problem solved in the information-matching method is likely to be infeasible when only a single candidate atomic configuration is provided.

OUTPUT_KEY = 'fim_matching_weight'
supported_score_quantities = [ScoreQuantity.IMPORTANCE]
default_solver_kwargs = {'solver': 'SDPA'}
default_fim_preconditioning_kwargs = {'scale_type': 'max_frobenius'}
__init__(**kwargs)[source]

Initialize the Recorder mixin class.

Sets up logging configuration and creates a logger instance named after the class of the object using it.

Parameters:
  • args – Positional arguments passed to other supers in the MRO.

  • kwargs – Keyword arguments passed to other supers in the MRO.

compute(atoms, score_quantity, **kwargs)[source]

User should call compute_batch() method to do FIM-matching.

Running information-matching calculation with only a single atomic configuration candidate will most likely fail because the problem is more likely to be infeasible.

compute_batch(list_of_atoms, fim_property, score_quantity, convexopt_init_kwargs=None, fim_preconditioning_kwargs=None, solver_kwargs=None, weight_tolerance=None, **kwargs)[source]

Runs the FIM-matching calculation for a batch of atomic configurations. This is intended to be able to be used in a serial (non-distributed) manner, outside of a proper orchestrator workflow.

Notes: In other ConfigurationScore modules, the argument score_quantity comes as the second positional argument. But for FIM- matching, it is weird to separate the placements of list_of_atoms and fim_property arguments. This is not a deal-breaker, just a preference based on how each argument have connection to each other.

Note

If there are multiple properties to target, the FIMs of all those target properties should be combined (summed) to give a single FIM. Additionally, the target covariance to achieve should already be included in this FIM.

Parameters:
  • list_of_atoms (list of ase.Atoms) – a list of the ASE Atoms object

  • fim_property (Path-like or np.ndarray) – the FIM of the target properties.

  • score_quantity (str ("IMPORTANCE")) – The type of score value to compute. For this module, the accepted argument in “IMPORTANCE”.

  • convexopt_init_kwargs (dict) –

    Keyword arguments to instantiate the infomation_matching.ConvexOpt object. Available keys are:

    • weight_upper_bound (float or np.ndarray): Sets an upper bound for the optimal weights.

    • l1norm_obj (bool): Uses a stricter (but slower) objective function.

  • fim_preconditioning_kwargs (dict) –

    Keyword arguments for preconditioning the FIMs to help with solving the convex optimization problem in information-matching. The preconditioning is done by scaling the FIMs by some numbers. If None, the default preconditioning is used (using “max_fobenius” with 0 padding). If an empty dictionary is given, no preconditioning is done. Available keywords are:

    • scale_type (str): A string that specifies how the scaling factors are calculated. Currently, user can choose between “frobenius” or “max_frobenius”.

    • pad (float): A small padding factor so that the inverse of the scaling factor doesn’t diverge.

  • solver_kwargs (dict) – Keyword arguments containing the convex optimization solver settings. An extensive list of available keywords is provided in the CVXPY documentation.

  • weight_tolerance (dict) –

    Keyword arguments that specify the tolerance for extracting non-zero weights. Available keywords are:

    • zero_tol (float): Sets the tolerance for the (primal) value of the weights.

    • zero_tol_dual (float): Sets the tolerance for the dual value of the weights.

    The optimal weight is set to zero when its (primal) value is less than zero_tol, and its dual value is greater than zero_tol_dual.

Returns:

Optimal weights for the training configurations

Return type:

list

fim_match(fim_property, fim_candidates, convexopt_init_kwargs=None, fim_preconditioning_kwargs=None, solver_kwargs=None, weight_tolerance=None)[source]

An alternative method to run information-matching.

This method is the main machinary to do the FIM-matching using the information-matching package. This step includes instantiating the solver, solving the convex optimization problem, and extracting the optimal weights.

Note

If targeting multiple target properties, the FIMs of the target properties need to be summed first before inputing into this method.

Parameters:
  • fim_property (np.ndarray or dict) – Target FIM. If a dictionary is given, it should follow the convention in the information-matching repo.

  • fim_candidates (Union[ndarray, dict]) –

    FIMs of the candidate configurations. If a dictionary is given, it should follow the convention in the information-matching repo.

  • convexopt_init_kwargs (dict) –

    Keyword arguments to instantiate the infomation_matching.ConvexOpt object. Available keys are:

    • weight_upper_bound (float or np.ndarray): Sets an upper bound for the optimal weights.

    • l1norm_obj (bool): Uses a stricter (but slower) objective function.

  • fim_preconditioning_kwargs (dict) –

    Keyword arguments for preconditioning the FIMs to help with solving the convex optimization problem in information-matching. The preconditioning is done by scaling the FIMs by some numbers. If None, the default preconditioning is used (using “max_fobenius” with 0 padding). If an empty dictionary is given, no preconditioning is done. Available keywords are:

    • scale_type (str): A string that specifies how the scaling factors are calculated. Currently, user can choose between “frobenius” or “max_frobenius”.

    • pad (float): A small padding factor so that the inverse of the scaling factor doesn’t diverge.

  • solver_kwargs (dict) – Keyword arguments containing the convex optimization solver settings. An extensive list of available keywords is provided in the CVXPY documentation.

  • weight_tolerance (dict) –

    Keyword arguments that specify the tolerance for extracting non-zero weights. Available keywords are:

    • zero_tol (float): Sets the tolerance for the (primal) value of the weights.

    • zero_tol_dual (float): Sets the tolerance for the dual value of the weights.

    The optimal weight is set to zero when its (primal) value is less than zero_tol, and its dual value is greater than zero_tol_dual.

Py fim_candidates:

3d-array or dict

Returns:

Optimal weights for the training configurations

Return type:

list

setup_problem(fim_property, fim_candidates, convexopt_init_kwargs, fim_preconditioning_kwargs)[source]

Setup the FIM-matching problem.

This step includes reading and preparing the FIMs and instantiating the solver.

Return type:

list

get_colabfit_property_definition(score_quantity=None)[source]

A ‘property definition’ is a dictionary used by the ColabFit storage module for exactly specifying the details (data type, shape, description, etc.) of each field required for uniquely defining a given property. This function must be implemented in order to support storage of the computed results in the ColabFit module.

Parameters:

name (str) – the name of the property. Only needs to be provided if the Computer can return multiple properties.

Returns:

the property definition

Return type:

dict