Augmentor Module

Abstract Base Class

class orchestrator.augmentor.augmentor_base.Augmentor(default_iteration_limit=50, checkpoint_file='./orchestrator_checkpoint.json', checkpoint_name='augmentor', **kwargs)[source]

Bases: Recorder

Augmentor class containing methods for dataset augmentation operations

Parameters:
  • default_iteration_limit (int) – iteration limit for iterative FPS algorithms used within the Augmentor, can be overridden by method arguments at runtime. Optional,

    Default: 50

  • checkpoint_file (str) – name of the checkpoint file to write restart information to

    Default: ‘./orchestrator_checkpoint.json’

  • checkpoint_name (str) – name of the restart block for this module in the checkpoint file

    Default: ‘augmentor’

__init__(default_iteration_limit=50, checkpoint_file='./orchestrator_checkpoint.json', checkpoint_name='augmentor', **kwargs)[source]

set variables and initialize the recorder and default workflow

Parameters:
  • default_iteration_limit (int) – iteration limit for iterative FPS algorithms used within the Augmentor, can be overridden by method arguments at runtime. Optional,

    Default: 50

  • checkpoint_file (str) – name of the checkpoint file to write restart information to

    Default: ‘./orchestrator_checkpoint.json’

  • checkpoint_name (str) – name of the restart block for this module in the checkpoint file

    Default: ‘augmentor’

checkpoint_augmentor()[source]

checkpoint the augmentor module into the checkpoint file

save necessary internal variables into a dict with key checkpoint_name and write to the (json) checkpoint file for restart capabilities

restart_augmentor()[source]

restart the augmentor module from the checkpoint file

check if the checkpoint_file has an entry matching the checkpoint_name and set internal variables accordingly if so

identify_novel_environments(configs_to_evaluate, reference_dataset, score_module, score_compute_args, workflow, job_details=None, batch_size=1, test_criteria=0.0)[source]

Use a novelty score metric to find environments outside a dataset

Evaluate the atomic environments of an input list of configurations with respect to an existing dataset, denoting the environments which are novel to the dataset based on the evaluated score metric.

Parameters:
  • configs_to_evaluate (list[Atoms]) – ASE Atoms list (or single Atoms object) of configurations to analyze compared to a reference dataset. Descriptors must be precomputed on the Atoms.

  • reference_dataset (list[Atoms]) – ASE Atoms which encapsulate the reference set to compare to. Descriptors must be precomputed on the Atoms.

  • score_module (ScoreBase) – instantiated Score module which provides compute functions for obtaining a score value for a given dataset. Currently this method only supports QUESTSDeltaEntropyScore.

  • score_compute_args (str) – arguments that define the computation of the score module. Must include the ‘descriptors_key’ key for the precomputed descriptors.

  • workflow (Workflow) – Workflow module to use for computing scores

  • job_details (dict) – dict that includes any additional parameters for running the job (passed to submit_job())

    Default: None

  • batch_size (int) – number of configurations to pass to compute_batch() at once.

    Default: 1

  • test_criteria (float) – value of the score which indicates an atomic environment should be considered “novel”

    Default: 0

Returns:

tuple of list of configs with attached scores, list of length num configs of boolean masks matching the size of the different configs to evaluate, where True indicates that the environment should be considered as novel compared to the reference dataset and the max index in the combined array to seed a FPS if not all points will be used

Return type:

tuple(list[np.ndarray], int)

extract_and_tag_subcells(configs, selection_masks, extract_rc, extract_box_size, min_dist_delete=0.7, keys_to_transfer=None)[source]

use the extract_env function to isolate and relax subcells from a cell

this method will generally be used as part of an active learning loop in conjunction with identify_novel_environments(). If the extracted cell that is desired is larger than the inital configuration, the inital configuration is returned instead.

Parameters:
  • configs (list[Atoms]) – ASE Atoms list (or single Atoms object) containing the configurations from which subcells will be extracted. Should be the same length as the selection_masks list

  • selection_masks (np.ndarray or str) – list of masks of the same shape as configs which note the atomic environments to extract or the string ‘attached’ in which case masks will be taken from the arrays with the key SELECTION_MASK_KEY from the configs

  • extract_rc (float) – cutoff radius to extract and constrain positions in Angstroms if float. Otherwise should be a string of form ‘shell-X’ where X is the desired NN shell (min 1) to denote as valid

  • extract_box_size (float) – side length of the box to embed the extracted environment into

  • min_dist_delete (float) – dist in Angstroms specifies how close atoms need to be to one another, excluding those in the fixed center core, to be considered colliding and deleted. Set to 0 for no deletions. This is done to remove unphysically close contacts resulting from the new periodic boundaries.

    Default: 0.7

  • keys_to_transfer (list of str) – any keys of data attached to the configs which should be preserved through the extraction process.

Returns:

a list of length sum(selection_masks) which contain ASE Atoms with a central_atom_index key in their metadata dictionary

Return type:

list[Atoms]

score_and_extract_subcells(configs, reference_dataset, score_module, score_compute_args, workflow, extract_rc, extract_box_size, min_dist_delete=0.7, job_details=None, batch_size=1, max_num_to_extract=None, extraction_pruning_score=None, extraction_pruning_args=None)[source]

Integrated method for identifying and extracting novel atomic envs

This method uses identify_novel_environments() and extract_and_tag_subcells() to obtain a list of subcells with central atoms which are considered novel with respect to the provided reference set by the provided score module.

Parameters:
  • configs (list[Atoms]) – ASE Atoms list (or single Atoms object) of configurations to analyze compared to a reference dataset. Descriptors must be precomputed on the Atoms.

  • reference_dataset (list[Atoms]) – ASE Atoms which encapsulate the reference set to compare to. Descriptors must be precomputed on the Atoms.

  • score_module (ScoreBase) – instantiated Score module which provides compute functions for obtaining a score value for a given dataset. Currently this method only supports QUESTSDeltaEntropyScore.

  • score_compute_args (str) – arguments that define the computation of the score module. Must include the ‘descriptors_key’ key for the precomputed descriptors.

  • workflow (Workflow) – Workflow module to use for computing scores

  • extract_rc (float) – cutoff radius to extract and constrain positions in Angstroms

  • extract_box_size (float) – side length of the box to embed the extracted environment into

  • min_dist_delete (float) – dist in Angstroms specifies how close atoms need to be to one another, excluding those in the fixed center core, to delete colliding atoms. Set to 0 for no deletions. This is done to remove unphysically close contacts resulting from the new boundaries.

    Default: 0.7

  • job_details (dict) – dict that includes any additional parameters for running the initial novel environment job

    Default: None

  • batch_size (int) – number of configurations to batch for the novel environments job

    Default: 1

  • max_num_to_extract (int) – limit of number of subcells to extract. If this parameter is provided, FPS will be performed directly on the subset of novel environments to return the desired number of envs. The first point selected for sampling will be the most novel determined by the score module.

    Default: None

  • extraction_pruning_score (Score) – Score module to use for down-selecting which of the novel environments to extract. If max_num_to_extract is provided, this argument will be ignored. Chunked iterative pruning will be applied using this score module to remove redundancy from the novel environments prior to subcell extraction. Compute arguments can be provided via the extraction_pruning_args argument.

    Default: None

  • extraction_pruning_args (dict) – Arguments to specify the score calculation employed in iterative pruning if extraction_pruning_score is provided. If extraction_pruning_score is present, this argument remains optional, with the score using default values for the compute args in this circumstance.

    Default: None.

Returns:

a list of length min(extraction, number of novel environments) which contain ASE Atoms of subcells extracted from the provided configurations with a central_atom_index key in their metadata dictionary to mark which atom in the subcell is calculated to be novel

Return type:

list[Atoms]

simple_prune_dataset(dataset, prune_method, prune_value, prune_large_value, score_args, score_module, workflow, storage=None)[source]

prune a dataset based on analysis from one or more UQ methods

Take an existing dataset and reduce its size. This can be done based on a percentage or a value cutoff (prune_method) of the UQ metric output. The dataset can be an explicit list of ASE atoms or a storage handle. If the latter, then the storage module must also be provided. For more intelligent pruning, consider using the chunked_iterative_fps_prune() method instead.

Parameters:
  • dataset (list[Atoms] or str) – ASE list or dataset handle

  • prune_method (str) – option of either ‘percentage’ or ‘cutoff’

  • prune_value (float) – metric to apply to prune_method. If prune_method = ‘percentage’, value can be between 0 and 1.0, representing the percentage that will be pruned from the dataset. If prune_method = ‘cutoff’, value represents the absolute quantity above or below (see prune_large_value) which data is pruned.

  • prune_large_value (bool) – a sorting variable to indicate if larger values (True) or smaller values (False) should be pruned.

  • score_args (dict) – arguments that define the computation of the score module. These may be set by an input file or from the caller.

  • score_module (ScoreBase) – instantiated Score module which provides compute functions for obtaining a score value for a given dataset.

  • workflow (Workflow) – Workflow module to use for computing scores

  • storage (Storage) – Storage module where the dataset is stored if the dataset argument is a dataset_handle. Otherwise this argument is not necessary.

Returns:

list of ASE atoms with a pruning mask applied as 0 or 1 weights

Return type:

list[Atoms]

estimate_pruning_ratio(dataset, score_module, score_compute_args)[source]

Use an appropriate score metric to estimate the ideal degree of pruning

Apply a score metric that estimates the redundancy of the provided dataset. From this quantity, provide the estimated pruning ratio as a value between 0.0 and 1.0

Parameters:
  • dataset (list[Atoms]) – dataset to evaluate for pruning

  • score_module (QUESTSEfficiencyScore) – score module to obtain the (redundancy) metric from

  • score_compute_args (dict) – compute arguments to drive the score calculation

Returns:

the pruning ratio as a value between 0 and 1, representing the % to be pruned from a dataset

Return type:

float

iterative_fps_prune(dataset, descriptors_key, prune_approach, num_chunks=1, prune_ratio_args=None, pruning_convergence=0.01, iteration_limit=None, fps_approach='full', first_index=0, _print_pid=False)[source]

Iteratively apply an FPS algorithm to select an optimally diverse set

Use an iterative estimation of the dataset’s information content to set a pruning ratio, removing that fraction of the dataset until the pruning ratio is below the convergence threshold. Different sampling approaches are supported for executing the pruning of the dataset, including the ability to split pruning over multiple processors. Pruning is performed on precomputed descriptors which are provided with the dataset. For large or production level runs, consider using the chunked_iterative_fps_prune() method instead, as it will scale much more favorably.

Parameters:
  • dataset (list[Atoms]) – list of Atoms which includes precomputed descriptors. Can optionally include existing SELECTION_MASK_KEY data.

  • descriptors_key (str) – string key for the computed descriptors over which we will compute distance for sampling

  • prune_approach (Score) – a score module that can be used to provide a pruning quantity based on the information of the dataset. Currently only QUESTSEfficiencyScore is supported

  • num_chunks (int) – degree of parallelism to apply to sampling, if desired.

    Default: 1

  • prune_ratio_args (dict) – arguments needed to run the score computation

    Default: None

  • pruning_convergence (float) – value below which pruning will stop

    Default: 0.01

  • iteration_limit (int) – variable to limit the number of iterations that can be performed in the pruning loop. Uses the values set at Augmentor initialization (default of 30) if None

    Default: None

  • fps_approach (str) – selector for the FPS method to use. Supported options are ‘full’, ‘multi’, and ‘approximate’. The ‘full’ approach will carry out serial FPS pruning on the full dataset. This option is the most accurate and most expensive. The ‘multi’ option divides the dataset into num_chunks and extracting the full number of samples in each chunk, combining these pruned divisions and pruning a second time on the full set. Because the full number of samples is pruned from each chunk, the minimum pruning ratio is 50% (supporting 2 chunks). See _multiprocess_prune() for more detail. The ‘approximate’ option will also divide the full dataset into num_chunks, but extract only an equal proportion of the total number of samples from each chunk. This method will be the fastest, but also least accurate, as there are no guarantees for the distribution of points within or between different chunks. See _approximate_multiprocess_prune() for more details.

    Default: ‘full’

  • first_index (int) – first index to use in the FPS algorithm. If not supplied, 0 will always be used. The index is updated through the pruning iterations to track a given sample if non-zero. This assumes that the index corresponds to the atom index in the full dataset.

    Default: 0

  • _print_pid (bool) – internal variable to print the associated PID in logger data. Used when this method is called from within a multiprocessing call

    Default: False

Returns:

a list of atoms matching the input dataset with a new (or) updated array with key SELECTION_MASK_KEY which is True for atomic environments which should be considered/included in subsequent operations.

Return type:

list[Atoms]

chunked_iterative_fps_prune(dataset, descriptors_key, prune_approach, num_chunks=1, prune_ratio_args=None, pruning_convergence=0.01, iteration_limit=None, first_index=0, hierarchical_parallelism=False)[source]

Iteratively apply an FPS algorithm to select an optimally diverse set

Use an iterative estimation of the dataset’s information content to set a pruning ratio, removing that fraction of the dataset until the pruning ratio is below the convergence threshold. This function performs complete pruning on each subset of the data independently prior to recombining the full dataset for pruning, allowing for better scaling of the pruning to large datasets and a more nuanced treatment of the pruning of the subsets compared to the iterative_fps_prune() method. If the hierarchical_parallelism option is turned on, multiple rounds of increasingly coarse-grained chunking can be employed to refine the dataset. Pruning is performed on precomputed descriptors which are provided with the dataset.

Parameters:
  • dataset (list[Atoms]) – list of Atoms which includes precomputed descriptors. Can optionally include existing SELECTION_MASK_KEY data.

  • descriptors_key (str) – string key for the computed descriptors over which we will compute distance for sampling

  • prune_approach (Score) – a score module that can be used to provide a pruning quantity based on the information of the dataset. Currently only QUESTSEfficiencyScore is supported

  • num_chunks (int) – degree of parallelism to apply to sampling, if desired.

    Default: 1

  • prune_ratio_args (dict) – arguments needed to run the score computation

    Default: None

  • pruning_convergence (float) – value below which pruning will stop

    Default: 0.01

  • iteration_limit (int) – variable to limit the number of iterations that can be performed in the pruning loop. Uses the values set at Augmentor initialization (default of 30) if None

    Default: None

  • first_index (int) – first index to use in the FPS algorithm. If not supplied, 0 will always be used. The index is updated through the pruning iterations to track a given sample if non-zero. This assumes that the index corresponds to the atom index in the full dataset.

    Default: 0

  • hierarchical_parallelism (bool) – flag for running the chunking iteratively with number of chunks halved at each iteration. If False, just do a single step of chunking

    Default: False

Returns:

a list of atoms matching the input dataset with a new (or) updated array with key SELECTION_MASK_KEY which is True for atomic environments which should be considered/included in subsequent operations.

Return type:

list[Atoms]