Restart
=======

Considering that usage of the Orchestrator may include long-walltime jobs or multistep workflows, it is useful to have the ability to restart the Orchestrator in some specific state beyond basic initialization. To this end, we have created restart functionality that is flexible and extensible, defined on a per-module basis. Generally speaking, one simply needs to import the
:class:`~orchestrator.utils.restart.restarter` instance to leverage the
:meth:`~orchestrator.utils.restart.Restart.write_checkpoint_file` and
:meth:`~orchestrator.utils.restart.Restart.read_checkpoint_file` methods which
are central to the restart methodology.

The primary purpose of the Restart module is to save the current state of the
Orchestrator to enable discontinuous operation. As with the overall design of
the Orchestrator itself, the restart functionality is intended to be modular
and decentralized - each module is responsible to defining and handling its own
checkpointing and restart behavior. Some modules may not require such
capabilities at all. By convention, a module which includes restart
capabilities needs to define two methods: ``checkpoint_[module name]()`` and
``restart_[module name]()``.

See the full API for the module at :ref:`restart_module`.

.. _restart_structure:

Checkpoint File Structure
-------------------------

While modules are free to define their own auxiliary files to assist in the
restart process (see Workflows
:meth:`~orchestrator.workflow.workflow_base.Workflow.save_job_dict` as an
example), the bulk of the information will be saved in a shared json file. This
file is organized in a hierarchical fashion - at the highest level it is split
up into sections which correspond to each module instance. Each of these
sections can also be organized, but this level of oragnization is imposed by
the individual modules. Generically, this leads to a file structured as:

.. code-block:: none

    "module_name_1":{
        "item1": data,
        "item2": data,
        ...
    },
    "module_name_2":{
        "item1": data,
        "hierarchical item2":{
            "item1": data,
            "item2": data
        },
        ...
    },
    ...

While the file should generally be left alone (since Orchestrator will automatically read and write from it as necessary), it is still human readable. This allows for one to inspect the current state of the Orchestrator, and advanced users may find it convenient to modify this file directly to induce a desired state upon restart.

Naming Conventions
------------------

By default, modules using restart functionality define a file name where the
checkpointed information is written: ``./orchestrator_checkpoint.json``. This
file will be shared by all modules and its structure is discussed in greater
detail in :ref:`restart_structure`. This filename can be overridden by adding
the ``checkpoint_file`` field in the input arguments for constructing the
module::

    module = module_builder.build(
        'TOKEN',
        {
            'module_arg1': value1,
            'checkpoint_file': 'custom_name.json',
        },
    )

In addition to the checkpoint file name itself, modules also define the name
they will use to demarcate their section of the checkpoint file, with the
default value as the module type. For instance any :class:`~.Potential` module
uses a default of "potential". This value can be changed by specifying a
``checkpoint_name`` in the input arguments for that module::

    module = module_builder.build(
        'TOKEN',
        {
            'module_arg1': value1,
            'checkpoint_name': 'custom_section_name',
        },
    )

.. warning::
    If your application uses multiple instances of a given module type, you
    should change at least one of their ``checkpoint_name``\ s, otherwise they
    will overwrite each other's checkpoint information. Due to the common usage
    of multiple different :class:`~.Workflow` modules for complex Orchestrator
    operations, this module sets the ``checkpoint_name`` to the specific class
    name be default to avoid common collisions. If using multiple instances of
    the same class, the ``checkpoint_name`` should be manually overridden for
    at least one of the instances.

Use Cases
---------

Read
^^^^

Generally, the "restart" machinary should only be used at instantiation/start up: :meth:`~orchestrator.utils.restart.Restart.read_checkpoint_file` should be called as the last step of the module's ``__init__()`` function. In this way, if any information is available in the ``checkpoint_file`` under the proper ``checkpoint_name``, it can properly initialize or update variables based on the last checkpoint.

Write
^^^^^

More discretion can be used regarding when a module should write to the
``checkpoint_file``. In the case of the
:class:`~orchestrator.workflow.workflow_base.Workflow` classes, the checkpoint
file is updated any time a JobStatus is updated. On the other hand,
the :class:`~orchestrator.potential.potential_base.Potential` module never
calls its own
:meth:`~orchestrator.potential.potential_base.Potential.checkpoint_potential`
method, but relies on other functions which handle logic around the potential
to inform when the checkpoint should be written.

For some modules, checkpointing acts as a simple way to save the "memory" of
the module, while for others, checkpointing enables the discontinuous execution
of more complex or time-intensive operations. In these cases, logic must be
integrated into methods which change their behavior based on the state of flags
designed to track the progress. See :meth:`~.MeltingPoint.calculate_property`
for an example.

Inheritance Graph
-----------------

.. inheritance-diagram::
   orchestrator.utils.restart
   :parts: 3