Control Flow of Optimization Problems¶

Control Flow for SingleOptimizable¶

The SingleOptimizable interface provides two methods that a host application can interact with: get_initial_params() and compute_single_objective().

The Execution Loop¶

A host application must receive an initial point by calling get_initial_params() before any call to compute_single_objective(). The initial point usually seeds the optimization algorithm. It is often the current state of the system, a fixed reasonable guess, or a random point in the phase space.

Once the initial point has been received, host applications may call compute_single_objective() as many times as desired. Arguments to the function must lie within the bounds of the optimization_space. Optimization algorithms are strongly encouraged to use the initial point as argument to their first call to compute_single_objective().

Host algorithms should not assume that the last evaluation of an optimization algorithm is also the optimal one. After a successful optimization, they should call compute_single_objective() once more with the optimal argument. Because SingleOptimizable objects are stateful, this is expected to set them to their optimal state.

The Initial Point¶

The initial point is strongly encouraged to lie within bounds of the optimization_space. Host applications may assume that it’s safe to call compute_single_objective() with the point returned by get_initial_params(). This often happens when an optimization has failed or been cancelled and a user wishes to return the system to its initial state.

This implies that compute_single_objective() should not clip its argument into bounds. Instead, host applications are strongly encouraged to clip them before calling compute_single_objective(), and to never clip the result of get_initial_params().

Cancellation and Repetition¶

Optimization runs may be cancelled at any point. A SingleOptimizable may not expect to run to completion every time. In particular, a user may want to interrupt a call to compute_single_objective() if it takes a considerable time. Optimization problems are encouraged to use Cancellation to honor such requests.

A host application may call get_initial_params() more than once. Each call to the function is expected to start a new optimization, so SingleOptimizable is allowed to clear internal buffers and restart any rendering from scratch.

Rendering¶

Host applications may call render() at any point between other calls, including before the first call to get_initial_params(). Rendering may be requested multiple times between two calls to compute_single_objective(), so it should not modify the state of the problem.

Calls to get_initial_params() and compute_single_objective() should not automatically call render() except when:

the render mode is "human";
the render mode is list-based, e.g. "rgb_array_list" or "ansi_list".

SingleOptimizable Example¶

A typical execution loop could look like this:

from gymnasium.spaces import Box
from numpy import clip

from cernml import coi

problem = coi.make("MySingleOptimizableProblem-v0")
assert isinstance(problem, coi.SingleOptimizable)
with problem:
    # Fetch initial state.
    optimizer = get_optimizer()
    space = problem.optimization_space
    assert isinstance(space, Box)
    initial = params = problem.get_initial_params()
    best = (float("inf"), initial)

    while not optimizer.is_done():
        # Update optimum.
        loss = problem.compute_single_objective(params)
        best = min(best, (float(loss), params))

        # Fetch next set of parameters.
        params = optimizer.step(loss)
        params = clip(params, space.low, space.high)

    if optimizer.has_failed():
        # Restore initial state.
        problem.compute_single_objective(initial)
    else:
        # Restore best state.
        problem.compute_single_objective(best[1])

Control Flow for FunctionOptimizable¶

Though the FunctionOptimizable is similar to a sequence of multiple single-objective optimization problems, much greater care must be taken around correctly resetting them in case of failure.

Skeleton Points¶

The precise meaning of the time parameter is a little vague, since it will typically depend on the institution and context where it is used.

In the CERN accelerator complex, the injectors such as PS and SPS run in cycles where each cycle is one full sequence of particle injection, acceleration, and extraction (with several optional stages in between). Each cycle is typically associated with a different user, who may request the beam to go down a particular path of the complex (e.g. towards the LHC or towards the North Experimental Area).

In this context, the skeleton points are points in time along one cycle given in milliseconds. They’re always measured from the start of the cycle (rather than e.g. from the start of injection).

Warning

Other laboratories are strongly encouraged to adopt a similarly strong notion about the interpretation of skeleton points. To facilitate cooperation and to avoid catastrophic human error, the notion of skeleton points should be as homogeneous across a laboratory as possible.

Selecting Skeleton Points¶

A host application must query skeleton points from the optimization problem via override_skeleton_points(). If it returns a list, that list of points must be used in the following optimization. If (and only if) it returns None, the user may be prompted to input a list of their choosing. Whether override_skeleton_points() returns a list or None may depend on its configuration.

Sequencing Optimizations¶

Optimizations of individual skeleton points are always fully sequenced with respect to each other. Only once a skeleton point has been fully optimized may the next optimization be started. Optimization problems are allowed to allocated resources based on whether the skeleton point parameter has changed.

Skeleton points are always optimized in order, from lowest to highest. Optimization problems may rely on this fact and e.g. use the fact that get_initial_params() has been called with a lower skeleton point than before as a signal to clear their rendering data.

This sequencing rule includes get_optimization_space() and get_initial_params(): the methods may only be called with a skeleton point once the optimization for that point starts. It is forbidden to e.g. fetch the spaces or the initial parameters for all skeleton points at once and then start optimization for each of them.

Resetting¶

Within the optimization of a single skeleton point, the same rules as for SingleOptimizable apply. One exception concerns cancellation of an optimization due to an error or user request. When a FunctionOptimizable is reset, the reset must begin with the lowest skeleton point and then proceed to the highest that the host application has interacted with. Skeleton points higher than the one whose optimization was interrupted must not be reset. This means that host applications must usually keep track of which skeleton points have been optimized and which haven’t.

FunctionOptimizable Example¶

A typical execution loop over multiple skeleton points could look like this:

from gymnasium.spaces import Box
from numpy import clip

from cernml import coi

problem = coi.make("MyFunctionOptimizableProblem-v0")
assert isinstance(problem, coi.FunctionOptimizable)
with problem:
    # Select skeleton points.
    skeleton_points = problem.override_skeleton_points()
    if skeleton_points is None:
        skeleton_points = request_skeleton_points()

    # Keep track of which points we have modified and which not.
    restore_on_failure = []

    try:
        for time in skeleton_points:
            # Fetch initial state.
            optimizer = get_optimizer()
            space = problem.get_optimization_space(time)
            assert isinstance(space, Box)
            initial = params = problem.get_initial_params(time)
            best = (float("inf"), initial)
            restore_on_failure.append((time, initial))

            while not optimizer.is_done():
                # Update optimum.
                loss = problem.compute_function_objective(time, params)
                best = min(best, (float(loss), params))

                # Fetch next set of parameters.
                params = optimizer.step(loss)
                params = clip(params, space.low, space.high)

            if optimizer.has_failed():
                raise OptFailed(f"optimizer failed at t={time}")
            else:
                # Restore best state.
                problem.compute_function_objective(time, best[1])
    except:
        # If anything fails, restore initial state not only for the
        # current skeleton point, but all previous ones as well.
        while restore_on_failure:
            time, params = restore_on_failure.pop()
            problem.compute_function_objective(time, params)
        raise

Control Flow for Env¶

The Env interface provides three methods that a host application can interact with: reset(), step() and close(). In contrast to SingleOptimizable, the Env interface is typically called many times in episodes, especially during training. Each episode follows the same protocol.

Episode Start¶

The reset() method must be called at the start of an episode. It may clear any buffers from the previous episode and set the system to an initial state. That state may be constant, but is typically random and known to be bad. The function then returns an initial observation that is used to seed the RL agent. It also returns an info dict, which may contain additional debugging information or other metadata.

Note

The AutoResetWrapper calls reset() automatically, even if a host application doesn’t do so.

Episode Steps¶

The initial observation given by reset() is passed to the RL agent, which calculates a recommended action based on its policy. This action is passed to step(), which must return a quintuple (obs, reward, terminated, truncated, info), where:

obs: is the next observation and must be used to determine the next action;
reward: is the reward for the previous action (a reinforcement learner’s goal is to maximize the expected cumulative reward over an episode);
terminated: is a boolean flag indicating whether the agent has reached a terminal state of the environment (e.g. game won/lost);
truncated: is a boolean flag indicating whether the episode has been ended due to a reason external to the environment (e.g. training time limit expired).
info: is an info dict, which may contain additional debugging information or other metadata.

In short: given the initial observation, agent and environment act in a loop, with observations going into the agent and actions into the environment, until the end of the episode.

An episode ends when the return value of either terminated or truncated (or both) is True. When the episode is over, the host application must not make any further calls to step(). Instead, it must call reset() to start the next episode.

The host application is free to end an episode prematurely, i.e. to call reset() before the end of the episode. There is no guarantee that any episode is ever driven to completion.

The Info Dict¶

While the info dict is free to return any additional information imaginable, there are a few keys that have an established meaning:

info["success"]: bool¶

is a bool indicating whether the episode has ended by reaching a “good” terminal state. Rendering wrappers may use this key to highlight the episode in a particular manner.

If the step hasn’t actually ended the episode, this key has no meaning. If the episode has ended and the key is absent, this must be interpreted as an indeterminate terminal state, and not necessarily as a bad one.

info["final_observation"]: ObsType¶

info["final_info"]: InfoDict¶: are defined by AutoResetWrapper. They are added whenever an episode ends and reset() is called automatically. They contain the observation and info from the last step of the previous episode, since in the return value of step(), these values have been supplanted with those from reset().

info["episode"]: dict[str, Any]¶: is defined by RecordEpisodeStatistics. It is a dict with the cumulative reward, the episode length in steps, and the length in time.

info["reward"]: float¶: is defined by SeparableEnv and SeparableGoalEnv. It contains the reward of the current step and is set by their default implementations of step().

Closing¶

The close() method is called at the end of the lifetime of an environment. This may happen after one full optimization run or after several. No further calls to reset() or step() will be made afterwards. This method should release any resources that the environment has acquired in its __init__() method.

Env Rendering¶

The same rules for Rendering apply as for the other classes. Automatic calls to render() are usually handled by wrappers like HumanRendering or RenderCollection, and not by the environment itself.

Env Example¶

A typical execution loop for environments might look like this:

from gymnasium import Env
from gymnasium.spaces import Box
from numpy import clip

from cernml import coi

policy = get_policy()
num_episodes = get_num_episodes()

# Limit steps per episode to prevent infinite loops.
env = coi.make("MyEnv-v0", max_episode_steps=10)
assert isinstance(env, Env)
with env:
    ac_space = env.action_space
    assert isinstance(ac_space, Box)

    for _ in range(num_episodes):
        terminated = truncated = False
        obs, info = env.reset()
        while not (terminated or truncated):
            action = policy.predict(obs)
            action = clip(action, ac_space.low, ac_space.high)
            obs, reward, terminated, truncated, info = env.step(action)