Control Flow of Optimization Problems¶
See also
- Running Your Optimization Problem
A much shorter overview focused on minimal examples.
This page describes the order in which the functions of the various interfaces are expected to be called. This is sometimes also called the lifecycle of an object.
The contract describes here binds both parties: host applications are expected not to call functions out of the expected order; and plugins are expected to be prepared to handle calls that are unusual but within these guidelines.
Control Flow for SingleOptimizable¶
The SingleOptimizable
interface provides two methods that a host application
can interact with: get_initial_params()
and
compute_single_objective()
.
The Execution Loop¶
A host application must receive an initial point by calling
get_initial_params()
before any call to
compute_single_objective()
. The initial point usually
seeds the optimization algorithm. It is often the current state of the system,
a fixed reasonable guess, or a random point in the phase space.
Once the initial point has been received, host applications may call
compute_single_objective()
as many times as desired.
Arguments to the function must lie within the bounds of the
optimization_space
. Optimization algorithms are strongly
encouraged to use the initial point as argument to their first call to
compute_single_objective()
.
Host algorithms should not assume that the last evaluation of an
optimization algorithm is also the optimal one. After a successful
optimization, they should call compute_single_objective()
once more with the optimal argument. Because SingleOptimizable
objects are
stateful, this is expected to set them to their optimal state.
The Initial Point¶
The initial point is strongly encouraged to lie within bounds of the
optimization_space
. Host applications may assume that it’s
safe to call compute_single_objective()
with the point
returned by get_initial_params()
. This often happens when
an optimization has failed or been cancelled and a user wishes to return the
system to its initial state.
This implies that compute_single_objective()
should not
clip its argument into bounds. Instead, host applications are strongly
encouraged to clip them before calling
compute_single_objective()
, and to never clip the
result of get_initial_params()
.
Cancellation and Repetition¶
Optimization runs may be cancelled at any point. A SingleOptimizable
may
not expect to run to completion every time. In particular, a user may want to
interrupt a call to compute_single_objective()
if it takes
a considerable time. Optimization problems are encouraged to use
Cancellation to honor such requests.
A host application may call get_initial_params()
more
than once. Each call to the function is expected to start a new optimization,
so SingleOptimizable
is allowed to clear internal buffers and restart any
rendering from scratch.
Rendering¶
Host applications may call render()
at any point between other
calls, including before the first call to
get_initial_params()
. Rendering may be requested multiple
times between two calls to compute_single_objective()
, so
it should not modify the state of the problem.
Calls to get_initial_params()
and
compute_single_objective()
should not automatically call
render()
except when:
the render mode is
"human"
;the render mode is list-based, e.g.
"rgb_array_list"
or"ansi_list"
.
SingleOptimizable Example¶
A typical execution loop could look like this:
1from gymnasium.spaces import Box
2from numpy import clip
3
4from cernml import coi
5
6problem = coi.make("MySingleOptimizableProblem-v0")
7assert isinstance(problem, coi.SingleOptimizable)
8with problem:
9 # Fetch initial state.
10 optimizer = get_optimizer()
11 space = problem.optimization_space
12 assert isinstance(space, Box)
13 initial = params = problem.get_initial_params()
14 best = (float("inf"), initial)
15
16 while not optimizer.is_done():
17 # Update optimum.
18 loss = problem.compute_single_objective(params)
19 best = min(best, (float(loss), params))
20
21 # Fetch next set of parameters.
22 params = optimizer.step(loss)
23 params = clip(params, space.low, space.high)
24
25 if optimizer.has_failed():
26 # Restore initial state.
27 problem.compute_single_objective(initial)
28 else:
29 # Restore best state.
30 problem.compute_single_objective(best[1])
Control Flow for FunctionOptimizable¶
Though the FunctionOptimizable
is similar to a sequence of multiple
single-objective optimization problems, much greater care must be taken around correctly
resetting them in case of failure.
Skeleton Points¶
The precise meaning of the time parameter is a little vague, since it will typically depend on the institution and context where it is used.
In the CERN accelerator complex, the injectors such as PS and SPS run in cycles where each cycle is one full sequence of particle injection, acceleration, and extraction (with several optional stages in between). Each cycle is typically associated with a different user, who may request the beam to go down a particular path of the complex (e.g. towards the LHC or towards the North Experimental Area).
In this context, the skeleton points are points in time along one cycle given in milliseconds. They’re always measured from the start of the cycle (rather than e.g. from the start of injection).
Warning
Other laboratories are strongly encouraged to adopt a similarly strong notion about the interpretation of skeleton points. To facilitate cooperation and to avoid catastrophic human error, the notion of skeleton points should be as homogeneous across a laboratory as possible.
Selecting Skeleton Points¶
A host application must query skeleton points from the optimization problem
via override_skeleton_points()
. If it returns a list,
that list of points must be used in the following optimization. If (and
only if) it returns None, the user may be prompted to input a list of their
choosing. Whether override_skeleton_points()
returns
a list or None may depend on its configuration.
Sequencing Optimizations¶
Optimizations of individual skeleton points are always fully sequenced with respect to each other. Only once a skeleton point has been fully optimized may the next optimization be started. Optimization problems are allowed to allocated resources based on whether the skeleton point parameter has changed.
Skeleton points are always optimized in order, from lowest to highest.
Optimization problems may rely on this fact and e.g. use the fact that
get_initial_params()
has been called with a lower
skeleton point than before as a signal to clear their rendering data.
This sequencing rule includes get_optimization_space()
and get_initial_params()
: the methods may only be called
with a skeleton point once the optimization for that point starts. It is
forbidden to e.g. fetch the spaces or the initial parameters for all skeleton
points at once and then start optimization for each of them.
Resetting¶
Within the optimization of a single skeleton point, the same rules as for
SingleOptimizable apply. One exception
concerns cancellation of an optimization due to an error or user request. When
a FunctionOptimizable
is reset, the reset must begin with the lowest
skeleton point and then proceed to the highest that the host application has
interacted with. Skeleton points higher than the one whose optimization was
interrupted must not be reset. This means that host applications must usually
keep track of which skeleton points have been optimized and which haven’t.
FunctionOptimizable Example¶
A typical execution loop over multiple skeleton points could look like this:
1from gymnasium.spaces import Box
2from numpy import clip
3
4from cernml import coi
5
6problem = coi.make("MyFunctionOptimizableProblem-v0")
7assert isinstance(problem, coi.FunctionOptimizable)
8with problem:
9 # Select skeleton points.
10 skeleton_points = problem.override_skeleton_points()
11 if skeleton_points is None:
12 skeleton_points = request_skeleton_points()
13
14 # Keep track of which points we have modified and which not.
15 restore_on_failure = []
16
17 try:
18 for time in skeleton_points:
19 # Fetch initial state.
20 optimizer = get_optimizer()
21 space = problem.get_optimization_space(time)
22 assert isinstance(space, Box)
23 initial = params = problem.get_initial_params(time)
24 best = (float("inf"), initial)
25 restore_on_failure.append((time, initial))
26
27 while not optimizer.is_done():
28 # Update optimum.
29 loss = problem.compute_function_objective(time, params)
30 best = min(best, (float(loss), params))
31
32 # Fetch next set of parameters.
33 params = optimizer.step(loss)
34 params = clip(params, space.low, space.high)
35
36 if optimizer.has_failed():
37 raise OptFailed(f"optimizer failed at t={time}")
38 else:
39 # Restore best state.
40 problem.compute_function_objective(time, best[1])
41 except:
42 # If anything fails, restore initial state not only for the
43 # current skeleton point, but all previous ones as well.
44 while restore_on_failure:
45 time, params = restore_on_failure.pop()
46 problem.compute_function_objective(time, params)
47 raise
Control Flow for Env¶
The Env
interface provides three methods that a host application can interact
with: reset()
, step()
and
close()
. In contrast to SingleOptimizable
, the Env
interface is
typically called many times in episodes, especially during training. Each
episode follows the same protocol.
Episode Start¶
The reset()
method must be called at the start of an
episode. It may clear any buffers from the previous episode and set the
system to an initial state. That state may be constant, but is typically random
and known to be bad. The function then returns an initial observation that
is used to seed the RL agent. It also returns an info
dict, which may contain additional debugging
information or other metadata.
Note
The AutoResetWrapper
calls
reset()
automatically, even if a host application
doesn’t do so.
Episode Steps¶
The initial observation given by reset()
is passed to
the RL agent, which calculates a recommended action based on its policy.
This action is passed to step()
, which must return
a quintuple (obs, reward, terminated, truncated, info)
,
where:
- obs
is the next observation and must be used to determine the next action;
- reward
is the reward for the previous action (a reinforcement learner’s goal is to maximize the expected cumulative reward over an episode);
- terminated
is a boolean flag indicating whether the agent has reached a terminal state of the environment (e.g. game won/lost);
- truncated
is a boolean flag indicating whether the episode has been ended due to a reason external to the environment (e.g. training time limit expired).
- info
is an info dict, which may contain additional debugging information or other metadata.
In short: given the initial observation, agent and environment act in a loop, with observations going into the agent and actions into the environment, until the end of the episode.
An episode ends when the return value of either terminated or truncated
(or both) is True. When the episode is over, the host application must not make
any further calls to step()
. Instead, it must call
reset()
to start the next episode.
The host application is free to end an episode prematurely, i.e. to call
reset()
before the end of the episode. There is no
guarantee that any episode is ever driven to completion.
The Info Dict¶
While the info dict
is free to return any additional information
imaginable, there are a few keys that have an established meaning:
- info["success"]: bool¶
is a bool indicating whether the episode has ended by reaching a “good” terminal state. Rendering wrappers may use this key to highlight the episode in a particular manner.
If the step hasn’t actually ended the episode, this key has no meaning. If the episode has ended and the key is absent, this must be interpreted as an indeterminate terminal state, and not necessarily as a bad one.
- info["final_info"]: InfoDict¶
are defined by
AutoResetWrapper
. They are added whenever an episode ends andreset()
is called automatically. They contain the observation and info from the last step of the previous episode, since in the return value ofstep()
, these values have been supplanted with those fromreset()
.
- info["episode"]: dict[str, Any]¶
is defined by
RecordEpisodeStatistics
. It is adict
with the cumulative reward, the episode length in steps, and the length in time.
- info["reward"]: float¶
is defined by
SeparableEnv
andSeparableGoalEnv
. It contains the reward of the current step and is set by their default implementations ofstep()
.
Closing¶
The close()
method is called at the end of the lifetime of an
environment. This may happen after one full optimization run or after several.
No further calls to reset()
or
step()
will be made afterwards. This method should
release any resources that the environment has acquired in its
__init__()
method.
Env Rendering¶
The same rules for Rendering apply as for the other
classes. Automatic calls to render()
are usually handled by wrappers
like HumanRendering
or
RenderCollection
, and not by the environment itself.
Env Example¶
A typical execution loop for environments might look like this:
1from gymnasium import Env
2from gymnasium.spaces import Box
3from numpy import clip
4
5from cernml import coi
6
7policy = get_policy()
8num_episodes = get_num_episodes()
9
10# Limit steps per episode to prevent infinite loops.
11env = coi.make("MyEnv-v0", max_episode_steps=10)
12assert isinstance(env, Env)
13with env:
14 ac_space = env.action_space
15 assert isinstance(ac_space, Box)
16
17 for _ in range(num_episodes):
18 terminated = truncated = False
19 obs, info = env.reset()
20 while not (terminated or truncated):
21 action = policy.predict(obs)
22 action = clip(action, ac_space.low, ac_space.high)
23 obs, reward, terminated, truncated, info = env.step(action)