Optimal Control Theory
& Dynamic Programming
- Types of solutions for a control system
- Closed form: y=f(x)
- Numerical: f(x)= \int_{-\infty}^x g(t) dt
- Simulation: the system state satisfies a
set of differential equations; we need to run (do simulation) the system to
get the trajectory of state evolution.
- An (sequential) optimal control problem is
defined as
- find a sequence of controls {u_t} such
that the objective J(T)= \psi (t) + \int_0^T l( x_t, u_t, w_t) dt
is optimal (maximized/minimized), for the system characterized by
dx_t/ dt = f(x_t, u_t, w_t) where x_t is the state vector and w_t is the
system noise vector (disturbance).
- (Separation theorem for linear systems and
quadratic cost) The optimal controller with imperfect state information can
be decomposed into two parts:
- An estimator
- find a sequence of estimates { \hat{x}_t
} such that the expected norm of error, E[\int_0^T ( \hat{x}_t - x_t)
dt ] is minimized, where the expectation is taken over the joint
distribution of x_t and w_t.
- An actuator, u_t^* = L_t * \hat{x}_t.
- Solutions to the deterministic optimal
control problem
- Variational approach (use calculus of
variations): solve a system of partial differential equations with
boundary conditions.
- Pontryagin maximum principle:
- The solution: introduce adjoint
vectors; derive adjoint equation (PDE); maximize the Hamiltonian
over the adjoint
- Nonlinear programming: steepest descent,
conjugate gradient, Newton's method
- Deterministic dynamic programming:
- continuous time: solve Hamilton-Jacobi-Bellman
(partial differential) equation (HJB equation)
- Bayesian approach assumes the
distributions of the system noises, measurement noises, and the initial
state x_0 are all known.
- For quadratic cost, the optimal estimate
for x_n is conditional expectation E(x_n|z_{[0,n]}) given all the
observations z_{[0,n]} from slot 0 to n.
- Further, for joint Gaussian r.v., the
conditional expectation is a linear function of the observation (linear
estimator).
- Further, if the noise is white (or noises
are independent r.v.'s), we have a recursive structure like Kalman
filter. The basic ideas are using sufficient statistics and
compressing the data to get innovations. Specifically, first use
the previous estimate \hat{x}_{t-1|t-1} to predict the current state
\hat{x}_{t|t-1} (simulate the original system x_t = A x_{t-1}); then use
the predicted current state \hat{x}_{t|t-1} to estimate the observation
\hat{y}_t; the innovation (prediction error) is e_t = y_t - \hat{y}_t;
amplify (normalize) the error by k_t and use the resulting signal to
correct the prediction of the current state \hat{x}_{t|t-1}; then
we get the estimate of the current state \hat{x}_{t|t}=
\hat{x}_{t|t-1}+ k_t * e_t.
- The minimax approach vs. the H_{infinity}
- The common features: both of them assume
the distributions of the system noises, measurement noises, and the
initial state x_0 are unknown.
- The difference between them is that the
cost function for the minimax is instantaneous while the cost function
for the H_{infinity} is integrated over the time from the initial to the
current. So they are optimal w.r.t. to different criterion.
- Risk-sensitive stochastic control
problem:
- the system is defined by an Ito
stochastic differential equation.
- (Minimizing entropy) the objective
function (performance index) is a log-moment generating function of a
certain cost function (sum of terminal cost and integral cost) with the
parameter \epsilon representing the sensitivity requirement.
Why use log-moment generating function rather than a simple expectation
over the joint pdf? This is because the log-moment
generating function can include the sensitivity index.
- When \epsilon=0, the risk-sensitive stochastic controller
reduces to H-{infinity} controller.
- Robust control
- Goal: design controllers that yield
acceptable performance for not a single plant and under known inputs but
rather a family of plants under various types of inputs and
disturbances.
- Approaches:
- the sensitivity approach:
allowing small perturbations around an adopted nominal model.
- the linear-quadratic-Gaussian (LQG)
design: ascribing some statistical description
(specifically, Gaussian statistics) to the disturbances or the
unknown inputs.
- dynamic (differential) game theory or
minimax controller design: design a controller that
minimizes a given performance index under worst possible
disturbances or parameter variations.
- H-infinity optimization:
minimize the maximum norm (i.e., H-infinity norm) of an input-output
operator, where the maximum is taken over the
unknowns.
- (Frequency-domain) Operator
theory, approximation theory, spectral factorization, and (Youla)
parameterization
- (Time-domain) generalized Riccati
equation
Dynamic programming
- Dynamic programming (DP):
- action vs. strategy: an action (decision)
refer to a specific control (value); a strategy refer to a decision
rule, which maps the state set to the set of actions.
Dynamic programming provides a sequences of strategies (functions)
instead of a sequence of actions.
- Principle of optimality
- Rough idea (truncation preserves
optimality): for the shortest path between two points A & B, any truncated
path on the shortest path, taking A or B as its node, is the shortest path
between the two end nodes on the truncated path.
- DP rests on this principle.
- Sufficient & necessary condition for using
DP:
- The cost-to-go function is decomposable; that
is, the cost-to-go function over N variable (a sequence of control/unknown
variables) can be represented by the sum of the cost of each unknown variable,
so that the cost over N variable can computed recursively by adding the cost
of one unknown variable once at a time;
- If cost-to-go function is not decomposable
(not having additive structure w.r.t. each variable), the principle of
optimality (on which DP rests) does not hold.
- and the number of possible states is finite.
- DP is a computation algorithm to solve the
optimization problem (e.g., minimizing total cost), so the number of possible
states has to be finite to be computable.
- Forward vs. backward DP algorithm
- Forward DP algorithm iteratively computes
cost-to-arrive function.
- Backward DP algorithm iteratively computes
cost-to-go function.
- Dynamic programming vs. Q learning in
machine learning
- Dynamic programming assumes both the cost
function and the system (difference/differential) equation are unknown.
- Q learning is a learning algorithm used
in an unknown environment (both the cost function and the system
equation are unknown).
- It estimates the (limiting)
discounted cost, which is time-invariant. Due to
the time invariance of the limiting discounted cost, there is no
need to know the system evolution equation.
- The basic idea
- use Bellman's equation for
infinite horizon to compute the Q function.
- (a probabilistic approach for
estimation) use Monte Carlo method many (or infinite) times to
obtain an estimate of the Q function.
- Limitation of Q learning:
- Q learning is only suitable for
the problems where the discounted cost is time-invariant
(theoretically, only infinite horizon problems satisfy
this). If the discounted cost is time varying,
knowledge of system equation and cost function is required and
dynamic programming comes into play. In this case, Q
learning is not applicable since one measurement at slot k is
far from enough to estimate the discounted cost of all states
and all actions at slot k.
- Deterministic vs. stochastic DP
- Certainty Equivalence principle: use the
expectations to replace the random variables in the stochastic
problem. Then the stochastic DP reduces to deterministic
DP.
- Differences between deterministic and stochastic
DP
- In backward stochastic DP, for a given
current state x_k, each action can result in several state transitions
from x_k to x_{k+1} with certain probabilities; we need to compute the
average over all the state transitions. In contrast, in
backward deterministic DP, each action only results in one state
transition.
- The condition for Certainty Equivalence
principle:
- Deterministic DP is for deterministic
optimal control.
- Stochastic DP is for stochastic optimal
control, also known as Markov decision process or Markov decision chain.
- This is because the state and control
variable {x_k, u_k} form the state of a Markov chain.
- Complexity of DP
- Time complexity of stochastic DP is
O( m* n^2 * N) for m controls, n states and N
stages.
- The reason is as follows.
There are n states and m controls. So there are O(m*n)
operations for each state-time pair. Since there are n*N
state-time pairs, the total number of operations is O( m* n^2 * N).
- Time complexity of deterministic DP
is O( n^2 * N).
- The reason is as follows.
There are n states and each state is connected to n
previous states. So there are O(n^2) operations at each
stage. Since there are N stages, the total number of
operations is O( n^2 * N).
- For Viterbi algorithm (VA), the time
complexity is O( n^2 * N). In addition, the
storage complexity is O( n * N). This is because all n
cost-to-go functions at each stage will be kept, and there are N
stages.
- Storage complexity of deterministic DP is
O( n * N). This is because each of the n state has
an optimal child at each stage; these parent-child relations have to be
remembered; there are N stages, resulting
O( n * N) parent-child relations.
- For Viterbi algorithm, the storage
complexity is also O( n * N).
Classification of optimal control problems:
Deterministic optimal control |
Stochastic optimal control |
กก
Discrete state
กก |
Discrete time |
กก
Discrete state
กก |
Discrete time |
Perfect state
information |
Imperfect state
information |
Continuous time |
Continuous time |
Perfect state
information |
Imperfect state
information |
Continuous state |
Discrete time |
Continuous state |
Discrete time |
Perfect state
information |
Imperfect state
information |
Continuous time |
Continuous time |
Perfect state
information |
Imperfect state
information |
In a deterministic optimal control problem, a
state equation of the dynamical system is given as x_{k+1}=f_k(x_k,u_k) for a
discrete time system where x_k is the state variable and u_k is the control
variable, and dx(t)/dt=f(x_t,u_t) for a continuous time system. There is
no observation/measurement equation.
In a stochastic optimal control problem, a state
equation of the dynamical system is given as x_{k+1}=f_k(x_k,u_k, w_k) for a
discrete time system where w_k is random system disturbance/noise, and dx(t)/dt=f(x_t,u_t,w_t)
for a continuous time system. The observation/measurement equation is
given as y_k=h_k(x_k) if there is no measurement noise; otherwise, y_k=h_k(x_k,
v_k), where v_k is random measurement error/noise.
In a stochastic optimal control problem, perfect
state information means that states are perfectly observable, e.g., y_k=h_k(x_k),
where h_k is (deterministic) one-to-one correspondence; imperfect state
information means that state are partially observable, e.g., y_k=h_k(x_k, v_k),
where v_k is random (like a measurement error).
For deterministic optimal control problem,
feedback control does not help (there is no need to use feedback control), i.e.,
optimal feed-forward control and optimal feedback control result in the same
solution. The optimal control trajectory and optimal state trajectory can
be computed a priori.
For stochastic optimal control problem, there
are two optimization criteria: expected discounted cost and expected average
cost. The expectation is over states.
Finite horizon |
Expected
discounted cost |
Expected average
cost |
Infinite horizon |
Expected
discounted cost |
Expected average
cost |
For stochastic optimal control problem, feedback
control is used since feedback control can achieve better performance than
feed-forward control. The optimal control law is a sequence of functions
{\mu_k(x_k)}, where {\mu_k} are functions indexed by time k, the function \mu_k
take the state at time k, x_k, as its argument. We don't know a
priori what control action will be taken at time k, since typically we assume a
causal system (we don't know/estimate x_k till time k). Only when
x_k is known or estimated, we can take a control action.
Solutions to the above optimal control problem:
12 cases:
- Deterministic optimal control
- Continuous state
- Variational methods: usually more
efficient than DP; the variational methods are based on Pontryagin
maximal/minimum principle (assuming that the cost is to be
maximized/minimized).
- Continuous time:
- solve a set of equations, including
an adjoint equation and maximizing the Hamiltonian function
- Discrete time:
- solve a set of equations, including
an adjoint equation and the Hamiltonian's derivative=0.
- Steepest descent, conjugate gradient,
Newton's method (nonlinear programming): usually more efficient than DP
- Dynamic programming: more general than
variational methods and iterative optimal control algorithms using
nonlinear programming
- The principle of optimality: From any
point on an optimal trajectory, the remaining trajectory is optimal for
the corresponding problem initiatied at that point.
- Continuous time: solve Hamilton-Jacobi-Bellman
(HJB) equation to obtain the optimal control.
- Discrete time: iteratively compute
optimal cost-to-go function (return function) and associated optimal
control
- Discrete state: only DP is applicable;
variational methods and nonlinear programming requires states to be
continuous so that derivative can be taken.
- Discrete time: shortest path problem,
forward and backward DP algorithm to solve it.
- Continuous time: no such problem?
- Stochastic optimal control
- Stochastic shortest path problem:
- Each stage k is assigned a cost function
g_k(x_k,u_k, w_k), where w_k is a random disturbance in the state equation.
Hence, the cost g_k(x_k,u_k, w_k) is also random.
- This is different from the deterministic
shortest path problem, where each stage has a deterministic cost function
g_k(x_k,u_k).
- The criterion of expected total cost for
N-stage finite horizon, is to optimize E[\sum_k g_k(x_k,u_k, w_k)] where the
expectation is over {w_k} k=1, ..., N.
- Use DP algorithm for
- DP algorithms are all offline. To
achieve on-line computation, we need to use sub-optimal algorithms, which can
adapt to new problem data.
Dynamic programming, Markov Decision Process (MDP),
Partially Observable MDP (POMDP), Markov Control Process (MCP), Controlled
Markov Process (CMP), Markov Control Model (MCM)
Deterministic dynamic programming
Stochastic dynamic programming
Risk sensitive (entropy minimizing) stochastic
control: the cost is of an exponential form, using a sensitive parameter to
characterize sensitivity; Risk neutral stochastic control: the cost is of
additive form.
MDP, semi-Markov (SMDP), POMDP,
LQR problem: DARE, CARE,
Kalman filter: there is no control in the state
equation. x_{k+1}=A*x_k+ B*w_k and y_k=C*x_k+v_k.
Use Raccati recursion to update the covariance matrix of the state-estimation
error.
LQR with imperfect state information
Other optimization criteria:
- H^{\infty}: sup-norm, min-max, dynamic game,
time domain/frequency domain solution
- H^2: L^2 norm, LQR
To find a lower bound of a constrained
optimization problem, tighten the constraints (work on sufficient conditions);
to find a upper bound of a constrained optimization problem, loosen the
constraints (work on the necessary conditions).
Model Predictive Control (MPC):
Model Predictive Control (MPC) has been widely adopted in industry as an effective means to deal with large multivariable
constrained control problems. In MPC the control action is chosen by solving an
optimal control problem on line. The optimization aims at minimizing a performance criterion over a future
(small) horizon, possibly subject to constraints on the manipulated inputs and
outputs.
MPC is different from conventional optimal control in the following
aspects:
- It has constraints which reflect the
practical situations (we cannot have arbitrarily large inputs and outputs in
practice) which the conventional optimal control problems do not have
constraints, i.e., the sets of the controls and of the outputs are the whole
Euclidean spaces.
- The complexity of dynamic programming is very
high if the horizon is large. It's impossible to realize it as
an on-line algorithm. In MPC, a small horizon is used for the
sake of real-time implementation.
Although MPC has long been recognized as the preferred alternative for constrained systems, its applicability has been limited to slow
systems such as chemical processes, where large sampling times make it possible to solve large optimization problems each time new
measurements are collected from the plant.
Alternatively, the optimization problem can be solved off line for all the expected measurement values through multiparametric solvers.
The resulting feedback controller inherits all the stability and performance properties of MPC, and turns out to be piecewise linear. The
on-line computation is reduced to a simple linear function evaluation. Therefore, the new technique is expected to enlarge the scope of
applicability of MPC to applications with fast dynamics and sampling rates.