Content

Optimal Control Theory & Dynamic Programming

Types of solutions for a control system
- Closed form: y=f(x)
- Numerical: f(x)= \int_{-\infty}^x g(t) dt
- Simulation: the system state satisfies a set of differential equations; we need to run (do simulation) the system to get the trajectory of state evolution.

An (sequential) optimal control problem is defined as
- find a sequence of controls {u_t} such that the objective J(T)= \psi (t) + \int_0^T l( x_t, u_t, w_t) dt is optimal (maximized/minimized), for the system characterized by dx_t/ dt = f(x_t, u_t, w_t) where x_t is the state vector and w_t is the system noise vector (disturbance).
(Separation theorem for linear systems and quadratic cost) The optimal controller with imperfect state information can be decomposed into two parts:
- An estimator
  - find a sequence of estimates { \hat{x}_t } such that the expected norm of error, E[\int_0^T ( \hat{x}_t - x_t) dt ] is minimized, where the expectation is taken over the joint distribution of x_t and w_t.
- An actuator, u_t^* = L_t * \hat{x}_t.
Solutions to the deterministic optimal control problem
- Variational approach (use calculus of variations): solve a system of partial differential equations with boundary conditions.
  - Pontryagin maximum principle:
  - The solution: introduce adjoint vectors; derive adjoint equation (PDE); maximize the Hamiltonian over the adjoint
- Nonlinear programming: steepest descent, conjugate gradient, Newton's method
- Deterministic dynamic programming:
  - continuous time: solve Hamilton-Jacobi-Bellman (partial differential) equation (HJB equation)
Bayesian approach assumes the distributions of the system noises, measurement noises, and the initial state x_0 are all known.
- For quadratic cost, the optimal estimate for x_n is conditional expectation E(x_n|z_{[0,n]}) given all the observations z_{[0,n]} from slot 0 to n.
- Further, for joint Gaussian r.v., the conditional expectation is a linear function of the observation (linear estimator).
- Further, if the noise is white (or noises are independent r.v.'s), we have a recursive structure like Kalman filter. The basic ideas are using sufficient statistics and compressing the data to get innovations. Specifically, first use the previous estimate \hat{x}_{t-1|t-1} to predict the current state \hat{x}_{t|t-1} (simulate the original system x_t = A x_{t-1}); then use the predicted current state \hat{x}_{t|t-1} to estimate the observation \hat{y}_t; the innovation (prediction error) is e_t = y_t - \hat{y}_t; amplify (normalize) the error by k_t and use the resulting signal to correct the prediction of the current state \hat{x}_{t|t-1}; then we get the estimate of the current state \hat{x}_{t|t}= \hat{x}_{t|t-1}+ k_t * e_t.
The minimax approach vs. the H_{infinity}
- The common features: both of them assume the distributions of the system noises, measurement noises, and the initial state x_0 are unknown.
- The difference between them is that the cost function for the minimax is instantaneous while the cost function for the H_{infinity} is integrated over the time from the initial to the current. So they are optimal w.r.t. to different criterion.
Risk-sensitive stochastic control problem:
- the system is defined by an Ito stochastic differential equation.
- (Minimizing entropy) the objective function (performance index) is a log-moment generating function of a certain cost function (sum of terminal cost and integral cost) with the parameter \epsilon representing the sensitivity requirement. Why use log-moment generating function rather than a simple expectation over the joint pdf? This is because the log-moment generating function can include the sensitivity index.
- When \epsilon=0, the risk-sensitive stochastic controller reduces to H-{infinity} controller.
Robust control
- Goal: design controllers that yield acceptable performance for not a single plant and under known inputs but rather a family of plants under various types of inputs and disturbances.
- Approaches:
  - the sensitivity approach: allowing small perturbations around an adopted nominal model.
  - the linear-quadratic-Gaussian (LQG) design: ascribing some statistical description (specifically, Gaussian statistics) to the disturbances or the unknown inputs.
  - dynamic (differential) game theory or minimax controller design: design a controller that minimizes a given performance index under worst possible disturbances or parameter variations.
  - H-infinity optimization: minimize the maximum norm (i.e., H-infinity norm) of an input-output operator, where the maximum is taken over the unknowns.
    - (Frequency-domain) Operator theory, approximation theory, spectral factorization, and (Youla) parameterization
    - (Time-domain) generalized Riccati equation

Dynamic programming

Dynamic programming (DP):
- action vs. strategy: an action (decision) refer to a specific control (value); a strategy refer to a decision rule, which maps the state set to the set of actions. Dynamic programming provides a sequences of strategies (functions) instead of a sequence of actions.
Principle of optimality
- Rough idea (truncation preserves optimality): for the shortest path between two points A & B, any truncated path on the shortest path, taking A or B as its node, is the shortest path between the two end nodes on the truncated path.
- DP rests on this principle.
Sufficient & necessary condition for using DP:
- The cost-to-go function is decomposable; that is, the cost-to-go function over N variable (a sequence of control/unknown variables) can be represented by the sum of the cost of each unknown variable, so that the cost over N variable can computed recursively by adding the cost of one unknown variable once at a time;
  - If cost-to-go function is not decomposable (not having additive structure w.r.t. each variable), the principle of optimality (on which DP rests) does not hold.
- and the number of possible states is finite.
  - DP is a computation algorithm to solve the optimization problem (e.g., minimizing total cost), so the number of possible states has to be finite to be computable.
Forward vs. backward DP algorithm
- Forward DP algorithm iteratively computes cost-to-arrive function.
- Backward DP algorithm iteratively computes cost-to-go function.
Dynamic programming vs. Q learning in machine learning
- Dynamic programming assumes both the cost function and the system (difference/differential) equation are unknown.
- Q learning is a learning algorithm used in an unknown environment (both the cost function and the system equation are unknown).
  - It estimates the (limiting) discounted cost, which is time-invariant. Due to the time invariance of the limiting discounted cost, there is no need to know the system evolution equation.
  - The basic idea
    - use Bellman's equation for infinite horizon to compute the Q function.
    - (a probabilistic approach for estimation) use Monte Carlo method many (or infinite) times to obtain an estimate of the Q function.
  - Limitation of Q learning:
    - Q learning is only suitable for the problems where the discounted cost is time-invariant (theoretically, only infinite horizon problems satisfy this). If the discounted cost is time varying, knowledge of system equation and cost function is required and dynamic programming comes into play. In this case, Q learning is not applicable since one measurement at slot k is far from enough to estimate the discounted cost of all states and all actions at slot k.
Deterministic vs. stochastic DP
- Certainty Equivalence principle: use the expectations to replace the random variables in the stochastic problem. Then the stochastic DP reduces to deterministic DP.
- Differences between deterministic and stochastic DP
  - In backward stochastic DP, for a given current state x_k, each action can result in several state transitions from x_k to x_{k+1} with certain probabilities; we need to compute the average over all the state transitions. In contrast, in backward deterministic DP, each action only results in one state transition.
- The condition for Certainty Equivalence principle:
- Deterministic DP is for deterministic optimal control.
- Stochastic DP is for stochastic optimal control, also known as Markov decision process or Markov decision chain.
  - This is because the state and control variable {x_k, u_k} form the state of a Markov chain.
Complexity of DP
- Time complexity of stochastic DP is O( m* n^2 * N) for m controls, n states and N stages.
  - The reason is as follows. There are n states and m controls. So there are O(m*n) operations for each state-time pair. Since there are n*N state-time pairs, the total number of operations is O( m* n^2 * N).
- Time complexity of deterministic DP is O( n^2 * N).
  - The reason is as follows. There are n states and each state is connected to n previous states. So there are O(n^2) operations at each stage. Since there are N stages, the total number of operations is O( n^2 * N).
  - For Viterbi algorithm (VA), the time complexity is O( n^2 * N). In addition, the storage complexity is O( n * N). This is because all n cost-to-go functions at each stage will be kept, and there are N stages.
- Storage complexity of deterministic DP is O( n * N). This is because each of the n state has an optimal child at each stage; these parent-child relations have to be remembered; there are N stages, resulting O( n * N) parent-child relations.
  - For Viterbi algorithm, the storage complexity is also O( n * N).

Classification of optimal control problems:

Deterministic optimal control		Stochastic optimal control
Discrete state	Discrete time	Discrete state	Discrete time	Perfect state information
				Imperfect state information
	Continuous time		Continuous time	Perfect state information
				Imperfect state information
Continuous state	Discrete time	Continuous state	Discrete time	Perfect state information
				Imperfect state information
	Continuous time		Continuous time	Perfect state information
				Imperfect state information

In a deterministic optimal control problem, a state equation of the dynamical system is given as x_{k+1}=f_k(x_k,u_k) for a discrete time system where x_k is the state variable and u_k is the control variable, and dx(t)/dt=f(x_t,u_t) for a continuous time system. There is no observation/measurement equation.

In a stochastic optimal control problem, a state equation of the dynamical system is given as x_{k+1}=f_k(x_k,u_k, w_k) for a discrete time system where w_k is random system disturbance/noise, and dx(t)/dt=f(x_t,u_t,w_t) for a continuous time system. The observation/measurement equation is given as y_k=h_k(x_k) if there is no measurement noise; otherwise, y_k=h_k(x_k, v_k), where v_k is random measurement error/noise.

In a stochastic optimal control problem, perfect state information means that states are perfectly observable, e.g., y_k=h_k(x_k), where h_k is (deterministic) one-to-one correspondence; imperfect state information means that state are partially observable, e.g., y_k=h_k(x_k, v_k), where v_k is random (like a measurement error).

For deterministic optimal control problem, feedback control does not help (there is no need to use feedback control), i.e., optimal feed-forward control and optimal feedback control result in the same solution. The optimal control trajectory and optimal state trajectory can be computed a priori.

For stochastic optimal control problem, there are two optimization criteria: expected discounted cost and expected average cost. The expectation is over states.

Finite horizon	Expected discounted cost
	Expected average cost
Infinite horizon	Expected discounted cost
	Expected average cost

For stochastic optimal control problem, feedback control is used since feedback control can achieve better performance than feed-forward control. The optimal control law is a sequence of functions {\mu_k(x_k)}, where {\mu_k} are functions indexed by time k, the function \mu_k take the state at time k, x_k, as its argument. We don't know a priori what control action will be taken at time k, since typically we assume a causal system (we don't know/estimate x_k till time k). Only when x_k is known or estimated, we can take a control action.

Solutions to the above optimal control problem:

12 cases:

Deterministic optimal control
- Continuous state
  - Variational methods: usually more efficient than DP; the variational methods are based on Pontryagin maximal/minimum principle (assuming that the cost is to be maximized/minimized).
    - Continuous time:
      - solve a set of equations, including an adjoint equation and maximizing the Hamiltonian function
    - Discrete time:
      - solve a set of equations, including an adjoint equation and the Hamiltonian's derivative=0.
  - Steepest descent, conjugate gradient, Newton's method (nonlinear programming): usually more efficient than DP
  - Dynamic programming: more general than variational methods and iterative optimal control algorithms using nonlinear programming
    - The principle of optimality: From any point on an optimal trajectory, the remaining trajectory is optimal for the corresponding problem initiatied at that point.
    - Continuous time: solve Hamilton-Jacobi-Bellman (HJB) equation to obtain the optimal control.
    - Discrete time: iteratively compute optimal cost-to-go function (return function) and associated optimal control
- Discrete state: only DP is applicable; variational methods and nonlinear programming requires states to be continuous so that derivative can be taken.
  - Discrete time: shortest path problem, forward and backward DP algorithm to solve it.
  - Continuous time: no such problem?
Stochastic optimal control
Stochastic shortest path problem:
- Each stage k is assigned a cost function g_k(x_k,u_k, w_k), where w_k is a random disturbance in the state equation. Hence, the cost g_k(x_k,u_k, w_k) is also random.
  - This is different from the deterministic shortest path problem, where each stage has a deterministic cost function g_k(x_k,u_k).
- The criterion of expected total cost for N-stage finite horizon, is to optimize E[\sum_k g_k(x_k,u_k, w_k)] where the expectation is over {w_k} k=1, ..., N.
- Use DP algorithm for
DP algorithms are all offline. To achieve on-line computation, we need to use sub-optimal algorithms, which can adapt to new problem data.

Dynamic programming, Markov Decision Process (MDP), Partially Observable MDP (POMDP), Markov Control Process (MCP), Controlled Markov Process (CMP), Markov Control Model (MCM)

Deterministic dynamic programming

Stochastic dynamic programming

Risk sensitive (entropy minimizing) stochastic control: the cost is of an exponential form, using a sensitive parameter to characterize sensitivity; Risk neutral stochastic control: the cost is of additive form.

MDP, semi-Markov (SMDP), POMDP,

LQR problem: DARE, CARE,

Kalman filter: there is no control in the state equation. x_{k+1}=A*x_k+ B*w_k and y_k=C*x_k+v_k. Use Raccati recursion to update the covariance matrix of the state-estimation error.

LQR with imperfect state information

Other optimization criteria:

H^{\infty}: sup-norm, min-max, dynamic game, time domain/frequency domain solution
H^2: L^2 norm, LQR

To find a lower bound of a constrained optimization problem, tighten the constraints (work on sufficient conditions); to find a upper bound of a constrained optimization problem, loosen the constraints (work on the necessary conditions).

Model Predictive Control (MPC):

Model Predictive Control (MPC) has been widely adopted in industry as an effective means to deal with large multivariable constrained control problems. In MPC the control action is chosen by solving an optimal control problem on line. The optimization aims at minimizing a performance criterion over a future (small) horizon, possibly subject to constraints on the manipulated inputs and outputs.

MPC is different from conventional optimal control in the following aspects:

It has constraints which reflect the practical situations (we cannot have arbitrarily large inputs and outputs in practice) which the conventional optimal control problems do not have constraints, i.e., the sets of the controls and of the outputs are the whole Euclidean spaces.
The complexity of dynamic programming is very high if the horizon is large. It's impossible to realize it as an on-line algorithm. In MPC, a small horizon is used for the sake of real-time implementation.

Although MPC has long been recognized as the preferred alternative for constrained systems, its applicability has been limited to slow systems such as chemical processes, where large sampling times make it possible to solve large optimization problems each time new measurements are collected from the plant.

Alternatively, the optimization problem can be solved off line for all the expected measurement values through multiparametric solvers. The resulting feedback controller inherits all the stability and performance properties of MPC, and turns out to be piecewise linear. The on-line computation is reduced to a simple linear function evaluation. Therefore, the new technique is expected to enlarge the scope of applicability of MPC to applications with fast dynamics and sampling rates.