Content

Statistical Signal Processing (Detection and Estimation)

The goal of hypothesis test (detection) is to decide which hypothesis (among n hypotheses) is true (where n can be any integer greater than 1), given some observation. A hypothesis is a statement about a possible phenomenon, which need not be random. Note that a hypothesis need not be of random nature, but the observation is a random function of a hypothesis. For example, the null hypothesis H₀ is that a person has lung cancer while the alternative hypothesis H₁ is that the person does not have lung cancer. These two hypotheses are not random; we simply do not know which hypothesis is true. But under each hypothesis, the observation (e.g., the number of white blood cells) is random.

The goal of estimation is to extract quantity of interest from noisy data/signal. Estimation problems can be classified into the following categories:

Parameter/point estimation: a parameter is a point in n-dimensional vector space (e.g, Euclidean space, Hilbert space, and Krein space), where n can be any positive integer. A parameter could be deterministically continuous, or discrete random variable, or continuous random variable, but it cannot be deterministically discrete. If the parameter is a discrete random variable, the estimation problem becomes a hypothesis test. Note a hypothesis test is not a subset of parameter estimation since a hypothesis need not be a random variable.
(Time-varying) signal estimation: the signal could be discrete-time or continuous-time, and it could also be deterministic or stochastic. A basic method for signal estimation is Kalman filer. For continuous-time signal estimation, Ito statistical calculus will be used, which is fairly advanced.
Density estimation: there are two approaches, i.e., parametric and non-parametric. Parametric methods are the same as those in parameter estimation. Non-parametric methods use various kernel functions (like weighted windowing used in FIR filter design, spectral estimation, non-linear image processing, etc.)
Confidence interval estimation

Key techniques: all detection/estimation problems are optimization problems with respect to a certain criterion as follows. So whenever one talks about an optimal detector/estimator, don't forget to ask him/her what optimal criterion is used.

Maximum likelihood criterion: (used for both detection and estimation)
- LRT (Likelihood Ratio Test) for simple hypothesis test: the threshold is 1, that is, decide the one with a larger likelihood. Can be used for multiple hypothesis test.
- GLRT (Generalized LRT) for composite hypothesis test (take the max/super-norm of the likelihood function over the parameter set, as the generalized likelihood function).
- ML estimator for estimation: (used for the case when prior distribution is unknown or when the parameter to be estimated is deterministic). An ML estimate is biased, but is consistent (asymptotically unbiased) and asymptotically efficient (asymptotically achieving the Cramer-Rao bound).
Neyman-Pearson criterion: (used for detection only)
- Maximize the detection probability (called power) under the constraint on the false alarm probability (called size). Generally, it's a constrained optimization problem and can be solved by Lagrange multiplier method.
- More importantly, Neyman-Pearson criterion converts a multi-objective optimization problem into a single-objective (constrained) optimization problem. By changing the value of tolerable false alarm probability, one can obtain receiver operating characteristic (ROC) curve, which is Pareto optimal or Pareto frontier.
- A typical solution to the Neyman-Pearson problem (the above constrained optimization problem): LRT/GLRT. That is, determine the threshold, based on the false alarm probability; use this threshold for LRT/GLRT.
- Insight: Neyman-Pearson criterion achieves a tradeoff between two contradictory requirements: detection probability and false alarm probability. The larger the false alarm probability, the larger the detection probability.
- Fundamental implication: whenever there are two contradictory design requirements, Neyman-Pearson criterion can be used to formulate a constrained optimization problem, and the optimal solution results in a best tradeoff between the two contradictory requirements.
- Why there is no Neyman-Pearson detector for multiple hypothesis testing?
  - Because hypothesis testing with N hypotheses will incur N*(N-1) types of errors but Neyman-Pearson detector can only address two types of errors (false alarm and miss).
Minimum variance unbiased (MVUB) estimation criterion:
- Every efficient estimator is MVUB, but the converse is not true. That is, an estimator may be MVUB but not achieve the Cramer-Rao bound.
- Fisher information matrix: expectation of second derivative of Shannon codeword length log(1/p(X|\theta).
  - I(\theta)= E[s(\theta,X)*s^T(\theta,X)] = E[d^2 log(1/p(X|\theta)/d\theta^2], where the score s(\theta,X)=dlog(p(X|\theta))/d\theta, s^T is a transpose of s.
- Cramer-Rao bound: the estimation error covariance matrix C=E[(\hat{\theta}-\theta)*(\hat{\theta}-\theta)^T]>=I^{-1}(\theta), where I^{-1}(\theta) is an inverse of I(\theta).
Bayesian criterion: (used for both detection and estimation)
- Used for estimating a random parameter (not for estimating a deterministic parameter).
- Assume that the prior distribution is known and the cost/risk function is given.
- Minimize the average cost/risk. This is because the cost is generally a random variable (due to randomness of noise) and cannot be meaningfully optimized. So we choose to optimize the average cost. If the cost is a discrete random variable, we have another option: minimax criterion.
- More importantly, Bayesian criterion converts a multi-objective optimization problem into a single-objective (unconstrained) optimization problem, using weighted sum approach. By changing the values of the weights/costs, one can obtain the Pareto frontier.
- If the cost function is quadratic, the Bayes estimator, called Minimum Mean Square Error (MMSE) estimator, is the conditional mean of the posterior PDF (i.e., conditional distribution of the parameter given the observations).
  - The nice thing about MMSE is that the MMSE estimator is linear if the signal and observation are jointly Gaussian. This has similar flavor of quadratic programming (an optimization theory), which also involves solving linear equations since the derivative/gradient of a quadratic cost is linear.
  - For non-Gaussian signal/noise, MMSE estimator is not linear. Due to the simplicity of linear estimation, we choose to use linear MMSE estimator, which is sub-optimal compared to MMSE.
- If the cost function is uniform, the Bayes estimator reduces to MAP (Maximum A Posteriori probability) estimator. A point at which a density achieves its maximum value is termed a mode of the PDF. So MAP estimate is the mode of the posterior PDF. Note that the conversion from Bayes to MAP requires that the threshold for the cost function be very small.
- If the cost function is absolute, the Bayes estimate, termed Minimum Mean Absolute Error (MMAE) estimator, is the median of the posterior PDF.
- In summary, the MMSE, MMAE, and MAP estimates are the mean, median, and mode of the posterior PDF, respectively. In general, Bayes estimates are features of the posterior PDF. This is one reason why Bayesian estimators are widely used in pattern recognition.
  - For Gaussian posterior PDF, MMSE=MMAE=MAP.
Minimax criterion: (used for both detection and estimation)
- Game-theoretical approach: minimax detector/estimator uses zero-sum game theory. Suppose there are two players: an experimentalist and Mother Nature. Mother Nature chooses the least favorable prior PMF (probability mass function). Then the minimax detector/estimator reduces to Bayes detector/estimator with respect to the least favorable prior. So minimax detector/estimator can be regarded as a special case of Bayes detector/estimator.
- Minimax detector/estimator is applicable to both discrete and continuous random variable detection/estimation. For continuous r.v., play the strategies with arbitrary pdf. (In other words, discretize the range of the r.v. and get a pmf for the resulting discrete r.v.. The limit of this discrete r.v. is the continuous r.v. So we can play with this pmf of the discrete r.v. in place of the pdf of the continuous r.v.)

Types of estimators

Maximum likelihood estimator
Bayesian estimator
Moment estimator:
- let sample moments equal to ensemble moments; solve the equations, resulting in estimates of parameters of interest.
Least square estimator:
- does not require prior distribution
- minimize the quadratic cost, for given observations Y, over the parameter of interest \theta.
  - minimize (Y-h(\theta))^2

Important concepts

A statistic of an unknown fixed parameter
- a real-valued function of measurements/samples of the parameter.
- a quantity (such as a statistical median, quartile deviation, etc.), which is calculated from observed data.
A sufficient statistic of an unknown fixed parameter contains all the statistical information required to estimate the parameter.
A minimal sufficient statistic

Minimum Description Length (MDL) model selection

If someone asks you whether a Markov chain is a reasonable model to do the time series analysis, you should ask him/her comparing to what? Without a baseline solution, anything could be reasonable to start with. So the first step in modeling is selecting a model. Then, the second step is to estimate the parameter of the model.

MDL can be used as a criterion for model selection. It measures the universal code length of the data, given a model. One can compare two different models based on their MDLs using the data available and favor one with shorter code length. However, it is not clear how to optimize from a model set (functional space) with infinite models based on MDL. Note that MDL is used for comparing given models in a parameterized family, rather than searching for the best model over the whole functional space.

Kalman filters requires 1) linear system equation, 2) Gaussian noise. The linearity of system equation preserves the Gaussianity of the process (Gauss Markov process, actually, AR process). Since the state process is a Gauss Markov process, Kalman filters only have to propagate mean and covariance. Kalman filters can be used to estimate the state of time-varying linear system. Just change the system matrices from constant ones to time varying ones.

Graphical Models

"Graphical models are a marriage between probability theory and graph theory. They provide a natural tool for dealing with two problems that occur throughout applied mathematics and engineering -- uncertainty and complexity -- and in particular they are playing an increasingly important role in the design and analysis of machine learning algorithms. Fundamental to the idea of a graphical model is the notion of modularity -- a complex system is built by combining simpler parts. Probability theory provides the glue whereby the parts are combined, ensuring that the system as a whole is consistent, and providing ways to interface models to data. The graph theoretic side of graphical models provides both an intuitively appealing interface by which humans can model highly-interacting sets of variables as well as a data structure that lends itself naturally to the design of efficient general-purpose algorithms.

Many of the classical multivariate probabalistic systems studied in fields such as statistics, systems engineering, information theory, pattern recognition and statistical mechanics are special cases of the general graphical model formalism -- examples include mixture models, factor analysis, hidden Markov models, Kalman filters and Ising models. The graphical model framework provides a way to view all of these systems as instances of a common underlying formalism. This view has many advantages -- in particular, specialized techniques that have been developed in one field can be transferred between research communities and exploited more widely. Moreover, the graphical model formalism provides a natural framework for the design of new systems." --- Michael Jordan, 1998.

Modeling of a deterministic/stochastic dynamic system

State equations: characterize a linear/nonlinear deterministic/stochastic system; can have both system state equations and output equations (can generate ARMA sequence as output)
Graphical models: the link in a (directed) graph represents the dependence relation between two nodes (r.v.'s); by utilizing Markov property/causality/generative property in a graph, one can design an efficient algorithm to reduce computational complexity.

Trellis graph: can be used to characterize finite state machine (e.g., convolutional code encoder); can be used to characterize the dependence relationships among symbols in a symbol sequence; visualize MLSE sequential decoding (Viterbi algorithm, shortest path algorithm)
State transition diagram: can be used to characterize finite state machine
Graph with multiple stages: FFT butterfly model, factor graph, trellis graph

Modeling of a deterministic/stochastic signal (or symbol) sequence

Time series (e.g., AR, ARMA, ARIMA, ARCH models): model a random signal sequence

For linear systems, modeling approaches include

State equations: modeling a linear deterministic/stochastic system based on state space representation (characterizing how state variables evolve over time)
Transfer function: modeling a linear system in the transform/frequency domain
Impulse response: modeling a linear system in the time domain

EM algorithm vs. BCJR

EM is a sub-optimal algorithm to compute maximum likelihood or MAP. What EM obtains is a fixed-point, rather than the desired global optimum. It is guaranteed that each step in EM converges to the fixed-point.

BCJR is an algorithm to compute a posteri probabilty exactly.

EM is an iterative algorithm while BCJR is a sequential algorithm.

Kalman Filter (KF)

1) MMSE state estimate of linear Gaussian dynamical systems

Refer to: Scharf's book, Statistical Signal Processing
Structure of KF
- Simulation of the dynamical system and the measurement process, self reconstruction (using feedback), refer to page 311 for block diagram of KF, i.e., Figure 7.17
- Innovation
- Filter gain
- Orthogonality between innovation and prediction

2) Dynamic estimation as a recursive static estimation

Refer to: Bar-Shalom, et al., Estimation with Applications to Tracking and Navigation
Estimate of the state (filtered value), predicted value of the state, smoothed value of the state (page 201)
Derivation: start from a static estimation (Gauss Markov theorem), replace the means by prior conditional means (predictions, given previous measurement), and replace the covariance matrices by prior conditional covariance matrices; derive posterior covariance of the state. These result in the celebrated six equations of KF.
Six equations of KF
1. Gauss Markov Theorem: \hat{x}=\bar{x} + C_{xz}*C_{zz}^{-1}* (z - \bar{z})
2. Update the mean of the state at n+1
3. Update the mean of the measurement at n+1
4. Update the state prediction covariance matrix
5. Update the measurement prediction covariance matrix
6. Update the posterior state covariance matrix
The covariance matrices at each time instant can be obtained offline since the covariance equations are independent of the measurements.
Solve DARE (Discrete-time Algebraic Riccati Equation) to obtain steady-state one-step state prediction covariance matrix.

3) The recursive algorithm is to reduce computation complexity.

Refer to: Kailath, et al., Linear Estimation
The complexity is reduced from O(N^3 * n^3) to O(N * n^3), where N is the number of measurements (length of time horizon), and n is the dimension of each measurement (page 318)

4) Since linear systems preserve Gaussianity, one only needs to propagate means and covariance matrices of the predicted state and the predicted measurement.

In contrast, a Particle Filter propagates particles (PDF), instead of the first two moments.

Least square (LS) estimator vs. MMSE estimator

Least square estimator can be regarded as a practical realization of MMSE estimator. MMSE estimator requires PDF of the data, p(X|\theta) and p(\theta); but in practical situations, typically we don't know the PDF of the data.
To use least square estimation, we assume the underlying data generation is Y=f(X); the parameter of function f is denoted by a vector \theta. Then given the measurements {X_i, Y_i} (i=1, ...,N), we can use least square estimation to estimate \theta as below

\hat{\theta}= arg min_{\theta} \sum_{i=1}^N [Y_i - f(X_i)]^2

Then, the least square estimation is to solve the minimization problem

min Q, where Q=\sum_{i=1}^N [Y_i - f(X_i)]^2

Taking partial derivative of Q w.r.t. each of the parameters and equalize them to zero, we obtain a group of equations. Solving the equations, we obtain the LS estimates of the parameters.

Principle of SAR (synthetic aperture radar)

Basic idea: based on the well-known principle of echoing-ranging.
- Transmit a waveform pulse, which is known (no randomness). The carrier is microwave.
- The pulse is reflected by objects.
- The reflected signal carries information about the objects. Different objects may have different reflectivity, which attenuates the transmitted signal differently.
- Demodulate the reflected signal.
- From the demodulated signal, we can reconstruct the image of the terrain (i.e., an estimate of the complex reflectivity as a function of range). The resolution of the image depends on the waveform pulse used.
Pulse selection:
- Continuous waveform (CW) pulse: e.g., a raised cosine pulse modulating a sinusoidal (AM)
- Dispersed waveform pulse: linear FM chirp pulse
  - Can provide very large time-bandwidth product
    - So it can provide large energy per pulse (to increase standoff range) due to use of long pulse duration, while at the same time it can also provide high resolution due to large bandwidth used.
  - The waveform pulse is dispersed in time, not in bandwidth. I.e., for equal bandwidths, an FM chirp pulse is dispersed in time by a factor equal to its time-bandwidth product, compared to a CW burst pulse.

Spectral Estimation

Nonparametric methods
- Periodogram
- Burg's maximum entropy
- Cepstrum
Parametric methods
- Model selection
  - AIC
  - BIC
  - MDL
- Parameter estimation for AR(p) process
  - Least square method
  - Yule-Walker equation
  - Burg's method

How to model non-stationary process

y(t)=x(t)+u(t)+v(t)
- where x(t) is a stationary process with zero mean, u(t) is a periodic deterministic signal, and v(t) is an aperiodic deterministic signal.
- First, estimate v(t) using a polynomial model, e.g., v(t)=a0+a1*t+a2*t^2+a3*t^3. The parameters of the model can be estimated through least square method.
- Second, estimate u(t) using periodogram. The peaks in the periodogram give the frequency components of u(t).
- Third, remove u(t) and v(t) from y(t), i.e., x(t)= y(t)-u(t)-v(t). Model x(t) by stochastic models such as AR(p), MA(q), ARMA(p,q).
y(t)=x(t)+u(t)+v(t)
- where x(t) is a non-stationary process with zero mean, u(t) is a periodic deterministic signal, and v(t) is an aperiodic deterministic signal.
- x(t) may be modeled as the output of linear time-variant (LTV) system, with a zero-mean stationary process s(t) as the input, and k_t(\tau) as time varying impulse response of the LTV system.