Statistical
Signal Processing (Detection and Estimation)
The goal of hypothesis test (detection) is to decide which
hypothesis (among n hypotheses) is true (where n can be any integer greater than
1), given some observation. A hypothesis is a statement about
a possible phenomenon, which need not be random. Note that a hypothesis need not be
of random nature, but the observation is a random function of a hypothesis.
For example, the null hypothesis H0 is that a person has lung cancer
while the alternative hypothesis H1 is that the person does not have
lung cancer. These two hypotheses are not random; we simply do not
know which hypothesis is true. But under each hypothesis, the
observation (e.g., the number of white blood cells) is random.
The goal of estimation is to extract quantity
of interest from noisy data/signal. Estimation problems can be classified into the
following categories:
- Parameter/point estimation: a parameter is a point in
n-dimensional vector space (e.g, Euclidean space, Hilbert space, and Krein
space), where n can be any positive integer. A parameter could be
deterministically continuous, or discrete random variable, or continuous
random variable, but it cannot be deterministically discrete. If
the parameter is a discrete
random variable, the estimation problem becomes a hypothesis test. Note a
hypothesis test is not a subset of parameter estimation since a hypothesis need not be a random
variable.
- (Time-varying) signal estimation: the signal
could be discrete-time or continuous-time, and it could also be deterministic or
stochastic. A basic method for signal estimation is Kalman filer. For
continuous-time signal estimation, Ito statistical calculus will be used,
which is fairly advanced.
- Density
estimation: there are two approaches, i.e., parametric and
non-parametric. Parametric methods are the same as those in parameter
estimation. Non-parametric methods use various kernel functions (like
weighted windowing used in FIR filter design, spectral estimation,
non-linear image processing, etc.)
- Confidence interval estimation
Key techniques:
all detection/estimation problems are optimization problems with respect to a
certain criterion as follows. So
whenever one talks about an optimal detector/estimator, don't forget to ask
him/her what optimal criterion is used.
- Maximum likelihood criterion: (used for both
detection and estimation)
- LRT (Likelihood Ratio Test) for simple
hypothesis test: the threshold is 1, that is, decide the one with a
larger likelihood. Can be used for multiple hypothesis test.
- GLRT (Generalized LRT) for composite
hypothesis test (take the max/super-norm of the likelihood function over
the parameter set, as the generalized likelihood function).
- ML estimator for estimation: (used for
the case when prior distribution is unknown or when the parameter to be
estimated is deterministic). An ML estimate is biased, but is
consistent (asymptotically unbiased) and asymptotically efficient
(asymptotically achieving the Cramer-Rao bound).
- Neyman-Pearson criterion: (used for detection
only)
- Maximize the detection probability
(called power) under the constraint on the false alarm probability
(called size). Generally, it's a constrained optimization
problem and can be solved by Lagrange multiplier method.
- More importantly, Neyman-Pearson
criterion converts a multi-objective optimization problem into a
single-objective (constrained) optimization problem. By changing the
value of tolerable false alarm probability, one can obtain receiver
operating characteristic (ROC) curve, which is Pareto optimal or Pareto
frontier.
- A typical solution to the Neyman-Pearson
problem (the above constrained optimization problem): LRT/GLRT.
That is, determine the threshold, based on the false alarm probability;
use this threshold for LRT/GLRT.
- Insight: Neyman-Pearson criterion
achieves a tradeoff between two contradictory requirements: detection
probability and false alarm probability. The larger the false
alarm probability, the larger the detection probability.
- Fundamental implication: whenever there
are two contradictory design requirements, Neyman-Pearson criterion can
be used to formulate a constrained optimization problem, and the optimal solution
results in a best tradeoff between the two contradictory requirements.
- Why there is no Neyman-Pearson detector
for multiple hypothesis testing?
- Because hypothesis testing with N
hypotheses will incur N*(N-1) types of errors but Neyman-Pearson detector
can only address two types of errors (false alarm and miss).
- Minimum variance unbiased (MVUB) estimation
criterion:
- Every efficient estimator is MVUB, but
the converse is not true. That is, an estimator may be MVUB
but not achieve the Cramer-Rao
bound.
- Fisher information matrix: expectation of
second derivative of Shannon codeword length log(1/p(X|\theta).
- I(\theta)= E[s(\theta,X)*s^T(\theta,X)] =
E[d^2 log(1/p(X|\theta)/d\theta^2], where the score s(\theta,X)=dlog(p(X|\theta))/d\theta,
s^T is a transpose of s.
- Cramer-Rao bound: the estimation error
covariance matrix
C=E[(\hat{\theta}-\theta)*(\hat{\theta}-\theta)^T]>=I^{-1}(\theta), where
I^{-1}(\theta) is an inverse of I(\theta).
- Bayesian criterion: (used for both detection
and estimation)
- Used for estimating a random parameter
(not for estimating a deterministic parameter).
- Assume that the prior distribution is
known and the cost/risk function is given.
- Minimize the average
cost/risk. This is because the cost is generally a random
variable (due to randomness of noise) and cannot be meaningfully
optimized. So we choose to optimize the average cost. If the
cost is a discrete random variable, we have another option: minimax
criterion.
- More importantly, Bayesian criterion
converts a multi-objective optimization problem into a single-objective
(unconstrained) optimization problem, using weighted sum approach. By
changing the values of the weights/costs, one can obtain the Pareto
frontier.
- If the cost function is quadratic, the
Bayes estimator, called Minimum Mean Square Error (MMSE) estimator, is
the conditional mean of the posterior PDF (i.e., conditional
distribution of the parameter given the observations).
- The nice thing about MMSE is that the
MMSE estimator is linear if the signal and observation are jointly
Gaussian. This has similar flavor of quadratic
programming (an optimization theory), which also involves solving
linear equations since the derivative/gradient of a quadratic cost
is linear.
- For non-Gaussian signal/noise, MMSE
estimator is not linear. Due to the simplicity of linear
estimation, we choose to use linear MMSE estimator, which is
sub-optimal compared to MMSE.
- If the cost function is uniform, the
Bayes estimator reduces to MAP
(Maximum A Posteriori probability) estimator. A point at
which a density achieves its maximum value is termed a mode of the PDF.
So MAP estimate is the mode of the posterior PDF. Note that the
conversion from Bayes to MAP requires that the threshold for the cost
function be very small.
- If the cost function is absolute, the
Bayes estimate, termed Minimum Mean Absolute Error (MMAE) estimator, is
the median of the posterior PDF.
- In summary, the MMSE, MMAE, and MAP
estimates are the mean, median, and mode of the posterior PDF,
respectively. In general, Bayes estimates are features of
the posterior PDF. This is one reason why Bayesian
estimators are widely used in pattern recognition.
- For Gaussian posterior PDF, MMSE=MMAE=MAP.
- Minimax criterion: (used for both detection
and estimation)
- Game-theoretical approach: minimax
detector/estimator uses zero-sum game theory. Suppose there are
two players: an experimentalist and Mother Nature. Mother
Nature chooses the least favorable prior PMF (probability mass
function). Then the minimax detector/estimator reduces to
Bayes detector/estimator with respect to the least favorable
prior. So minimax detector/estimator can be regarded as a
special case of Bayes detector/estimator.
- Minimax detector/estimator is applicable to
both discrete and continuous random variable detection/estimation. For
continuous r.v., play the strategies with arbitrary pdf. (In other
words, discretize the range of the r.v. and get a pmf for the resulting
discrete r.v.. The limit of this discrete r.v. is the continuous
r.v. So we can play with this pmf of the discrete r.v. in
place of the pdf of the continuous r.v.)
Types of estimators
- Maximum likelihood estimator
- Bayesian estimator
- Moment estimator:
- let sample moments equal to ensemble
moments; solve the equations, resulting in estimates of parameters of
interest.
- Least square estimator:
- does not require prior distribution
- minimize the quadratic cost, for given
observations Y, over the parameter of interest \theta.
Important concepts
- A statistic of an unknown fixed
parameter
- a real-valued function of
measurements/samples of the parameter.
- a quantity (such as a statistical median,
quartile deviation, etc.), which is calculated from observed data.
- A sufficient statistic of an unknown
fixed parameter contains all the statistical information required to estimate
the parameter.
- A minimal sufficient statistic
Minimum Description Length (MDL) model
selection
If someone asks you whether a Markov chain is a reasonable model to do the
time series analysis, you should ask him/her comparing to what? Without a
baseline solution, anything could be reasonable to start with.
So the first step in modeling is selecting a model. Then, the second step
is to estimate the parameter of the model.
MDL can be used as a criterion for model
selection. It measures the universal code length of the data, given a
model. One can compare two different models based on their MDLs using the
data available and favor one with shorter code length. However, it is not
clear how to optimize from a model set (functional space) with infinite models
based on MDL. Note that MDL is used for comparing given models in a
parameterized family, rather than searching for the best model over the whole
functional space.
Kalman filters requires 1) linear system
equation, 2) Gaussian noise. The linearity of system equation
preserves the Gaussianity of the process (Gauss Markov process, actually, AR
process). Since the state process is a Gauss Markov process, Kalman
filters only have to propagate mean and covariance. Kalman filters can be
used to estimate the state of time-varying linear system. Just
change the system matrices from constant ones to time varying ones.
"Graphical models are a marriage between
probability theory and graph theory. They provide a natural tool for dealing
with two problems that occur throughout applied mathematics and engineering --
uncertainty and complexity -- and in particular they are playing an increasingly
important role in the design and analysis of machine learning algorithms.
Fundamental to the idea of a graphical model is the notion of modularity -- a
complex system is built by combining simpler parts. Probability theory provides
the glue whereby the parts are combined, ensuring that the system as a whole is
consistent, and providing ways to interface models to data. The graph theoretic
side of graphical models provides both an intuitively appealing interface by
which humans can model highly-interacting sets of variables as well as a data
structure that lends itself naturally to the design of efficient general-purpose
algorithms.
Many of the classical multivariate probabalistic
systems studied in fields such as statistics, systems engineering, information
theory, pattern recognition and statistical mechanics are special cases of the
general graphical model formalism -- examples include mixture models, factor
analysis, hidden Markov models, Kalman filters and Ising models. The graphical
model framework provides a way to view all of these systems as instances of a
common underlying formalism. This view has many advantages -- in particular,
specialized techniques that have been developed in one field can be transferred
between research communities and exploited more widely. Moreover, the graphical
model formalism provides a natural framework for the design of new systems."
--- Michael Jordan, 1998.
Modeling of a deterministic/stochastic dynamic
system
- State equations: characterize a
linear/nonlinear deterministic/stochastic system; can have both system state
equations and output equations (can generate ARMA sequence as output)
- Graphical models: the link in a
(directed) graph represents the dependence relation between two nodes (r.v.'s);
by utilizing Markov property/causality/generative property in a graph, one can
design an efficient algorithm to reduce computational complexity.
- Trellis graph: can be used to characterize
finite state machine (e.g., convolutional code encoder); can be used to
characterize the dependence relationships among symbols in a symbol sequence;
visualize MLSE sequential decoding (Viterbi algorithm, shortest path
algorithm)
- State transition diagram: can be used to
characterize finite state machine
- Graph with multiple stages: FFT butterfly
model, factor graph, trellis graph
Modeling of a deterministic/stochastic signal
(or symbol) sequence
- Time series (e.g., AR, ARMA, ARIMA, ARCH
models): model a random signal sequence
For linear systems, modeling approaches include
- State equations: modeling a linear
deterministic/stochastic system based on state space representation
(characterizing how state variables evolve over time)
- Transfer function: modeling a linear system
in the transform/frequency domain
- Impulse response: modeling a linear system in
the time domain
EM algorithm vs. BCJR
EM is a sub-optimal algorithm to compute maximum
likelihood or MAP. What EM obtains is a fixed-point, rather than the
desired global optimum. It is guaranteed that each step in EM converges to the
fixed-point.
BCJR is an algorithm to compute a posteri
probabilty exactly.
EM is an iterative algorithm while BCJR is a
sequential algorithm.
Kalman Filter (KF)
1) MMSE state estimate of linear Gaussian
dynamical systems
- Refer to: Scharf's book, Statistical Signal
Processing
- Structure of KF
- Simulation of the dynamical system and the
measurement process, self reconstruction (using feedback), refer to page 311
for block diagram of KF, i.e., Figure 7.17
- Innovation
- Filter gain
- Orthogonality between innovation and
prediction
2) Dynamic estimation as a recursive static
estimation
- Refer to: Bar-Shalom, et al., Estimation with
Applications to Tracking and Navigation
- Estimate of the state (filtered value),
predicted value of the state, smoothed value of the state (page 201)
- Derivation: start from a static estimation
(Gauss Markov theorem), replace the means by prior conditional means
(predictions, given previous measurement), and replace the covariance matrices
by prior conditional covariance matrices; derive posterior covariance of the
state. These result in the celebrated six equations of KF.
- Six equations of KF
- Gauss Markov Theorem: \hat{x}=\bar{x} +
C_{xz}*C_{zz}^{-1}* (z - \bar{z})
- Update the mean of the state at n+1
- Update the mean of the measurement at n+1
- Update the state prediction covariance
matrix
- Update the measurement prediction
covariance matrix
- Update the posterior state covariance
matrix
- The covariance matrices at each time instant
can be obtained offline since the covariance equations are independent of the
measurements.
- Solve DARE (Discrete-time Algebraic Riccati
Equation) to obtain steady-state one-step state prediction covariance matrix.
3) The recursive algorithm is to reduce
computation complexity.
- Refer to: Kailath, et al., Linear Estimation
- The complexity is reduced from O(N^3 * n^3)
to O(N * n^3), where N is the number of measurements (length of time horizon),
and n is the dimension of each measurement (page 318)
4) Since linear systems preserve Gaussianity,
one only needs to propagate means and covariance matrices of the predicted state
and the predicted measurement.
- In contrast, a Particle Filter propagates
particles (PDF), instead of the first two moments.
Least square (LS) estimator vs. MMSE estimator
- Least square estimator can be regarded as a
practical realization of MMSE estimator. MMSE estimator requires
PDF of the data, p(X|\theta) and p(\theta); but in practical situations,
typically we don't know the PDF of the data.
- To use least square estimation, we assume the
underlying data generation is Y=f(X); the parameter of function f is denoted
by a vector \theta. Then given the measurements {X_i, Y_i} (i=1,
...,N), we can use least square estimation to estimate \theta as below
\hat{\theta}= arg min_{\theta} \sum_{i=1}^N [Y_i
- f(X_i)]^2
- Then, the least square estimation is to solve
the minimization problem
min Q, where Q=\sum_{i=1}^N [Y_i - f(X_i)]^2
- Taking partial derivative of Q w.r.t. each of
the parameters and equalize them to zero, we obtain a group of equations.
Solving the equations, we obtain the LS estimates of the parameters.
Principle of SAR (synthetic aperture radar)
- Basic idea: based on the well-known principle
of echoing-ranging.
- Transmit a waveform pulse, which is known
(no randomness). The carrier is microwave.
- The pulse is reflected by objects.
- The reflected signal carries information
about the objects. Different objects may have different reflectivity,
which attenuates the transmitted signal differently.
- Demodulate the reflected signal.
- From the demodulated signal, we can
reconstruct the image of the terrain (i.e., an estimate of the complex
reflectivity as a function of range). The resolution of the
image depends on the waveform pulse used.
- Pulse selection:
- Continuous waveform (CW) pulse: e.g., a
raised cosine pulse modulating a sinusoidal (AM)
- Dispersed waveform pulse: linear FM chirp
pulse
- Can provide very large time-bandwidth
product
- So it can provide large energy per
pulse (to increase standoff range) due to use of long pulse duration,
while at the same time it can also provide high resolution due to large
bandwidth used.
- The waveform pulse is dispersed in time,
not in bandwidth. I.e., for equal bandwidths, an FM chirp pulse is
dispersed in time by a factor equal to its time-bandwidth product,
compared to a CW burst pulse.
Spectral Estimation
- Nonparametric methods
- Periodogram
- Burg's maximum entropy
- Cepstrum
- Parametric methods
- Model selection
- Parameter estimation for AR(p) process
- Least square method
- Yule-Walker equation
- Burg's method
How to model non-stationary process
- y(t)=x(t)+u(t)+v(t)
- where x(t) is a stationary process with
zero mean, u(t) is
a periodic deterministic signal, and v(t) is an aperiodic deterministic
signal.
- First, estimate v(t) using a polynomial model, e.g., v(t)=a0+a1*t+a2*t^2+a3*t^3. The parameters of the model
can be estimated through least square method.
- Second, estimate u(t) using periodogram.
The peaks in the periodogram give the frequency components of u(t).
- Third, remove u(t) and v(t) from y(t),
i.e., x(t)= y(t)-u(t)-v(t). Model x(t) by stochastic models such
as AR(p), MA(q), ARMA(p,q).
- y(t)=x(t)+u(t)+v(t)
- where x(t) is a non-stationary process with
zero mean, u(t) is
a periodic deterministic signal, and v(t) is an aperiodic deterministic
signal.
- x(t) may be modeled as the output of linear
time-variant (LTV) system, with a zero-mean stationary process s(t) as the
input, and k_t(\tau) as time varying impulse response of the LTV system.
กก
กก
กก