Communication theory
CDMA [html]
Noise performance of communication systems
- Uncoded modulation such as BPSK, QPSK
- What's the performance of symbol error
probability (SEP) vs. Eb/N0 for AWGN channel?
- What about ISI +AWGN channel?
- What about fading + AWGN channel?
- What about fading + ISI + AWGN channel?
- Coded modulation
- Uncoded DS-CDMA + modulation
- Convolutionally coded DS-CDMA+ modulation
- Multiuser detection + Convolutionally coded
DS-CDMA+ modulation
- Under what condition, the performance for
MAI + ISI +AWGN channel reduces to single-user performance?
- MAI + fading +AWGN channel
- MAI + fading +ISI +AWGN channel
- OFDM+ modulation
- Why coin the term "keying" in ASK, FSK,
PSK?
- Keying = switching
- Since your message to be sent consists of letters from a
discrete alphabet (like Morse code), you could imagine you are typing on a
typewriter: you type a letter and then shift to another key and type. So
digital modulation (ASK, FSK, PSK) is like amplitude/frequency/phase shifted
by keying/typing.
Modulation:
- Analog modulation
- Amplitude modulation (AM)
- Phase modulation (PM)
- Frequency modulation (FM)
- Digital modulation (a special case of analog
modulation)
- Baseband
- Pulse amplitude modulation (PAM)
- Pulse phase modulation (PPM)
- Pulse duration modulation (PDM) or Pulse
width modulation (PWM)
- Pulse frequency modulation (PFM)
- Passband
- Amplitude shift keying (ASK)
- Phase shift keying (PSK)
- Frequency shift keying (FSK)
Digital Communications
- Basically, a receiver consists of two
components:
- Signal demodulator
- Correlation demodulator (correlator)
- Matched filter demodulator
- Channel equalizer: may be combined
with a detector, e.g., Decision Feedback Equalizer (DFE)
- Detector
- When the signal has no memory, use
symbol-by-symbol detector, such as ML, MAP.
- When the signal has memory (trellis
coded modulation, convolutional codes), use Maximum Likelihood
Sequence Detector (MLSD), or use symbol-by-symbol MAP detector,
where each symbol decision is based on an observation of a sequence
of received signal vectors.
- An essential objective of communication
system design is to combat the impairments caused by the
channel. So it is important to study the channel behavior.
- Channel models:
- Additive noise: modeling the thermal
noise generated in the receiver.
- Multiplicative noise: the channel gain
process is a stochastic process (i.e., fading channel)
- ISI:
- deterministic (Linear Time Invariant
channel): delay spread (or response spread) of an input impulse due to
finite bandwidth of the channel.
- Ideal Nyquist channels (having
constant spectrum) or channels with raised cosine spectrum, do
not cause ISI, i.e., the values of the impulse response at the
sampling instants are zero.
- Non-ideal bandwidth-limited
channels cause ISI, i.e., the values of the impulse response at
the sampling instants are non-zero.
- stochastic (multi-path fading channel, each path is a
stochastic process): delay spread due to multiple paths.
- If the maximum delay spread is
less than the symbol duration, there is no ISI. This
is the reason why OFDM (reducing the symbol rate of each
sub-channel) performs well under multi-path environments.
- Linear channel models: channel gain model +
noise model
- Memoryless channels:
- Discrete memoryless channels (DMC):
- described by P(y_i|x_i), where x_i
and y_i are both in discrete time; x_i is discrete-valued, and y_i
could be continuous-valued or discrete-valued.
- At MODEM layer, y_i is
continuous-valued (taking values from continuous ranges), y_i=x_i+n_i,
where n_i is continuous-valued (e.g., Gaussian noise).
- These models are typically used in
detection theory.
- At CODEC layer, y_i is
discrete-valued (y_i is generated by a hard decision). Example
of such channels: Binary Symmetric Channel (BSC), Binary Erasure Channel.
- These models are typically used in
information theory.
- Continuous memoryless channels:
- Noise model:
- Additive White Gaussian Noise
(AWGN): the marginal distribution of AWGN process is
Gaussian with zero mean; the auto-correlation is a delta
function; the power spectral density is a constant over the
range of frequency from 0 to infinity (i.e., white
property). Since it is a Gaussian process, the first
order and second order statistics are enough to completely
characterize the AWGN process.
- Additive colored Gaussian
noise: different from AWGN, the power spectral density is
not flat over the range of frequency from 0 to infinity; the
auto-correlation is not a delta function.
- impulse noise (e.g., hyper-Gaussian)
- Channel gain model: fading
channel with independent channel gains (only marginal
distribution is enough to characterize the memoryless fading
channel).
- Channels with memory:
- Noise model: channels with additive
colored Gaussian noise (power spectrum density function is not a
constant; auto-correlation is not a Dirac delta function). For
such channels, water filling/pouring can be used to achieve the
channel capacity under the constraint on the signal power.
- Channel gain model (linear
distortion): ISI channel (deterministic and Linear Time Invariant
system)
- Channel gain model: Fading channel
(channel gain is a stochastic process)
- Slow or fast fading: Doppler
spectrum
- Flat or Freqency-selective fading
(multipath): Power Delay Profile (PDP), ISI
- Frequency-selective slow
fading channel: tapped-delay-line channel model
- Model: h(t,\tau) = \sum_{k=1}^L
c_k(t) \delta(\tau - k/W)
where h(t,\tau) is the impulse
response of the channel at time t; L is the number of resolvable
paths; c_k(t) is the complex voltage gain of path k at time t
(assuming it is constant during a pulse/chip duration due to slow
fading); \delta(\tau) is Dirac delta function; W is the bandwidth
of the transmitted passband signal (W/2 is the bandwidth of the
transmitted baseband signal).
- This model considers non-zero
propagation delay so that the first resolvable path has a delay of
1/W. W is equal to the chip rate in DS-CDMA, 1/pulse_duration
in ultra-wideband systems (e.g., 3 GHz), or 1/symbol_duration if
pulse duration = symbol duration.
- If c_k(t) is complex gaussian,
then it is Rayleigh fading. c_k(t) = \alpha_k(t)
*exp(-j* \theta_k(t)) = \alpha_k(t) * cos(\theta_k(t)) + j* \alpha_k(t)
*sin(\theta_k(t)) = g_k(t) + j* q_k(t), where g_k(t) and q_k(t)
are Gaussian due to central limit theorem, and they are
independent due to orthogonality/uncorrelatedness of g_k(t) and
q_k(t) (uncorrelatedness implies independence for Gaussian r.v's).
Since g_k(t) and q_k(t) are independent Gaussian, the amplitude \alpha_k(t)
is Rayleigh distributed and the phase \theta_k(t) is uniformly
distributed over [0, 2*\pi).
- For an ultra-wideband (UWB)
system, since the signal bandwidth and the sampling rate are high,
there are only a small number of paths that contribute to a
resolvable tap/bin. So central limit theorem does not
hold and we do not see uniformly distributed phase. It
is observed in actual measurements that the phase only takes
values of 0 or \pi.
- Why the delay of a resolvable
path has to be a multiple of 1/W? The proof is on page
795-797, Proakis's Digital Communications, 3rd ed. The
reasons are 1) the transmitted signal required to be band-limited
(passband requirement); 2) Nyquist criterion for ISI free
transmission.
- Since some resolvable paths may
have c_k(t)=0, the model can be reduced to h(t,\tau) = \sum_{k=1}^L
c_k(t) \delta(\tau - \tau_k), where \tau_k is the delay for the
k_th resolvable path with non-zero c_k(t).
- Marginal distributions: Rayleigh,
Rician (Nakagami-n), Nakagami-m, Nakagami-q (Hoyt).
- Second-order statistics: Doppler
spectrum (power spectral density)
- High-order statistics: bispectrum,
trispectrum.
- Matched filter is an optimal receiver for
pulse shaper + additive white noise channel (non-Gaussian or Gaussian does not
matter), under the criterion of maximizing the peak SNR (at the sampling
instant).
- Mathematically, it is the optimal solution to
the problem of optimizing the SNR over a functional space L^2 (set of all
possible impulse responses for an LTI system). The solution is obtained
by Schwarz inequality. It is a sufficient and necessary condition.
- The intuition of matched filter being
optimal, is that the transmitted signal has the highest correlation with
itself or its scaled version; in other words, the transmitted signal resonates
with itself.
- Minimum mean squared error (MMSE) receiver is
an optimal receiver for pulse shaper+ LTI channel + additive white noise
channel (non-Gaussian or Gaussian does not matter), under the criterion of
minimizing the mean squared error (at the sampling instant).
- An MMSE filter consists of a matched filter
and a tapped-delay-line equalizer.
- Mathematically, it is the optimal solution to
the problem of minimizing the mean squared error over a functional space L^2
(set of all possible impulse responses for an LTI system). The solution
is obtained by setting the derivative of mean squared error (MSE) to zero and
solving it for the transfer function of the MMSE filter. It is a
necessary condition. But if the MSE is convex, it is a sufficient and
necessary condition.
- The intuition of the MMSE filter being
optimal, is that it uses a matched filter to maximize SNR while using a
tapped-delay-line equalizer to invert the joint channel (LTI + AWGN).
- Zero-forcing equalizer will enhance noise
when it inverts the LTI channel. Hence, it has poor SNR performance.
- Nonlinear channel models
- Trellis coded modulation:
- Set partitioning: partition an M-ary
constellation of interest into 2, 4, 8, ... subsets with size M/2,
M/4, M/8, ..., which progressively increase minimum Euclidean distance
between their respective signal points.
- Ungerboeck codes (it is used for TCM; there
are many TCM designs):
- (Set partitioning) To send n bits/symbol,
use a two-dimensional constellation of 2^{n+1} signal points (typically,
8-PSK or 16-QAM). Partition the constellation into 4 or 8 subsets.
- Operations consist of two parts:
- (Subset selection) Convolutional
encoding: One or two incoming bits per symbol enter a
rate-1/2 or rate-2/3 binary convolutional encoder, respectively; the
resulting two or three coded bits determine the selection of a
particular subset.
- The bits for subset selection are
most significant since one bit in error will cause a big difference in
received signal; hence these bits need heavy protection and so we use
channel coding to protect them. (Redundancy is introduced to
protect these bits.)
- Signal mapping: The remaining uncoded
data bits determine which particular point from the selected subset is
to be signaled.
- Since the minimum Euclidean distance
within a subset is increased, the detection error probability is
reduced. This is achieved without introducing redundancy.
- Maximum likelihood sequence estimation (MLSE)
at the receiver, using Viterbi algorithm
- (Determine branch metrics:) For
each subset of the constellation, determine the signal point that is
closest to the received signal point in the Euclidean sense. The
Euclidean distance will be used as a branch metric in Viterbi algorithm.
We have 4 or 8 such signal points and associated metrics.
- (MLSE:) Use Viterbi algorithm to find the
information sequence with minimum path metric; each branch in the trellis
of the Ungerboeck code corresponds to a subset rather than an individual
signal point.
- Source models: "discrete" here
means both discrete-time and discrete alphabet (sample space).
- Discrete memoryless source: generate (in
discrete time) a sequence of i.i.d. discrete-valued r.v.'s,
described by a probability mass function.
- Discrete Markov source: generate (stationary)
discrete-time finte-state Markov chain, described by initial state
distribution and transition probability matrix.
- It models a convolutional code; characterizes
the stochastic nature of a convolutional encoder.
- x_{n+1}=A*x_n +B* v_n, where v_n is i.i.d.
r.v. (input sequence), x_n is the state of the encoder; y_n= C*x_n+D*v_n,
where y_n is the output of the encoder and C is the generating matrix.
Since from the codeword {y_n}, we can uniquely determine the information
sequence {v_n}, and hence the state sequence {x_n}. So the state is
perfectly observable, i.e., {y_n} is a one-to-one correspondence of {x_n};
then it can be proved that the output of the encoder y_n is a Markov chain.
Proof: y_n= C*x_n+D*v_n= C*x_n+D* g(x_n, x_{n+1})
(v_n is a deterministic function of x_n and x_{n+1}; denote this function by
g)
= h(x_n, x_{n+1}) (so y_n is a
deterministic function of {[x_n, x_{n+1}]^T}, denoted by h; since {[x_n,
x_{n+1}]^T} is a Markov process, {y_n} is also a Markov process.
This complete the proof.)
- Discrete hidden Markov source:
- It models a trellis coded modulator (TCM);
describes the stochastic structure of a trellis coded modulator.
- A trellis coded modulator consists of two
parts:
- Convolutional encoder: Its
output sequence is a Markov chain. The coded bits
determine the selection of a particular subset.
- Signal mapper: The remaining i.i.d.
uncoded data bits determine which particular point from the selected subset
is to be signaled. The output of a TCM at time n, is a random
function of the state of a Markov chain (output of convolutional encoder),
and the randomness is independent across time n (due to the i.i.d property
of the remaining uncoded bits); hence, the output sequence of a TCM is an
HMM.
- Counter-measures for ISI: channel
equalization. Note that after channel equalization, we need to use a
detector.
- If the channel impulse response is known,
use ZFE or MMSE linear equalizer (LE).
- Zero-forcing equalizer (ZFE)
- Force ISI to zero at the sampling
instants t = kT, except k=0.
- Suitable for the case when the
channel noise is zero or not significant.
- Why are IIR filters not used in ZFE?
Theoretically, ZFE is simply inverse of the channel (its
transfer function H(z)=1/G(z), where G(z) is the transfer function of the channel). Typically, H(z) is an IIR.
However, if G(z) is not a minimum phase transfer function, 1/G(z) is not stable. In
contrast, using FIR, which is always stable, can avoid this problem.
- ZFE is realized as an FIR of infinite/finite length using Taylor expansion of 1/G(z).
- MMSE equalizer (MMSE-LE)
- In the design of MMSE-LE, the
data sequence is assumed to be i.i.d. with zero
mean. What if the data sequence is not i.i.d. (not
perfectly compressed)?
- Use scrambler, which the
delay introduced is not greater than a symbol
duration. The idea of scrambling is to use a
pseudo-random sequence to modulate the data sequence (with
spreading gain=1) so as to break/reduce the correlation
between data symbols.
- Use interleaving in channel
coding, interleaving (possibly pseudo-random) can
break/reduce the correlation between data symbols.
- For an unknown channel, use adaptive
filters, which is Linear Equalizer/Non-Linear Equalizer (LE/NLE).
- training-based/supervised adaptive equalizer
- blind/unsupervised adaptive equalizer
- Classification of receivers based on channel
models
- Noisy channel (e.g., AWGN channel):
matched filter
- A RAKE receiver is a matched filter;
actually, it consists of multiple matched filters.
Specifically, it consists of tapped delay lines, correlators, a maximal
ratio combiner, and an integrator.
- A RAKE receiver is not an equalizer (at
least not in the sense of ISI cancellation) since it only collects energy
from different replicas of the same symbol but does not intend to remove
the inter-symbol interference.
- ISI channel: transversal filter
(tapped-delay-line filter) / zero-forcing equalizer
- Noisy channel with ISI (noisy and
dispersive channel):
- MMSE equalizer. It consists of
a matched filter (matching the channel response, collecting signal
energy, maximizing SNR) and a synchronous transversal equalizer (canceling
ISI with the consideration of noise). Merit:
it is suitable for the case when noise is significant; it is optimal
in the sense of MMSE; it always performs as well as, and often
better than, ZFE. Limitation: complexity is
higher than ZFE.
- Approximation (implementation) of MMSE
equalizer:
- Adaptive equalizer, which adjusts
its equalizer coefficients to deal with time-varying
channels. The tap coefficients of the transversal
equalizer can be adjusted by an LMS algorithm. Basically, there
are two classes of adaptive equalizers: 1) linear equalizer
(transversal filter); 2) Decision Feedback Equalizer (DFE),
which consists of a feedforward section, a feedback section, and
a decision device. Both the feedforward and feedback
sections are transversal filters and can be jointly adjusted by
an LMS algorithm. For frequency-selective fading
channels, DFE offers a significant improvement in performance
over a linear equalizer for an equal number of taps.
Limitation: DFE suffers from error propagation.
- Fractionally Spaced Equalizer (FSE),
where the spacing between adjacent taps is less than the symbol
period. The finer granularity of FSE is more effective to
compensate delay distortions than a synchronous equalizer.
- Zero-forcing equalizer (followed by a
decision-making device). Merit: ZFE is simple and is suitable
for the case when noise is negligible.
Drawback: ZFE ignores the effect of noise and has severe performance
degradation when noise is significant.
- Transversal filter, tapped-delay-line filter,
FIR (nonrecursive) filter, and Moving Average (MA) filter are the same. Transversal
filter and tapped-delay-line filter are typically used in communication
theory. FIR filter is typically used in digital signal
processing. MA filter is typically used in statistical signal
processing (time-series theory).
- IIR (recursive) filter and Auto-regressive Moving Average
(ARMA) filter are the same. Auto-regressive (AR) filter is a special
case of ARMA filter.
- Mathematically, correlation demodulator (correlator)
and matched filter are the same. The differences are
- They have different structure. A
correlator consists of a multiplier (coherent demodulator) and an
integrator, while a matched filter is just a IIR/FIR, whose impulse
response is determined by the transmitted signal.
- They have different requirements on
implementation. A correlator requires synchronization due to
coherent demodulation, while a matched filter does not.
- Counter-measures for fading
- The relationship among estimators (estimation
theory), observer (control theory), equalizers (communication theory)
- Estimators and observers are the
equivalent.
- Hard-decision vs. soft-decision
decoding
- Hard-decision decoding has separate symbol
detector and channel decoder. The symbol detector converts the
continuous-valued (demodulated) received signal to discrete-valued symbols;
and then the channel decoder decodes the discrete-valued symbols based on the
minimum Hamming distance criterion (find a codeword that minimizes the Hamming
distance between the symbol sequence and the codeword). The
information sequence corresponding to the estimated codeword (one-to-one
correspondence) is the output of the decoder.
- Soft-decision decoding has a joint symbol
detector and channel decoder (joint detection/decoding). The channel
decoder decodes the continuous-valued received signal based on the minimum
Euclidean distance criterion (find a codeword that minimizes the Euclidean
distance between the received signal and the modulated signal corresponding to
the codeword). The information sequence corresponding to the estimated
codeword is the output of the decoder.
- Hard-output vs. soft output decoding
- Hard output decoding outputs discrete-valued
information sequence, given continuous-valued received signal.
- Soft output decoding outputs
continuous-valued log-likelihood ratio (or log-APP ratio), actually a test
statistic, given
continuous-valued received signal. APP: a posteriori probability.
The test statistic can be used to make a hard decision.
- Optimal detector, adaptive filter, blind
estimator
- Optimal detector: If the
auto-correlations of the channel input process x(n) and the channel
output process y(n), and the cross-correlation of x(n) and y(n) are
known, the optimum detector under MMSE criterion is Wiener filter (a
linear filter).
- Adaptive filter: If the
auto-correlations and cross-correlations are unknown, we need to
estimate the correlations. Such estimation algorithms
include LMS and RLS, which are sub-optimal implementation/approximation
of the MMSE detector. Adaptive filters need training (offline
estimation).
- Blind estimator: If training is not
used, adaptive filters are called blind estimator (online estimation).
- CDMA receiver
- Single user detection: treat Multiple
Access Interference (MAI) as background noise.
- Multiuser detection: detect the signals
from interfering users; then remove the interference.
- Synchronous CDMA: the signal from
each interfering user has the same delay as the signal of interest.
- Asynchronous CDMA: the signals from
interfering users have different delays as the signal of interest.
- List all kinds of the linear receivers.
- Matcher filter
- MMSE (for Gaussian source and AWGN
channel)
- For CDMA systems, why is the MMSE detector
the best among all linear receivers, in the sense of maximum
Signal-to-Interference Ratio (SIR)?
- List typical nonlinear receivers
- Quadratic detector (energy detector)
- Chi-square detector
- Detector with student t statistic
- Detector with F statistic
- High Order Statistics (HOS):
- Cumulant:
- generalization of auto-correlation of
a stochastic process
- cumulants could be time variant.
- Polyspectrum:
- generalization of power spectrum of a
stochastic process
- polyspectrum could be time variant.
- HOS provides a basis for the
identification of a non-minimum-phase system (such as multi-path fading
channel) since HOS preserves phase information in the observed signal.
-
Nonlinear channels
- Microwave systems with nonlinear amplifier
- AM/AM conversion
- AM/FM conversion
- High density digital magnetic recording
channel
- saturation
- partial erasure
- Fiber optic data transmission
- vectorial nonlinear Schroedinger equation
- self phase modulation (chirp)
- chromatic dispersion
- polarization mode dispersion (PMD)
- photo diode direct detection
- Satellite channels are usually (but not
always) nonlinear.
- Continuous phase modulation
- Properties
- Phase continuity
- Orthogonality (orthogonal decomposition of
the signal)
- QPSK: acronym for quaternary phase-shift
keying or quadrature phase-shift keying
- quaternary phase-shift keying: QPSK has 4
signal points in the constellation.
- quadrature phase-shift keying: a QPSK signal
consists of both inphase and quadrature signal components at the same time.
- Power amplifier vs. modulation scheme
- Class A
- used for linear modulation (AM/ASK, PM/PSK)
- maximum power efficiency: 50%
- Class AB
- Class B
- maximum power efficiency: 78.5%
- Class C
- used for nonlinear modulation (FM/FSK)
- maximum power efficiency: almost 100%
- Class D
- used for nonlinear modulation (FM/FSK)
- maximum power efficiency: almost 100%
- It works like a switch; if there is no input
signal, the output signal is also zero (assuming the transistor has zero
saturation voltage).
- Class E
- used for nonlinear modulation (FM/FSK)
- maximum power efficiency: almost 100%
- It works like a switch; if there is no input
signal, the output signal is also zero (assuming the transistor has zero
saturation voltage).
- Class S
- used for nonlinear modulation (FM/FSK)
- maximum power efficiency: almost 100%
- It works like a switch; if there is no input
signal, the output signal is also zero (assuming the transistor has zero
saturation voltage).
- M-ary PSK, M-ary QAM, and M-ary FSK are
commonly used in coherent systems.
- M-ary DPSK and M-ary FSK are commonly used in
noncoherent systems. M-ary ASK can also be used in noncoherent
systems but is not common.
Key Techniques:
- Linear independence, orthogonality,
statistical independence
- Any signal can be decomposed into a
linear combination of linearly independent waveforms.
- Any signal in a linear vector
(functional) space can be represented by an expansion of orthonormal
basis functions of the vector space. Why is an orthonormal basis a
better representation of signals than linearly independent
waveforms/signals? No scaling.
Sampling and Interpolation
- Why non-uniform/nonlinear sampling at the
Nyquist rate cannot have distortionless reconstruction of the original
signal?
At channel encoding/decoding layer, typically we
use E_b/N_0, where E_b is energy per information bit and N_0 is single-sided
noise power spectrum density. The performance of a CODEC is
characterized by the relation between BER and E_b/N_0.
At modulation/demodulation layer, typically we
use E_s/N_0, where E_s is energy per symbol and N_0 is single-sided noise power
spectrum density. The performance of an MODEM is characterized by
the relation between BER and E_s/N_0. E_s/N_0=E_b/N_0 * bits/symbol * k/n,
where k/n is the rate of the channel code.
The received SNR = E_s/N_0 * symbol_rate/signal_bandwidth
= E_b/N_0 * bit_rate/signal_bandwidth. If the transmitter uses a
square-root raised cosine pulse shape and the receiver uses a matched filter,
symbol_rate = signal_bandwidth.
How to convert a digital signal into a
band-limited signal, so that it can be recovered without distortion after
transmitted over a band-limited channel (rectangular spectrum)?
A method is to use a square-root raised cosine
pulse shape at the transmitter and at the receiver (matched filter). The
benefits of this method are 1) satisfying Nyquist's criterion for zero ISI, and
2) maximizing SNR.
Viterbi Algorithm (VA) vs. Kalman filter (KF)
Both of them use causality to reduce
computational complexity. The existence of causality leads to the design
of a recursive algorithm so that the quantity (cost-to-arrive for VA or state
estimate for KF) computed at the last stage can be used for obtaining the
quantity at current stage.
there is no need to re-compute the total cost
when new data comes in.
Both of them can be used for state estimation of
hidden Markov model (HMM).
Kalman filter has the following assumptions (linearity, Gaussianity, MMSE):
1) system state equation is known and linear; observation equation is known and
linear. System states must be continuous since the
disturbance is Gaussian. So the state can not be estimated by
dynamic programming (DP). DP is only for estimating the state of
FSMC.
2) disturbances in the state equation are i.i.d. Gaussian; measurement errors in
the observation equation are i.i.d. Gaussian.
3) the distribution of initial state is known and Gaussian.
4) the (optimal) criterion for estimation is MMSE.
VA has the following assumptions:
1) system states must be discrete but cannot be continuous; the process of the
system state forms a finite state Markov chain (FSMC).
2) the system state is partially observable (the observation is a random
function of the state; the ranges of the observations corresponding to different
states are overlapped; otherwise, from the observation, the state can be
uniquely determined.)
3) We know the distribution of measurement errors, i.e., the probability of the
observation, given the state transition. Note the observation is related to
the state transition (a function of two variables); the observation related to
the single state is a special case.
4) The criterion of state sequence estimation could be ML or MAP.
For MAP, we assume that the distribution of initial state is known but does not
have to be Gaussian. It can be any probability distribution.
We also need to know the transition probability of the Markov chain, i.e., the
likelihood function or the disturbance's distribution, which doesn't need to be
Gaussian.
In digital communication, we typically assume transmitted symbols are equally
probable and statistically independent. Then MAP is reduced to ML. In addition,
typically we assume the channel noise is AWGN. So ML is reduced to
minimum Euclidean distance between the transmitted symbol and the received
signal. It's also called nearest neighbor criterion.
What about channels having memory (ISI
channels)? A channel with memory is characterized by FSMC, where the
number of states is determined by the channel memory length. So VA
can be used for ISI channels, rather than AWGN channels. The
assumption of using VA is that transmitted symbols are i.i.d. and the channel
noise is white Gaussian.
What about transmitted symbols having memory (convolutional
codes)? Use VA to decode the codeword.
What about noise is colored? 1) whiten the noise
using R^{-1/2}, where R is the noise covariance matrix. 2) time-domain process:
model the noise U_n as AR(p) . U_n - \sum_{i=1}^p a_i*U_{n-i} is
i.i.d. Gaussian noise. Then we can use VA w.r.t. whitened noise U_n
- \sum_{i=1}^p a_i*U_{n-i}.
Sequential Bayesian estimation methods:
- Sequential Bayesian estimate of a random
variable
- Kalman filter: sequential Bayesian estimate
of the state of linear system (estimate an AR process)
- Particle filter: Sequential Importance
Sampling, estimate the state of a nonlinear system or linear system with
non-Gaussian disturbances.
BCJR vs. Viterbi Algorithm
The Bahl, Cocke, Jelinek, and Raviv (BCJR)
algorithm is used for a posteriori probability (APP) decoding of convolutional
codes in AWGN. Although the BCJR and Viterbi algorithms share may
similarities, they differ in a fundamental way because they compute very
different quantities. The Viterbi algorithm computes hard decisions by
performing maximum likelihood decoding. The BCJR algorithm computes soft
information about the message in the form of a posteriori probabilities for each
of the message bits. Of course, this soft information could be converted to hard
decisions by comparing the a posteriori probabilities to a threshold of one
half; the resulting decisions would be individually optimal in the sense that
they have minimal probability of being in error. Thus, whereas the Viterbi
algorithm produces the most likely message sequence, which minimizes the
word-error rate, quantizing the BCJR probabilities produces the sequence of most
likely message bits, which minimizes bit-error rate. The distinction is subtle
and is not by itself enough to warrant much interest in BCJR. Indeed, the
hard-decision performance of BCJR is only marginally better than Viterbi. The
real value of BCJR is it that the a posteriori probabilities are valuable in
their own right; they may be used to estimate the reliability of a decision, and
are especially useful when interacting with other modules in a receiver. Another
important feature of BCJR is that it naturally and easily
exploits any a priori information that the receiver may have about the message
bits.
BCJR (or MAP algorithm) is an optimal decoding
algorithm for Turbo codes. Sub-optimal algorithms for decoding Turbo codes
include log-MAP and SOVA. Converting (multiplication, summation) to (max,
summation) can further reduce complexity.
Equivalence between the BCJR algorithm and
Generalized VA (GVA)
BCJR algorithm is not a special EM
algorithm. BCJR computes the exact APP, so that we can obtain MAP
estimate. EM computes ML estimate.
Turbo principle: refine the processing
recursively
-
EM algorithm: Given the observation X,
compute ML estimate of \theta.
- Complete data description
- p(X,Y|\theta), where X is an observation
vector and Y is a latent vector (not observable)
- Incomplete data description:
Algorithm
- (E step) given a value \theta_j, compute
Q(\theta, \theta_j)=E[log f(X,Y|\theta)], where the expectation is over
f(Y|X,\theta_j).
- (M step) compute \theta_{j+1}= arg
max_{\theta} Q(\theta, theta_j); repeat steps 1 and 2 till the algorithm
converges.
Used in iterative decoding of turbo codes,
pattern recognition.
Note that it may not be easy to compute the
expectation and maximization. You may have to use Monte Carlo
methods to compute the integration in the expectation, and optimization in the
maximization. In addition, you may not have the distribution f(Y|X,\theta_j),
which is typical in pattern recognition. Hence, in practice, there
could be many different algorithms (in different form) to implement the EM.
Use Baum-Welch, also called
forward-backward, algorithm (a special EM) to find the ML estimate of the
parameters of an HMM. Described later.
- Sum-product algorithm (message-passing
algorithm): operates in a
factor graph (a bipartite graph),
consisting of upper-layer nodes and lower-layer nodes, and links connecting
the nodes at two layers.
- for each upper-layer node, compute the
sum-product (product of all the incident links, resulting in a global
function, then sum over redundant variables, resulting in a marginal
function), assuming that the incoming information is all correct.
- for each lower-layer node, compute the
sum-product (product of all the incident links, then sum over redundant
variables), assuming that the incoming information is all correct.
Repeat steps 1 and 2 till the algorithm converges.
Computation is in the logarithmic regime.
If x1 and x2 are very different, then log(x1+x2) is approximately max{log(x1),
log(x2)}. Sum-product algorithm is used in LDPC decoding.
Belief propagation is a special
sum-product algorithm.
-
Lloyd algorithm for obtaining the optimal
codebook for vector quantization: (same as k-means algorithm, where k
denotes the number of centroids.)
- given a codebook (a set of centroids),
optimize the decoder using the nearest neighbor rule, i.e., decode the
given points and hence obtain an optimal partition of the given points
(optimal w.r.t. the codebook);
- for the partition of the given points,
optimize the encoder using the centroid rule, and hence obtain a new
codebook which is optimal for that partition; repeat steps 1 and 2 till the
algorithm converges.
Used in cluster analysis (for pattern
recognition) and vector quantization (for signal compression)
Markov process vs. hidden Markov model (HMM):
conditional independence property
- {X_n} is a Markov process if P{X_n|X_0, X_1, ...,
X_{n-1}}=P{X_n|X_{n-1}}, where P is probability if {X_n} are discrete r.v.'s;
otherwise, P is a probability density function.
- {Y_n} is a hidden Markov model if
(definition is sufficient and necessary condition)
- P{Y_n|X_0, X_1, ..., X_n, Y_0, Y_1, ..., Y_{n-1}}=P{Y_n|X_n} and {X_n} is a Markov process;
- or more
compactly, P{Y_n, X_{n+1}|X_0, X_1, ..., X_n, Y_0, Y_1, ...,
Y_{n-1}}=P{Y_n,X_{n+1}|X_n}. Note that marginalizing it over
Y_n and X_{n+1} respectively, results in the first definition.
For finite horizon T, {Y_n} is a hidden Markov model if
(two conditional independence assumptions)
- P{Y_n|X_0, X_1, ..., X_n, X_{n+1},..., X_T, Y_0, Y_1, ..., Y_{n-1},
Y_{n+1}, ..., Y_T}=P{Y_n|X_n}
- P{X_n|X_0, X_1, ..., X_{n-1}}=P{X_n|X_{n-1}}
An example of HMM (using an observation equation): {Y_n} is a hidden Markov model if Y_n=X_n+U_n, where {U_n} is
a sequence of independent r.v. (not necessarily identically distributed) and {X_n} is
a Markov process. Note that the noise {U_n} has to be white; otherwise,
conditional independence (HMM property) cannot be satisfied.
A hidden Markov chain can be characterized by a triplet \lambda={A, B, \Pi}
where A is the state transition probability matrix A={a_{ij}, where a_{ij}is
the probability that the state transitions from i to j; B is the observation
probability matrix B={b_j(y_n)}}, where b_j(y_n)=P{Y_n=y_n|X_n=j}}; \Pi is the
initial state probability.
- The nice thing about HMM is that efficient
algorithms can be derived for the following three problems, due to utilizing
the two conditional independence assumptions:
- (Computation of likelihood) Find
P{Y_1^T|\lambda} for some Y_1^T={Y_1, Y_2, ..., Y_T}. Use the
forward (or backward) procedure since it is much more efficient than direct
evaluation; the complexity is reduced from O(K^T) to O(T*K), where K is the
number of state. This is because for direct evaluation, we need
to marginalize the joint probability P{Y_1^T,X_1^T|\lambda}over the state
sequence X_1^T, which has K^T different sequences. In contrast, the
forward procedure utilizes the Markov property to reduce computation;
the Markov property leads to the recursion, which turns a joint computation
problem into separate sub-problems; the join problem needs computing over an
exponential number of possible state sequences, while each separate
sub-problem only needs computing over linear number of states.
Similarly, Viterbi algorithm reduces computation complexity due to
recursive computation, or the principle of optimality, or additive cost
(cost can be added up at each stage t). Additive cost property
is like Markov property, i.e., given the current cost, the future cost and
past cost are independent.
- (State estimation) Given some Y_1^T and
some \lambda, find the MAP estimate of state sequence X_1^T (since we know
the initial state distribution, ML is replaced by MAP). Use
Viterbi algorithm to compute the MAP; compared to the direct computation,
the complexity is reduced from O(K^T) to O(T*K^2), where K is the number of
state; this is because there are K^T possible state sequences to be computed
for direct computation, while there are T stages in Viterbi algorithm and
each stage incurs K^2 possible branches.
- (Parameter estimation) Find the ML estimate
of \lambda, given some Y_1^T. Use Baum-Welch, also called
forward-backward, algorithm (a special EM).
-
Use Baum-Welch, also called
forward-backward, algorithm (a special EM) to find the ML estimate of the
parameters of an HMM. Both forward and backward algorithm compute the
likelihood of \lambda.
- Forward procedure: Define \alpha_i(t)=P{Y_1^t=y_1^t,
X_t=i| \lambda}, which is the likelihood of \lambda, given the
observation-sequence-to-arrive Y_1^t and current state X_t=i.
- Initial condition: \alpha_i(1)=\pi_i *b_i(y_1).
- Recursive computation: \alpha_j(t+1)=[\sum_i \alpha_i(t) *a_{ij}]*
b_j(y_{t+1})
- Derivation:
\alpha_j(t+1)=P{Y_1^{t+1}=y_1^{t+1},
X_{t+1}=j| \lambda}
=\sum_i
P{Y_1^{t+1}=y_1^{t+1}, X_t=i, X_{t+1}=j|
\lambda} (marginalization of joint probability)
= \sum_i
P{Y_1^t=y_1^t, X_t=i| \lambda}*P{X_{t+1},Y_{t+1}=y_{t+1}|X_t=i} (chain
rule)
= \sum_i
P{Y_1^t=y_1^t, X_t=i| \lambda}*P{X_{t+1}=j|X_t=i}*
P{Y_{t+1}=y_{t+1}|X_{t+1}=j}(chain rule) done!
- Likelihood computation: P{Y_1^T|\lambda}=\sum_i
\alpha_i(T)
- Backward procedure: Define \beta_i(t)=P{Y_{t+1}^T=y_{t+1}^T|
X_t=i, \lambda}, which is the likelihood of \lambda, given the
observation-sequence-to-go Y_{t+1}^T and current state X_t=i.
- Initial condition: \beta_i(T)=1.
- Recursive computation: \beta_i(t)=\sum_j a_{ij}*
b_j(y_{t+1})*\beta_j(t+1)
- Derivation:
\beta_i(t)=P{Y_{t+1}^T=y_{t+1}^T|
X_t=i, \lambda}
=\sum_j
P{Y_{t+1}^T=y_{t+1}^T,X_{t+1}=j| X_t=i, \lambda}
(marginalization of joint probability)
= \sum_j
P{X_{t+1}=j|X_t=i}* P{Y_{t+1}^T=y_{t+1}^T|
X_{t+1}=j, \lambda} (chain rule)
= \sum_j
P{X_{t+1}=j|X_t=i}*P{Y_{t+1}=y_{t+1}|X_{t+1}=j,
\lambda} * P{Y_{t+2}^T=y_{t+2}^T|
X_{t+1}=j, \lambda} (chain rule) done!
- Likelihood computation:
P{Y_1^T|\lambda}=\sum_i \pi_i* b_i(y_1)*\beta_i(1)
- Define \gamma_i(t)=P{X_t=i| Y_1^T, \lambda}, which is a posterior
probability of X_t, given the observations Y_1^T.
- \gamma_i(t) = \alpha_i(t)*\beta_i(t)/ sum_j [\alpha_j(t)*\beta_j(t)]
(using Bayes rule to compute posterior probability from likelihood)
- Define \xi_{ij}(t) = P{X_t=i, X_{t+1}=j| Y_1^T, \lamdba}, which is the
joint probability of the consecutive states (not transition probability),
given the observation.
- \xi_{ij}(t) = \gamm_i(t)*a_{ij}*b_j(y_{t+1})*\beta_j(t+1)/\beta_i(t)
- Parameter estimation using relative frequencies (law of large number)
- Estimate of \pi_i: \tilde{\pi}_i = \gamma_i(1)
- Estimate of a_{ij}: \tilde{a}_{ij} = \sum_{t=1}^{T-1}\xi_{ij}(t)/
\sum_{t=1}^{T-1} \gamma_i(t)
- Estimate of b_i(k): \tilde{b}_i(k) = \sum_{t=1}^T \delta_{y_t,v_k}
\gamma_i(t) / \sum_{t=1}^T \gamma_i(t), where \delta_{y_t,v_k}=
1 if y_t=v_k; otherwise, \delta_{y_t,v_k}= 0. {v_k} is the
set of values that an observation can take.
- A sufficient condition for an HMM {Y_n} being a Markov
process is that Y_n= f(X_n}, where f is a deterministic
one-to-one correspondence and {X_n} is a Markov process.
Proof:
P{Y_n=y_n|Y_0=y_0, Y_1=y_1, ..., Y_{n-1}=y_{n-1}}
=P{X_n=x_n|X_0=x_0, X_1=x_1, ..., X_{n-1}=x_{n-1}}
=P{X_n=x_n|X_{n-1}=x_{n-1}}
=P{Y_n=y_n|Y_{n-1}=y_{n-1}}
where x_i=f^{-1}(y_i) and f^{-1} is an inverse function of f.
The proof holds for both discrete and continuous r.v.'s. If {X_n} and {Y_n}
are discrete r.v.'s, then P is probability; otherwise, P is a probability
density function.
- Consider a case where Y_n=X_n+U_n, {X_n} is a Markov
chain, {U_n} are independent, and X_n can be perfectly determined from the
continuous observation Y_n; that is, P{X_n|Y_n} is either 1 or 0. The ranges
of Y_n, w.r.t. different state of X_n, are disjoint. For example, U_n
\in (-1,+1) and X_n= -1 or +1. Then if X_n=-1, Y_n \in (-2,0); if X_n=1,
Y_n \in (0,2). Is {Y_n} a Markov chain? Yes.
Why?
Proof: Use density function p here.
p{Y_n=y_n|Y_0=y_0, Y_1=y_1, ..., Y_{n-1}=y_{n-1}}
=p{U_n=y_n-X_n|U_0=y_0-X_0, U_1=y_1-X_1, ..., U_{n-1}=y_{n-1}-X_{n-1}} ( note
that given y_i, X_i is deterministic.)
=p{U_n=y_n-X_n|U_{n-1}=y_{n-1}-X_{n-1}} (this is because {U_n} are
independent; keep U_{n-1} to prove Markov property)
=P{Y_n=y_n|Y_{n-1}=y_{n-1}}
- The necessary (I believe) and sufficient
condition for an HMM {Y_n} being a Markov
process is that {X_n} is a Markov process and P{X_n|Y_n} is either 1 or 0;
that is, X_n is perfectly observable. That is why HMM is also called
partially observable Markov process.
Linear systems with
perfectly observable state
Given a state equation and
observation equation of a system,
X_{n+1} = A* X_n + V_n
Y_n = f(X_n)
where A is invertible square
matrices, f is an invertible function, {V_n} are i.i.d. and V_n is independent
of X_n, then both the states {X_n} and the observations {Y_n} are Markov chains;
{Y_n} preserves the transition matrix of {X_n}, although the
state space is changed.
Linear systems with
imperfectly observable state
If Y_n=f(X_n) + U_n, where {U_n}
are i.i.d. and U_n is independent of X_n, then the observations {Y_n} is a
Hidden Markov Model (HMM). Note that Y_n is a random function of X_n. Note {U_n}
must be i.i.d., otherwise, conditional independence required by an HMM cannot be
satisfied: Pr{Y_n | X_0, X_1, X_n}=Pr{Y_n|X_n}.
Using Markov property (or Hidden Markov
property) to reduce computation complexity
BCJR (more general, EM algorithm), Viterbi
algorithm (more general, Dynamic programming), Kalman Filter, Particle filter,
Markov Chain Monte Carlo (MCMC) algorithm, sum-product algorithm, all can
utilize Markov property to reduce computation complexity.
If the EM algorithm is applied to HMM, it
can also utilizes Markov property to reduce computation complexity.
But the EM algorithm does not necessarily have to be applied to HMM and it can
also be used in other situations. The key properties about EM are 1)
iterative computation, 2) convergence, 3) rate of convergence, and 4)
local/global optimality; not Markov property.
If Dynamic programming is applied to HMM,
it can also utilizes Markov property to reduce computation complexity.
But Dynamic programming does not necessarily have to be applied to HMM and it
can also be used in other situations. The key properties about DP
are 1) additive cost and 2) the principle of optimality; not Markov property.
Kalman filter reduces the computation complexity
from O((N*n)^3) to O(N*n^3), where N is the number of samples and n is the
dimension of the observation vector.
Quote: "Viterbi algorithm gives a global optimum
while EM can only give a local optimum (actually a fixed point)." This is
not a fair comparison. Viterbi algorithm is only used for estimating
discrete r.v.'s rather than continuous r.v. So the cost function
(optimal criterion) cannot have a local optimum (the argument of the function is
discrete). Only continuous
cost functions can have local optimums. EM can be used for
estimating both discrete and continuous parameters. However,
for estimating discrete parameters \theta, it can not guarantee the convergence
of {\theta_j} to a
stationary point of likelihood function p(X|\theta), although {\theta_j}
converging to a stationary point of Q(\theta,\theta_j) is guaranteed.
A sufficient condition for a limit point of an EM sequence {\theta_j} being a
stationary point of the likelihood function p(X|\theta), is that Q(\theta,\theta_j)
is continuous in both \theta and \theta_j. Note that a stationary point
could be either a local optimum or a saddle point.
Discussion about the EM algorithm
In some situations (e.g., situations with
missing data), the EM algorithm may be much simpler than direct maximum
likelihood estimation (MLE). Direct MLE may require exponential
complexity while the EM algorithm may only incur linear complexity.
Given the complete data description p(X,Y|\theta),
where X is known and Y is unknown, a direct MLE of the unknown parameter \theta
could be:
- marginalize p(X,Y|\theta), i.e., integrate
p(X,Y|\theta) w.r.t. Y, resulting in p(X|\theta);
- maximize p(X|\theta) over \theta, resulting
in the MLE of \theta.
In the following paper,
C.N. Georghiades and J.C. Han, Sequence
Estimation in the Presence of Random Parameters Via the EM Algorithm, IEEE
Transactions on Communications, vol. 45, pp. 300-308, March 1997.
Georghiades and Han gave an example, in which
direct computation of MLE requires exponential complexity in the sequence length
while the EM algorithm only requires linear complexity in the sequence length.
This motivates people to use the EM algorithm.
Complex Gaussian vector channel for MIMO
systems
I. Emre Telatar, "Capacity
of Multi-antenna Gaussian Channels," European Transactions on
Telecommunications, Vol. 10, No. 6, pp. 585-595, Nov/Dec 1999.
Media-dependent noise
Alek Kavcic and J. M. F. Moura. "The Viterbi
Algorithm and Markovian Noise Memory." IEEE Transactions on
Information Theory, vol. 46, no. 1, pp. 291-301, Jan. 2000.
E.Eleftheriou and W.Hirt, "Noise-predictive
maximum-likelihood (NPML) detection for the magnetic recording channel"
Shirish A.Altekar and Jack. Wolf, "Improvements in detectors based upon colored
noise"
Space-time coding
Space-time
coding (STC) has emerged over the past few years as a new paradigm for
optimally combining modulation, coding, and diversity gains over wireless
links. It is ideal for improving the downlink performance (which is the
bottleneck in asymmetric applications such as Internet browsing and
downloading) while keeping user terminals lightweight and cost effective.
STC was originally developed for frequency-flat quasi-static fading
channels. Extensive recent research has focused on extending it to frequency
selective and time-selective fading channels.
Detection of space-time-coded signals presents a significant signal
processing challenge as multiple coded, distorted, and faded copies of the
desired signal are superimposed at each receive antenna and corrupted by
noise, multiuser interference, and synchronization errors.
|
|
Signal
processing techniques for space-time-coded transmission
topics: |
|
• Channel
Estimation (training-based, semi-blind, and blind)
• Equalization and Efficient Decoding
• Interference Suppression and Beamforming for STC
• STC for Time-Varying Channels (channel tracking)
• Synchronization Techniques
• STC and Multiple Access (OFDM, CDMA, TDMA)
• Turbo Techniques
• Capacity and Performance Limits
• High-Rate Layered Space-Time Methods (such as BLAST)
• Differential Space-Time Methods
• Prototyping Efforts
|