Content

Communication theory

CDMA [html]

Noise performance of communication systems

Uncoded modulation such as BPSK, QPSK
- What's the performance of symbol error probability (SEP) vs. Eb/N0 for AWGN channel?
- What about ISI +AWGN channel?
  - Use MMSE equalizer
- What about fading + AWGN channel?
- What about fading + ISI + AWGN channel?
Coded modulation
Uncoded DS-CDMA + modulation
Convolutionally coded DS-CDMA+ modulation
Multiuser detection + Convolutionally coded DS-CDMA+ modulation
- Under what condition, the performance for MAI + ISI +AWGN channel reduces to single-user performance?
- MAI + fading +AWGN channel
- MAI + fading +ISI +AWGN channel
OFDM+ modulation

Why coin the term "keying" in ASK, FSK, PSK?
- Keying = switching
- Since your message to be sent consists of letters from a discrete alphabet (like Morse code), you could imagine you are typing on a typewriter: you type a letter and then shift to another key and type. So digital modulation (ASK, FSK, PSK) is like amplitude/frequency/phase shifted by keying/typing.

Modulation:

Analog modulation
- Amplitude modulation (AM)
- Phase modulation (PM)
- Frequency modulation (FM)
Digital modulation (a special case of analog modulation)
- Baseband
  - Pulse amplitude modulation (PAM)
  - Pulse phase modulation (PPM)
  - Pulse duration modulation (PDM) or Pulse width modulation (PWM)
  - Pulse frequency modulation (PFM)
- Passband
  - Amplitude shift keying (ASK)
  - Phase shift keying (PSK)
  - Frequency shift keying (FSK)

Digital Communications

Basically, a receiver consists of two components:
- Signal demodulator
  - Correlation demodulator (correlator)
  - Matched filter demodulator
  - Channel equalizer: may be combined with a detector, e.g., Decision Feedback Equalizer (DFE)
- Detector
  - When the signal has no memory, use symbol-by-symbol detector, such as ML, MAP.
  - When the signal has memory (trellis coded modulation, convolutional codes), use Maximum Likelihood Sequence Detector (MLSD), or use symbol-by-symbol MAP detector, where each symbol decision is based on an observation of a sequence of received signal vectors.
An essential objective of communication system design is to combat the impairments caused by the channel. So it is important to study the channel behavior.
Channel models:
- Additive noise: modeling the thermal noise generated in the receiver.
- Multiplicative noise: the channel gain process is a stochastic process (i.e., fading channel)
- ISI:
  - deterministic (Linear Time Invariant channel): delay spread (or response spread) of an input impulse due to finite bandwidth of the channel.
    - Ideal Nyquist channels (having constant spectrum) or channels with raised cosine spectrum, do not cause ISI, i.e., the values of the impulse response at the sampling instants are zero.
    - Non-ideal bandwidth-limited channels cause ISI, i.e., the values of the impulse response at the sampling instants are non-zero.
  - stochastic (multi-path fading channel, each path is a stochastic process): delay spread due to multiple paths.
    - If the maximum delay spread is less than the symbol duration, there is no ISI. This is the reason why OFDM (reducing the symbol rate of each sub-channel) performs well under multi-path environments.
Linear channel models: channel gain model + noise model
- Memoryless channels:
  - Discrete memoryless channels (DMC):
    - described by P(y_i|x_i), where x_i and y_i are both in discrete time; x_i is discrete-valued, and y_i could be continuous-valued or discrete-valued.
    - At MODEM layer, y_i is continuous-valued (taking values from continuous ranges), y_i=x_i+n_i, where n_i is continuous-valued (e.g., Gaussian noise).
      - These models are typically used in detection theory.
    - At CODEC layer, y_i is discrete-valued (y_i is generated by a hard decision). Example of such channels: Binary Symmetric Channel (BSC), Binary Erasure Channel.
      - These models are typically used in information theory.
  - Continuous memoryless channels:
    - Noise model:
      - Additive White Gaussian Noise (AWGN): the marginal distribution of AWGN process is Gaussian with zero mean; the auto-correlation is a delta function; the power spectral density is a constant over the range of frequency from 0 to infinity (i.e., white property). Since it is a Gaussian process, the first order and second order statistics are enough to completely characterize the AWGN process.
      - Additive colored Gaussian noise: different from AWGN, the power spectral density is not flat over the range of frequency from 0 to infinity; the auto-correlation is not a delta function.
      - impulse noise (e.g., hyper-Gaussian)
    - Channel gain model: fading channel with independent channel gains (only marginal distribution is enough to characterize the memoryless fading channel).
- Channels with memory:
  - Noise model: channels with additive colored Gaussian noise (power spectrum density function is not a constant; auto-correlation is not a Dirac delta function). For such channels, water filling/pouring can be used to achieve the channel capacity under the constraint on the signal power.
  - Channel gain model (linear distortion): ISI channel (deterministic and Linear Time Invariant system)
  - Channel gain model: Fading channel (channel gain is a stochastic process)
    - Slow or fast fading: Doppler spectrum
    - Flat or Freqency-selective fading (multipath): Power Delay Profile (PDP), ISI
      - Frequency-selective slow fading channel: tapped-delay-line channel model
        
        Model: h(t,\tau) = \sum_{k=1}^L c_k(t) \delta(\tau - k/W)
        
        where h(t,\tau) is the impulse response of the channel at time t; L is the number of resolvable paths; c_k(t) is the complex voltage gain of path k at time t (assuming it is constant during a pulse/chip duration due to slow fading); \delta(\tau) is Dirac delta function; W is the bandwidth of the transmitted passband signal (W/2 is the bandwidth of the transmitted baseband signal).
        
        This model considers non-zero propagation delay so that the first resolvable path has a delay of 1/W.    W is equal to the chip rate in DS-CDMA, 1/pulse_duration in ultra-wideband systems (e.g., 3 GHz), or 1/symbol_duration if pulse duration = symbol duration.
        
        If c_k(t) is complex gaussian, then it is Rayleigh fading.   c_k(t) = \alpha_k(t) *exp(-j* \theta_k(t)) = \alpha_k(t) * cos(\theta_k(t)) + j* \alpha_k(t) *sin(\theta_k(t)) = g_k(t) + j* q_k(t), where g_k(t) and q_k(t) are Gaussian due to central limit theorem, and they are independent due to orthogonality/uncorrelatedness of g_k(t) and q_k(t) (uncorrelatedness implies independence for Gaussian r.v's).   Since g_k(t) and q_k(t) are independent Gaussian, the amplitude \alpha_k(t) is Rayleigh distributed and the phase \theta_k(t) is uniformly distributed over [0, 2*\pi).
        
        For an ultra-wideband (UWB) system, since the signal bandwidth and the sampling rate are high, there are only a small number of paths that contribute to a resolvable tap/bin.   So central limit theorem does not hold and we do not see uniformly distributed phase.   It is observed in actual measurements that the phase only takes values of 0 or \pi.
        
        Why the delay of a resolvable path has to be a multiple of 1/W?   The proof is on page 795-797, Proakis's Digital Communications, 3rd ed. The reasons are 1) the transmitted signal required to be band-limited (passband requirement); 2) Nyquist criterion for ISI free transmission.
        
        Since some resolvable paths may have c_k(t)=0, the model can be reduced to h(t,\tau) = \sum_{k=1}^L c_k(t) \delta(\tau - \tau_k), where \tau_k is the delay for the k_th resolvable path with non-zero c_k(t).
    - Marginal distributions: Rayleigh, Rician (Nakagami-n), Nakagami-m, Nakagami-q (Hoyt).
    - Second-order statistics: Doppler spectrum (power spectral density)
    - High-order statistics: bispectrum, trispectrum.
Matched filter is an optimal receiver for pulse shaper + additive white noise channel (non-Gaussian or Gaussian does not matter), under the criterion of maximizing the peak SNR (at the sampling instant).
- Mathematically, it is the optimal solution to the problem of optimizing the SNR over a functional space L^2 (set of all possible impulse responses for an LTI system). The solution is obtained by Schwarz inequality. It is a sufficient and necessary condition.
- The intuition of matched filter being optimal, is that the transmitted signal has the highest correlation with itself or its scaled version; in other words, the transmitted signal resonates with itself.
Minimum mean squared error (MMSE) receiver is an optimal receiver for pulse shaper+ LTI channel + additive white noise channel (non-Gaussian or Gaussian does not matter), under the criterion of minimizing the mean squared error (at the sampling instant).
- An MMSE filter consists of a matched filter and a tapped-delay-line equalizer.
- Mathematically, it is the optimal solution to the problem of minimizing the mean squared error over a functional space L^2 (set of all possible impulse responses for an LTI system). The solution is obtained by setting the derivative of mean squared error (MSE) to zero and solving it for the transfer function of the MMSE filter. It is a necessary condition. But if the MSE is convex, it is a sufficient and necessary condition.
- The intuition of the MMSE filter being optimal, is that it uses a matched filter to maximize SNR while using a tapped-delay-line equalizer to invert the joint channel (LTI + AWGN).
- Zero-forcing equalizer will enhance noise when it inverts the LTI channel. Hence, it has poor SNR performance.
Nonlinear channel models
Trellis coded modulation:
- Set partitioning: partition an M-ary constellation of interest into 2, 4, 8, ... subsets with size M/2, M/4, M/8, ..., which progressively increase minimum Euclidean distance between their respective signal points.
- Ungerboeck codes (it is used for TCM; there are many TCM designs):
  - (Set partitioning) To send n bits/symbol, use a two-dimensional constellation of 2^{n+1} signal points (typically, 8-PSK or 16-QAM). Partition the constellation into 4 or 8 subsets.
  - Operations consist of two parts:
    1. (Subset selection) Convolutional encoding: One or two incoming bits per symbol enter a rate-1/2 or rate-2/3 binary convolutional encoder, respectively; the resulting two or three coded bits determine the selection of a particular subset.
      - The bits for subset selection are most significant since one bit in error will cause a big difference in received signal; hence these bits need heavy protection and so we use channel coding to protect them. (Redundancy is introduced to protect these bits.)
    2. Signal mapping: The remaining uncoded data bits determine which particular point from the selected subset is to be signaled.
      - Since the minimum Euclidean distance within a subset is increased, the detection error probability is reduced. This is achieved without introducing redundancy.
- Maximum likelihood sequence estimation (MLSE) at the receiver, using Viterbi algorithm
  - (Determine branch metrics:) For each subset of the constellation, determine the signal point that is closest to the received signal point in the Euclidean sense. The Euclidean distance will be used as a branch metric in Viterbi algorithm. We have 4 or 8 such signal points and associated metrics.
  - (MLSE:) Use Viterbi algorithm to find the information sequence with minimum path metric; each branch in the trellis of the Ungerboeck code corresponds to a subset rather than an individual signal point.
Source models: "discrete" here means both discrete-time and discrete alphabet (sample space).
- Discrete memoryless source: generate (in discrete time) a sequence of i.i.d. discrete-valued r.v.'s, described by a probability mass function.
- Discrete Markov source: generate (stationary) discrete-time finte-state Markov chain, described by initial state distribution and transition probability matrix.
  - It models a convolutional code; characterizes the stochastic nature of a convolutional encoder.
  - x_{n+1}=A*x_n +B* v_n, where v_n is i.i.d. r.v. (input sequence), x_n is the state of the encoder; y_n= C*x_n+D*v_n, where y_n is the output of the encoder and C is the generating matrix. Since from the codeword {y_n}, we can uniquely determine the information sequence {v_n}, and hence the state sequence {x_n}. So the state is perfectly observable, i.e., {y_n} is a one-to-one correspondence of {x_n}; then it can be proved that the output of the encoder y_n is a Markov chain.
    Proof: y_n= C*x_n+D*v_n= C*x_n+D* g(x_n, x_{n+1}) (v_n is a deterministic function of x_n and x_{n+1}; denote this function by g)
    
    = h(x_n, x_{n+1}) (so y_n is a deterministic function of {[x_n, x_{n+1}]^T}, denoted by h; since {[x_n, x_{n+1}]^T} is a Markov process, {y_n} is also a Markov process. This complete the proof.)
- Discrete hidden Markov source:
  - It models a trellis coded modulator (TCM); describes the stochastic structure of a trellis coded modulator.
  - A trellis coded modulator consists of two parts:
    1. Convolutional encoder: Its output sequence is a Markov chain. The coded bits determine the selection of a particular subset.
    2. Signal mapper: The remaining i.i.d. uncoded data bits determine which particular point from the selected subset is to be signaled. The output of a TCM at time n, is a random function of the state of a Markov chain (output of convolutional encoder), and the randomness is independent across time n (due to the i.i.d property of the remaining uncoded bits); hence, the output sequence of a TCM is an HMM.
Counter-measures for ISI: channel equalization. Note that after channel equalization, we need to use a detector.
- If the channel impulse response is known, use ZFE or MMSE linear equalizer (LE).
  - Zero-forcing equalizer (ZFE)
    - Force ISI to zero at the sampling instants t = kT, except k=0.
    - Suitable for the case when the channel noise is zero or not significant.
    - Why are IIR filters not used in ZFE? Theoretically, ZFE is simply inverse of the channel (its
      transfer function H(z)=1/G(z), where G(z) is the transfer function of the channel). Typically, H(z) is an IIR. However, if G(z) is not a minimum phase transfer function, 1/G(z) is not stable. In contrast, using FIR, which is always stable, can avoid this problem.
    - ZFE is realized as an FIR of infinite/finite length using Taylor expansion of 1/G(z).
  - MMSE equalizer (MMSE-LE)
    - In the design of MMSE-LE, the data sequence is assumed to be i.i.d. with zero mean. What if the data sequence is not i.i.d. (not perfectly compressed)?
      - Use scrambler, which the delay introduced is not greater than a symbol duration. The idea of scrambling is to use a pseudo-random sequence to modulate the data sequence (with spreading gain=1) so as to break/reduce the correlation between data symbols.
      - Use interleaving in channel coding, interleaving (possibly pseudo-random) can break/reduce the correlation between data symbols.
- For an unknown channel, use adaptive filters, which is Linear Equalizer/Non-Linear Equalizer (LE/NLE).
  - training-based/supervised adaptive equalizer
  - blind/unsupervised adaptive equalizer
Classification of receivers based on channel models
- Noisy channel (e.g., AWGN channel): matched filter
  - A RAKE receiver is a matched filter; actually, it consists of multiple matched filters. Specifically, it consists of tapped delay lines, correlators, a maximal ratio combiner, and an integrator.
  - A RAKE receiver is not an equalizer (at least not in the sense of ISI cancellation) since it only collects energy from different replicas of the same symbol but does not intend to remove the inter-symbol interference.
- ISI channel: transversal filter (tapped-delay-line filter) / zero-forcing equalizer
- Noisy channel with ISI (noisy and dispersive channel):
  - MMSE equalizer. It consists of a matched filter (matching the channel response, collecting signal energy, maximizing SNR) and a synchronous transversal equalizer (canceling ISI with the consideration of noise). Merit: it is suitable for the case when noise is significant; it is optimal in the sense of MMSE; it always performs as well as, and often better than, ZFE. Limitation: complexity is higher than ZFE.
  - Approximation (implementation) of MMSE equalizer:
    - Adaptive equalizer, which adjusts its equalizer coefficients to deal with time-varying channels. The tap coefficients of the transversal equalizer can be adjusted by an LMS algorithm. Basically, there are two classes of adaptive equalizers: 1) linear equalizer (transversal filter); 2) Decision Feedback Equalizer (DFE), which consists of a feedforward section, a feedback section, and a decision device. Both the feedforward and feedback sections are transversal filters and can be jointly adjusted by an LMS algorithm. For frequency-selective fading channels, DFE offers a significant improvement in performance over a linear equalizer for an equal number of taps. Limitation: DFE suffers from error propagation.
    - Fractionally Spaced Equalizer (FSE), where the spacing between adjacent taps is less than the symbol period. The finer granularity of FSE is more effective to compensate delay distortions than a synchronous equalizer.
  - Zero-forcing equalizer (followed by a decision-making device). Merit: ZFE is simple and is suitable for the case when noise is negligible. Drawback: ZFE ignores the effect of noise and has severe performance degradation when noise is significant.
Transversal filter, tapped-delay-line filter, FIR (nonrecursive) filter, and Moving Average (MA) filter are the same. Transversal filter and tapped-delay-line filter are typically used in communication theory. FIR filter is typically used in digital signal processing. MA filter is typically used in statistical signal processing (time-series theory).
IIR (recursive) filter and Auto-regressive Moving Average (ARMA) filter are the same. Auto-regressive (AR) filter is a special case of ARMA filter.
Mathematically, correlation demodulator (correlator) and matched filter are the same. The differences are
- They have different structure. A correlator consists of a multiplier (coherent demodulator) and an integrator, while a matched filter is just a IIR/FIR, whose impulse response is determined by the transmitted signal.
- They have different requirements on implementation. A correlator requires synchronization due to coherent demodulation, while a matched filter does not.
Counter-measures for fading
The relationship among estimators (estimation theory), observer (control theory), equalizers (communication theory)
- Estimators and observers are the equivalent.
Hard-decision vs. soft-decision decoding
- Hard-decision decoding has separate symbol detector and channel decoder. The symbol detector converts the continuous-valued (demodulated) received signal to discrete-valued symbols; and then the channel decoder decodes the discrete-valued symbols based on the minimum Hamming distance criterion (find a codeword that minimizes the Hamming distance between the symbol sequence and the codeword). The information sequence corresponding to the estimated codeword (one-to-one correspondence) is the output of the decoder.
- Soft-decision decoding has a joint symbol detector and channel decoder (joint detection/decoding). The channel decoder decodes the continuous-valued received signal based on the minimum Euclidean distance criterion (find a codeword that minimizes the Euclidean distance between the received signal and the modulated signal corresponding to the codeword). The information sequence corresponding to the estimated codeword is the output of the decoder.
Hard-output vs. soft output decoding
- Hard output decoding outputs discrete-valued information sequence, given continuous-valued received signal.
- Soft output decoding outputs continuous-valued log-likelihood ratio (or log-APP ratio), actually a test statistic, given continuous-valued received signal. APP: a posteriori probability. The test statistic can be used to make a hard decision.
Optimal detector, adaptive filter, blind estimator
- Optimal detector: If the auto-correlations of the channel input process x(n) and the channel output process y(n), and the cross-correlation of x(n) and y(n) are known, the optimum detector under MMSE criterion is Wiener filter (a linear filter).
- Adaptive filter: If the auto-correlations and cross-correlations are unknown, we need to estimate the correlations. Such estimation algorithms include LMS and RLS, which are sub-optimal implementation/approximation of the MMSE detector. Adaptive filters need training (offline estimation).
- Blind estimator: If training is not used, adaptive filters are called blind estimator (online estimation).
CDMA receiver
- Single user detection: treat Multiple Access Interference (MAI) as background noise.
- Multiuser detection: detect the signals from interfering users; then remove the interference.
  - Synchronous CDMA: the signal from each interfering user has the same delay as the signal of interest.
  - Asynchronous CDMA: the signals from interfering users have different delays as the signal of interest.
List all kinds of the linear receivers.
- Matcher filter
- MMSE (for Gaussian source and AWGN channel)
For CDMA systems, why is the MMSE detector the best among all linear receivers, in the sense of maximum Signal-to-Interference Ratio (SIR)?
List typical nonlinear receivers
- Quadratic detector (energy detector)
- Chi-square detector
- Detector with student t statistic
- Detector with F statistic
High Order Statistics (HOS):
- Cumulant:
  - generalization of auto-correlation of a stochastic process
  - cumulants could be time variant.
- Polyspectrum:
  - generalization of power spectrum of a stochastic process
  - polyspectrum could be time variant.
- HOS provides a basis for the identification of a non-minimum-phase system (such as multi-path fading channel) since HOS preserves phase information in the observed signal.
Nonlinear channels
- Microwave systems with nonlinear amplifier
  - AM/AM conversion
  - AM/FM conversion
- High density digital magnetic recording channel
  - saturation
  - partial erasure
- Fiber optic data transmission
  - vectorial nonlinear Schroedinger equation
  - self phase modulation (chirp)
  - chromatic dispersion
  - polarization mode dispersion (PMD)
  - photo diode direct detection
- Satellite channels are usually (but not always) nonlinear.
Continuous phase modulation
- Properties
  - Phase continuity
  - Orthogonality (orthogonal decomposition of the signal)
QPSK: acronym for quaternary phase-shift keying or quadrature phase-shift keying
- quaternary phase-shift keying: QPSK has 4 signal points in the constellation.
- quadrature phase-shift keying: a QPSK signal consists of both inphase and quadrature signal components at the same time.
Power amplifier vs. modulation scheme
- Class A
  - used for linear modulation (AM/ASK, PM/PSK)
  - maximum power efficiency: 50%
- Class AB
- Class B
  - maximum power efficiency: 78.5%
- Class C
  - used for nonlinear modulation (FM/FSK)
  - maximum power efficiency: almost 100%
- Class D
  - used for nonlinear modulation (FM/FSK)
  - maximum power efficiency: almost 100%
  - It works like a switch; if there is no input signal, the output signal is also zero (assuming the transistor has zero saturation voltage).
- Class E
  - used for nonlinear modulation (FM/FSK)
  - maximum power efficiency: almost 100%
  - It works like a switch; if there is no input signal, the output signal is also zero (assuming the transistor has zero saturation voltage).
- Class S
  - used for nonlinear modulation (FM/FSK)
  - maximum power efficiency: almost 100%
  - It works like a switch; if there is no input signal, the output signal is also zero (assuming the transistor has zero saturation voltage).
M-ary PSK, M-ary QAM, and M-ary FSK are commonly used in coherent systems.
M-ary DPSK and M-ary FSK are commonly used in noncoherent systems. M-ary ASK can also be used in noncoherent systems but is not common.

Key Techniques:

Linear independence, orthogonality, statistical independence
- Any signal can be decomposed into a linear combination of linearly independent waveforms.
- Any signal in a linear vector (functional) space can be represented by an expansion of orthonormal basis functions of the vector space. Why is an orthonormal basis a better representation of signals than linearly independent waveforms/signals? No scaling.

Sampling and Interpolation

Why non-uniform/nonlinear sampling at the Nyquist rate cannot have distortionless reconstruction of the original signal?

At channel encoding/decoding layer, typically we use E_b/N_0, where E_b is energy per information bit and N_0 is single-sided noise power spectrum density. The performance of a CODEC is characterized by the relation between BER and E_b/N_0.

At modulation/demodulation layer, typically we use E_s/N_0, where E_s is energy per symbol and N_0 is single-sided noise power spectrum density. The performance of an MODEM is characterized by the relation between BER and E_s/N_0. E_s/N_0=E_b/N_0 * bits/symbol * k/n, where k/n is the rate of the channel code.

The received SNR = E_s/N_0 * symbol_rate/signal_bandwidth = E_b/N_0 * bit_rate/signal_bandwidth. If the transmitter uses a square-root raised cosine pulse shape and the receiver uses a matched filter, symbol_rate = signal_bandwidth.

How to convert a digital signal into a band-limited signal, so that it can be recovered without distortion after transmitted over a band-limited channel (rectangular spectrum)?

A method is to use a square-root raised cosine pulse shape at the transmitter and at the receiver (matched filter). The benefits of this method are 1) satisfying Nyquist's criterion for zero ISI, and 2) maximizing SNR.

Viterbi Algorithm (VA) vs. Kalman filter (KF)

Both of them use causality to reduce computational complexity. The existence of causality leads to the design of a recursive algorithm so that the quantity (cost-to-arrive for VA or state estimate for KF) computed at the last stage can be used for obtaining the quantity at current stage.

there is no need to re-compute the total cost when new data comes in.

Both of them can be used for state estimation of hidden Markov model (HMM).

Kalman filter has the following assumptions (linearity, Gaussianity, MMSE):
1) system state equation is known and linear; observation equation is known and linear.   System states must be continuous since the disturbance is Gaussian.   So the state can not be estimated by dynamic programming (DP).   DP is only for estimating the state of FSMC.
2) disturbances in the state equation are i.i.d. Gaussian; measurement errors in the observation equation are i.i.d. Gaussian.
3) the distribution of initial state is known and Gaussian.
4) the (optimal) criterion for estimation is MMSE.

VA has the following assumptions:
1) system states must be discrete but cannot be continuous; the process of the system state forms a finite state Markov chain (FSMC).
2) the system state is partially observable (the observation is a random function of the state; the ranges of the observations corresponding to different states are overlapped; otherwise, from the observation, the state can be uniquely determined.)
3) We know the distribution of measurement errors, i.e., the probability of the observation, given the state transition.   Note the observation is related to the state transition (a function of two variables); the observation related to the single state is a special case.
4) The criterion of state sequence estimation could be ML or MAP.
For MAP, we assume that the distribution of initial state is known but does not have to be Gaussian.   It can be any probability distribution.
We also need to know the transition probability of the Markov chain, i.e., the likelihood function or the disturbance's distribution, which doesn't need to be Gaussian.
In digital communication, we typically assume transmitted symbols are equally probable and statistically independent. Then MAP is reduced to ML. In addition, typically we assume the channel noise is AWGN. So ML is reduced to minimum Euclidean distance between the transmitted symbol and the received signal. It's also called nearest neighbor criterion.

What about channels having memory (ISI channels)? A channel with memory is characterized by FSMC, where the number of states is determined by the channel memory length. So VA can be used for ISI channels, rather than AWGN channels. The assumption of using VA is that transmitted symbols are i.i.d. and the channel noise is white Gaussian.

What about transmitted symbols having memory (convolutional codes)? Use VA to decode the codeword.

What about noise is colored? 1) whiten the noise using R^{-1/2}, where R is the noise covariance matrix. 2) time-domain process: model the noise U_n as AR(p) . U_n - \sum_{i=1}^p a_i*U_{n-i} is i.i.d. Gaussian noise. Then we can use VA w.r.t. whitened noise U_n - \sum_{i=1}^p a_i*U_{n-i}.

Sequential Bayesian estimation methods:

Sequential Bayesian estimate of a random variable
Kalman filter: sequential Bayesian estimate of the state of linear system (estimate an AR process)
Particle filter: Sequential Importance Sampling, estimate the state of a nonlinear system or linear system with non-Gaussian disturbances.

BCJR vs. Viterbi Algorithm

The Bahl, Cocke, Jelinek, and Raviv (BCJR) algorithm is used for a posteriori probability (APP) decoding of convolutional codes in AWGN. Although the BCJR and Viterbi algorithms share may similarities, they differ in a fundamental way because they compute very different quantities. The Viterbi algorithm computes hard decisions by performing maximum likelihood decoding. The BCJR algorithm computes soft information about the message in the form of a posteriori probabilities for each of the message bits. Of course, this soft information could be converted to hard decisions by comparing the a posteriori probabilities to a threshold of one half; the resulting decisions would be individually optimal in the sense that they have minimal probability of being in error. Thus, whereas the Viterbi algorithm produces the most likely message sequence, which minimizes the word-error rate, quantizing the BCJR probabilities produces the sequence of most likely message bits, which minimizes bit-error rate. The distinction is subtle and is not by itself enough to warrant much interest in BCJR. Indeed, the hard-decision performance of BCJR is only marginally better than Viterbi. The real value of BCJR is it that the a posteriori probabilities are valuable in their own right; they may be used to estimate the reliability of a decision, and are especially useful when interacting with other modules in a receiver. Another important feature of BCJR is that it naturally and easily
exploits any a priori information that the receiver may have about the message bits.

BCJR (or MAP algorithm) is an optimal decoding algorithm for Turbo codes. Sub-optimal algorithms for decoding Turbo codes include log-MAP and SOVA. Converting (multiplication, summation) to (max, summation) can further reduce complexity.

Equivalence between the BCJR algorithm and Generalized VA (GVA)

BCJR algorithm is not a special EM algorithm. BCJR computes the exact APP, so that we can obtain MAP estimate. EM computes ML estimate.

Turbo principle: refine the processing recursively

EM algorithm: Given the observation X, compute ML estimate of \theta.
- Complete data description
  - p(X,Y|\theta), where X is an observation vector and Y is a latent vector (not observable)
- Incomplete data description:
  - p(X|\theta)
Algorithm
1. (E step) given a value \theta_j, compute Q(\theta, \theta_j)=E[log f(X,Y|\theta)], where the expectation is over f(Y|X,\theta_j).
2. (M step) compute \theta_{j+1}= arg max_{\theta} Q(\theta, theta_j); repeat steps 1 and 2 till the algorithm converges.
Used in iterative decoding of turbo codes, pattern recognition.

Note that it may not be easy to compute the expectation and maximization. You may have to use Monte Carlo methods to compute the integration in the expectation, and optimization in the maximization. In addition, you may not have the distribution f(Y|X,\theta_j), which is typical in pattern recognition. Hence, in practice, there could be many different algorithms (in different form) to implement the EM.

Use Baum-Welch, also called forward-backward, algorithm (a special EM) to find the ML estimate of the parameters of an HMM. Described later.
Sum-product algorithm (message-passing algorithm): operates in a factor graph (a bipartite graph), consisting of upper-layer nodes and lower-layer nodes, and links connecting the nodes at two layers.
1. for each upper-layer node, compute the sum-product (product of all the incident links, resulting in a global function, then sum over redundant variables, resulting in a marginal function), assuming that the incoming information is all correct.
2. for each lower-layer node, compute the sum-product (product of all the incident links, then sum over redundant variables), assuming that the incoming information is all correct. Repeat steps 1 and 2 till the algorithm converges.
Computation is in the logarithmic regime. If x1 and x2 are very different, then log(x1+x2) is approximately max{log(x1), log(x2)}. Sum-product algorithm is used in LDPC decoding.

Belief propagation is a special sum-product algorithm.
Lloyd algorithm for obtaining the optimal codebook for vector quantization: (same as k-means algorithm, where k denotes the number of centroids.)
1. given a codebook (a set of centroids), optimize the decoder using the nearest neighbor rule, i.e., decode the given points and hence obtain an optimal partition of the given points (optimal w.r.t. the codebook);
2. for the partition of the given points, optimize the encoder using the centroid rule, and hence obtain a new codebook which is optimal for that partition; repeat steps 1 and 2 till the algorithm converges.
Used in cluster analysis (for pattern recognition) and vector quantization (for signal compression)

Markov process vs. hidden Markov model (HMM): conditional independence property

{X_n} is a Markov process if P{X_n|X_0, X_1, ..., X_{n-1}}=P{X_n|X_{n-1}}, where P is probability if {X_n} are discrete r.v.'s; otherwise, P is a probability density function.
{Y_n} is a hidden Markov model if (definition is sufficient and necessary condition)
- P{Y_n|X_0, X_1, ..., X_n, Y_0, Y_1, ..., Y_{n-1}}=P{Y_n|X_n} and {X_n} is a Markov process;
- or more compactly, P{Y_n, X_{n+1}|X_0, X_1, ..., X_n, Y_0, Y_1, ..., Y_{n-1}}=P{Y_n,X_{n+1}|X_n}. Note that marginalizing it over Y_n and X_{n+1} respectively, results in the first definition.
For finite horizon T, {Y_n} is a hidden Markov model if (two conditional independence assumptions)
1. P{Y_n|X_0, X_1, ..., X_n, X_{n+1},..., X_T, Y_0, Y_1, ..., Y_{n-1}, Y_{n+1}, ..., Y_T}=P{Y_n|X_n}
2. P{X_n|X_0, X_1, ..., X_{n-1}}=P{X_n|X_{n-1}}
An example of HMM (using an observation equation): {Y_n} is a hidden Markov model if Y_n=X_n+U_n, where {U_n} is a sequence of independent r.v. (not necessarily identically distributed) and {X_n} is a Markov process. Note that the noise {U_n} has to be white; otherwise, conditional independence (HMM property) cannot be satisfied.

A hidden Markov chain can be characterized by a triplet \lambda={A, B, \Pi} where A is the state transition probability matrix A={a_{ij}, where a_{ij}is the probability that the state transitions from i to j; B is the observation probability matrix B={b_j(y_n)}}, where b_j(y_n)=P{Y_n=y_n|X_n=j}}; \Pi is the initial state probability.
The nice thing about HMM is that efficient algorithms can be derived for the following three problems, due to utilizing the two conditional independence assumptions:
1. (Computation of likelihood) Find P{Y_1^T|\lambda} for some Y_1^T={Y_1, Y_2, ..., Y_T}. Use the forward (or backward) procedure since it is much more efficient than direct evaluation; the complexity is reduced from O(K^T) to O(T*K), where K is the number of state. This is because for direct evaluation, we need to marginalize the joint probability P{Y_1^T,X_1^T|\lambda}over the state sequence X_1^T, which has K^T different sequences. In contrast, the forward procedure utilizes the Markov property to reduce computation; the Markov property leads to the recursion, which turns a joint computation problem into separate sub-problems; the join problem needs computing over an exponential number of possible state sequences, while each separate sub-problem only needs computing over linear number of states. Similarly, Viterbi algorithm reduces computation complexity due to recursive computation, or the principle of optimality, or additive cost (cost can be added up at each stage t). Additive cost property is like Markov property, i.e., given the current cost, the future cost and past cost are independent.
2. (State estimation) Given some Y_1^T and some \lambda, find the MAP estimate of state sequence X_1^T (since we know the initial state distribution, ML is replaced by MAP). Use Viterbi algorithm to compute the MAP; compared to the direct computation, the complexity is reduced from O(K^T) to O(T*K^2), where K is the number of state; this is because there are K^T possible state sequences to be computed for direct computation, while there are T stages in Viterbi algorithm and each stage incurs K^2 possible branches.
3. (Parameter estimation) Find the ML estimate of \lambda, given some Y_1^T. Use Baum-Welch, also called forward-backward, algorithm (a special EM).
Use Baum-Welch, also called forward-backward, algorithm (a special EM) to find the ML estimate of the parameters of an HMM. Both forward and backward algorithm compute the likelihood of \lambda.
- Forward procedure: Define \alpha_i(t)=P{Y_1^t=y_1^t, X_t=i| \lambda}, which is the likelihood of \lambda, given the observation-sequence-to-arrive Y_1^t and current state X_t=i.
  1. Initial condition: \alpha_i(1)=\pi_i *b_i(y_1).
  2. Recursive computation: \alpha_j(t+1)=[\sum_i \alpha_i(t) *a_{ij}]* b_j(y_{t+1})
    - Derivation:
      \alpha_j(t+1)=P{Y_1^{t+1}=y_1^{t+1}, X_{t+1}=j| \lambda}
      =\sum_i P{Y_1^{t+1}=y_1^{t+1}, X_t=i, X_{t+1}=j| \lambda} (marginalization of joint probability)
      = \sum_i P{Y_1^t=y_1^t, X_t=i| \lambda}*P{X_{t+1},Y_{t+1}=y_{t+1}|X_t=i} (chain rule)
      = \sum_i P{Y_1^t=y_1^t, X_t=i| \lambda}*P{X_{t+1}=j|X_t=i}* P{Y_{t+1}=y_{t+1}|X_{t+1}=j}(chain rule) done!
  3. Likelihood computation: P{Y_1^T|\lambda}=\sum_i \alpha_i(T)
- Backward procedure: Define \beta_i(t)=P{Y_{t+1}^T=y_{t+1}^T| X_t=i, \lambda}, which is the likelihood of \lambda, given the observation-sequence-to-go Y_{t+1}^T and current state X_t=i.
  1. Initial condition: \beta_i(T)=1.
  2. Recursive computation: \beta_i(t)=\sum_j a_{ij}* b_j(y_{t+1})*\beta_j(t+1)
    - Derivation:
      \beta_i(t)=P{Y_{t+1}^T=y_{t+1}^T| X_t=i, \lambda}
      =\sum_j P{Y_{t+1}^T=y_{t+1}^T,X_{t+1}=j| X_t=i, \lambda} (marginalization of joint probability)
      = \sum_j P{X_{t+1}=j|X_t=i}* P{Y_{t+1}^T=y_{t+1}^T| X_{t+1}=j, \lambda} (chain rule)
      = \sum_j P{X_{t+1}=j|X_t=i}*P{Y_{t+1}=y_{t+1}|X_{t+1}=j, \lambda} * P{Y_{t+2}^T=y_{t+2}^T| X_{t+1}=j, \lambda} (chain rule) done!
  3. Likelihood computation: P{Y_1^T|\lambda}=\sum_i \pi_i* b_i(y_1)*\beta_i(1)
- Define \gamma_i(t)=P{X_t=i| Y_1^T, \lambda}, which is a posterior probability of X_t, given the observations Y_1^T.
  - \gamma_i(t) = \alpha_i(t)*\beta_i(t)/ sum_j [\alpha_j(t)*\beta_j(t)] (using Bayes rule to compute posterior probability from likelihood)
- Define \xi_{ij}(t) = P{X_t=i, X_{t+1}=j| Y_1^T, \lamdba}, which is the joint probability of the consecutive states (not transition probability), given the observation.
  - \xi_{ij}(t) = \gamm_i(t)*a_{ij}*b_j(y_{t+1})*\beta_j(t+1)/\beta_i(t)
- Parameter estimation using relative frequencies (law of large number)
  - Estimate of \pi_i: \tilde{\pi}_i = \gamma_i(1)
  - Estimate of a_{ij}: \tilde{a}_{ij} = \sum_{t=1}^{T-1}\xi_{ij}(t)/ \sum_{t=1}^{T-1} \gamma_i(t)
  - Estimate of b_i(k): \tilde{b}_i(k) = \sum_{t=1}^T \delta_{y_t,v_k} \gamma_i(t) / \sum_{t=1}^T \gamma_i(t), where \delta_{y_t,v_k}= 1 if y_t=v_k; otherwise, \delta_{y_t,v_k}= 0. {v_k} is the set of values that an observation can take.
A sufficient condition for an HMM {Y_n} being a Markov process is that Y_n= f(X_n}, where f is a deterministic one-to-one correspondence and {X_n} is a Markov process.
Proof:
P{Y_n=y_n|Y_0=y_0, Y_1=y_1, ..., Y_{n-1}=y_{n-1}}
=P{X_n=x_n|X_0=x_0, X_1=x_1, ..., X_{n-1}=x_{n-1}}
=P{X_n=x_n|X_{n-1}=x_{n-1}}
=P{Y_n=y_n|Y_{n-1}=y_{n-1}}
where x_i=f^{-1}(y_i) and f^{-1} is an inverse function of f.
The proof holds for both discrete and continuous r.v.'s. If {X_n} and {Y_n} are discrete r.v.'s, then P is probability; otherwise, P is a probability density function.
Consider a case where Y_n=X_n+U_n, {X_n} is a Markov chain, {U_n} are independent, and X_n can be perfectly determined from the continuous observation Y_n; that is, P{X_n|Y_n} is either 1 or 0. The ranges of Y_n, w.r.t. different state of X_n, are disjoint. For example, U_n \in (-1,+1) and X_n= -1 or +1. Then if X_n=-1, Y_n \in (-2,0); if X_n=1, Y_n \in (0,2). Is {Y_n} a Markov chain? Yes. Why?
Proof: Use density function p here.
p{Y_n=y_n|Y_0=y_0, Y_1=y_1, ..., Y_{n-1}=y_{n-1}}
=p{U_n=y_n-X_n|U_0=y_0-X_0, U_1=y_1-X_1, ..., U_{n-1}=y_{n-1}-X_{n-1}} ( note that given y_i, X_i is deterministic.)
=p{U_n=y_n-X_n|U_{n-1}=y_{n-1}-X_{n-1}} (this is because {U_n} are independent; keep U_{n-1} to prove Markov property)
=P{Y_n=y_n|Y_{n-1}=y_{n-1}}
The necessary (I believe) and sufficient condition for an HMM {Y_n} being a Markov process is that {X_n} is a Markov process and P{X_n|Y_n} is either 1 or 0; that is, X_n is perfectly observable. That is why HMM is also called partially observable Markov process.

Linear systems with perfectly observable state

Given a state equation and observation equation of a system,

X_{n+1} = A* X_n + V_n

Y_n = f(X_n)

where A is invertible square matrices, f is an invertible function, {V_n} are i.i.d. and V_n is independent of X_n, then both the states {X_n} and the observations {Y_n} are Markov chains; {Y_n} preserves the transition matrix of {X_n}, although the state space is changed.

Linear systems with imperfectly observable state

If Y_n=f(X_n) + U_n, where {U_n} are i.i.d. and U_n is independent of X_n, then the observations {Y_n} is a Hidden Markov Model (HMM). Note that Y_n is a random function of X_n. Note {U_n} must be i.i.d., otherwise, conditional independence required by an HMM cannot be satisfied: Pr{Y_n | X_0, X_1, X_n}=Pr{Y_n|X_n}.

Using Markov property (or Hidden Markov property) to reduce computation complexity

BCJR (more general, EM algorithm), Viterbi algorithm (more general, Dynamic programming), Kalman Filter, Particle filter, Markov Chain Monte Carlo (MCMC) algorithm, sum-product algorithm, all can utilize Markov property to reduce computation complexity.

If the EM algorithm is applied to HMM, it can also utilizes Markov property to reduce computation complexity. But the EM algorithm does not necessarily have to be applied to HMM and it can also be used in other situations. The key properties about EM are 1) iterative computation, 2) convergence, 3) rate of convergence, and 4) local/global optimality; not Markov property.

If Dynamic programming is applied to HMM, it can also utilizes Markov property to reduce computation complexity. But Dynamic programming does not necessarily have to be applied to HMM and it can also be used in other situations. The key properties about DP are 1) additive cost and 2) the principle of optimality; not Markov property.

Kalman filter reduces the computation complexity from O((N*n)^3) to O(N*n^3), where N is the number of samples and n is the dimension of the observation vector.

Quote: "Viterbi algorithm gives a global optimum while EM can only give a local optimum (actually a fixed point)." This is not a fair comparison. Viterbi algorithm is only used for estimating discrete r.v.'s rather than continuous r.v. So the cost function (optimal criterion) cannot have a local optimum (the argument of the function is discrete). Only continuous cost functions can have local optimums. EM can be used for estimating both discrete and continuous parameters. However, for estimating discrete parameters \theta, it can not guarantee the convergence of {\theta_j} to a
stationary point of likelihood function p(X|\theta), although {\theta_j} converging to a stationary point of Q(\theta,\theta_j) is guaranteed. A sufficient condition for a limit point of an EM sequence {\theta_j} being a stationary point of the likelihood function p(X|\theta), is that Q(\theta,\theta_j) is continuous in both \theta and \theta_j. Note that a stationary point could be either a local optimum or a saddle point.

Discussion about the EM algorithm

In some situations (e.g., situations with missing data), the EM algorithm may be much simpler than direct maximum likelihood estimation (MLE). Direct MLE may require exponential complexity while the EM algorithm may only incur linear complexity.

Given the complete data description p(X,Y|\theta), where X is known and Y is unknown, a direct MLE of the unknown parameter \theta could be:

marginalize p(X,Y|\theta), i.e., integrate p(X,Y|\theta) w.r.t. Y, resulting in p(X|\theta);
maximize p(X|\theta) over \theta, resulting in the MLE of \theta.

In the following paper,

C.N. Georghiades and J.C. Han, Sequence Estimation in the Presence of Random Parameters Via the EM Algorithm, IEEE Transactions on Communications, vol. 45, pp. 300-308, March 1997.

Georghiades and Han gave an example, in which direct computation of MLE requires exponential complexity in the sequence length while the EM algorithm only requires linear complexity in the sequence length. This motivates people to use the EM algorithm.

Complex Gaussian vector channel for MIMO systems

I. Emre Telatar, "Capacity of Multi-antenna Gaussian Channels," European Transactions on Telecommunications, Vol. 10, No. 6, pp. 585-595, Nov/Dec 1999.

Media-dependent noise

Alek Kavcic and J. M. F. Moura. "The Viterbi Algorithm and Markovian Noise Memory." IEEE Transactions on Information Theory, vol. 46, no. 1, pp. 291-301, Jan. 2000.

E.Eleftheriou and W.Hirt, "Noise-predictive maximum-likelihood (NPML) detection for the magnetic recording channel"

Shirish A.Altekar and Jack. Wolf, "Improvements in detectors based upon colored noise"

Space-time coding


Space-time coding (STC) has emerged over the past few years as a new paradigm for optimally combining modulation, coding, and diversity gains over wireless links. It is ideal for improving the downlink performance (which is the bottleneck in asymmetric applications such as Internet browsing and downloading) while keeping user terminals lightweight and cost effective. STC was originally developed for frequency-flat quasi-static fading channels. Extensive recent research has focused on extending it to frequency selective and time-selective fading channels. Detection of space-time-coded signals presents a significant signal processing challenge as multiple coded, distorted, and faded copies of the desired signal are superimposed at each receive antenna and corrupted by noise, multiuser interference, and synchronization errors.

Signal processing techniques for space-time-coded transmission topics:
	• Channel Estimation (training-based, semi-blind, and blind) • Equalization and Efficient Decoding • Interference Suppression and Beamforming for STC • STC for Time-Varying Channels (channel tracking) • Synchronization Techniques • STC and Multiple Access (OFDM, CDMA, TDMA) • Turbo Techniques • Capacity and Performance Limits • High-Rate Layered Space-Time Methods (such as BLAST) • Differential Space-Time Methods • Prototyping Efforts