Probability Theory
-
Probability theory can be classified into two
categories:
Some frequentists, like Neyman, want to add some
qualifications to such statements. Mathematical probability is a property of
some formal, abstract mathematical objects. It so happens, as an empirical fact,
that these objects can be used, with a fair degree of accuracy, to represent
various bits and pieces of the real world. The long-run limits invoked are to be
understood as a way of speaking about what would happen if we repeated
experiments indefinitely, about (as Neyman puts it) ``imaginary random
experiments.'' Rather than explore this topic in the depth it deserves, which
would lead to discussing, on the one hand, the axioms of probability, measure
theory, sigma-algebras and Kolmogorov complexity, and on the other hand the
connection between mathematics and physical reality, so probability has two
interpretations: rigorous mathematical interpretation (measure), and intuitive
physical interpretation (relative frequency), which is simple but somewhat
inaccurate.
Mean, median, mode characterize the central
tendency of a r.v. Variance and standard deviation characterize dispersion
of a r.v.
- Sample space is a set of all possible
outcomes of an experiment. An event is a subset of a
sample space. Note that a sample space is not just a set; moreover, it
contains random events. The randomness is inherent in the
experiment.
- An elementary event consists of a
single element in a sample space.
- Geometric distribution:
- Geometric mean vs. arithmetic mean:
- Geometric mean is so called since it
is the n-th root of the product of n numbers (say, a_1, ..., a_n),
which has a direct geometric interpretation; that is, the volume of
a hypercube with equal side length of the geometric mean, is equal
to the volume of an orthotope with side lengths of a_1, ..., a_n.
- Geometric series
- b_n = c^n, which is the volume of a
hypercube with equal side length of c. The volume
exponentially increases with the sequence number n. So
this series has a geometric interpretation.
- Geometric distribution
- P_n = pq^{n-1}, n=1, 2, ...
- It is a product of a constant and a
geometric series.
- E[X]<infinity if and only if E[|X|]<infinity. On the contrary, absolute integrability/summability
implies conditional integrability/summability.
- Probability mass function (PMF), probability
density function (PDF)
- come from physics. Mass is an integral
of density over an area/volume. For discrete r.v., PDF is a Dirac
delta function; but the integral of PDF is finite, i.e., PMF.
- Statistical mean = expectation = ensemble
average; empirical mean = sample average/mean = time average.
- If Y = f(X), where f is a one-to-one
correspondence, then p{X=x, Y=f(x)}=p{X=x}=p{Y=f(x)}.
Types of Convergence
- Convergence everywhere
- We say that a random sequence x_n
converges everywhere if the sequence of numbers x_n(\zeta) converges
for every $\zeta$ in the sample space. That is, every
realization (a sequence of outcomes) of the process x_n
converges. Or every sample path converges. The limit
of each realization of the process x_n is a number that depends,
in general, on $\zeta$. That is, the limit of x_n is
a random variable x.
- One may be confused by
$\zeta$. Here, $\zeta$ is just a symbol representing an
outcome of an experiment; $\zeta$ is unknown before an
experiment. Once the experiment is done, $\zeta$ is
known. Then, x_n(\zeta) simply maps the outcome $\zeta$ to a
real number. So each realization of the random sequence will
result in a sequence of deterministic number.
- An example: Denote x an
arbitrary r.v. Let x_n= x+1/n. Then any
realization of the process x_n converges to the realization of x.
So x_n converges everywhere to x.
- Convergence almost everywhere (a.e.) or
almost sure (a.s.), also called convergence with probability 1 or strong
convergence
- Convergence in probability or weak
convergence
- P(lim inf A_n) <= lim inf P(A_n) <= lim
sup P(A_n) <= P( lim sup A_n)
- Since P( lim sup A_n)>= P(A_n) for
every n, we have P( lim sup A_n)>= lim sup P(A_n).
For a rigorous proof, see Billingsley's book "Probability and
Measure".
Transforms of PDF/PMF
- Characteristic function: E[exp(jwX)]
- Similar to Fourier transform of PDF:
E[exp(-jwX)]
- So, when you compute characteristic
function of a PDF using existing Fourier transform pairs, don't forget to
put a negative sign before jw in the Fourier transform of the PDF.
- Moment generating function:
- For continuous r.v.: E[exp(sX)], where s is
real-valued. Since s takes real values, MGF may not exist for
some r.v.'s but characteristic function may exist for these r.v.'s
- Similar to Laplace transform of PDF:
E[exp(-sX)], where s is complex-valued.
- So, when you compute moment generating
function of a PDF using existing Laplace transform pairs, don't forget to
put a negative sign before s in the Laplace transform of the PDF.
- For discrete r.v.: E[z^(X)]
- Similar to Z-transform of PMF: E[z^(-X)]
- So, when you compute moment generating
function of a PMF using existing Z-transform pairs, don't forget to put a
negative sign before X in the Z-transform of the PMF.
Important distributions:
-
Continuous
distributions
- Uniform: simplest distribution for support =
bounded region (a,b); for a scalar r.v., need two parameters, i.e., the
two end points of the boundary, (a,b).
- Exponential: simplest distribution for
support = [0,\infty); need one parameter, i.e., mean.
- Normal: simplest distribution for support =
[-\infty,\infty); need two parameters, i.e., mean and variance.
- Log-normal:
- Two r.v.'s X and Y have the relation Y=exp(X)
or X=log(Y); if X is a normal r.v., then Y is a log-normal r.v.
- Note that log-normal comes from the fact that
the log function of a r.v. (e.g., Y) is normal-distributed. The
log function of a normal r.v. is not log-normal (the pdf in this case can be
easily obtained).
- Log-normal distribution is used to model
shadowing effect (large-scale fading) in wireless channel.
- Chi-square distribution
- When df=2, the chi-square distribution is
exponential distribution.
- Chi distribution (Nakagami-m distribution):
- It is normal distribution if df=1
- It is Rayleigh distribution if df=2
- Used to characterized fading channel.
- Student t:
- Z=X/Y. If X is normal and Y is
chi distributed, then Z is Student t distributed. If Y is normal
(chi distribution with one degree of freedom), Z is Cauchy distributed.
- Used as a test statistic in CFAR problems
(matched filter + unknown noise variance, for linear statistical model).
- Weibull distribution:
- Discrete distributions
- Binomial distribution: pmf for configurations
of two sets.
- Multinomial distribution: pmf for
configurations of m sets.
- Mixture distribution: normalized weighted
sum/integral of distributions. For example, I have 10 Gaussian pdf f_i(x),
\sum_{i=1}^10 a_i=1; a mixture Gaussian is \sum_{i=1}^10 a_i*f_i(x).
Another example of a mixture distribution is \int g(\theta)* f(x|\theta)
d\theta, where g(\theta) is a pdf of \theta and f(x|\theta) is a pdf of x,
parameterized by \theta.
Sterling's formula: n! \approx (n/e)^n =
e^{n log n - n}, the approximation is accurate as n goes to infinity.
What is inverse chi-square distribution?
It is just the inverse function of chi-square distribution, actually, the set of
percentiles.
The Goodness of Fit Test
Suppose that we have a random experiment with a
random variable X of interest. Assume additionally
that X is discrete with density function f
on a finite set S. We repeat the experiment n times go
generate a random sample of size n from the distribution of
X:
X1,
X2, ..., Xn.
Recall that these are independent variables,
each with the distribution of X.
In this section, we assume that the distribution
of X is unknown. For a given density function f0,
we will test the hypotheses
H0:
f = f0 versus H1: f
』 f0,
The test that we will construct is known as the
goodness of fit test for the conjectured density f0.
As usual, our challenge in developing the test is to find a good test
statistic--one that gives us information about the hypotheses and whose
distribution, under the null hypothesis, is known, at least approximately.
Suppose that S = {x1,
x2, ..., xk}.
To simplify the notation, let
pj =
f0(xj) for
j = 1, 2, ..., k.
Now let Nj = #{i
in {1, 2, ..., n}: Xi
= xj} for j = 1, 2,
..., k.
1. Show
that under the null hypothesis,
- N = (N1,
N2, ..., Nk) has the multinomial
distribution with parameters n and p1, p2,
..., pk.
- E(Nj) =
npj.
- var(Nj) = npj(1
− pj).
Exercise 1 indicates how we might begin to
construct our test: for each j we can compare the observed
frequency of xj (namely Nj)
with the expected frequency of value xj (namely
npj), under the null hypothesis. Specifically, our test
statistic will be
V = (N1
− np1)2 / np1 + (N2
− np2)2 / np2 + ,,, + (Nk
− npk)2 / npk.
Note that the test statistic is based on the
squared errors (the squares of the differences between the expected
frequencies and the observed frequencies). The reason that the squared errors
are scaled as they are is the following crucial fact, which we will accept
without proof: Under the null hypothesis, as n increases to infinity,
the distribution of V converges to the chi-square distribution with
k − 1 degrees of freedom.
As usual, for m > 0 and r
in (0, 1), we will let vm, r denote the quantile of order
r for the chi-square distribution with k degrees of
freedom. For selected values of m and r, vm, r
can be obtained from the table of the chi-square distribution.
2. Show
that the following test has approximate significance level α:
Reject H0:
f = f0 versus H1: f
』 f0, if and only if V > vk
− 1, 1 − α.
Again, the test is an approximate one that works
best when n is large. Just how large n needs to be depends
on the pj; the rule of thumb is that the test will work
well if the expected frequencies npj are at least 1 and at
least 80% are at least 5.
- Chain rule:
- P(X,Y)=P(Y)*P(X|Y)
- P(X,Y|Z)=P(Y|Z)*P(X|Y,Z)
- P(X_1,X_2,..., X_n)=P(X_1)*P(X_2|X_1)*P(X_3|X_1,X_2)*...*P(X_n|X_1,...,X_{n-1})
- For Markov process, P(X_1,X_2,..., X_n)=P(X_1)*P(X_2|X_1)*P(X_3|X_2)*...*P(X_n|X_{n-1})
= P(X_1)* \prod_{i=2}^n P(X_i|X_{i-1})
- Law of total probability:
- Total probability can be computed by
weighted sum of conditional probabilities, conditioned on partitioned
causes: P(X)= \sum_{i=1}^N P(X|H_i)*P(H_i).
- The law of total probability consists of
two computation steps:
- Chain rule: P(X,H_i)=P(H_i)*P(X|H_i)
- Marginalization (marginalizing the joint
probability): P(X)= \sum_{i=1}^N P(X,H_i)= \sum_{i=1}^N P(X|H_i)*P(H_i).
- Bayes formula (Bayes rule) is to compute posterior
probability P(H_i|X) from likelihood function P(X|H_i) and prior probability P(H_i):
- P(H_i|X)=P(X|H_i)*P(H_i)/(\sum_{i=1}^N
P(X|H_i)*P(H_i)), where \sum_{i=1}^N P(X|H_i)*P(H_i)=P(X), i.e., total
probability is used.
- Bayes rule consists of two steps:
- Chain rule: P(X,H_i)=P(H_i)*P(X|H_i)
- Use total probability: P(H_i|X)=P(X,H_i)/P(X)
=P(X|H_i)*P(H_i)/(\sum_{i=1}^N
P(X|H_i)*P(H_i))
- See more at
http://en.wikipedia.org/wiki/Bayes'_theorem
- It is possible
to have two r.v.'s that are each individually Gaussian, but are not jointly
Gaussian.