Loading [MathJax]/extensions/TeX/newcommand.js

MathJax

Friday, January 30, 2015

Conditional Expectation

Definition.  λ is absolutely continuous (AC) with respect to μ, written λμ, if μ(A)=0 implies λ(A)=0.

Theorem 2 Radon-Nikodym Theorem  Let (Ω,B,P) be the probability space.  Suppose v is a positive bounded measure and vP.  Then there exists an integrable random variable XB, such that
v(E)=EXdP,EB X is a.s. unique (P) and is written
X=dvdP. We also write dv=XdP

Definition of Conditional Expectation

Suppose XL1(Ω,B,P) and let GB be a sub-σ-field.  Then there exists a random variable E(X|G), called the conditional expectation of X with respect to G, such that

  1. E(X|G) is G-measurable and integrable.
  2. For all GG we have GXdP=GE(X|G)dP
Notes.
  1. Definition of conditional probability: Given (Ω,B,P), a probability space, with G a sub-σ-field of B, define P(A|G)=E(1A|G),AB.  Thus P(A|G) is a random variable such that 
    1. P(A|G) is G-measurable and integrable.
    2. P(A|G) satisfies GP(A|G)dP=P(AG),GG.
  2. Conditioning on random variables: Suppose {Xt,tT} is a family of random variables defined on (Ω,B) and indexed by some index set T.   Define G:=σ(Xt,tT) to be the \sigma-field generated by the process {Xt,tT}.  Then define E(X|Xt,tT)=E(X|G).
Note (1) continues the duality of probability and expectation but seems to place expectation in a somewhat more basic position, since conditional probability is defined in terms of conditional expectation.

Note (2) saves us from having to make separate definitions for E(X|X1), E(X|X1,X2), etc.

Countable partitions Let {Λn,n1} be a partition of Ω so thyat ΛiΛj=,ij, and nΛn=Ω.  Define
G=σ(Λn,n1) so that
G={iJΛi:J{1,2,}}. For XL1(P), define
EΛn(X)=XP(dω|Λn)=ΛnXdP/PΛn, if P(Λn)>0 and EΛn(X)=18 if P(Λn)=0.  We claim

  1. E(X|G)a.s.=n=1EΛn(X)1Λn and for any AB
  2. P(A|G)a.s.=n=1P(A|Λn)1Λn



Product Spaces, Transition Kernel and Rubini's Theorem

Lemma 1. Sectioning sets Sections of measurable sets are measurable.  If AB1×B2, then for all ω1Ω1,
Aw1B2
Corollary 1. Sections of measurable functions are measurable.  That is if
X:(Ω1×Ω2,B1×B2)(S,S) then Xω1B2.  We say Xω1 is B/S measurable.

Define the transition (probability) kernel
K(ω1,A2):Ω1×B2[0,1]
if it satisfies the following
  1. for each ω1,K(ω1,) is a probability measure on B2, and
  2. for each A2B2,K(,A2) is B1/B([0,1]) measurable.
Transition kernels are used to define discrete time Markov processes where K(ω1,A2) represents the conditional probability that, starting from ω1, the next movement of the system results in a state in A2.

Theorem 1.  Let P1 be a probability measure on B1, and suppose
K:Ω1×B2[0,1] is a transition kernel.  Then K and P1 uniquely determine a probability on B1×B2 via the formula
P(A1×A2)=A1K(ω1,A2)P1(dw1), for all A1×A2 in the class of measurable rectangles.

Theorem 2. Marginalization  Let P1 be a probability measure on (Ω1,B1) and suppose K:Ω1×B2[0,1] is a transition kernel.  Define P on (Ω1×Ω2,B1×B2) by
P(A1×A2)=A1K(ω1,A2)P1(dω1).  Assume
X:(Ω1×Ω2,B1×B2)(R,B(R)) and furthermore suppose X is integrable.  Then
Y(ω1)=Ω2K(ω1,dω2)Xω2(ω2) has the properties

  1. Y is well defined.
  2. YB1
  3. YL1(P1) and furthermore
Ω1×Ω2XdP=Ω1Y(ω1)P1(dω1)=Ω1[Ω2K(ω1,dω2)Xω1(dω2)]P1(ω1).
Theorem 3.  Fubini Theorem  Let P=P1×P2 be a product measure.  If X is B1×B2 measurable and is either non-negative or integrable with respect to P then
Ω1×Ω2XdP=Ω1[Ω2Xω1(ω2)P2(dω2)]P1(dω1)=Ω2[Ω1Xω2(ω1)P1(dω1)]P2(dω2)

Thursday, January 29, 2015

Clarification of Expectation

Let X be a random variable on the probability space (Ω,B,P).  Recall that the distribution of X is the measure
F:=PX1 on (R,B(R)) defined by
F(A)=PX1(A)=P[XA].
The distribution function of X is
F(x):=F((,x])=P[Xx].  Note that the letter "F" is overloaded in two ways.

An application of the Transformation Theorem allows us to compute the abstract integral
E(X)=ΩXdP as
E(X)=RxF(dx), which is an integral on R.

More precisely,
E(X)=ΩX(ω)P(dω)=RxF(dx).
Also given a measurable function g(X), The expectation of g(X) is
E(g(X))=Ωg(X(ω))P(dω)=Rg(x)F(dx).
Instead of computing expectations on the abstract space Ω, one can always compute them on R using F, the distribution of X.

Random variables and Inverse maps

A random variable is a real valued function with domain Ω which has an extra property called measurability that allows us to make probability statements about the random variable.

Suppose Ω and Ω are two sets.  Often Ω=R.  Suppose
X:ΩΩ,  Then X determines an inverse map (a set valued function)
X1:P(Ω)P(Ω) defined by
X1(A)={ωΩ:X(ω)A} for AΩ.

X1 preserves complementation, union and intersections.

Wednesday, January 28, 2015

Convergence Concepts

Implications of convergence



1. Almost Sure Convergence

Examples of statements that hold almost surely (a.s.)

  • Let X,X be two random variables.  Then X=X a.s. means P[X=X]=1; that is, there exists an event NB, such that P(N)=0 and if ωNc, then X(ω)=X(ω).
  • If {Xn} is a sequence of random variables, then limnXn exists a.s. means there exists an event NB, such that P(N)=0 and if ωNc then limnXn(w) exists. It also means that for a.a. ω, lim supnXn(ω)=lim infnXn(ω). We will write limnXn=X or Xna.s.X.
  • If {Xn} is a sequence of random variables, then nXn converges a.s. means there exists an event NB, such that P(N)=0, and ωNc implies nXn(w) converges.

2. Convergence in Probability

Suppose Xn,n1 and X are random variables.  Then Xn converges in probability (i.p.) to X, written XnPX, if for any ϵ>0 limnP[|XnX|>ϵ]=0.
Almost sure convergence of {Xn} demands that for a.e. ω, Xn(w)X(w) gets small and stay small.  Convergence i.p. is weaker and merely requires that the probability of the difference Xn(w)X(w) being non-trivial become small.

It is possible for a sequence to converge in probability but not almost surely.

Theorem 1. Convergence a.s. implies convergence i.p. Suppose that Xn,n1 and X are random variables on a probability space (Ω,B,P).  If XnX,a.s. then XnPX.
Proof.  If XnX a.s. then for any ϵ,
0=P([|XnX|>ϵ]i.o.)=P(lim supn[|XnX|>ϵ])=limNP(nN[|XnX|>ϵ])limnP[|XnX|>ϵ]

3. Lp Convergence

Recall the notation XLp which means E(|X|p)<. For random variables X,YLp, we define the Lp metric for p1 by
d(X,Y)=(E|XY|p)1/p.  This metric is norm induced because
Xp:=(E|X|p)1/p is a norm on the space Lp.

A sequence {Xn} of random variables converges in Lp to X, written
XnLpX, if
E(|XnX|p)0 as n.

Facts about Lp convergence.
  1. Lp convergence implies convergence in probability: For p>0, if XnLpX then XnPX.  This follows readily from Chebychev's inequality, P[|XnX|ϵ]E(|XnX|p|)ϵp0.
  2. Convergence in probability does not imply Lp convergence.  What can go wrong is that the nth function in the sequence can be huge on a very small set.
    Example.  Let the probability space be ([0,1],B([0,1]),λ), where λ is Lebesgue measure and define
    Xn=2n1(0,1n) then
    P[|Xn|>ϵ]=P((0,1n))=1n0 but
    E(|Xn|p)=2np1n
  3. Lp convergence does not imply almost sure convergence.
    Example.  Consider the functions {Xn} defined on ([0,1],B([0,1]),λ), where λ is Lebesgue measure.
    X1=1[0,12],X2=1[12,1]X3=1[0,13],X4=1[13,23]X5=1[13,1],X6=1[0,14], and so on, Note that for any p>0,
    E(|X1|p)=E(|X2|p)=12,E(|X3|p)=E(|X4|p)=E(|X5|p)=13,E(|X6|p)=14, so  E(|Xn|p)0 and
    XnLp0.
    Observe that {Xn} does not converge almost surely to 0.

Limits and Integrals

Under certain circumstances we are allowed to interchange expectation with limits.

Theorem 1. Monotone Convergence Theorem (MCT). If
0XnXthen
0E(Xn)E(X)
Corollary 1.  Series Version of MCT.  If Xn0 are non-negative random variables for n1, then
E(n=1Xn)=n=1E(Xn)
so that the expectation and infinite sum can be interchanged

Theorem 2. Fatou Lemma. If Xn0, then
E(lim infnXn)lim infnE(Xn)
More generally, if there exists ZL1 and XnZ, then
E(lim infnXn)lim infnE(Xn)
Corollary 2. More Fatou.  If 0XnZ where ZL1, then
E(lim supnXn)lim supnE(Xn)
Theorem 3. Dominated Convergence Theorem (DCT).  If
XnX and there exists a dominating random variable ZL1 such that
|Xn|Zthen
E(Xn)E(X)andE|XnX|0.
\newcommand{\scriptB}{\mathcal{B}} \newcommand{\scriptP}{\mathcal{P}} \newcommand{\vecX}{\mathbf{X}} \newcommand{\vecx}{\mathbf{x}} \newcommand{\reals}{\mathbb{R}} \newcommand{\cplxs}{\mathbb{C}} \newcommand{\rationals}{\mathbb{Q}} \newcommand{\naturals}{\mathbb{N}} \newcommand{\integers}{\mathbb{Z}} \newcommand{\ntoinf}{n\rightarrow\infty} \newcommand{\mtoinf}{m\rightarrow\infty} \newcommand{\tendsto}{\rightarrow}
Example of when interchanging limits and integrals without the dominating condition.  (When something very nasty happens on a small set and the degree of nastiness overpowers the degree of smallness).

Let
(\Omega, \scriptB, P) = ([0,1], \scriptB([0,1]), \lambda) \lambda the Lebesgue measure. Define
X_n = n^2 1_{(0,1/n)}. For any \omega \in [0,1],
 1_{(0,1/n)}(w) \tendsto 0, so
X_n \tendsto 0. However
 E(X_n) = n^2 \cdot \frac{1}{n} = n \tendsto \infty, so
E(\liminf_{\ntoinf} X_n) = 0 \le \liminf_{\ntoinf} (EX_n) = \infty and
E(\limsup_{\ntoinf} X_n) = 0 \not\ge \limsup_{\ntoinf} (EX_n) = \infty.

Tuesday, January 27, 2015

Zero-One Laws

There are several common zero-one laws which identify the possible range of a random variable to be trivial. There are also several zero-one laws which provide the basis for all proofs of almost sure convergence.

Proposition 1. Borel-Cantelli Lemma Let \{A_n\} be any events (not necessarily independent).

If \sum_n{P(A_n)}<\infty, then
P([A_n \; i.o.])=P(\underset{n\rightarrow \infty}{\text{lim sup}} A_n) = 0.
Proposition 2. Borel Zero-One Law If \{A_n\} is a sequence of independent events, then

 \begin{equation*} P([A_n \; i.o.])= \begin{cases} 0, \quad &  \text{iff} \sum_n P(A_n) < \infty \\ 1, \quad &  \text{iff} \sum_n P(A_n) = \infty \end{cases} \end{equation*}
Definition.  An almost trivial \sigma-field is a \sigma-field all of whose events has probability 0 or 1.

Theorem 3. Kolmogorov Zero-One Law If \{X_n\} are independent random variables with tail \sigma-field \mathcal{T}, then \Lambda\in \mathcal{T} implies P(\Lambda)=0 or 1 so that the tail \sigma-field is almost trivial.

Lemma 4. Almost trivial \sigma-fields  Let \mathcal{G} be an almost trivial \sigma-field and let X be a random variable measurable with respect to \mathcal{G}.  Then there exists c such that P[X=c] = 1.

Corollary 5.  Let \{X_n\} be independent random variables.  Then the following are true.

(a) The event
[\sum_n X_n \;converges] has probability 0 or 1.

(b) The random variables \text{lim sup}_{n\rightarrow \infty}X_n and \text{lim inf}_{n\rightarrow \infty}X_n are constant with probability 1.

(c) The event
\{\omega: S_n(\omega)/n \rightarrow 0 \} has probability 0 or 1.

Monday, January 26, 2015

Inequalities

Another valuable book for anyone in computer science who ever wants to bound any quantity (so, everyone!) is: The Cauchy-Schwarz Master Class: An Introduction to the Art of Mathematical Inequalities by Michael Steele.
An encyclopedic book on the topic is A Dictionary of Inequalities. While this is not a book for reading cover-to-cover, it is good to have it at your disposal. See also the supplement of the book.
Moreover, Wikipedia has an excellent list of inequalities.
For specific topics, you may consult:

Friday, January 23, 2015

Dynkin's theorem

First we define a structure named \lambda-system.

A class of subsets \mathcal{L} of \Omega is called a \lambda-system if it satisfies the following postulates

1.  \Omega\in \mathcal{L}
2. A\in\mathcal{L} \Rightarrow A^c \in \mathcal{L}
3. n\neq m, A_nA_m = \emptyset, A_n \in \mathcal{L} \Rightarrow \cup_n A_n \in \mathcal{L}

It is clear that a \sigma-field is always a \lambda-system.

Next a \pi-system is a class of sets closed under finite intersections.

Dynkin's theorem

a) if \mathcal{P} is a \pi-system and \mathcal{L} is a \lambda-system such that \mathcal{P}\subset\mathcal{L}, then \sigma(\mathcal{P})\subset \mathcal{L}.

b) If \mathcal{P} is a \pi-system,

\sigma(\mathcal{P})=\mathcal{L}(\mathcal{P})

that is, the minimal \sigma-field over \mathcal{P} equals the minimal \lambda-system over \mathcal{P}

Wednesday, January 21, 2015

An example of Stein's paradox

Charles Stein showed in 1958, that a nonlinear, biased estimator of a multivariate mean has a lower MSE compared to the ML estimator.

Given a sample of N measurements of X\sim\mathcal{N}(\mu,\sigma I_p) with unknown parameter vector \mu of length p.

The James-Stein estimator is given by

\begin{equation*} \hat{\mu}_{JS}=\left (1-\frac{(p-2)\frac{\sigma^2}{N}}{\|\bar{x} \|^2}\right ) \bar{x} \end{equation*}
where \bar{x} is the sample mean.

This estimator dominates the MLE everywhere in terms of MSE.  For all \mu\in\mathbb{R}^p,

\begin{equation*} \mathbb{E}_\mu \| \hat{\mu}_{JS}-\mu\|^2 < \mathbb{E}_\mu \| \hat{\mu}_{MLE}-\mu\|^2 \end{equation*}
This makes the MLE inadmissible for p\ge3!

Wednesday, January 14, 2015

The need for measure theory

The problem with measure arises when one needs to decompose a body into (possibly uncountable) number of components and reassemble it after some action on those components.  Even when you restrict attention to just finite partitions, one still runs into trouble here. The most striking example is the Banach-Tarski paradox, which shows that a unit ball B in three dimension can be disassembled into a finite number of pieces and reassembled to form two disjoint copies of the ball B.

Such pathological sets almost never come up in practical applications of mathematics.  Because of this, the standard solution to the problem of measure has been to abandon the goal of measuring every subset E of \mathbb{R}^d and instead to settle for only measuring a certain subclass of
non-pathological subsets of \mathbb{R}^d, referred to as the measurable sets.

The most fundamental concepts of measure is the properties of
  1. finite or countable additivity
  2. translation invariance
  3. rotation invariance
The concept of Jordan measure (closely related to that of Riemann and Darboux integral) is sufficient for undergraduate level analysis.

However, the type of sets that arise in analysis, and in particular those sets that arise as limit of other sets, requires an extended concept of measurability (Lebesgue measurability)

Lebesgue theory is viewed as a completion of the Jordan-Darboux-Riemann theory.  It keeps almost all of the desirable properties of Jordan measure, but with the crucial additional property that many features of the Lebesgue theory are preserved under limits.

Probability space concepts

A field is a non-empty class of subsets of \Omega closed under finite union, finite intersection and complements.  A synonym for field is algebra.

From de Morgan's laws a field is also closed under finite intersection.

A \sigma-field \mathcal{B} is a non-empty class of subsets of \Omega closed under countable union, countable intersection and complements.  A synonym for \sigma-field is \sigma-algebra.

In probability theory, the event space is a \sigma-field.  This allows us enough flexibility constructing new-events from old ones (closure) but not so much flexibility that we have trouble assigning probabilities to the elements of the \sigma-field.

For the Reals, we start with sets that we know how to assign probabilities.

Supposes \Omega=\mathbb{R} and let

\mathcal{C}=\{(a,b],-\infty \leq a \leq b < \infty \}

The Borel sets is defined as

\mathcal{B}(\mathbb{R}) \equiv \sigma(\mathcal{C})

Also one can show that

\mathcal{B}(\mathbb{R}) = \sigma(\text{open sets in } \mathbb{R})

Wednesday, January 07, 2015

Russell's Paradox

Shortly after the turn of the 19th century, Bertrand Russell demonstrated a hole in mathematical logic of set theory at the time.  A set can be member of itself. For sets R and S

R = \{S | R \notin S\}

The set R contains all sets that do not have themselves as members.

However, is R a member of itself?

Clearly not, since by definition R is the set of all sets that do not have themselves as member.

But then, if R does not have itself as a member then it must be a member of the set R

At that point in time, it created a huge stir among the mathematical community since most of what they do are based upon the foundation of sets.