Loading [MathJax]/jax/element/mml/optable/MathOperators.js

MathJax

Tuesday, December 22, 2015

Notes on transform domain adaptation (LMS)

Transform domain adaptive filter (e.g. DFT-LMS, DCT-LMS)
  • Performance of LMS is sensitive to the eigenvalue spread of the input covariance matrix.
  • The smallest eigenvalue contribute to slower convergence and the largest eigenvalue limit the range of allowed step-sizes, thus limiting the learning abilities of the filter.
A DFT/DCT type transformation allows one to whiten the input sequence in the transform domain without the need to know the correlation matrix (e.g. using KLT) of the sequence which may not be stationary.

[Hodgkiss, 1979] has shown the conditions under which time domain and frequency domain processing is equivalent.  
[Beaufays, 1995] showed analytically the eigenvalue distributions of a markov-1 sequence to have asymptotic eigenvalue spread of  [interesting matrix theory]
(1+ρ1ρ)2,before transformation 
(1+ρ1ρ),after DFT and power normalization 
(1+ρ),after DCT and power normalization

Monday, December 21, 2015

Random sequence whitening


  • Orthogonal decomposition (e.g. EVD) and triangular decomposition (e.g. LDU) decorrelate a signal sequence.
  • LDU decomposition allows us to whiten a signal causally (since B=L1 is a causal linear transform).

Notes on Discrete Karhunen-Loeve Transform (DKLT) and DFT


  • In general, the DKLT is obtained from eigenvalue decomposition.
  • The correlation matrix of a stationary process is Toeplitz.
  • If the autocorrelation sequence of a random process is periodic with fundamental period M, its correlation matrix becomes circulant.
  • The DFT provides the DKLT of periodic random sequences.
    • This can be easily seen because DFT defines the complete set of eigenvectors for all circulant matrices.

Thursday, December 10, 2015

Notes on Globecom 2015


Monday 12/7/2015 

Morning session 
  • Suppression of Analog Self-Interference Canceller Nonlinearities in MIMO Full Duplex 
    • Use of nonlinear modeling to cancel nonlinear behavior of the RF chain
  • Receive Spatial Modulation (Marco Di Renzo) 
    • New ideas for MIMO Broadcast Channel 
  • A 60GHz LOS MIMO Backhaul Design Combining Spatial Multiplexing and Beamforming for a 100Gbps Throughput (Gerhard Fettweis' group)
    • Deterministic Spatial Multiplexing (Antenna spacing determined by the distance between TX and RX)
Noon session (mmWave)
  • Adaptive One-bit Compressive Sensing with Application to Low-Precision Receivers at mmWave (joint work University of Vigo, Spain and Robert Heath)
    • Compressive sensing for estimating channel of 1-bit receiver
Afternoon session (Massive MIMO)
  • Exploiting the Tolerance of Massive MIMO to Incomplete CSI for Low-Complexity Transmission (UCL)
    • Exploit antenna correlation by training a subset of antennas and interpolate missing information.  (Perhaps CS based schemes can be used?)
  • Downlink Performance and User Scheduling of HetNet with Large-Scale Antenna Arrays (University of Victoria)
    • Lower bound analysis on Hetnet.  (Useful technique)
  • Large System Analysis of Base Station Cooperation in the Downlink (Luca Sanguinetti, Couillet and Debbah)
    • New RMT results for DL BS cooperation with imperfect CSI 
Tuesday 12/8/2015 

Morning session (Massive MIMO)
  • Polynomial-Expansion Multi-Cell Aware Detector for Uplink Massive MIMO Systems with Imperfect CSI (Germany)
    • replacing matrix inverse with polynomial expansion
  • Joint Use of H-inf Criterion in Channel Estimation and Precoding to Mitigate Pilot Contamination in Massive MIMO Systems (University of Northeastern, China)
    • H-inf Criterion used to design CE and Precoding (minimax like criterion, minimize worst case penalty)
  • Location-Aided Pilot Contamination Elimination for Massive MIMO Systems (Chalmers, Sweden)
    • estimate AoA and AS to compute Covariance matrix.
  • Data-Assisted Massive MIMO Uplink Transmission with Large Backhaul Cooperation Delay: Scheme Design and System-Level Analysis (Very good analysis)
    • Successively use data to improve channel estimation.   
    • Develop network model with fixed BS location using stochastic geometry
    • Shifted Zadoff Chu sequence used (do not reuse pilots)
Noon session 
  • Maximal Ratio Transmission in Wireless Poisson Networks under Spatially Correlated Fading Channels
    • Stochastic geometry.  Correlated channel.  Spatial correlation is SINR dependent.
  • Performance Analysis of Single-Carrier Modulation with Correlated Large-Scale Antennas (Geoffrey Li, Georgia Tech) 
    • Propose SC over OFDM in massive MIMO.  No equalizer under the condition of i.i.d channel.
Afternoon session
  • Multi-switch for antenna selection in massive MIMO
    • binary switching in antenna selection (have small loss in capacity)
  • A Multi-cell MMSE Precoder for Massive MIMO Systems and New Large System Analysis (Emil Bjornson)  Good analysis
    • Multicell MMSE precoder for TDD massive MIMO using idea of Pilot reuse.  Exploit UL/DL duality to come up with precoder design that spans B directions only.  
  • Harmonized Cellular and Distributed Massive MIMO: Load Balancing and Scheduling (DOCOMO)
    • Distributed massive MIMO with Hetnet, scheduling using blanking etc.  
    • research has industry knowledge 

Thursday, December 03, 2015

Multi-cluster massive MIMO

Ideas 
  • [12/3/2015] It might be worthwhile to look into the design of the transmit unitary matrix.
Challenges
  • When interfering channel is present, any use of the interfering channel will add to the noise temperature of the environment.

Key Concepts
  • Successive decoding achieves the capacity region of degraded broadcast channel.  
    • Scalar broadcast channel is a degraded broadcast channel.
  • MISO channel: With only covariance feedback, water filling along the eigenvectors of the channel covariance achieve capacity.  Specifically independent complex circular Gaussian inputs along the eigenvectors of Σ is optimal.

Tuesday, November 24, 2015

Quasiconvex functions

A function f is quasiconvex iff domf is convex and for any x,ydomf and 0θ1
f(θx+(1θ)y)max{f(x),f(y)}  A continuous function f:RR is quasiconvex iff at least one of the following conditions holds

  • f is nondecreasing
  • f is nonincreasing
  • there is a point cdomf such that for tc, f is nonincreasing, and for tc, f is nondecreasing.

First order conditions

A continuous differentiable function f:RnR is quasiconvex iff domf is convex and for all x,ydomf 
f(y)f(x)f(x)T(yx)0
Operations that preserve quasiconvexity
  • nonnegative weighted maximum
    The function f(x)=supyC(wygy(x))  where wy0 and gy(x) is a parameterized family of quasiconvex functions.
  • composition
    • if g is quasiconvex and h is nondecreasing then, f=hg is quasiconvex
    • composition of a quasiconvex function with an affine or linear-fractional transformation yields a quasiconvex function.
  • minimization
    if f(x,y) is quasiconvex jointly in x and y and C is a convex set, then the function
    g(x)=infyCf(x,y) is quasiconvex.

Wednesday, November 18, 2015

Gaussian interference channel


The sum capacity is known for two special cases:

Strong interference: Capacity is achieved by decoding and canceling the interference before decoding the desired signal.  The condition is defined as I2S1 and I1S2
R1C(S1),R2C(S2),R1+R2min{C(S1+I1),C(S2+I2)}

Weak interference:  Capacity is achieved by treating interference as Gaussian noise
Csum=C(S11+I1)+C(S21+I2)


Gaussian broadcast channel

Two users case
Scalar Gaussian broadcast channel belongs to the class of degraded broadcast channel.  (The users can be absolutely ranked by their channel strength)  

When the transmitter has more than one antenna, the Gaussian broadcast channel is non-degraded.

Sender of power P transmit to two receivers, one with Gaussian noise power N1 and the other with N2.  assume N1<N2.   The capacity region is
R1<C(αPN1)R2<C((1α)PαP+N2) where 0α1

The weaker receiver Y2 decodes its own message treating the other user's message as interference.  The stronger receiver Y1 decodes Y2's message and cancels it before decoding his own.

Note:  The role of transmitter side information reduces with the growth in the number of TX antennas.  With either CSIT or CDIT and under the ZMSW model the asymptotic growth is linear as Cmin(M,K).


Monday, November 02, 2015

weighted arithmetic-geometric mean inequality

Generalized form of the AM-GM inequality
iαiviivαii where v0 and α0,1Tα=1.

Tuesday, October 27, 2015

Karush-Kuhn-Tucker conditions for non-convex problems

Consider a problem (ICP) with equality and inequality constraints:
minimizef(x)subject tohi(x)=0,i=1,,mgj(x)0,j=1,,r where f,hi,gj are continuously differentiable functions from Rn to R.

With Lagrange function
L(x,λ,μ)=f(x)+mi=1λihi(x)+rj=1μjgj(x)
Define the set of active inequality constraints:
A(x)={j|gj(x)=0} A feasible vector x is regular if the equality constraint gradients hi(x),i=1,,m, and the active inequality constraint gradients gj(x),jA(x) are linearly independent.

KKT optimality conditions

KKT necessary conditions for regular x

Let x be a local minimum of the problem
minimizef(x)subject tohi(x)=0,i=1,,mgj(x)0,j=1,,r where f,hi,gj are continuously differentiable functions from Rn to R and x is regular. There exist unique Lagrange multiplier vectors λ=(λ1,,λm), μ=(μ1,,μr) such that
xL(x,λ,μ)=0,μj0,j=1,,r,μj=0,jA(x)  Note: The condition can be written as
μjgj(x)=0,j=1,,r and aptly referred to as complementary slackness condition.

If in addition f, h, and g are twice continuously differentiable, there holds
yT2xxL(x,λ,μ)y0 for all yRn such that
hi(x)Ty=0,i=1,,mgj(x)Ty=0,jA(x)


Wednesday, October 14, 2015

Braess' paradox

An interesting anomaly in non-cooperative behavior.  (An example where Nash equilibrium is not optimal) See link

Monday, September 28, 2015

Weak and strong alternatives

Weak alternatives
  • Two systems of inequalities are called weak alternatives if at most one of the two is feasible.
  • This is true whether or not the inequalities of system 1 are convex (i.e. fi convex, hi affine)
  • System 2 is always convex (g concave and λi0 are convex)
Lagrange duality theory can be applied to the problem of determining feasibility of a system of inequalities and equalities
fi(x)0,i=1,,m,hi(x)=0,i=1,,p
The above can be posed as the standard problem with objective f0=0:
minimize0subject tofi(x)0,i=1,,mhi(x)=0,i=1,,p
This problem has optimal value
p={0if system is feasibleif system is infeasible
We associate with the inequality system the dual function
g(λ,ν)=infxD(mi=1λifi(x)+pi=1νihi(x))
Note that the dual function has is positive homogeneous, i.e. that for α>0,g(αλ,αν)=αg(λ,ν).   The dual problem of maximizing g(λ,ν)s.t.λ0 has the optimal value
d={λ0,g(λ,ν)>0 is feasible0λ0,g(λ,ν)>0 is infeasible
From weak duality dp and the above facts, we can conclude that the inequality system
λ0,g(λ,ν)>0 is feasible d=, then the inequality system is infeasible (since p=)

A solution to the dual function is a certificate of infeasibility of the system.  To summarize:
  • feasibility of system 2 implies infeasibility of system 1
  • feasibility of system 1 implies infeasibility of system 2

Strong alternatives

When the original inequality system is convex, and some type of constraint qualification holds, then the pairs of weak alternatives becomes strong alternatives, which means exactly one of the two alternatives holds.

If system 1 is convex, it can be expressed as
fi(x)0,i=1,,m,Ax=b,ARp×n
This system and its alternative
λ0,g(λ,ν)>0
are strong alternatives provided there exists an xrelint D with Ax=b and the optimal value p is attained.  [More on that later...]

See Farkas' lemma

Friday, September 25, 2015

Karush-Kuhn-Tucker (KKT) conditions

If strong duality (with zero duality gap) holds and x,λ,ν - possibly more than one pair - are optimal (see previous post), and the problem has differentiable fi,hi, it must satisfy the KKT conditions: [necessary condition]
  1. primal constraints: fi(x)0,i=1,,m,hi(x)=0,i=1,,p
  2. dual constraints: λ0
  3. complementary slackness: λifi(x)=0,i=1,,m
  4. gradient of Lagrangian with respect to x vanishes:
    f0(x)+mi=1λifi(x)+pi=1νihi(x)=0
For convex problems

Key: When the primal problem is convex, the KKT conditions are also sufficient for points to be primal and dual optimal.

If ˜x,˜λ,˜ν satisfy KKT for a convex problem (i.e. fi convex, hi affine).  Then they are optimal, with zero duality gap:
  • from complementary slackness: f0(˜x)=L(˜x,˜λ,˜ν)
  • from 4th condition (and convexity): g(˜x,˜λ,˜ν)=L(˜x,˜λ,˜ν)
hence, f0(˜x)=g(˜x,˜λ,˜ν)


If Slater's condition is satisfied:

x is optimal iff there exist λ,ν that satisfy KKT conditions
  • Slater implies strong duality, therefore dual optimum is attained
  • generalizes optimality condition f0(x)=0 for unconstrained problem
Key: Many algorithms for convex optimization are conceived as methods for solving the KKT conditions.

Slater's condition

If the primal problem is convex i.e. of the form
minimizef0(x)subject tofi(x)0,i=1,,m,Ax=b with f0,,fm convex, and if there exists a point that is strictly feasible, or more precisely an xrelint D such that
fi(x)<0,i=1,,m,Ax=b.
Strong duality holds if the above condition is met.
Also, the dual optimum is attained d>

Thursday, September 24, 2015

Notes on duality

Given a standard form problem (not necessarily convex)
minimizef0(x)subject tofi(x)0,i=1,,mhi(x)=0,i=1,,p
variable xRn, domain D, optimal value p
with Lagrangian:
L(x,λ,ν)=f0(x)+mi=1λifi(x)+pi=1νihi(x)
and the (Lagrange) dual function:
g(λ,ν)=infxDL(x,λ,ν)
Properties:

  • g is concave over λ,ν since it is the infimum of a family of affine functions.
  • Lower bound property: if λ0 then g(λ,ν)p

since, if ˜x is feasible and λ0, then
g(λ,ν)=infxDL(x,λ,ν)L(˜x,λ,ν)f0(˜x) minimizing over all feasible ˜x gives g(λ,ν)p

Lagrange dual and conjugate function

Given an optimization problem with linear inequality and equality constraints
minimizef0(x)subject toAxbCx=d.
Using the definition of f=supxdom f(yTxf(x)), we can write the dual function as:
g(λ,ν)=infx(f0(x)+λT(Axb)+νT(Cx=d))=bTλdTν+infx(f0(x)+(ATλ+CTν)Tx)=bTλdTνf(ATλCTν) with domain
dom g={(λ,ν)|ATλCTνdom f0}

  • simplifies derivation of dual if conjugate of f0 is known

The dual problem (finding the greatest lower bound for the primal problem)
maximizeg(λ,ν)subject toλ0
  • a convex optimization problem; optimal value denoted d
  • λ,ν are dual feasible if λ0,(λ,ν)dom g

Weak duality: dp

  • always holds (for convex and nonconvex problems)
  • can be used to find lower bounds for difficult problems

Strong duality: d=p (zero duality gap)

  • does not hold in general
  • (usually) holds for convex problems
  • constraint qualifications assert conditions that guarantee strong duality in convex problems (e.g. Slater's constraint qualification)

Notes on dual norm

Dual norm is defined as
z=sup{zTx|x1} The dual norm can be interpreted as the operator norm of zT, interpreted as a 1×n matrix with the norm on Rn.

From the definition of dual norm we obtain the inequality
zTxxz
The conjugate of any norm f(x)=x is the indicator function of the dual norm unit ball
f(y)={0y1otherwise
Proof. If y>1, then by definition of the dual norm, there is a zRn with z1 and yTz>1.  Taking x=tz and letting t, we have
yTxx=t(yTzz) which shows that f(y)=.  Conversely, if y1, then we have yTxxy for all x, which implies for all x, yTxx0.  Therefore x=0 is the value that maximizes yTxx, with maximum value 0.

Wednesday, September 23, 2015

Schur complement: characterizations of positive definiteness for block matrices

Let a block matrix XSn be partitioned as

X=[ABBTC] where ASk.  If detA0, the schur complement is
S=CBTA1B The following characterizations of positive definiteness can be derived
  • X0 iff A0 and S0
  • If A0, then X0 iff S0
Example:
ATAt2I,t0tIt1ATIA0,t0[tIAATtI]0

Tuesday, September 22, 2015

Notes on optimization problem

Optimality criterion for differentiable f0
Given a convex optimization problem with objective f0.  x is optimal iff it is feasible and
f0(x)T(yx)0for all feasible y

if f0(x)0, it means that f0(x) defines a supporting hyperplane to the feasible set at x.

Unconstrained problem: x is optimal iff
xdom f0,f0(x)=0
Equality constrained problem:
minimize f0(x) subject to Ax=b
x is optimal iff there exists a ν such that
xdom f0,Ax=b,f0(x)+ATν=0
Note: Lagrange multiplier optimality condition.

Minimization over nonnegative orthant
minimize f0(x) subject to x0
x is optimal iff
xdom f0,x0,{f0(x)i0,xi=0f0(x)i=0,xi0
Note: The last condition is a complementarity condition.


Monday, September 21, 2015

Notes on conjugate function (Fenchel conjugate)

The conjugate function (Fenchel conjugate)
  • Counterpart to Fourier transform in Convex Analysis
The conjugate of f (f not necessarily convex):
f(y)=supxdom f(yTxf(x))

The domain of f consists of yRn for which the supermum is finite, i.e. for which the difference yTxf(x) is bounded above on dom f.

f is convex, since it is the pointwise supremum of a family of affine functions of y.

Examples:
  • Affine function. f(x)=ax+b.  f(y)=yxaxb=(ya)xb.  

f(y)={by=aotherwise with dom f={a}.

  • Negative logarithm. f(x)=logx, with dom f=R++.  

The function xy+logx is unbound above if y0 and reaches its max at x=1/y otherwise.
f(y)={1log(y)y<0otherwise with dom f={y|y<0}

Fenchel's inequality

Fenchel's inequality can be obtained from the definition of conjugate function
f(x)+f(y)xTy,x,y
since for each y,
f(y)xTyf(x)f(x)+f(y)xTy

Notes on convex functions

Basic Properties

[TODO]

Examples
  • Affine functions are both convex and concave
  • All norms are convex (e.g. Frobenius norm, Spectral norm)
  • Max function f(x)=maxixi is convex on Rn
  • Quadratic-over-linear function f(x,y)=x2/y is convex on dom f=R×R++
  • Log-sum-exp f(x)=logni=1exi is convex on Rn
  • Geometric mean f(x)=(ni=1xi)1/n is concave on dom f=Rn++
  • Log-det is concave on dom f=Sn++
Restriction of a convex function to a line

f:RnR is convex iff the function g:RR
g(t)=f(x+tv),dom g={t|x+tvdom f}
is convex in t for any xdom f, vRn

Upshot: can check convexity of f by checking convexity of functions of one variable.

Epigraph and sublevel set 

α-sublevel set of f:RnR, is defined as all x for which the value of f(x) is less then α
Cα={xdom f|f(x)α}

Property: sublevel sets of convex functions are convex (converse is false)

epigraph of f:RnR:
epi f={(x,t)Rn+1|xdom f,f(x)t}

Property: f is convex iff  epi f is a convex set

Operations that perserve convexity

methods for establishing convexity of a function
  1. verify definition (often simplified by restricting to a line)
  2. for twice differentiable functions, show 2f(x)0
  3. show that f is obtained from simple convex functions by operations that preserve convexity
    • nonnegative weighted sum
    • composition with affine function
    • pointwise maximum and supremum
    • composition
    • minimization
    • perspective

Notes on convex sets

Cones

Definition of a cone is quite general:

A set C is a cone if for every xC and θ0, we have θxC.  (Note that, a subspace is a cone since all (non-negative) combination of points in the subspace belongs to the subspace)

A set C is a convex cone if it is convex and a cone.

A proper cone on the other hand is more strict.

A cone KRn is a proper cone if it satisfies the following

  • K is convex
  • K is closed
  • K is solid, which means it has nonempty interior.
  • K is pointed, which means that it contains no line.

Dual cones and generalized inequalities

Dual cones of a cone K:
K={y|yTx0 for all xK}
Dual cones of proper cones are proper, so a dual cone defines generalize inequalities in the dual domain
yK0yTx0 for all xK0

Characterization of minimum and minimal elements via dual inequalities

minimum element w.r.t. K

Definition: x is minimum element of S iff for all λK0, x is the unique minimizer of λTz over S

minimal element w.r.t. K

Definitions:

  • if x minimizes λTz over S for some λK0, then x is minimal
  • if x is a minimal element of a convex set S, then there exists a nonzero λK0 such that x minimizes λTz over S

Thursday, July 30, 2015

Useful Matrix properties

Theorem

Let A be m×n matrix, B and C n×n matrices, with B symmetric, x n×1 vector
  1. Ax=0x if and only if A=0,
  2. xTAxx if and only if B=0,
  3. xTCxx if and only if CT=C (skew symmetric).

Wednesday, July 22, 2015

Concentration of measure

To characterize the deviation of a random quantity (typically a general function of weakly correlated variables) from it's mean.  

One can look at
  • the tail end probability (moment method and chernoff's bound)
  • the stability about it's mean (i.e. \mathbb{P}\{ \left| \frac{Z}{\mathbb{E}Z} -1 \right| > \epsilon\})
  • the bounded difference (\mathbb{P}[ | Z - \mathbb{E}Z | > t ])

Monday, July 13, 2015

Circularly symmetric complex Gaussian

Special case: scalar variable ZCN(0,1)
fZ(z)=1πe|z|2f|Z|,Θ(|z|,θ)=1π|z|e|z|2
Note that the distribution is a function of the magnitude of Z, and therefore constant over all angles θ.  To get the distribution of the magnitude of Z, we integrate over all angles πθπ.
f|Z|(|z|)=f|Z|,Θ(|z|,θ)dθ=ππdθ1π|z|e|z|2=2|z|e|z|2,|z|>0
Define Y=|Z| then,
fY(y)=2yey2,y>0
For magnitude squared distribution, we can perform a change of variable, with X=Y2=g(Y), since P({x,x>0})=0, we only have one region (the set {x,x>0} ) to worry about.
fX(x)=fY(g1(x))|g1(x)x|=fY(x)|xx|=(2x12ex)12x12=ex,x>0
Note:
Here we assume the probability spaces (X,X,PX), (Y,Y,PY)and that a mapping T exists that maps AX to BY.
Start from the joint cdf in the X-space, define A{X2+Y2|z|,tan1(Y/X)θ0}
F|Z|,Θ(|z|,θ0)=P(A)=
With a change of variable, define B\equiv\{|Z|\leq |z|, \Theta \leq \theta_0\}
\begin{aligned} F_{|Z|,\Theta}(|z|,\theta_0) = P(B) &=  \underset{(r,\theta)\in B}{\iint} f(x(r,\theta), y(r,\theta)) r \; dr \; d\theta \\ &= \int_0^{|z|} \int_0^{\theta_0} f(x(r,\theta), y(r,\theta)) r \; dr \; d\theta \end{aligned}
Differentiate wrt |z| and \theta to get the pdf.

Theorem.  Given any function g:\mathbb{R}^2 \rightarrow \mathbb{R} and T that maps the A\in\mathcal{X} to B\in\mathcal{Y} then
\iint_A g(x_1,x_2) dx_1 \; dx_2 = \iint_B g(x_1(y_1,y_2), x_2(y_1,y_2))  |J(y_1, y_2)| dy_1\; dy_2 where J(y_1,y_2) is the Jacobian.


Monday, July 06, 2015

Delay gratification

Newyorker article

The cult of Genius

Discover magazine blog
"... the brain is a muscle. Giving it a harder workout makes you smarter"

Friday, May 15, 2015

searching for files in Windows command prompt

  • If you know the name of the file
    • dir secret.* /s /p


Friday, April 24, 2015

Relationship between vec operator, Schur, Kronecker and Khatri-Rao product

\DeclareMathOperator{\diag}{diag}\DeclareMathOperator{\vec}{vec} Define \vec(A) as the operation of stacking the columns of matrix A into a vector, A\otimes B the Kronecker product, A\circ B the Schur (Hadamard) product and finally A\diamond B Bhatri-Rao product is defined as the column wise Kronecker product.

Here are some useful properties:
(A\otimes B)^T = A^T \otimes B^T
A \diag(x \circ y) B^T = A \diag(x) \diag(y) B^T
(C\otimes D) ( A \diamond B) = CA \diamond DB
\vec(AXB^T) = (B \otimes A) \vec(X)
\vec(A \diag(x) B^T) = (A \diamond B) x
(A\diamond B \diamond x^T) y = (A \diamond B) (x \circ y)

Where A,B,C,D,X are matrices and x,y are vectors of compatible dimensions.

Tuesday, March 03, 2015

Linear separability

(Cover, 1965)  Suppose we have N data points distributed at random in \mathbb{R}^d with an unspecified distribution.  Assume that there is no subset of d or fewer points which are linearly dependent.  We then assign each of the points to one of the two classes \mathcal{C}_1 and \mathcal{C}_2 with equal probability.

The fraction F(N,d) of realizations that is linearly separable is given by the expression
F(N,d) = \left\{     \begin{matrix}     1 \quad &\mathrm{when}\; N \le d+1 \\     \frac{1}{2^{N-1}}\sum\limits_{i=0}^d \left( \begin{matrix} N-1 \\ i \end{matrix} \right) \quad & \mathrm{when}\; N \ge d + 1     \end{matrix}\right.   Intuitively, the probability of separability increase with increasing dimension d.

[TODO] include plot...

http://www-isl.stanford.edu/~cover/papers/paper76.pdf

Monday, March 02, 2015

Simply connectedness

Informally, a thick object in our space is simply-connected if it consists of one piece and does not have any "holes" that pass all the way through it.

A sphere (or, equivalently, a rubber ball with a hollow center) is simply connected, because any loop on the surface of a sphere can contract to a point, even though it has a "hole" in the hollow center.

The stronger condition, that the object has no holes of any dimension, is called contractibility.

Another characterization of simply-connectedness is the following:

X is simply-connected if and only if 
  • it is path-connected, and 
  • whenever p: [0,1] \rightarrow X and q: [0,1] \rightarrow X are two paths (i.e. continuous maps) with the same start and endpoint (p(0)=q(0) and p(1) == q(1)), then p and q are homotopic relative to {0,1}.
Intuitively, this means that p can be "continuously deformed" to get q while keeping the endpoints fixed. Hence the term simply connected: for any two given points in X, there is one and "essentially" only one path connecting them.

Examples:

  • All convex sets in \mathbb{R}^n are simply connected.
  • A sphere is simply connected.


Friday, February 27, 2015

Interesting courses Spring 2015

  • ECE 287 Spec Topics/Comm Theory & Syst
    • TuTh 5:00p-6:20p CENTR223 Franceschetti, Massimo
  • MATH 287D Statistical Learning
    • TuTh 5:00p-6:20p APM 5402 Bradic, Jelena
  • MATH 281C Mathematical Statistics
    • TuTh 2:00p-3:20p APM 5402 Arias-Castro, Ery 
  • MATH 280C Probability Theory
    • MW 5:00p-6:20p APM 5402 Williams, Ruth J
  • MATH 245C Convex Analysis and Optimization III
    • MWF 4:00p-4:50p APM 5402 Nie, Jiawang 
  • MATH 140C Foundations of Real Analysis III
    • MWF 1:00p-1:50p HSS 1128A Saab, Rayan
  • CSE 255 Data Mining and Predictive Analytics
    • TuTh 3:30p-4:50p WLH 2207 Freund, Yoav 
  • CSE 291 Neural Networks
    • TuTh 2:00p-3:20p WLH 2113 Cottrell, Garrison W 
  • CSE 272 Advanced Image Synthesis
    • TuTh 11:00a-12:20p EBU3B 4140 Jensen, Henrik

Tuesday, February 24, 2015

Epigraph

$ \newcommand{\epi}{\mathop{\mathrm{epi}}} $
The epigraph of a function f: \mathbb{R}^n \rightarrow \mathbb{R} is the set of points lying on or above its graph:
\text{epi} f = \{ (x,\mu) : x\in \mathbb{R}^n, \mu \in \mathbb{R}, \mu \ge f(x) \} \subset \mathbb{R}^{n+1}
Properties:

A function is convex if and only if its epigraph is a convex set.

A function is lower semicontinuous if and only if its epigraph is closed.

Saturday, February 14, 2015

Entropy maximization, part II

Let \mathcal{X} \in \{\alpha_1, \cdots, \alpha_n\} be a random variable with finite alphabet, what is the distribution that will achieve maximum entropy given the constraint that E[f(\mathcal{X})]=\beta ?

We define the expectation
\sum_i \alpha_i p_i = \beta The optimization problem is given as
\max_\mathbf{p} H(\mathbf{p}) \; s.t., \; \sum_i p_i = 1,\; \sum_i \alpha_i p_i = \beta Optimizing using Lagrange multipliers \lambda and \mu, we have
p_i = exp^{-(1-\lambda)}exp^{\mu \alpha_i} which turns out to be the familiar expression
p_i = \frac{1}{Z}exp^{\mu \alpha_i} with Z=\sum_i exp^{\mu \alpha_i} being the partition function.

The above form is called a Gibbs distribution.

Now consider the problem where with \mathcal{X} \in \mathbb{R}^p and we are given the moments E[f_j(\mathcal{X})]=\beta_j, j=1,\dots,p.  The optimization problem becomes
\max_\mathbf{p} H(\mathbf{p}) \; s.t., \; \sum_i p_i = 1,\; \sum_i \alpha_{ij} p_i = \beta_j The expression have the form
p_i = \frac{1}{Z} exp^{\sum_{j=1}^p \mu_j \alpha_{ij}}          =\frac{1}{Z} exp^{\mathbf{\mu}^T \mathbf{\alpha_{i}}} Since the distribution in general has infinite support the constraint on the moment will allow one to reach a unique solution.

Friday, February 13, 2015

Entropy maximization

One interesting and well known result is that out of all finite alphabet distributions, the uniform distribution achieves maximum entropy.

The Laplace's principle of insufficient reasoning, calls for assuming uniformity unless there is additional information.

Entropy maximization is the equivalent of minimizing the KL-Divergence between the distribution p and the uniform distribution.

More precisely, let \mathcal{X} \in \{\alpha_1, \cdots, \alpha_n\} be a random variable with finite alphabet, given a family of distribution \mathcal{P} and the uniform distribution u,
\underset{p\in\mathcal{P}}{\text{arg min}} \;D_{KL}(p\| u) = \underset{p\in\mathcal{P}}{\text{arg max}} \;H(p) where H(p) is the entropy.

Proof:
D(p\|u) = \sum_i p_i \log p_i +  (\sum_i p_i) \log (n) = \log (n) - H(p)
Note
This is true in general for random variables with finite support.  For RVs with infinite support,  additional constraint is required.  See Gibbs measure...

Thursday, February 12, 2015

A Side Path to Statistical Mechanics

Consider a physical system with many degrees of freedom, that can reside in any one of a large number of possible states.  Let p_i denote the probability of occurrence of state i, for example, with the following properties:
p_i\ge 0 \quad \text{for all}\; iand 
\sum_i p_i = 1 Let E_i denote the energy of the system when it is in state i.  A fundamental results from statistical mechanics tells us that when the system is in thermal equilibrium with its surrounding environment, statie i occurs with a probability define by 
p_i = \frac{1}{Z}\exp(-\frac{E_i}{k_BT}) where T is the absolute temperature in kelvins, k_B is the Boltzmann's constant, and Z is a constant that is independent of all states.  The partition function Z is the normalizing constant with 
Z= \sum_i \exp(-\frac{E_i}{k_BT}).  The probability distribution is called the Gibbs distribution.

Two interesting properties of the Gibbs distribution are:
  1. States of low energy have a higher probability of occurrence than states of high energy.
  2. As the temperature T is reduced, the probability is concentrated on a smaller subset of low-energy states.
In the context of neural networks, the parameter T may be viewed as a pseudo-temperature that controls thermal fluctualtions representing the effect of "synaptic noise" in a neuron.  Its precise scale is irrelevant.  We can redefine the probability p_i and partition function Z as
p_i = \frac{1}{Z}\exp(-\frac{E_i}{T}) and
Z = \sum_i \exp(-\frac{E_i}{T}) where T is referred to simply as the temperature of the system. 

Note that -\log p_i may be viewed as a form of "energy" measured at unit temperature.

Free Energy and Entropy

The Helmholtz free energy of a physical system, denoted by F, is defined in terms of the partition function Z as follows
F = -T \log Z.  The average energy of the system is defined by
\lt E\gt = \sum_i p_i E_i The difference between the average energy and free energy is
\lt E \gt - F = -T\sum_i p_i \log p_i which we can rewrite in terms of entropy H
\lt E \gt - F = T H or, equivalently,
F = \lt E \gt - TH.  The entropy of any systems tend to increase until it reaches an equilibrium, and therefore the free energy of the system will reach a minimum.

This is an important principle called the principle of minimal free energy.

Cross Entropy

The cross entropy of distributions p and q is
H(p,q)=E_p[-\log q] It can be viewed as
H(p,q)=H(p)+D_{KL}(p\|q) where H(p) is the entropy of p and D_{KL}(p\|q) is the non-negative Kullback–Leibler divergence.

From an source coding perspective, it is the total bits required to encode information if the estimated distributed q diverged from the true distribution p, where H(p) is the minimum.

This quantity is very useful in machine learning.  Viewed from a vector quantization point of view, logistic regression is a way of finding an optimal boundary to classifying samples in a (possibly high dimensional) space of interest.   This expression quantifies the loss of estimating distribution q instead of the true distribution p.  Since H(p) is fixed because it is a property of the underlying true distribution, minimizing cross-entropy is equivalent to minimizing KL divergence in this setting.

Friday, January 30, 2015

Conditional Expectation

Definition.  \lambda is absolutely continuous (AC) with respect to \mu, written \lambda \ll \mu, if \mu(A)=0 implies \lambda(A) = 0.

Theorem 2 Radon-Nikodym Theorem  Let (\Omega,\mathcal{B},P) be the probability space.  Suppose v is a positive bounded measure and v \ll P.  Then there exists an integrable random variable X\in \mathcal{B}, such that
v(E) = \int_E XdP, \quad \forall E \in \mathcal{B} X is a.s. unique (P) and is written
X=\frac{dv}{dP}. We also write dv=XdP

Definition of Conditional Expectation

Suppose X\in L_1(\Omega,\mathcal{B},P) and let \mathcal{G}\subset \mathcal{B} be a sub-\sigma-field.  Then there exists a random variable E(X|\mathcal{G}), called the conditional expectation of X with respect to \mathcal{G}, such that

  1. E(X|\mathcal{G}) is \mathcal{G}-measurable and integrable.
  2. For all G\in\mathcal{G} we have \int_G XdP = \int_G E(X|\mathcal{G})dP
Notes.
  1. Definition of conditional probability: Given (\Omega,\mathcal{B},P), a probability space, with \mathcal{G} a sub-\sigma-field of \mathcal{B}, define P(A|\mathcal{G})=E(1_A|\mathcal{G}), \quad A\in \mathcal{B}.  Thus P(A|\mathcal{G}) is a random variable such that 
    1. P(A|\mathcal{G}) is \mathcal{G}-measurable and integrable.
    2. P(A|\mathcal{G})  satisfies \int_G P(A|\mathcal{G})dP = P(A\cap G), \quad \forall G \in \mathcal{G}.
  2. Conditioning on random variables: Suppose \{X_t, t\in T \} is a family of random variables defined on (\Omega,\mathcal{B}) and indexed by some index set T.   Define \mathcal{G}:=\sigma(X_t,t\in T) to be the \sigma-field generated by the process \{X_t, t\in T \}.  Then define E(X|X_t, t\in T)= E(X|\mathcal{G}).
Note (1) continues the duality of probability and expectation but seems to place expectation in a somewhat more basic position, since conditional probability is defined in terms of conditional expectation.

Note (2) saves us from having to make separate definitions for E(X|X_1), E(X|X_1,X_2), etc.

Countable partitions Let \{\Lambda_n, n\ge 1 \} be a partition of \Omega so thyat \Lambda_i \cap \Lambda_j = \emptyset, i\neq j, and \sum_n \Lambda_n=\Omega.  Define
\mathcal{G}=\sigma(\Lambda_n, n\ge 1) so that
\mathcal{G}=\left\{ \sum_{i\in J}\Lambda_i: J\subset\{1,2,\dots \} \right\}. For X\in L_1(P), define
E_{\Lambda_n}(X)=\int XP(d\omega|\Lambda_n)=\int_{\Lambda_n}XdP/P\Lambda_n , if P(\Lambda_n)>0 and E_{\Lambda_n}(X) = 18 if P(\Lambda_n)=0.  We claim

  1. E(X|\mathcal{G})\overset{a.s.}{=} \sum_{n=1}^\infty E_{\Lambda_n}(X) 1_{\Lambda_n}   and for any A\in \mathcal{B}
  2. P(A|\mathcal{G})\overset{a.s.}{=} \sum_{n=1}^\infty P(A|\Lambda_n)1_{\Lambda_n}



Product Spaces, Transition Kernel and Rubini's Theorem

Lemma 1. Sectioning sets Sections of measurable sets are measurable.  If A\in \mathcal{B}_1 \times \mathcal{B}_2, then for all \omega_1 \in \Omega_1,
A_{w_1}\in \mathcal{B_2}
Corollary 1. Sections of measurable functions are measurable.  That is if
X: (\Omega_1\times \Omega_2, \mathcal{B}_1 \times \mathcal{B}_2) \mapsto (S,\mathcal{S}) then X_{\omega_1} \in \mathcal{B}_2.  We say X_{\omega_1} is \mathcal{B}/\mathcal{S} measurable.

Define the transition (probability) kernel
K(\omega_1,A_2):\Omega_1 \times \mathcal{B}_2 \mapsto [0,1]
if it satisfies the following
  1. for each \omega_1, K(\omega_1,\cdot) is a probability measure on \mathcal{B}_2, and
  2. for each A_2\in \mathcal{B}_2, K(\cdot, A_2) is \mathcal{B}_1/\mathcal{B}([0,1]) measurable.
Transition kernels are used to define discrete time Markov processes where K(\omega_1,A_2) represents the conditional probability that, starting from \omega_1, the next movement of the system results in a state in A_2.

Theorem 1.  Let P_1 be a probability measure on \mathcal{B}_1, and suppose
K:\Omega_1 \times \mathcal{B}_2 \mapsto [0,1] is a transition kernel.  Then K and P_1 uniquely determine a probability on \mathcal{B}_1 \times \mathcal{B}_2 via the formula
P(A_1\times A_2)= \int_{A_1}K(\omega_1,A_2)P_1(dw_1), for all A_1\times A_2 in the class of measurable rectangles.

Theorem 2. Marginalization  Let P_1 be a probability measure on (\Omega_1,\mathcal{B}_1) and suppose K: \Omega_1\times \mathcal{B_2} \mapsto [0,1] is a transition kernel.  Define P on (\Omega_1 \times \Omega_2, \mathcal{B_1}\times \mathcal{B_2}) by
P(A_1\times A_2)= \int_{A_1} K(\omega_1,A_2)P_1(d\omega_1).  Assume
X:(\Omega_1\times \Omega_2, \mathcal{B}_1\times \mathcal{B}_2) \mapsto (\mathbb{R},\mathcal{B}(\mathbb{R})) and furthermore suppose X is integrable.  Then
Y(\omega_1)=\int_{\Omega_2} K(\omega_1,d\omega_2)X_{\omega_2}(\omega_2) has the properties

  1. Y is well defined.
  2. Y \in B_1
  3. Y \in L_1(P_1) and furthermore
\int_{\Omega_1\times \Omega_2}XdP = \int_{\Omega_1} Y(\omega_1)P_1(d\omega_1) = \int_{\Omega_1} [ \int_{\Omega_2} K(\omega_1,d\omega_2) X_{\omega_1}(d\omega_2)]P_1(\omega_1).
Theorem 3.  Fubini Theorem  Let P=P_1\times P_2 be a product measure.  If X is \mathcal{B}_1 \times \mathcal{B}_2 measurable and is either non-negative or integrable with respect to P then
\begin{aligned} \int_{\Omega_1\times \Omega_2}XdP &= \int_{\Omega_1}[ \int_{\Omega_2}X_{\omega_1}(\omega_2) P_2(d\omega_2) ] P_1(d\omega_1) \\                                                                 &= \int_{\Omega_2}[ \int_{\Omega_1}X_{\omega_2}(\omega_1) P_1(d\omega_1) ] P_2(d\omega_2) \end{aligned}

Thursday, January 29, 2015

Clarification of Expectation

Let X be a random variable on the probability space (\Omega, \mathcal{B}, P).  Recall that the distribution of X is the measure
F := P \circ X^{-1} on (\mathbb{R},\mathcal{B}(\mathbb{R})) defined by
F(A)=P\circ X^{-1}(A) = P[X\in A].
The distribution function of X is
F(x):= F((-\infty,x])=P[X\leq x].  Note that the letter "F" is overloaded in two ways.

An application of the Transformation Theorem allows us to compute the abstract integral
E(X) = \int_\Omega XdP as
E(X) = \int_\mathbb{R} xF(dx), which is an integral on \mathbb{R}.

More precisely,
E(X) = \int_\Omega X(\omega)P(d\omega)=\int_\mathbb{R} x F(dx).
Also given a measurable function g(X), The expectation of g(X) is
E(g(X)) = \int_\Omega g(X(\omega))P(d\omega)=\int_\mathbb{R} g(x) F(dx).
Instead of computing expectations on the abstract space \Omega, one can always compute them on \mathbb{R} using F, the distribution of X.

Random variables and Inverse maps

A random variable is a real valued function with domain \Omega which has an extra property called measurability that allows us to make probability statements about the random variable.

Suppose \Omega and \Omega' are two sets.  Often \Omega' = \mathbb{R}.  Suppose
X:\Omega \mapsto \Omega',  Then X determines an inverse map (a set valued function)
X^{-1}: \mathcal{P}(\Omega')\mapsto \mathcal{P}(\Omega) defined by
X^{-1}(A') = \{\omega \in \Omega : X(\omega) \in A'\} for A' \subset \Omega'.

X^{-1} preserves complementation, union and intersections.

Wednesday, January 28, 2015

Convergence Concepts

Implications of convergence



1. Almost Sure Convergence

Examples of statements that hold almost surely (a.s.)

  • Let X,X' be two random variables.  Then X=X' a.s. means P[X=X']=1; that is, there exists an event N\in \mathcal{B}, such that P(N)=0 and if \omega\in N^c, then X(\omega)=X'(\omega).
  • If \{X_n\} is a sequence of random variables, then \lim_{n\rightarrow \infty}X_n exists a.s. means there exists an event N\in \mathcal{B}, such that P(N)=0 and if \omega\in N^c then \lim_{n\rightarrow \infty}X_n(w) exists. It also means that for a.a. \omega, \underset{n\rightarrow \infty}{\text{lim sup}}X_n(\omega)=\underset{n\rightarrow \infty}{\text{lim inf}}X_n(\omega). We will write \lim_{n\rightarrow \infty}X_n = X or X_n \overset{a.s.}{\rightarrow}X.
  • If \{X_n\} is a sequence of random variables, then \sum_n X_n converges a.s. means there exists an event N\in \mathcal{B}, such that P(N)=0, and \omega \in N^c implies \sum_n X_n(w) converges.

2. Convergence in Probability

Suppose X_n, n\ge 1 and X are random variables.  Then {X_n} converges in probability (i.p.) to X, written X_n \overset{P}{\rightarrow}X, if for any \epsilon > 0 \lim_{n\rightarrow \infty} P[|X_n-X|>\epsilon]=0.
Almost sure convergence of \{X_n\} demands that for a.e. \omega, X_n(w)-X(w) gets small and stay small.  Convergence i.p. is weaker and merely requires that the probability of the difference X_n(w)-X(w) being non-trivial become small.

It is possible for a sequence to converge in probability but not almost surely.

Theorem 1. Convergence a.s. implies convergence i.p. Suppose that X_n, n\ge 1 and X are random variables on a probability space (\Omega,\mathcal{B},P).  If X_n \rightarrow X, \; a.s. then X_n \overset{P}{\rightarrow}X.
Proof.  If X_n \rightarrow X a.s. then for any \epsilon,
\begin{aligned} 0\;&=P([|X_n-X|>\epsilon]i.o.) \\   &=P(\underset{n\rightarrow \infty}{\text{lim sup}}[|X_n-X|>\epsilon]) \\   &=\lim_{N\rightarrow \infty}P(\bigcup_{n\ge N}[|X_n-X|>\epsilon] ) \\   &\ge \lim_{n\rightarrow \infty}P[|X_n-X|>\epsilon] \end{aligned}

3. L_p Convergence

Recall the notation X\in L_p which means E(|X|^p)<\infty . For random variables X,Y\in L_p, we define the L_p metric for p\ge 1 by
d(X,Y)=(E|X-Y|^p)^{1/p}.  This metric is norm induced because
\|X\|_p := (E|X|^p)^{1/p} is a norm on the space L_p.

A sequence \{X_n\} of random variables converges in L_p to X, written
X_n \overset{L_p}{\rightarrow}X , if
E(|X_n-X|^p) \rightarrow 0 as n\rightarrow \infty.

Facts about L_p convergence.
  1. L_p convergence implies convergence in probability: For p>0, if X_n\overset{L_p}{\rightarrow} X then X_n \overset{P}{\rightarrow}X .  This follows readily from Chebychev's inequality, P[|X_n-X|\ge \epsilon] \leq  \frac{E(|X_n-X|^p|)}{\epsilon^p} \rightarrow 0.
  2. Convergence in probability does not imply L_p convergence.  What can go wrong is that the nth function in the sequence can be huge on a very small set.
    Example.  Let the probability space be ([0,1],\mathcal{B}([0,1]),\lambda) , where \lambda is Lebesgue measure and define
    X_n = 2^n 1_{(0,\frac{1}{n}) } then
    P[|X_n| > \epsilon ] = P \left( (0,\frac{1}{n}) \right)  = \frac{1}{n} \rightarrow 0 but
    E(|X_n|^p) = 2^{np} \frac{1}{n} \rightarrow \infty
  3. L_p convergence does not imply almost sure convergence.
    Example.  Consider the functions \{X_n\} defined on  ([0,1],\mathcal{B}([0,1]),\lambda) , where \lambda is Lebesgue measure.
    \begin{align*}     X_1 &= 1_{[0,\frac{1}{2}]},  \quad  X_2 = 1_{[\frac{1}{2},1]} \\     X_3 &= 1_{[0,\frac{1}{3}]},  \quad  X_4 = 1_{[\frac{1}{3},\frac{2}{3}]} \\     X_5 &= 1_{[\frac{1}{3},1]},  \quad  X_6 = 1_{[0,\frac{1}{4}]}, \cdots \\ \end{align*} and so on, Note that for any p>0,
    E(|X_1|^p)=E(|X_2|^p)=\frac{1}{2},\\     E(|X_3|^p)=E(|X_4|^p)=E(|X_5|^p)=\frac{1}{3}, \\     E(|X_6|^p)=\frac{1}{4}, \cdots so  E(|X_n|^p) \rightarrow 0 and
    X_n \xrightarrow[]{L_p} 0.
    Observe that \{X_n\} does not converge almost surely to 0.

Limits and Integrals

Under certain circumstances we are allowed to interchange expectation with limits.

Theorem 1. Monotone Convergence Theorem (MCT). If
0\leq X_n \uparrow Xthen
0\leq E(X_n) \uparrow E(X)
Corollary 1.  Series Version of MCT.  If X_n \ge 0 are non-negative random variables for n\ge1, then
E(\sum_{n=1}^\infty X_n)= \sum_{n=1}^\infty E(X_n)
so that the expectation and infinite sum can be interchanged

Theorem 2. Fatou Lemma. If X_n \ge 0, then
E(\underset{n\rightarrow \infty}{\text{lim inf}} X_n ) \leq \underset{n\rightarrow \infty}{\text{lim inf}}E(X_n)
More generally, if there exists Z\in L_1 and X_n\ge Z, then
E(\underset{n\rightarrow \infty}{\text{lim inf}} X_n ) \leq \underset{n\rightarrow \infty}{\text{lim inf}}E(X_n)
Corollary 2. More Fatou.  If 0 \leq X_n \leq Z where Z\in L_1, then
E(\underset{n\rightarrow \infty}{\text{lim sup}} X_n ) \ge \underset{n\rightarrow \infty}{\text{lim sup}}E(X_n)
Theorem 3. Dominated Convergence Theorem (DCT).  If
X_n \rightarrow X and there exists a dominating random variable Z\in L_1 such that
|X_n| \leq Zthen
E(X_n)\rightarrow E(X) \; \text{and} \; E|X_n-X|\rightarrow 0.
\newcommand{\scriptB}{\mathcal{B}} \newcommand{\scriptP}{\mathcal{P}} \newcommand{\vecX}{\mathbf{X}} \newcommand{\vecx}{\mathbf{x}} \newcommand{\reals}{\mathbb{R}} \newcommand{\cplxs}{\mathbb{C}} \newcommand{\rationals}{\mathbb{Q}} \newcommand{\naturals}{\mathbb{N}} \newcommand{\integers}{\mathbb{Z}} \newcommand{\ntoinf}{n\rightarrow\infty} \newcommand{\mtoinf}{m\rightarrow\infty} \newcommand{\tendsto}{\rightarrow}
Example of when interchanging limits and integrals without the dominating condition.  (When something very nasty happens on a small set and the degree of nastiness overpowers the degree of smallness).

Let
(\Omega, \scriptB, P) = ([0,1], \scriptB([0,1]), \lambda) \lambda the Lebesgue measure. Define
X_n = n^2 1_{(0,1/n)}. For any \omega \in [0,1],
 1_{(0,1/n)}(w) \tendsto 0, so
X_n \tendsto 0. However
 E(X_n) = n^2 \cdot \frac{1}{n} = n \tendsto \infty, so
E(\liminf_{\ntoinf} X_n) = 0 \le \liminf_{\ntoinf} (EX_n) = \infty and
E(\limsup_{\ntoinf} X_n) = 0 \not\ge \limsup_{\ntoinf} (EX_n) = \infty.

Tuesday, January 27, 2015

Zero-One Laws

There are several common zero-one laws which identify the possible range of a random variable to be trivial. There are also several zero-one laws which provide the basis for all proofs of almost sure convergence.

Proposition 1. Borel-Cantelli Lemma Let \{A_n\} be any events (not necessarily independent).

If \sum_n{P(A_n)}<\infty, then
P([A_n \; i.o.])=P(\underset{n\rightarrow \infty}{\text{lim sup}} A_n) = 0.
Proposition 2. Borel Zero-One Law If \{A_n\} is a sequence of independent events, then

 \begin{equation*} P([A_n \; i.o.])= \begin{cases} 0, \quad &  \text{iff} \sum_n P(A_n) < \infty \\ 1, \quad &  \text{iff} \sum_n P(A_n) = \infty \end{cases} \end{equation*}
Definition.  An almost trivial \sigma-field is a \sigma-field all of whose events has probability 0 or 1.

Theorem 3. Kolmogorov Zero-One Law If \{X_n\} are independent random variables with tail \sigma-field \mathcal{T}, then \Lambda\in \mathcal{T} implies P(\Lambda)=0 or 1 so that the tail \sigma-field is almost trivial.

Lemma 4. Almost trivial \sigma-fields  Let \mathcal{G} be an almost trivial \sigma-field and let X be a random variable measurable with respect to \mathcal{G}.  Then there exists c such that P[X=c] = 1.

Corollary 5.  Let \{X_n\} be independent random variables.  Then the following are true.

(a) The event
[\sum_n X_n \;converges] has probability 0 or 1.

(b) The random variables \text{lim sup}_{n\rightarrow \infty}X_n and \text{lim inf}_{n\rightarrow \infty}X_n are constant with probability 1.

(c) The event
\{\omega: S_n(\omega)/n \rightarrow 0 \} has probability 0 or 1.

Monday, January 26, 2015

Inequalities

Another valuable book for anyone in computer science who ever wants to bound any quantity (so, everyone!) is: The Cauchy-Schwarz Master Class: An Introduction to the Art of Mathematical Inequalities by Michael Steele.
An encyclopedic book on the topic is A Dictionary of Inequalities. While this is not a book for reading cover-to-cover, it is good to have it at your disposal. See also the supplement of the book.
Moreover, Wikipedia has an excellent list of inequalities.
For specific topics, you may consult:

Friday, January 23, 2015

Dynkin's theorem

First we define a structure named \lambda-system.

A class of subsets \mathcal{L} of \Omega is called a \lambda-system if it satisfies the following postulates

1.  \Omega\in \mathcal{L}
2. A\in\mathcal{L} \Rightarrow A^c \in \mathcal{L}
3. n\neq m, A_nA_m = \emptyset, A_n \in \mathcal{L} \Rightarrow \cup_n A_n \in \mathcal{L}

It is clear that a \sigma-field is always a \lambda-system.

Next a \pi-system is a class of sets closed under finite intersections.

Dynkin's theorem

a) if \mathcal{P} is a \pi-system and \mathcal{L} is a \lambda-system such that \mathcal{P}\subset\mathcal{L}, then \sigma(\mathcal{P})\subset \mathcal{L}.

b) If \mathcal{P} is a \pi-system,

\sigma(\mathcal{P})=\mathcal{L}(\mathcal{P})

that is, the minimal \sigma-field over \mathcal{P} equals the minimal \lambda-system over \mathcal{P}

Wednesday, January 21, 2015

An example of Stein's paradox

Charles Stein showed in 1958, that a nonlinear, biased estimator of a multivariate mean has a lower MSE compared to the ML estimator.

Given a sample of N measurements of X\sim\mathcal{N}(\mu,\sigma I_p) with unknown parameter vector \mu of length p.

The James-Stein estimator is given by

\begin{equation*} \hat{\mu}_{JS}=\left (1-\frac{(p-2)\frac{\sigma^2}{N}}{\|\bar{x} \|^2}\right ) \bar{x} \end{equation*}
where \bar{x} is the sample mean.

This estimator dominates the MLE everywhere in terms of MSE.  For all \mu\in\mathbb{R}^p,

\begin{equation*} \mathbb{E}_\mu \| \hat{\mu}_{JS}-\mu\|^2 < \mathbb{E}_\mu \| \hat{\mu}_{MLE}-\mu\|^2 \end{equation*}
This makes the MLE inadmissible for p\ge3!

Wednesday, January 14, 2015

The need for measure theory

The problem with measure arises when one needs to decompose a body into (possibly uncountable) number of components and reassemble it after some action on those components.  Even when you restrict attention to just finite partitions, one still runs into trouble here. The most striking example is the Banach-Tarski paradox, which shows that a unit ball B in three dimension can be disassembled into a finite number of pieces and reassembled to form two disjoint copies of the ball B.

Such pathological sets almost never come up in practical applications of mathematics.  Because of this, the standard solution to the problem of measure has been to abandon the goal of measuring every subset E of \mathbb{R}^d and instead to settle for only measuring a certain subclass of
non-pathological subsets of \mathbb{R}^d, referred to as the measurable sets.

The most fundamental concepts of measure is the properties of
  1. finite or countable additivity
  2. translation invariance
  3. rotation invariance
The concept of Jordan measure (closely related to that of Riemann and Darboux integral) is sufficient for undergraduate level analysis.

However, the type of sets that arise in analysis, and in particular those sets that arise as limit of other sets, requires an extended concept of measurability (Lebesgue measurability)

Lebesgue theory is viewed as a completion of the Jordan-Darboux-Riemann theory.  It keeps almost all of the desirable properties of Jordan measure, but with the crucial additional property that many features of the Lebesgue theory are preserved under limits.

Probability space concepts

A field is a non-empty class of subsets of \Omega closed under finite union, finite intersection and complements.  A synonym for field is algebra.

From de Morgan's laws a field is also closed under finite intersection.

A \sigma-field \mathcal{B} is a non-empty class of subsets of \Omega closed under countable union, countable intersection and complements.  A synonym for \sigma-field is \sigma-algebra.

In probability theory, the event space is a \sigma-field.  This allows us enough flexibility constructing new-events from old ones (closure) but not so much flexibility that we have trouble assigning probabilities to the elements of the \sigma-field.

For the Reals, we start with sets that we know how to assign probabilities.

Supposes \Omega=\mathbb{R} and let

\mathcal{C}=\{(a,b],-\infty \leq a \leq b < \infty \}

The Borel sets is defined as

\mathcal{B}(\mathbb{R}) \equiv \sigma(\mathcal{C})

Also one can show that

\mathcal{B}(\mathbb{R}) = \sigma(\text{open sets in } \mathbb{R})

Wednesday, January 07, 2015

Russell's Paradox

Shortly after the turn of the 19th century, Bertrand Russell demonstrated a hole in mathematical logic of set theory at the time.  A set can be member of itself. For sets R and S

R = \{S | R \notin S\}

The set R contains all sets that do not have themselves as members.

However, is R a member of itself?

Clearly not, since by definition R is the set of all sets that do not have themselves as member.

But then, if R does not have itself as a member then it must be a member of the set R

At that point in time, it created a huge stir among the mathematical community since most of what they do are based upon the foundation of sets.