Dave's Journal: 2015

Tuesday, December 22, 2015

Notes on transform domain adaptation (LMS)

Transform domain adaptive filter (e.g. DFT-LMS, DCT-LMS)

Performance of LMS is sensitive to the eigenvalue spread of the input covariance matrix.
The smallest eigenvalue contribute to slower convergence and the largest eigenvalue limit the range of allowed step-sizes, thus limiting the learning abilities of the filter.

A DFT/DCT type transformation allows one to whiten the input sequence in the transform domain without the need to know the correlation matrix (e.g. using KLT) of the sequence which may not be stationary.

[Hodgkiss, 1979] has shown the conditions under which time domain and frequency domain processing is equivalent.

[Beaufays, 1995] showed analytically the eigenvalue distributions of a markov-1 sequence to have asymptotic eigenvalue spread of [interesting matrix theory]

\[ \left( \frac{1+\rho}{1-\rho} \right)^2, \quad \text{before transformation} \]

\[ \left( \frac{1+\rho}{1-\rho} \right), \quad \text{after DFT and power normalization} \]

\[ (1+\rho), \quad \text{after DCT and power normalization} \]

Monday, December 21, 2015

Random sequence whitening

Orthogonal decomposition (e.g. EVD) and triangular decomposition (e.g. LDU) decorrelate a signal sequence.
LDU decomposition allows us to whiten a signal causally (since $\mathbf{B}=\mathbf{L}^{-1}$ is a causal linear transform).

Notes on Discrete Karhunen-Loeve Transform (DKLT) and DFT

In general, the DKLT is obtained from eigenvalue decomposition.
The correlation matrix of a stationary process is Toeplitz.
If the autocorrelation sequence of a random process is periodic with fundamental period $M$, its correlation matrix becomes circulant.
The DFT provides the DKLT of periodic random sequences.

This can be easily seen because DFT defines the complete set of eigenvectors for all circulant matrices.

Thursday, December 10, 2015

Notes on Globecom 2015

Monday 12/7/2015

Morning session

Suppression of Analog Self-Interference Canceller Nonlinearities in MIMO Full Duplex

Use of nonlinear modeling to cancel nonlinear behavior of the RF chain

Receive Spatial Modulation (Marco Di Renzo)

New ideas for MIMO Broadcast Channel

A 60GHz LOS MIMO Backhaul Design Combining Spatial Multiplexing and Beamforming for a 100Gbps Throughput (Gerhard Fettweis' group)

Deterministic Spatial Multiplexing (Antenna spacing determined by the distance between TX and RX)

Noon session (mmWave)

Adaptive One-bit Compressive Sensing with Application to Low-Precision Receivers at mmWave (joint work University of Vigo, Spain and Robert Heath)

Compressive sensing for estimating channel of 1-bit receiver

Afternoon session (Massive MIMO)

Exploiting the Tolerance of Massive MIMO to Incomplete CSI for Low-Complexity Transmission (UCL)

Exploit antenna correlation by training a subset of antennas and interpolate missing information. (Perhaps CS based schemes can be used?)

Downlink Performance and User Scheduling of HetNet with Large-Scale Antenna Arrays (University of Victoria)

Lower bound analysis on Hetnet. (Useful technique)

Large System Analysis of Base Station Cooperation in the Downlink (Luca Sanguinetti, Couillet and Debbah)

New RMT results for DL BS cooperation with imperfect CSI

Tuesday 12/8/2015

Morning session (Massive MIMO)

Polynomial-Expansion Multi-Cell Aware Detector for Uplink Massive MIMO Systems with Imperfect CSI (Germany)

replacing matrix inverse with polynomial expansion

Joint Use of H-inf Criterion in Channel Estimation and Precoding to Mitigate Pilot Contamination in Massive MIMO Systems (University of Northeastern, China)

H-inf Criterion used to design CE and Precoding (minimax like criterion, minimize worst case penalty)

Location-Aided Pilot Contamination Elimination for Massive MIMO Systems (Chalmers, Sweden)

estimate AoA and AS to compute Covariance matrix.

Data-Assisted Massive MIMO Uplink Transmission with Large Backhaul Cooperation Delay: Scheme Design and System-Level Analysis (Very good analysis)

Successively use data to improve channel estimation.
Develop network model with fixed BS location using stochastic geometry
Shifted Zadoff Chu sequence used (do not reuse pilots)

Noon session

Maximal Ratio Transmission in Wireless Poisson Networks under Spatially Correlated Fading Channels

Stochastic geometry. Correlated channel. Spatial correlation is SINR dependent.

Performance Analysis of Single-Carrier Modulation with Correlated Large-Scale Antennas (Geoffrey Li, Georgia Tech)

Propose SC over OFDM in massive MIMO. No equalizer under the condition of i.i.d channel.

Afternoon session

Multi-switch for antenna selection in massive MIMO

binary switching in antenna selection (have small loss in capacity)

A Multi-cell MMSE Precoder for Massive MIMO Systems and New Large System Analysis (Emil Bjornson) Good analysis

Multicell MMSE precoder for TDD massive MIMO using idea of Pilot reuse. Exploit UL/DL duality to come up with precoder design that spans B directions only.

Harmonized Cellular and Distributed Massive MIMO: Load Balancing and Scheduling (DOCOMO)

Distributed massive MIMO with Hetnet, scheduling using blanking etc.
research has industry knowledge

Thursday, December 03, 2015

Multi-cluster massive MIMO

Ideas

[12/3/2015] It might be worthwhile to look into the design of the transmit unitary matrix.

Challenges

When interfering channel is present, any use of the interfering channel will add to the noise temperature of the environment.

Key Concepts

Successive decoding achieves the capacity region of degraded broadcast channel.

Scalar broadcast channel is a degraded broadcast channel.

MISO channel: With only covariance feedback, water filling along the eigenvectors of the channel covariance achieve capacity. Specifically independent complex circular Gaussian inputs along the eigenvectors of $\Sigma$ is optimal.

Tuesday, November 24, 2015

Quasiconvex functions

A function $f$ is quasiconvex iff $\mathbf{dom} f$ is convex and for any $x,y \in \mathbf{dom} f$ and $0\leq \theta \leq 1$
\[ f(\theta x + (1-\theta) y) \leq \max\{ f(x), f(y) \}\] A continuous function $f: \mathbf{R} \rightarrow \mathbf{R}$ is quasiconvex iff at least one of the following conditions holds

$f$ is nondecreasing
$f$ is nonincreasing
there is a point $c\in \mathbf{dom} f$ such that for $t \leq c$, $f$ is nonincreasing, and for $t \ge c$, $f$ is nondecreasing.

First order conditions

A continuous differentiable function $f: \mathbf{R}^n \rightarrow \mathbf{R}$ is quasiconvex iff $\mathbf{dom} f$ is convex and for all $x,y\in \mathbf{dom}f$

\[ f(y) \leq f(x) \Rightarrow \nabla f(x)^T (y-x) \leq 0 \]

Operations that preserve quasiconvexity

nonnegative weighted maximum
The function \[ f(x) = \sup_{y\in C} (w_y g_y(x) ) \] where $w_y \ge 0$ and $g_y(x)$ is a parameterized family of quasiconvex functions.
composition

if $g$ is quasiconvex and $h$ is nondecreasing then, $f = h \circ g$ is quasiconvex
composition of a quasiconvex function with an affine or linear-fractional transformation yields a quasiconvex function.

minimization
if $f(x,y)$ is quasiconvex jointly in $x$ and $y$ and $C$ is a convex set, then the function
\[ g(x) = \inf_{y\in C} f(x,y)\] is quasiconvex.

Wednesday, November 18, 2015

Gaussian interference channel

The sum capacity is known for two special cases:

Strong interference: Capacity is achieved by decoding and canceling the interference before decoding the desired signal. The condition is defined as $I_2 \ge S_1$ and $I_1 \ge S_2$
\begin{align*}
R_1 & \leq C(S_1), \\
R_2 & \leq C(S_2), \\
R_1 + R_2 & \leq \min \lbrace C(S_1+I_1), C(S_2+I_2) \rbrace
\end{align*}

Weak interference: Capacity is achieved by treating interference as Gaussian noise

\[ C_{sum} = C \left( \frac{S_1}{1+I_1} \right) + C \left( \frac{S_2}{1+I_2} \right) \]

Gaussian broadcast channel

Two users case

Scalar Gaussian broadcast channel belongs to the class of degraded broadcast channel. (The users can be absolutely ranked by their channel strength)

When the transmitter has more than one antenna, the Gaussian broadcast channel is non-degraded.

Sender of power $P$ transmit to two receivers, one with Gaussian noise power $N_1$ and the other with $N_2$. assume $N_1 < N_2$. The capacity region is
\begin{align*}
R_1 &< C \left(\frac{\alpha P}{N_1}\right) \\
R_2 &< C \left(\frac{(1-\alpha) P}{\alpha P + N_2}\right) \\
\end{align*} where $0 \leq \alpha \leq 1 $

The weaker receiver $Y_2$ decodes its own message treating the other user's message as interference. The stronger receiver $Y_1$ decodes $Y_2$'s message and cancels it before decoding his own.

Note: The role of transmitter side information reduces with the growth in the number of TX antennas. With either CSIT or CDIT and under the ZMSW model the asymptotic growth is linear as $ \mathcal{C} \min(M,K) $.

Monday, November 02, 2015

weighted arithmetic-geometric mean inequality

Generalized form of the AM-GM inequality
\[ \sum_i \alpha_i v_i \ge \prod_i v_i^{\alpha_i} \] where $\mathbf{v} \succ 0$ and $\mathbf{\alpha} \succeq 0,\; \mathbf{1}^T\mathbf{\alpha} = 1$.

Tuesday, October 27, 2015

Karush-Kuhn-Tucker conditions for non-convex problems

Consider a problem (ICP) with equality and inequality constraints:
\begin{align*}
\text{minimize}\quad &f(x) \\
\text{subject to}\quad &h_i(x) = 0,\quad i=1,\dotsc,m\\
&g_j(x) \le 0,\quad j=1,\dotsc,r
\end{align*} where $f,\;h_i,\; g_j$ are continuously differentiable functions from $\mathbb{R}^n$ to $\mathbb{R}$.

With Lagrange function
\[L(x,\lambda,\mu) = f(x) + \sum_{i=1}^m \lambda_i h_i(x) + \sum_{j=1}^r \mu_j g_j(x)\]
Define the set of active inequality constraints:
\[ A(x) =\{ j \;|\; g_j(x) = 0 \}\] A feasible vector $x$ is regular if the equality constraint gradients $\nabla h_i(x),\; i=1,\dotsc,m$, and the active inequality constraint gradients $\nabla g_j(x),\; j\in A(x)$ are linearly independent.

KKT optimality conditions

KKT necessary conditions for regular $x^*$

Let $x^*$ be a local minimum of the problem
\begin{align*}
\text{minimize}\quad &f(x) \\
\text{subject to}\quad &h_i(x) = 0,\quad i=1,\dotsc,m\\
&g_j(x) \le 0,\quad j=1,\dotsc,r
\end{align*} where $f,\;h_i,\; g_j$ are continuously differentiable functions from $\mathbb{R}^n$ to $\mathbb{R}$ and $x^*$ is regular. There exist unique Lagrange multiplier vectors $\lambda^* = (\lambda^*_1,\dotsc,\lambda^*_m)$, $\mu^*=(\mu^*_1,\dotsc,\mu^*_r)$ such that
\begin{align*}
\nabla_x L(x^*,\lambda^*,\mu^*) = 0, \\
\mu^*_j\ge 0, \quad &j=1,\dotsc,r, \\
\mu^*_j=0,\quad &\forall j \notin A(x^*)
\end{align*} Note: The condition can be written as
\[\mu^*_j g_j(x^*) = 0, \quad j=1,\dotsc,r\] and aptly referred to as complementary slackness condition.

If in addition $f$, $h$, and $g$ are twice continuously differentiable, there holds
\[ y^T \nabla^2_{xx} L(x^*,\lambda^*,\mu^*)y \ge 0\] for all $y\in \mathbb{R}^n$ such that
\begin{align*}
\nabla h_i(x^*)^T y &= 0, \quad \forall i=1,\dotsc,m \\
\nabla g_j(x^*)^T y &= 0, \quad \forall j \in A(x^*)
\end{align*}

Wednesday, October 14, 2015

Braess' paradox

An interesting anomaly in non-cooperative behavior. (An example where Nash equilibrium is not optimal) See link

Monday, September 28, 2015

Weak and strong alternatives

Weak alternatives

Two systems of inequalities are called weak alternatives if at most one of the two is feasible.
This is true whether or not the inequalities of system 1 are convex (i.e. $f_i$ convex, $h_i$ affine)
System 2 is always convex ($g$ concave and $\lambda_i\ge 0$ are convex)

Lagrange duality theory can be applied to the problem of determining feasibility of a system of inequalities and equalities
\begin{equation}f_i(x) \leq 0, \; i=1,\cdots,m,\quad h_i(x)=0,\; i=1,\cdots,p
\end{equation}
The above can be posed as the standard problem with objective $f_0=0$:
\begin{align*}
\text{minimize} \quad &0\\
\text{subject to}\quad &f_i(x) \leq 0, \quad i=1,\cdots,m \\
&h_i(x) = 0, \quad i=1,\cdots,p
\end{align*}
This problem has optimal value
\[ p^* = \begin{cases}
0\quad &\text{if system is feasible}\\
\infty\quad &\text{if system is infeasible}
\end{cases}
\]
We associate with the inequality system the dual function
\[ g(\lambda,\nu)= \inf_{x\in\mathcal{D}} \left(\sum_{i=1}^m \lambda_i f_i(x) + \sum_{i=1}^p \nu_i h_i(x) \right) \]
Note that the dual function has is positive homogeneous, i.e. that for $\alpha > 0, \; g(\alpha\lambda,\alpha\nu) = \alpha g(\lambda,\nu)$. The dual problem of maximizing $g(\lambda,\nu) \; s.t. \; \lambda \succeq 0$ has the optimal value
\[
d^* = \begin{cases}
\infty \quad &\lambda \succeq 0, \; g(\lambda,\nu) \gt 0 \text{ is feasible}\\
0 \quad &\lambda \succeq 0, \; g(\lambda,\nu) \gt 0 \text{ is infeasible}
\end{cases}
\]
From weak duality $d^*\leq p^*$ and the above facts, we can conclude that the inequality system
\begin{equation} \lambda\succeq 0, \quad g(\lambda,\nu) \gt 0\end{equation} is feasible $d^*=\infty$, then the inequality system is infeasible (since $p^* = \infty$)

A solution to the dual function is a certificate of infeasibility of the system. To summarize:

feasibility of system 2 implies infeasibility of system 1
feasibility of system 1 implies infeasibility of system 2

Strong alternatives

When the original inequality system is convex, and some type of constraint qualification holds, then the pairs of weak alternatives becomes strong alternatives, which means exactly one of the two alternatives holds.

If system 1 is convex, it can be expressed as
\[f_i(x) \leq 0, \; i=1,\dotsm,m,\quad Ax=b, \; A\in \mathbf{R}^{p\times n}\]
This system and its alternative
\[ \lambda\succeq 0,\quad g(\lambda,\nu) \gt 0\]
are strong alternatives provided there exists an $x\in \text{relint } \mathcal{D}$ with $Ax=b$ and the optimal value $p^*$ is attained. [More on that later...]

See Farkas' lemma

Friday, September 25, 2015

Karush-Kuhn-Tucker (KKT) conditions

If strong duality (with zero duality gap) holds and $x,\lambda,\nu$ - possibly more than one pair - are optimal (see previous post), and the problem has differentiable $f_i, h_i$, it must satisfy the KKT conditions: [necessary condition]

primal constraints: $f_i(x)\leq 0, i=1,\cdots,m, \; h_i(x) = 0, i=1,\cdots,p$
dual constraints: $\lambda\succeq 0$
complementary slackness: $\lambda_i f_i(x) = 0, i=1,\cdots,m $
gradient of Lagrangian with respect to $x$ vanishes:
\[\nabla f_0(x) + \sum_{i=1}^m \lambda_i \nabla f_i(x) + \sum_{i=1}^p \nu_i\nabla h_i(x) = 0\]

For convex problems

Key: When the primal problem is convex, the KKT conditions are also sufficient for points to be primal and dual optimal.

If $\tilde{x},\tilde{\lambda},\tilde{\nu}$ satisfy KKT for a convex problem (i.e. $f_i$ convex, $h_i$ affine). Then they are optimal, with zero duality gap:

from complementary slackness: $f_0(\tilde{x})=L(\tilde{x},\tilde{\lambda},\tilde{\nu})$
from 4th condition (and convexity): $g(\tilde{x},\tilde{\lambda},\tilde{\nu}) =L(\tilde{x},\tilde{\lambda},\tilde{\nu})$

hence, $f_0(\tilde{x}) = g(\tilde{x},\tilde{\lambda},\tilde{\nu})$

If Slater's condition is satisfied:

$x$ is optimal iff there exist $\lambda,\nu$ that satisfy KKT conditions

Slater implies strong duality, therefore dual optimum is attained
generalizes optimality condition $\nabla f_0(x) = 0$ for unconstrained problem

Key: Many algorithms for convex optimization are conceived as methods for solving the KKT conditions.

Slater's condition

If the primal problem is convex i.e. of the form
\begin{align*}
\text{minimize}\quad &f_0(x) \\
\text{subject to}\quad &f_i(x) \leq 0, \quad i = 1,\cdots,m, \\
&Ax=b
\end{align*} with $f_0,\cdots,f_m$ convex, and if there exists a point that is strictly feasible, or more precisely an $x\in \text{relint }\mathcal{D}$ such that

\[ f_i(x) \lt 0, \quad i=1,\cdots,m, \quad Ax=b.\]

Strong duality holds if the above condition is met.
Also, the dual optimum is attained $d^*\gt -\infty$

Thursday, September 24, 2015

Notes on duality

Given a standard form problem (not necessarily convex)
\begin{align*}
\text{minimize} \quad &f_0(x) \\
\text{subject to} \quad &f_i(x)\leq 0, &i=1,\cdots,m \\
&h_i(x) = 0, &i=1,\cdots,p
\end{align*}
variable $x\in \mathbb{R}^n$, domain $\mathcal{D}$, optimal value $p^*$
with Lagrangian:
\[ L(x,\lambda,\nu) = f_0(x) + \sum_{i=1}^m \lambda_i f_i(x) + \sum_{i=1}^p \nu_i h_i(x) \]
and the (Lagrange) dual function:
\[ g(\lambda,\nu) = \inf_{x\in \mathcal{D}} L(x,\lambda,\nu)\]
Properties:

$g$ is concave over $\lambda, \nu$ since it is the infimum of a family of affine functions.
Lower bound property: if $\lambda \succeq 0$ then $g(\lambda,\nu) \leq p^*$

since, if $\tilde{x}$ is feasible and $\lambda \succeq 0$, then
\[ g(\lambda,\nu) = \inf_{x\in \mathcal{D}} L(x,\lambda,\nu) \leq L(\tilde{x}, \lambda, \nu) \leq f_0(\tilde{x})\] minimizing over all feasible $\tilde{x}$ gives $ g(\lambda,\nu) \leq p^*$

Lagrange dual and conjugate function

Given an optimization problem with linear inequality and equality constraints
\begin{align*}
\text{minimize}\quad &f_0(x) \\
\text{subject to}\quad &Ax\preceq b \\
&Cx=d.
\end{align*}
Using the definition of $f^* = \sup_{x\in \text{dom }f} (y^T x - f(x))$, we can write the dual function as:
\begin{align*}
g(\lambda,\nu)\quad &= \quad \underset{x}{\inf} (f_0(x) + \lambda^T (Ax-b) + \nu^T(Cx=d)) \\
&= \quad -b^T \lambda - d^T \nu + \underset{x}{\inf} (f_0(x) + (A^T \lambda + C^T \nu)^T x)\\
&= \quad -b^T \lambda - d^T \nu - f^* (-A^T \lambda - C^T \nu)
\end{align*} with domain
\[ \text{dom }g = \{ (\lambda,\nu) \;|\; -A^T \lambda - C^T \nu \in \text{dom } f_0^* \} \]

simplifies derivation of dual if conjugate of $f_0$ is known

The dual problem (finding the greatest lower bound for the primal problem)
\begin{align*}
\text{maximize}\quad &g(\lambda,\nu) \\
\text{subject to}\quad &\lambda \succeq 0
\end{align*}

a convex optimization problem; optimal value denoted $d^*$
$\lambda,\nu$ are dual feasible if $\lambda \succeq 0, (\lambda,\nu)\in \text{dom }g$

Weak duality: $d^* \leq p^*$

always holds (for convex and nonconvex problems)
can be used to find lower bounds for difficult problems

Strong duality: $d^* = p^*$ (zero duality gap)

does not hold in general
(usually) holds for convex problems
constraint qualifications assert conditions that guarantee strong duality in convex problems (e.g. Slater's constraint qualification)

Notes on dual norm

Dual norm is defined as
\[ \|z\|_* = \sup \{z^Tx \;|\; \|x\| \leq 1\}\] The dual norm can be interpreted as the operator norm of $z^T$, interpreted as a $1\times n$ matrix with the norm $\|\cdot\|$ on $\mathbf{R}^n$.

From the definition of dual norm we obtain the inequality
\[ z^Tx \leq \|x\| \|z\|_*\]
The conjugate of any norm $f(x)=\|x\|$ is the indicator function of the dual norm unit ball
\[f^*(y) = \left\{ \begin{array}
00 & \|y\|_* \leq 1 \\
\infty & \text{otherwise}
\end{array}
\right. \]
Proof. If $\|y\|_* \gt 1$, then by definition of the dual norm, there is a $z\in \mathbf{R}^n$ with $\|z\| \leq 1$ and $y^T z \gt 1$. Taking $x=tz$ and letting $t \rightarrow \infty$, we have
\[ y^T x - \|x\| = t(y^Tz - \|z\|) \rightarrow \infty\] which shows that $f^*(y) = \infty$. Conversely, if $\|y\|_*\leq 1$, then we have $y^Tx \leq \|x\| \|y\|^*$ for all $x$, which implies for all $x$, $y^Tx - \|x\| \leq 0$. Therefore $x=0$ is the value that maximizes $y^Tx - \|x\|$, with maximum value 0.

Wednesday, September 23, 2015

Schur complement: characterizations of positive definiteness for block matrices

Let a block matrix $X\in \mathbf{S}^n$ be partitioned as

\[ X = \begin{bmatrix}
A &B \\
B^T &C
\end{bmatrix} \] where $A\in \mathbf{S}^k$. If $\det A \neq 0$, the schur complement is
\[ S =C - B^TA^{-1}B \] The following characterizations of positive definiteness can be derived

$X\succ 0$ iff $A\succ 0$ and $S\succ 0$
If $A\succ 0$, then $X\succeq 0$ iff $S\succeq 0$

Example:
\begin{align*}
&A^TA \preceq t^2I, \quad &t\ge 0 \\
\Longleftrightarrow \quad &tI-t^{-1}A^T I A \succeq 0, \quad &t\ge 0 \\
\Longleftrightarrow \quad &
\begin{bmatrix}
tI & A \\
A^T & tI
\end{bmatrix} \succeq 0
\end{align*}

Tuesday, September 22, 2015

Notes on optimization problem

Optimality criterion for differentiable $f_0$
Given a convex optimization problem with objective $f_0$. $x$ is optimal iff it is feasible and
\[ \nabla f_0(x)^T (y-x) \ge 0 \quad \text{for all feasible } y \]

if $\nabla f_0(x) \neq 0 $, it means that $ -\nabla f_0(x) $ defines a supporting hyperplane to the feasible set at $x$.

Unconstrained problem: $x$ is optimal iff
\[ x\in \text{dom }f_0, \quad \nabla f_0(x) = 0 \]
Equality constrained problem:
\[ \text{minimize } f_0(x) \quad \text{ subject to } Ax=b \]
$x$ is optimal iff there exists a $\nu$ such that
\[ x\in \text{dom }f_0, \quad Ax=b, \quad \nabla f_0(x)+A^T\nu = 0\]
Note: Lagrange multiplier optimality condition.

Minimization over nonnegative orthant
\[ \text{minimize } f_0(x) \quad \text{ subject to } x\succeq 0 \]
$x$ is optimal iff
\[ x\in\text{dom }f_0, \quad x\succeq 0, \quad \left\{ \begin{array}{lr}
\nabla f_0(x)_i \ge 0, \quad &x_i = 0 \\
\nabla f_0(x)_i = 0, \quad &x_i \ge 0
\end{array} \right. \]
Note: The last condition is a complementarity condition.

Monday, September 21, 2015

Notes on conjugate function (Fenchel conjugate)

The conjugate function (Fenchel conjugate)

Counterpart to Fourier transform in Convex Analysis

The conjugate of $f$ ($f$ not necessarily convex):

\[ f^*(y) = \underset{x\in \text{dom }f}{\sup} (y^Tx - f(x)) \]

The domain of $f^*$ consists of $y\in \mathbb{R}^n$ for which the supermum is finite, i.e. for which the difference $y^T x - f(x)$ is bounded above on $\text{dom }f$.

$f^*$ is convex, since it is the pointwise supremum of a family of affine functions of $y$.

Examples:

Affine function. $f(x)=ax+b$. $f^*(y) = yx-ax-b = (y-a)x-b$.

\[ f^*(y) = \left\{ \begin{array}
-b &y=a\\
\infty &\text{otherwise}
\end{array} \right. \] with $\text{dom }f^* = \{a\}$.

Negative logarithm. $f(x) = -\log x$, with $\text{dom }f=\mathbb{R}_{++}$.

The function $xy + \log x$ is unbound above if $y\ge 0$ and reaches its max at $x=-1/y$ otherwise.
\[ f^*(y) = \left\{ \begin{array}
-1-\log (-y) &y\lt 0\\
\infty &\text{otherwise}
\end{array} \right. \] with $\text{dom }f^* = \{y\; |\; y\lt 0 \}$

Fenchel's inequality

Fenchel's inequality can be obtained from the definition of conjugate function

\[ f(x) + f^*(y) \ge x^T y , \quad \forall x,y\]
since for each $y$,
\[ f^*(y) \ge x^Ty - f(x) \Leftrightarrow f(x) + f^*(y) \ge x^Ty \]

Notes on convex functions

Basic Properties

[TODO]

Examples

Affine functions are both convex and concave
All norms are convex (e.g. Frobenius norm, Spectral norm)
Max function $f(x) = \max_i x_i$ is convex on $\mathbb{R}^n$
Quadratic-over-linear function $f(x,y)=x^2/y$ is convex on $\text{dom }f=\mathbb{R}\times\mathbb{R}_{++}$
Log-sum-exp $f(x)=\log \sum^n_{i=1}e^{x_i}$ is convex on $\mathbb{R}^n$
Geometric mean $f(x)=(\prod^n_{i=1}x_i)^{1/n}$ is concave on $\text{dom }f=\mathbb{R}^n_{++}$
Log-det is concave on $\text{dom }f=\mathbf{S}^n_{++}$

Restriction of a convex function to a line

$f:\mathbb{R}^n \rightarrow \mathbb{R}$ is convex iff the function $g:\mathbb{R} \rightarrow \mathbb{R}$,

\[ g(t) = f(x+tv), \quad \text{dom } g = \{ t\; |\; x + tv \in \text{dom } f \} \]

is convex in $t$ for any $x \in \text{dom } f$, $v \in \mathbf{R}^n$

Upshot: can check convexity of $f$ by checking convexity of functions of one variable.

Epigraph and sublevel set

$\alpha$-sublevel set of $f:\mathbb{R}^n \rightarrow \mathbb{R}$, is defined as all x for which the value of $f(x)$ is less then $\alpha$

\[ C_\alpha = \{ x \in \text{dom } f \; | \; f(x) \leq \alpha \} \]

Property: sublevel sets of convex functions are convex (converse is false)

epigraph of $f:\mathbb{R}^n \rightarrow \mathbb{R}$:

\[ \text{epi } f = \{ (x,t) \in \mathbb{R}^{n+1} \; | \; x \in \text{dom } f, f(x) \leq t \} \]

Property: $f$ is convex iff $\text{epi }f$ is a convex set

Operations that perserve convexity

methods for establishing convexity of a function

verify definition (often simplified by restricting to a line)
for twice differentiable functions, show $\nabla^2 f(x) \succeq 0$
show that $f$ is obtained from simple convex functions by operations that preserve convexity

nonnegative weighted sum
composition with affine function
pointwise maximum and supremum
composition
minimization
perspective

Notes on convex sets

Cones

Definition of a cone is quite general:

A set $C$ is a cone if for every $x\in C$ and $\theta\ge 0$, we have $\theta x \in C$. (Note that, a subspace is a cone since all (non-negative) combination of points in the subspace belongs to the subspace)

A set $C$ is a convex cone if it is convex and a cone.

A proper cone on the other hand is more strict.

A cone $K \subseteq \mathbb{R}^n$ is a proper cone if it satisfies the following

$K$ is convex
$K$ is closed
$K$ is solid, which means it has nonempty interior.
$K$ is pointed, which means that it contains no line.

Dual cones and generalized inequalities

Dual cones of a cone $K$:
\[ K^* = \{y\; |\; y^T x\ge 0 \text{ for all } x \in K \} \]

Dual cones of proper cones are proper, so a dual cone defines generalize inequalities in the dual domain

\[ y \succeq_{K^*} 0 \quad \Longleftrightarrow \quad y^T x\ge 0 \text{ for all } x\succeq_K 0 \]

Characterization of minimum and minimal elements via dual inequalities

minimum element w.r.t. $ \preceq_K$

Definition: $x$ is minimum element of $S$ iff for all $\lambda \succeq_{K^*} 0$, $x$ is the unique minimizer of $\lambda^T z$ over $S$

minimal element w.r.t. $ \preceq_K $

Definitions:

if $x$ minimizes $\lambda^T z$ over $S$ for some $\lambda \succ_{K^*} 0$, then $x$ is minimal
if $x$ is a minimal element of a convex set $S$, then there exists a nonzero $\lambda \succeq_{K^*} 0$ such that $x$ minimizes $\lambda^T z$ over $S$

Thursday, July 30, 2015

Useful Matrix properties

Theorem

Let $A$ be $m\times n$ matrix, $B$ and $C$ $n\times n$ matrices, with $B$ symmetric, $x$ $n\times 1$ vector

$Ax=0 \; \forall x$ if and only if $A=0$,
$x^TAx \; \forall x$ if and only if $B=0$,
$x^TCx \; \forall x$ if and only if $C^T=-C$ (skew symmetric).

Wednesday, July 22, 2015

Concentration of measure

To characterize the deviation of a random quantity (typically a general function of weakly correlated variables) from it's mean.

One can look at

the tail end probability (moment method and chernoff's bound)
the stability about it's mean (i.e. $\mathbb{P}\{ \left| \frac{Z}{\mathbb{E}Z} -1 \right| > \epsilon\}$ )
the bounded difference ( $\mathbb{P}[ | Z - \mathbb{E}Z | > t ]$ )

Monday, July 13, 2015

Circularly symmetric complex Gaussian

Special case: scalar variable $Z\sim\mathcal{CN}(0,1)$
\begin{aligned}
f_Z(z) &= \frac{1}{\pi}e^{-|z|^2} \\
f_{|Z|,\Theta}(|z|,\theta) &= \frac{1}{\pi} |z| e^{-|z|^2}
\end{aligned}
Note that the distribution is a function of the magnitude of $Z$, and therefore constant over all angles $\theta$. To get the distribution of the magnitude of Z, we integrate over all angles $ -\pi \le \theta \le \pi$.
\begin{aligned}
f_{|Z|}(|z|) &= \int f_{|Z|,\Theta}(|z|,\theta) d\theta \\
&= \int_{-\pi}^\pi d\theta \; \frac{1}{\pi} |z| e^{-|z|^2} \\
&= 2 |z| e^{-|z|^2} ,\; |z| > 0
\end{aligned}
Define $Y=|Z|$ then,
\[ f_Y(y) = 2 y e^{-y^2}, \quad y > 0\]
For magnitude squared distribution, we can perform a change of variable, with $X=Y^2=g(Y)$, since $ P( \{-\sqrt{x}, x > 0\} ) = 0 $, we only have one region (the set $\{ \sqrt{x}, x>0 \}$ ) to worry about.
\begin{aligned}
f_X(x) &= f_Y(g^{-1}(x)) \left|\frac{ \partial g^{-1}(x) }{ \partial x} \right| \\
&= f_Y(\sqrt{x}) \left|\frac{ \partial \sqrt{x} }{ \partial x} \right| \\
&= (2x^{\frac{1}{2}}e^{-x}) \frac{1}{2} x^{-\frac{1}{2}} \\
&= e^{-x}, \; x>0
\end{aligned}
Note:
Here we assume the probability spaces $(X,\mathcal{X},P_X)$, $(Y,\mathcal{Y},P_Y)$and that a mapping T exists that maps $A\in\mathcal{X}$ to $B\in\mathcal{Y}$.
Start from the joint cdf in the X-space, define $A\equiv\{\sqrt{X^2+Y^2}\leq |z|, \tan^{-1}(Y/X) \leq \theta_0\}$
\[F_{|Z|,\Theta}(|z|,\theta_0) = P(A) = \underset{(x,y)\in A}{\iint} f(x,y) dx \; dy \]
With a change of variable, define $B\equiv\{|Z|\leq |z|, \Theta \leq \theta_0\}$
\begin{aligned}
F_{|Z|,\Theta}(|z|,\theta_0) = P(B) &= \underset{(r,\theta)\in B}{\iint} f(x(r,\theta), y(r,\theta)) r \; dr \; d\theta \\
&= \int_0^{|z|} \int_0^{\theta_0} f(x(r,\theta), y(r,\theta)) r \; dr \; d\theta
\end{aligned}
Differentiate wrt $|z|$ and $\theta$ to get the pdf.

Theorem. Given any function $g:\mathbb{R}^2 \rightarrow \mathbb{R}$ and T that maps the $A\in\mathcal{X}$ to $B\in\mathcal{Y}$ then
\[\iint_A g(x_1,x_2) dx_1 \; dx_2 = \iint_B g(x_1(y_1,y_2), x_2(y_1,y_2)) |J(y_1, y_2)| dy_1\; dy_2\] where $J(y_1,y_2)$ is the Jacobian.

Thursday, July 09, 2015

The inverse power of praise

New York Magazine article

Monday, July 06, 2015

Delay gratification

Newyorker article

The cult of Genius

Discover magazine blog

"... the brain is a muscle. Giving it a harder workout makes you smarter"

Friday, May 15, 2015

searching for files in Windows command prompt

If you know the name of the file

dir secret.* /s /p

Friday, April 24, 2015

Relationship between vec operator, Schur, Kronecker and Khatri-Rao product

$\DeclareMathOperator{\diag}{diag}$$\DeclareMathOperator{\vec}{vec}$ Define $\vec(A)$ as the operation of stacking the columns of matrix $A$ into a vector, $A\otimes B$ the Kronecker product, $A\circ B$ the Schur (Hadamard) product and finally $A\diamond B$ Bhatri-Rao product is defined as the column wise Kronecker product.

Here are some useful properties:

\[(A\otimes B)^T = A^T \otimes B^T\]

\[A \diag(x \circ y) B^T = A \diag(x) \diag(y) B^T\]

\[(C\otimes D) ( A \diamond B) = CA \diamond DB \]

\[\vec(AXB^T) = (B \otimes A) \vec(X)\]

\[\vec(A \diag(x) B^T) = (A \diamond B) x\]

\[(A\diamond B \diamond x^T) y = (A \diamond B) (x \circ y)\]

Where $A,B,C,D,X$ are matrices and $x,y$ are vectors of compatible dimensions.

Tuesday, March 03, 2015

Linear separability

(Cover, 1965) Suppose we have $N$ data points distributed at random in $\mathbb{R}^d$ with an unspecified distribution. Assume that there is no subset of $d$ or fewer points which are linearly dependent. We then assign each of the points to one of the two classes $\mathcal{C}_1$ and $\mathcal{C}_2$ with equal probability.

The fraction $F(N,d)$ of realizations that is linearly separable is given by the expression
\[ F(N,d) = \left\{
\begin{matrix}
1 \quad &\mathrm{when}\; N \le d+1 \\
\frac{1}{2^{N-1}}\sum\limits_{i=0}^d \left( \begin{matrix} N-1 \\ i \end{matrix} \right) \quad & \mathrm{when}\; N \ge d + 1
\end{matrix}\right.
\] Intuitively, the probability of separability increase with increasing dimension $d$.

[TODO] include plot...

http://www-isl.stanford.edu/~cover/papers/paper76.pdf

Monday, March 02, 2015

Simply connectedness

Informally, a thick object in our space is simply-connected if it consists of one piece and does not have any "holes" that pass all the way through it.

A sphere (or, equivalently, a rubber ball with a hollow center) is simply connected, because any loop on the surface of a sphere can contract to a point, even though it has a "hole" in the hollow center.

The stronger condition, that the object has no holes of any dimension, is called contractibility.

Another characterization of simply-connectedness is the following:

$X$ is simply-connected if and only if

it is path-connected, and
whenever $p: [0,1] \rightarrow X$ and $q: [0,1] \rightarrow X$ are two paths (i.e. continuous maps) with the same start and endpoint ($p(0)=q(0)$ and $p(1) == q(1)$), then $p$ and $q$ are homotopic relative to {0,1}.

Intuitively, this means that $p$ can be "continuously deformed" to get $q$ while keeping the endpoints fixed. Hence the term simply connected: for any two given points in $X$, there is one and "essentially" only one path connecting them.

Examples:

All convex sets in $\mathbb{R}^n$ are simply connected.
A sphere is simply connected.

Friday, February 27, 2015

Interesting courses Spring 2015

ECE 287 Spec Topics/Comm Theory & Syst

TuTh 5:00p-6:20p CENTR223 Franceschetti, Massimo

MATH 287D Statistical Learning

TuTh 5:00p-6:20p APM 5402 Bradic, Jelena

MATH 281C Mathematical Statistics

TuTh 2:00p-3:20p APM 5402 Arias-Castro, Ery

MATH 280C Probability Theory

MW 5:00p-6:20p APM 5402 Williams, Ruth J

MATH 245C Convex Analysis and Optimization III

MWF 4:00p-4:50p APM 5402 Nie, Jiawang

MATH 140C Foundations of Real Analysis III

MWF 1:00p-1:50p HSS 1128A Saab, Rayan

CSE 255 Data Mining and Predictive Analytics

TuTh 3:30p-4:50p WLH 2207 Freund, Yoav

CSE 291 Neural Networks

TuTh 2:00p-3:20p WLH 2113 Cottrell, Garrison W

CSE 272 Advanced Image Synthesis

TuTh 11:00a-12:20p EBU3B 4140 Jensen, Henrik

Tuesday, February 24, 2015

Epigraph

The epigraph of a function $f: \mathbb{R}^n \rightarrow \mathbb{R}$ is the set of points lying on or above its graph:

\[ \text{epi} f = \{ (x,\mu) : x\in \mathbb{R}^n, \mu \in \mathbb{R}, \mu \ge f(x) \} \subset \mathbb{R}^{n+1}\].

Properties:

A function is convex if and only if its epigraph is a convex set.

A function is lower semicontinuous if and only if its epigraph is closed.

Saturday, February 14, 2015

Entropy maximization, part II

Let $\mathcal{X} \in \{\alpha_1, \cdots, \alpha_n\}$ be a random variable with finite alphabet, what is the distribution that will achieve maximum entropy given the constraint that $E[f(\mathcal{X})]=\beta$ ?

We define the expectation
\[\sum_i \alpha_i p_i = \beta \] The optimization problem is given as
\[\max_\mathbf{p} H(\mathbf{p}) \; s.t., \; \sum_i p_i = 1,\; \sum_i \alpha_i p_i = \beta\] Optimizing using Lagrange multipliers $\lambda$ and $\mu$, we have
\[ p_i = exp^{-(1-\lambda)}exp^{\mu \alpha_i}\] which turns out to be the familiar expression
\[p_i = \frac{1}{Z}exp^{\mu \alpha_i}\] with $Z=\sum_i exp^{\mu \alpha_i}$ being the partition function.

The above form is called a Gibbs distribution.

Now consider the problem where with $\mathcal{X} \in \mathbb{R}^p$ and we are given the moments $E[f_j(\mathcal{X})]=\beta_j, j=1,\dots,p$. The optimization problem becomes
\[\max_\mathbf{p} H(\mathbf{p}) \; s.t., \; \sum_i p_i = 1,\; \sum_i \alpha_{ij} p_i = \beta_j\] The expression have the form
\[p_i = \frac{1}{Z} exp^{\sum_{j=1}^p \mu_j \alpha_{ij}}
=\frac{1}{Z} exp^{\mathbf{\mu}^T \mathbf{\alpha_{i}}}\] Since the distribution in general has infinite support the constraint on the moment will allow one to reach a unique solution.

Friday, February 13, 2015

Entropy maximization

One interesting and well known result is that out of all finite alphabet distributions, the uniform distribution achieves maximum entropy.

The Laplace's principle of insufficient reasoning, calls for assuming uniformity unless there is additional information.

Entropy maximization is the equivalent of minimizing the KL-Divergence between the distribution $p$ and the uniform distribution.

More precisely, let $\mathcal{X} \in \{\alpha_1, \cdots, \alpha_n\}$ be a random variable with finite alphabet, given a family of distribution $\mathcal{P}$ and the uniform distribution $u$,
\[ \underset{p\in\mathcal{P}}{\text{arg min}} \;D_{KL}(p\| u) = \underset{p\in\mathcal{P}}{\text{arg max}} \;H(p)\] where $H(p)$ is the entropy.

Proof:
\[ D(p\|u) = \sum_i p_i \log p_i + (\sum_i p_i) \log (n) = \log (n) - H(p) \]

Note:

This is true in general for random variables with finite support. For RVs with infinite support, additional constraint is required. See Gibbs measure...

Thursday, February 12, 2015

A Side Path to Statistical Mechanics

Consider a physical system with many degrees of freedom, that can reside in any one of a large number of possible states. Let $p_i$ denote the probability of occurrence of state $i$, for example, with the following properties:

\[p_i\ge 0 \quad \text{for all}\; i\]and

\[\sum_i p_i = 1\] Let $E_i$ denote the energy of the system when it is in state $i$. A fundamental results from statistical mechanics tells us that when the system is in thermal equilibrium with its surrounding environment, statie $i$ occurs with a probability define by

\[p_i = \frac{1}{Z}\exp(-\frac{E_i}{k_BT})\] where $T$ is the absolute temperature in kelvins, $k_B$ is the Boltzmann's constant, and $Z$ is a constant that is independent of all states. The partition function $Z$ is the normalizing constant with

\[Z= \sum_i \exp(-\frac{E_i}{k_BT}).\] The probability distribution is called the Gibbs distribution.

Two interesting properties of the Gibbs distribution are:

States of low energy have a higher probability of occurrence than states of high energy.
As the temperature $T$ is reduced, the probability is concentrated on a smaller subset of low-energy states.

In the context of neural networks, the parameter $T$ may be viewed as a pseudo-temperature that controls thermal fluctualtions representing the effect of "synaptic noise" in a neuron. Its precise scale is irrelevant. We can redefine the probability $p_i$ and partition function Z as

\[p_i = \frac{1}{Z}\exp(-\frac{E_i}{T}) \] and

\[Z = \sum_i \exp(-\frac{E_i}{T})\] where $T$ is referred to simply as the temperature of the system.

Note that $-\log p_i$ may be viewed as a form of "energy" measured at unit temperature.

Free Energy and Entropy

The Helmholtz free energy of a physical system, denoted by $F$, is defined in terms of the partition function $Z$ as follows
\[F = -T \log Z.\] The average energy of the system is defined by
\[\lt E\gt = \sum_i p_i E_i\] The difference between the average energy and free energy is
\[\lt E \gt - F = -T\sum_i p_i \log p_i\] which we can rewrite in terms of entropy $H$
\[\lt E \gt - F = T H\] or, equivalently,
\[F = \lt E \gt - TH\]. The entropy of any systems tend to increase until it reaches an equilibrium, and therefore the free energy of the system will reach a minimum.

This is an important principle called the principle of minimal free energy.

Cross Entropy

The cross entropy of distributions $p$ and $q$ is
\[H(p,q)=E_p[-\log q]\] It can be viewed as
\[H(p,q)=H(p)+D_{KL}(p\|q)\] where $H(p)$ is the entropy of $p$ and $D_{KL}(p\|q)$ is the non-negative Kullback–Leibler divergence.

From an source coding perspective, it is the total bits required to encode information if the estimated distributed $q$ diverged from the true distribution $p$, where $H(p)$ is the minimum.

This quantity is very useful in machine learning. Viewed from a vector quantization point of view, logistic regression is a way of finding an optimal boundary to classifying samples in a (possibly high dimensional) space of interest. This expression quantifies the loss of estimating distribution $q$ instead of the true distribution $p$. Since $H(p)$ is fixed because it is a property of the underlying true distribution, minimizing cross-entropy is equivalent to minimizing KL divergence in this setting.

Friday, January 30, 2015

Conditional Expectation

Definition. $\lambda$ is absolutely continuous (AC) with respect to $\mu$, written $\lambda \ll \mu$, if $\mu(A)=0$ implies $\lambda(A) = 0$.

Theorem 2 Radon-Nikodym Theorem Let $(\Omega,\mathcal{B},P)$ be the probability space. Suppose $v$ is a positive bounded measure and $v \ll P$. Then there exists an integrable random variable $X\in \mathcal{B}$, such that
\[v(E) = \int_E XdP, \quad \forall E \in \mathcal{B} \] $X$ is a.s. unique ($P$) and is written
\[X=\frac{dv}{dP}.\] We also write $dv=XdP$

Definition of Conditional Expectation

Suppose $X\in L_1(\Omega,\mathcal{B},P)$ and let $\mathcal{G}\subset \mathcal{B}$ be a sub-$\sigma$-field. Then there exists a random variable $E(X|\mathcal{G})$, called the conditional expectation of $X$ with respect to $\mathcal{G}$, such that

$E(X|\mathcal{G})$ is $\mathcal{G}$-measurable and integrable.
For all $G\in\mathcal{G}$ we have \[ \int_G XdP = \int_G E(X|\mathcal{G})dP\]

Notes.

Definition of conditional probability: Given $(\Omega,\mathcal{B},P)$, a probability space, with $\mathcal{G}$ a sub-$\sigma$-field of $\mathcal{B}$, define \[P(A|\mathcal{G})=E(1_A|\mathcal{G}), \quad A\in \mathcal{B}.\] Thus $P(A|\mathcal{G}) $ is a random variable such that

$P(A|\mathcal{G}) $ is $\mathcal{G}$-measurable and integrable.
$P(A|\mathcal{G}) $ satisfies \[\int_G P(A|\mathcal{G})dP = P(A\cap G), \quad \forall G \in \mathcal{G}. \]

Conditioning on random variables: Suppose $\{X_t, t\in T \}$ is a family of random variables defined on $(\Omega,\mathcal{B})$ and indexed by some index set $T$. Define \[\mathcal{G}:=\sigma(X_t,t\in T)\] to be the \sigma-field generated by the process $\{X_t, t\in T \}$. Then define \[E(X|X_t, t\in T)= E(X|\mathcal{G}).\]

Note (1) continues the duality of probability and expectation but seems to place expectation in a somewhat more basic position, since conditional probability is defined in terms of conditional expectation.

Note (2) saves us from having to make separate definitions for $E(X|X_1)$, $E(X|X_1,X_2)$, etc.

Countable partitions Let $\{\Lambda_n, n\ge 1 \}$ be a partition of $\Omega$ so thyat $\Lambda_i \cap \Lambda_j = \emptyset, i\neq j$, and $\sum_n \Lambda_n=\Omega$. Define
\[\mathcal{G}=\sigma(\Lambda_n, n\ge 1)\] so that
\[\mathcal{G}=\left\{ \sum_{i\in J}\Lambda_i: J\subset\{1,2,\dots \} \right\}.\] For $X\in L_1(P)$, define
\[E_{\Lambda_n}(X)=\int XP(d\omega|\Lambda_n)=\int_{\Lambda_n}XdP/P\Lambda_n , \] if $P(\Lambda_n)>0$ and $E_{\Lambda_n}(X) = 18$ if $P(\Lambda_n)=0$. We claim

\[E(X|\mathcal{G})\overset{a.s.}{=} \sum_{n=1}^\infty E_{\Lambda_n}(X) 1_{\Lambda_n} \] and for any $A\in \mathcal{B}$
\[P(A|\mathcal{G})\overset{a.s.}{=} \sum_{n=1}^\infty P(A|\Lambda_n)1_{\Lambda_n}\]

Product Spaces, Transition Kernel and Rubini's Theorem

Lemma 1. Sectioning sets Sections of measurable sets are measurable. If $A\in \mathcal{B}_1 \times \mathcal{B}_2$, then for all $\omega_1 \in \Omega_1$,
\[A_{w_1}\in \mathcal{B_2}\]
Corollary 1. Sections of measurable functions are measurable. That is if
\[ X: (\Omega_1\times \Omega_2, \mathcal{B}_1 \times \mathcal{B}_2) \mapsto (S,\mathcal{S})\] then \[ X_{\omega_1} \in \mathcal{B}_2. \] We say $X_{\omega_1}$ is $\mathcal{B}/\mathcal{S}$ measurable.

Define the transition (probability) kernel

\[K(\omega_1,A_2):\Omega_1 \times \mathcal{B}_2 \mapsto [0,1]\]

if it satisfies the following

for each $\omega_1, K(\omega_1,\cdot)$ is a probability measure on $\mathcal{B}_2$, and
for each $A_2\in \mathcal{B}_2, K(\cdot, A_2)$ is $\mathcal{B}_1/\mathcal{B}([0,1])$ measurable.

Transition kernels are used to define discrete time Markov processes where $K(\omega_1,A_2)$ represents the conditional probability that, starting from $\omega_1$, the next movement of the system results in a state in $A_2$.

Theorem 1. Let $P_1$ be a probability measure on $\mathcal{B}_1$, and suppose

\[K:\Omega_1 \times \mathcal{B}_2 \mapsto [0,1]\] is a transition kernel. Then $K$ and $P_1$ uniquely determine a probability on $\mathcal{B}_1 \times \mathcal{B}_2$ via the formula

\[P(A_1\times A_2)= \int_{A_1}K(\omega_1,A_2)P_1(dw_1),\] for all $A_1\times A_2$ in the class of measurable rectangles.

Theorem 2. Marginalization Let $P_1$ be a probability measure on $(\Omega_1,\mathcal{B}_1)$ and suppose $K: \Omega_1\times \mathcal{B_2} \mapsto [0,1]$ is a transition kernel. Define $P$ on $(\Omega_1 \times \Omega_2, \mathcal{B_1}\times \mathcal{B_2})$ by
\[P(A_1\times A_2)= \int_{A_1} K(\omega_1,A_2)P_1(d\omega_1). \] Assume
\[X:(\Omega_1\times \Omega_2, \mathcal{B}_1\times \mathcal{B}_2) \mapsto (\mathbb{R},\mathcal{B}(\mathbb{R})) \] and furthermore suppose $X$ is integrable. Then
\[Y(\omega_1)=\int_{\Omega_2} K(\omega_1,d\omega_2)X_{\omega_2}(\omega_2)\] has the properties

$Y$ is well defined.
$Y \in B_1$
$Y \in L_1(P_1)$ and furthermore

\[\int_{\Omega_1\times \Omega_2}XdP = \int_{\Omega_1} Y(\omega_1)P_1(d\omega_1) = \int_{\Omega_1} [ \int_{\Omega_2} K(\omega_1,d\omega_2) X_{\omega_1}(d\omega_2)]P_1(\omega_1). \]

Theorem 3. Fubini Theorem Let $P=P_1\times P_2$ be a product measure. If $X$ is $\mathcal{B}_1 \times \mathcal{B}_2$ measurable and is either non-negative or integrable with respect to $P$ then

\[\begin{aligned}
\int_{\Omega_1\times \Omega_2}XdP &= \int_{\Omega_1}[ \int_{\Omega_2}X_{\omega_1}(\omega_2) P_2(d\omega_2) ] P_1(d\omega_1) \\
&= \int_{\Omega_2}[ \int_{\Omega_1}X_{\omega_2}(\omega_1) P_1(d\omega_1) ] P_2(d\omega_2)
\end{aligned} \]

Thursday, January 29, 2015

Clarification of Expectation

Let $X$ be a random variable on the probability space $(\Omega, \mathcal{B}, P)$. Recall that the distribution of X is the measure
\[F := P \circ X^{-1}\] on $(\mathbb{R},\mathcal{B}(\mathbb{R}))$ defined by
\[F(A)=P\circ X^{-1}(A) = P[X\in A].\]
The distribution function of $X$ is
\[F(x):= F((-\infty,x])=P[X\leq x].\] Note that the letter "F" is overloaded in two ways.

An application of the Transformation Theorem allows us to compute the abstract integral
\[E(X) = \int_\Omega XdP\] as
\[E(X) = \int_\mathbb{R} xF(dx),\] which is an integral on $\mathbb{R}$.

More precisely,
\[E(X) = \int_\Omega X(\omega)P(d\omega)=\int_\mathbb{R} x F(dx).\]
Also given a measurable function $g(X)$, The expectation of $g(X)$ is
\[E(g(X)) = \int_\Omega g(X(\omega))P(d\omega)=\int_\mathbb{R} g(x) F(dx).\]
Instead of computing expectations on the abstract space $\Omega$, one can always compute them on $\mathbb{R}$ using $F$, the distribution of $X$.

Random variables and Inverse maps

A random variable is a real valued function with domain $\Omega$ which has an extra property called measurability that allows us to make probability statements about the random variable.

Suppose $\Omega$ and $\Omega'$ are two sets. Often $\Omega' = \mathbb{R}$. Suppose
\[X:\Omega \mapsto \Omega',\] Then $X$ determines an inverse map (a set valued function)
\[X^{-1}: \mathcal{P}(\Omega')\mapsto \mathcal{P}(\Omega)\] defined by
\[X^{-1}(A') = \{\omega \in \Omega : X(\omega) \in A'\}\] for $A' \subset \Omega'$.

$X^{-1}$ preserves complementation, union and intersections.

Wednesday, January 28, 2015

Convergence Concepts

Implications of convergence

1. Almost Sure Convergence

Examples of statements that hold almost surely (a.s.)

Let $X,X'$ be two random variables. Then $X=X'$ a.s. means \[P[X=X']=1;\] that is, there exists an event $N\in \mathcal{B}$, such that $P(N)=0$ and if $\omega\in N^c$, then $X(\omega)=X'(\omega)$.
If $\{X_n\}$ is a sequence of random variables, then $\lim_{n\rightarrow \infty}X_n$ exists a.s. means there exists an event $N\in \mathcal{B}$, such that $P(N)=0$ and if $\omega\in N^c$ then \[\lim_{n\rightarrow \infty}X_n(w)\] exists. It also means that for a.a. $\omega$, \[\underset{n\rightarrow \infty}{\text{lim sup}}X_n(\omega)=\underset{n\rightarrow \infty}{\text{lim inf}}X_n(\omega).\] We will write $\lim_{n\rightarrow \infty}X_n = X$ or $X_n \overset{a.s.}{\rightarrow}X$.
If $\{X_n\}$ is a sequence of random variables, then $\sum_n X_n$ converges a.s. means there exists an event $N\in \mathcal{B}$, such that $P(N)=0$, and $\omega \in N^c$ implies $\sum_n X_n(w)$ converges.

2. Convergence in Probability

Suppose $X_n, n\ge 1$ and $X$ are random variables. Then ${X_n}$ converges in probability (i.p.) to $X$, written $X_n \overset{P}{\rightarrow}X$, if for any $\epsilon > 0$ \[ \lim_{n\rightarrow \infty} P[|X_n-X|>\epsilon]=0.\]
Almost sure convergence of $\{X_n\}$ demands that for a.e. $\omega$, $X_n(w)-X(w)$ gets small and stay small. Convergence i.p. is weaker and merely requires that the probability of the difference $X_n(w)-X(w)$ being non-trivial become small.

It is possible for a sequence to converge in probability but not almost surely.

Theorem 1. Convergence a.s. implies convergence i.p. Suppose that $X_n, n\ge 1$ and $X$ are random variables on a probability space $(\Omega,\mathcal{B},P)$. If \[ X_n \rightarrow X, \; a.s.\] then \[X_n \overset{P}{\rightarrow}X.\]
Proof. If $X_n \rightarrow X$ a.s. then for any $\epsilon$,
\[\begin{aligned}
0\;&=P([|X_n-X|>\epsilon]i.o.) \\
&=P(\underset{n\rightarrow \infty}{\text{lim sup}}[|X_n-X|>\epsilon]) \\
&=\lim_{N\rightarrow \infty}P(\bigcup_{n\ge N}[|X_n-X|>\epsilon] ) \\
&\ge \lim_{n\rightarrow \infty}P[|X_n-X|>\epsilon]
\end{aligned} \]

3. $L_p$ Convergence

Recall the notation $X\in L_p$ which means $E(|X|^p)<\infty $. For random variables $X,Y\in L_p$, we define the $L_p$ metric for $p\ge 1$ by
\[d(X,Y)=(E|X-Y|^p)^{1/p}.\] This metric is norm induced because
\[\|X\|_p := (E|X|^p)^{1/p} \] is a norm on the space $L_p$.

A sequence $\{X_n\}$ of random variables converges in $L_p$ to $X$, written
\[X_n \overset{L_p}{\rightarrow}X , \] if
\[ E(|X_n-X|^p) \rightarrow 0 \] as $n\rightarrow \infty$.

Facts about $L_p$ convergence.

$L_p$ convergence implies convergence in probability: For $p>0$, if $X_n\overset{L_p}{\rightarrow} X$ then $X_n \overset{P}{\rightarrow}X $. This follows readily from Chebychev's inequality, \[P[|X_n-X|\ge \epsilon] \leq \frac{E(|X_n-X|^p|)}{\epsilon^p} \rightarrow 0.\]
Convergence in probability does not imply $L_p$ convergence. What can go wrong is that the $n$th function in the sequence can be huge on a very small set.
Example. Let the probability space be $ ([0,1],\mathcal{B}([0,1]),\lambda) $, where $\lambda$ is Lebesgue measure and define
\[X_n = 2^n 1_{(0,\frac{1}{n}) }\] then
\[P[|X_n| > \epsilon ] = P \left( (0,\frac{1}{n}) \right) = \frac{1}{n} \rightarrow 0 \] but
\[ E(|X_n|^p) = 2^{np} \frac{1}{n} \rightarrow \infty \]
$L_p$ convergence does not imply almost sure convergence.
Example. Consider the functions $\{X_n\}$ defined on $ ([0,1],\mathcal{B}([0,1]),\lambda) $, where $\lambda$ is Lebesgue measure.
\begin{align*}
X_1 &= 1_{[0,\frac{1}{2}]}, \quad  X_2 = 1_{[\frac{1}{2},1]} \\
X_3 &= 1_{[0,\frac{1}{3}]}, \quad  X_4 = 1_{[\frac{1}{3},\frac{2}{3}]} \\
X_5 &= 1_{[\frac{1}{3},1]}, \quad  X_6 = 1_{[0,\frac{1}{4}]}, \cdots \\
\end{align*} and so on, Note that for any $p>0$,
\[ E(|X_1|^p)=E(|X_2|^p)=\frac{1}{2},\\
  E(|X_3|^p)=E(|X_4|^p)=E(|X_5|^p)=\frac{1}{3}, \\
E(|X_6|^p)=\frac{1}{4}, \cdots \] so $E(|X_n|^p) \rightarrow 0$ and
\[X_n \xrightarrow[]{L_p} 0.\]
Observe that $\{X_n\}$ does not converge almost surely to 0.

Limits and Integrals

Under certain circumstances we are allowed to interchange expectation with limits.

Theorem 1. Monotone Convergence Theorem (MCT). If
\[0\leq X_n \uparrow X\]then
\[0\leq E(X_n) \uparrow E(X)\]
Corollary 1. Series Version of MCT. If $X_n \ge 0$ are non-negative random variables for $n\ge1$, then
\[E(\sum_{n=1}^\infty X_n)= \sum_{n=1}^\infty E(X_n)\]
so that the expectation and infinite sum can be interchanged

Theorem 2. Fatou Lemma. If $X_n \ge 0$, then
\[ E(\underset{n\rightarrow \infty}{\text{lim inf}} X_n ) \leq \underset{n\rightarrow \infty}{\text{lim inf}}E(X_n)\]
More generally, if there exists $Z\in L_1$ and $X_n\ge Z$, then
\[ E(\underset{n\rightarrow \infty}{\text{lim inf}} X_n ) \leq \underset{n\rightarrow \infty}{\text{lim inf}}E(X_n)\]
Corollary 2. More Fatou. If $ 0 \leq X_n \leq Z$ where $Z\in L_1$, then
\[ E(\underset{n\rightarrow \infty}{\text{lim sup}} X_n ) \ge \underset{n\rightarrow \infty}{\text{lim sup}}E(X_n)\]
Theorem 3. Dominated Convergence Theorem (DCT). If
\[X_n \rightarrow X\] and there exists a dominating random variable $Z\in L_1$ such that
\[ |X_n| \leq Z\]then
\[E(X_n)\rightarrow E(X) \; \text{and} \; E|X_n-X|\rightarrow 0.\]
$
\newcommand{\scriptB}{\mathcal{B}}
\newcommand{\scriptP}{\mathcal{P}}
\newcommand{\vecX}{\mathbf{X}}
\newcommand{\vecx}{\mathbf{x}}
\newcommand{\reals}{\mathbb{R}}
\newcommand{\cplxs}{\mathbb{C}}
\newcommand{\rationals}{\mathbb{Q}}
\newcommand{\naturals}{\mathbb{N}}
\newcommand{\integers}{\mathbb{Z}}
\newcommand{\ntoinf}{n\rightarrow\infty}
\newcommand{\mtoinf}{m\rightarrow\infty}
\newcommand{\tendsto}{\rightarrow}
$
Example of when interchanging limits and integrals without the dominating condition. (When something very nasty happens on a small set and the degree of nastiness overpowers the degree of smallness).

Let
\[ (\Omega, \scriptB, P) = ([0,1], \scriptB([0,1]), \lambda)\] $\lambda$ the Lebesgue measure. Define
\[ X_n = n^2 1_{(0,1/n)}. \] For any $\omega \in [0,1]$,
\[ 1_{(0,1/n)}(w) \tendsto 0,\] so
\[ X_n \tendsto 0. \] However
\[ E(X_n) = n^2 \cdot \frac{1}{n} = n \tendsto \infty, \] so
\[ E(\liminf_{\ntoinf} X_n) = 0 \le \liminf_{\ntoinf} (EX_n) = \infty \] and
\[ E(\limsup_{\ntoinf} X_n) = 0 \not\ge \limsup_{\ntoinf} (EX_n) = \infty.\]

Tuesday, January 27, 2015

Zero-One Laws

There are several common zero-one laws which identify the possible range of a random variable to be trivial. There are also several zero-one laws which provide the basis for all proofs of almost sure convergence.

Proposition 1. Borel-Cantelli Lemma Let $\{A_n\}$ be any events (not necessarily independent).

If $\sum_n{P(A_n)}<\infty$, then
\[P([A_n \; i.o.])=P(\underset{n\rightarrow \infty}{\text{lim sup}} A_n) = 0\].
Proposition 2. Borel Zero-One Law If $\{A_n\}$ is a sequence of independent events, then

\[ \begin{equation*}
P([A_n \; i.o.])= \begin{cases}
0, \quad & \text{iff} \sum_n P(A_n) < \infty \\
1, \quad & \text{iff} \sum_n P(A_n) = \infty
\end{cases}
\end{equation*}
\]
Definition. An almost trivial $\sigma$-field is a $\sigma$-field all of whose events has probability 0 or 1.

Theorem 3. Kolmogorov Zero-One Law If $\{X_n\}$ are independent random variables with tail $\sigma$-field $\mathcal{T}$, then $\Lambda\in \mathcal{T}$ implies $P(\Lambda)=0$ or 1 so that the tail $\sigma$-field is almost trivial.

Lemma 4. Almost trivial $\sigma$-fields Let $\mathcal{G}$ be an almost trivial $\sigma$-field and let $X$ be a random variable measurable with respect to $\mathcal{G}$. Then there exists $c$ such that $P[X=c] = 1$.

Corollary 5. Let $\{X_n\}$ be independent random variables. Then the following are true.

(a) The event
\[ [\sum_n X_n \;converges] \] has probability 0 or 1.

(b) The random variables $\text{lim sup}_{n\rightarrow \infty}X_n$ and $\text{lim inf}_{n\rightarrow \infty}X_n$ are constant with probability 1.

(c) The event
\[ \{\omega: S_n(\omega)/n \rightarrow 0 \} \] has probability 0 or 1.

Monday, January 26, 2015

Inequalities

Another valuable book for anyone in computer science who ever wants to bound any quantity (so, everyone!) is: The Cauchy-Schwarz Master Class: An Introduction to the Art of Mathematical Inequalities by Michael Steele.

An encyclopedic book on the topic is A Dictionary of Inequalities. While this is not a book for reading cover-to-cover, it is good to have it at your disposal. See also the supplement of the book.

Moreover, Wikipedia has an excellent list of inequalities.

For specific topics, you may consult:

Friday, January 23, 2015

Dynkin's theorem

First we define a structure named $\lambda$-system.

A class of subsets $\mathcal{L}$ of $\Omega$ is called a $\lambda$-system if it satisfies the following postulates

1. $\Omega\in \mathcal{L}$
2. $A\in\mathcal{L} \Rightarrow A^c \in \mathcal{L}$
3. $n\neq m, A_nA_m = \emptyset, A_n \in \mathcal{L} \Rightarrow \cup_n A_n \in \mathcal{L}$

It is clear that a $\sigma$-field is always a $\lambda$-system.

Next a $\pi$-system is a class of sets closed under finite intersections.

Dynkin's theorem

a) if $\mathcal{P}$ is a $\pi$-system and $\mathcal{L}$ is a $\lambda$-system such that $\mathcal{P}\subset\mathcal{L}$, then $\sigma(\mathcal{P})\subset \mathcal{L}$.

b) If $\mathcal{P}$ is a $\pi$-system,

$\sigma(\mathcal{P})=\mathcal{L}(\mathcal{P})$

that is, the minimal $\sigma$-field over $\mathcal{P}$ equals the minimal $\lambda$-system over $\mathcal{P}$

Wednesday, January 21, 2015

An example of Stein's paradox

Charles Stein showed in 1958, that a nonlinear, biased estimator of a multivariate mean has a lower MSE compared to the ML estimator.

Given a sample of $N$ measurements of $X\sim\mathcal{N}(\mu,\sigma I_p)$ with unknown parameter vector $\mu$ of length $p$.

The James-Stein estimator is given by

\begin{equation*}
\hat{\mu}_{JS}=\left (1-\frac{(p-2)\frac{\sigma^2}{N}}{\|\bar{x} \|^2}\right ) \bar{x}
\end{equation*}
where $\bar{x}$ is the sample mean.

This estimator dominates the MLE everywhere in terms of MSE. For all $\mu\in\mathbb{R}^p$,

\begin{equation*}
\mathbb{E}_\mu \| \hat{\mu}_{JS}-\mu\|^2 < \mathbb{E}_\mu \| \hat{\mu}_{MLE}-\mu\|^2
\end{equation*}
This makes the MLE inadmissible for $p\ge3$!

Wednesday, January 14, 2015

The need for measure theory

The problem with measure arises when one needs to decompose a body into (possibly uncountable) number of components and reassemble it after some action on those components. Even when you restrict attention to just finite partitions, one still runs into trouble here. The most striking example is the Banach-Tarski paradox, which shows that a unit ball $B$ in three dimension can be disassembled into a finite number of pieces and reassembled to form two disjoint copies of the ball $B$.

Such pathological sets almost never come up in practical applications of mathematics. Because of this, the standard solution to the problem of measure has been to abandon the goal of measuring every subset $E$ of $\mathbb{R}^d$ and instead to settle for only measuring a certain subclass of
non-pathological subsets of $\mathbb{R}^d$, referred to as the measurable sets.

The most fundamental concepts of measure is the properties of

finite or countable additivity
translation invariance
rotation invariance

The concept of Jordan measure (closely related to that of Riemann and Darboux integral) is sufficient for undergraduate level analysis.

However, the type of sets that arise in analysis, and in particular those sets that arise as limit of other sets, requires an extended concept of measurability (Lebesgue measurability)

Lebesgue theory is viewed as a completion of the Jordan-Darboux-Riemann theory. It keeps almost all of the desirable properties of Jordan measure, but with the crucial additional property that many features of the Lebesgue theory are preserved under limits.

Probability space concepts

A field is a non-empty class of subsets of $\Omega$ closed under finite union, finite intersection and complements. A synonym for field is algebra.

From de Morgan's laws a field is also closed under finite intersection.

A $\sigma$-field $\mathcal{B}$ is a non-empty class of subsets of $\Omega$ closed under countable union, countable intersection and complements. A synonym for $\sigma$-field is $\sigma$-algebra.

In probability theory, the event space is a $\sigma$-field. This allows us enough flexibility constructing new-events from old ones (closure) but not so much flexibility that we have trouble assigning probabilities to the elements of the $\sigma$-field.

For the Reals, we start with sets that we know how to assign probabilities.

Supposes $\Omega=\mathbb{R}$ and let

$\mathcal{C}=\{(a,b],-\infty \leq a \leq b < \infty \}$

The Borel sets is defined as

$\mathcal{B}(\mathbb{R}) \equiv \sigma(\mathcal{C})$

Also one can show that

$\mathcal{B}(\mathbb{R}) = \sigma(\text{open sets in } \mathbb{R}) $

Wednesday, January 07, 2015

Russell's Paradox

Shortly after the turn of the 19th century, Bertrand Russell demonstrated a hole in mathematical logic of set theory at the time. A set can be member of itself. For sets $R$ and $S$

$R = \{S | R \notin S\}$

The set $R$ contains all sets that do not have themselves as members.

However, is $R$ a member of itself?

Clearly not, since by definition $R$ is the set of all sets that do not have themselves as member.

But then, if $R$ does not have itself as a member then it must be a member of the set $R$

At that point in time, it created a huge stir among the mathematical community since most of what they do are based upon the foundation of sets.

MathJax