Dave's Journal: February 2015

Friday, February 27, 2015

Interesting courses Spring 2015

ECE 287 Spec Topics/Comm Theory & Syst

TuTh 5:00p-6:20p CENTR223 Franceschetti, Massimo

MATH 287D Statistical Learning

TuTh 5:00p-6:20p APM 5402 Bradic, Jelena

MATH 281C Mathematical Statistics

TuTh 2:00p-3:20p APM 5402 Arias-Castro, Ery

MATH 280C Probability Theory

MW 5:00p-6:20p APM 5402 Williams, Ruth J

MATH 245C Convex Analysis and Optimization III

MWF 4:00p-4:50p APM 5402 Nie, Jiawang

MATH 140C Foundations of Real Analysis III

MWF 1:00p-1:50p HSS 1128A Saab, Rayan

CSE 255 Data Mining and Predictive Analytics

TuTh 3:30p-4:50p WLH 2207 Freund, Yoav

CSE 291 Neural Networks

TuTh 2:00p-3:20p WLH 2113 Cottrell, Garrison W

CSE 272 Advanced Image Synthesis

TuTh 11:00a-12:20p EBU3B 4140 Jensen, Henrik

Tuesday, February 24, 2015

Epigraph

The epigraph of a function \(f: \mathbb{R}^n \rightarrow \mathbb{R}\) is the set of points lying on or above its graph:

\[ \text{epi} f = \{ (x,\mu) : x\in \mathbb{R}^n, \mu \in \mathbb{R}, \mu \ge f(x) \} \subset \mathbb{R}^{n+1}\].

Properties:

A function is convex if and only if its epigraph is a convex set.

A function is lower semicontinuous if and only if its epigraph is closed.

Saturday, February 14, 2015

Let \(\mathcal{X} \in \{\alpha_1, \cdots, \alpha_n\}\) be a random variable with finite alphabet, what is the distribution that will achieve maximum entropy given the constraint that \(E[f(\mathcal{X})]=\beta\) ?

We define the expectation
\[\sum_i \alpha_i p_i = \beta \] The optimization problem is given as
\[\max_\mathbf{p} H(\mathbf{p}) \; s.t., \; \sum_i p_i = 1,\; \sum_i \alpha_i p_i = \beta\] Optimizing using Lagrange multipliers \(\lambda\) and \(\mu\), we have
\[ p_i = exp^{-(1-\lambda)}exp^{\mu \alpha_i}\] which turns out to be the familiar expression
\[p_i = \frac{1}{Z}exp^{\mu \alpha_i}\] with \(Z=\sum_i exp^{\mu \alpha_i}\) being the partition function.

The above form is called a Gibbs distribution.

Now consider the problem where with \(\mathcal{X} \in \mathbb{R}^p\) and we are given the moments \(E[f_j(\mathcal{X})]=\beta_j, j=1,\dots,p\). The optimization problem becomes
\[\max_\mathbf{p} H(\mathbf{p}) \; s.t., \; \sum_i p_i = 1,\; \sum_i \alpha_{ij} p_i = \beta_j\] The expression have the form
\[p_i = \frac{1}{Z} exp^{\sum_{j=1}^p \mu_j \alpha_{ij}}
=\frac{1}{Z} exp^{\mathbf{\mu}^T \mathbf{\alpha_{i}}}\] Since the distribution in general has infinite support the constraint on the moment will allow one to reach a unique solution.

Friday, February 13, 2015

Entropy maximization

One interesting and well known result is that out of all finite alphabet distributions, the uniform distribution achieves maximum entropy.

The Laplace's principle of insufficient reasoning, calls for assuming uniformity unless there is additional information.

Entropy maximization is the equivalent of minimizing the KL-Divergence between the distribution \(p\) and the uniform distribution.

More precisely, let \(\mathcal{X} \in \{\alpha_1, \cdots, \alpha_n\}\) be a random variable with finite alphabet, given a family of distribution \(\mathcal{P}\) and the uniform distribution \(u\),
\[ \underset{p\in\mathcal{P}}{\text{arg min}} \;D_{KL}(p\| u) = \underset{p\in\mathcal{P}}{\text{arg max}} \;H(p)\] where \(H(p)\) is the entropy.

Proof:
\[ D(p\|u) = \sum_i p_i \log p_i + (\sum_i p_i) \log (n) = \log (n) - H(p) \]

Note:

This is true in general for random variables with finite support. For RVs with infinite support, additional constraint is required. See Gibbs measure...

Thursday, February 12, 2015

A Side Path to Statistical Mechanics

Consider a physical system with many degrees of freedom, that can reside in any one of a large number of possible states. Let \(p_i\) denote the probability of occurrence of state \(i\), for example, with the following properties:

\[p_i\ge 0 \quad \text{for all}\; i\]and

\[\sum_i p_i = 1\] Let \(E_i\) denote the energy of the system when it is in state \(i\). A fundamental results from statistical mechanics tells us that when the system is in thermal equilibrium with its surrounding environment, statie \(i\) occurs with a probability define by

\[p_i = \frac{1}{Z}\exp(-\frac{E_i}{k_BT})\] where \(T\) is the absolute temperature in kelvins, \(k_B\) is the Boltzmann's constant, and \(Z\) is a constant that is independent of all states. The partition function \(Z\) is the normalizing constant with

\[Z= \sum_i \exp(-\frac{E_i}{k_BT}).\] The probability distribution is called the Gibbs distribution.

Two interesting properties of the Gibbs distribution are:

States of low energy have a higher probability of occurrence than states of high energy.
As the temperature \(T\) is reduced, the probability is concentrated on a smaller subset of low-energy states.

In the context of neural networks, the parameter \(T\) may be viewed as a pseudo-temperature that controls thermal fluctualtions representing the effect of "synaptic noise" in a neuron. Its precise scale is irrelevant. We can redefine the probability \(p_i\) and partition function Z as

\[p_i = \frac{1}{Z}\exp(-\frac{E_i}{T}) \] and

\[Z = \sum_i \exp(-\frac{E_i}{T})\] where \(T\) is referred to simply as the temperature of the system.

Note that \(-\log p_i\) may be viewed as a form of "energy" measured at unit temperature.

Free Energy and Entropy

The Helmholtz free energy of a physical system, denoted by \(F\), is defined in terms of the partition function \(Z\) as follows
\[F = -T \log Z.\] The average energy of the system is defined by
\[\lt E\gt = \sum_i p_i E_i\] The difference between the average energy and free energy is
\[\lt E \gt - F = -T\sum_i p_i \log p_i\] which we can rewrite in terms of entropy \(H\)
\[\lt E \gt - F = T H\] or, equivalently,
\[F = \lt E \gt - TH\]. The entropy of any systems tend to increase until it reaches an equilibrium, and therefore the free energy of the system will reach a minimum.

This is an important principle called the principle of minimal free energy.

Cross Entropy

The cross entropy of distributions \(p\) and \(q\) is
\[H(p,q)=E_p[-\log q]\] It can be viewed as
\[H(p,q)=H(p)+D_{KL}(p\|q)\] where \(H(p)\) is the entropy of \(p\) and \(D_{KL}(p\|q)\) is the non-negative Kullback–Leibler divergence.

From an source coding perspective, it is the total bits required to encode information if the estimated distributed \(q\) diverged from the true distribution \(p\), where \(H(p)\) is the minimum.

This quantity is very useful in machine learning. Viewed from a vector quantization point of view, logistic regression is a way of finding an optimal boundary to classifying samples in a (possibly high dimensional) space of interest. This expression quantifies the loss of estimating distribution \(q\) instead of the true distribution \(p\). Since \(H(p)\) is fixed because it is a property of the underlying true distribution, minimizing cross-entropy is equivalent to minimizing KL divergence in this setting.

Dave's Journal

MathJax