Processing math: 100%

MathJax

Friday, February 27, 2015

Interesting courses Spring 2015

  • ECE 287 Spec Topics/Comm Theory & Syst
    • TuTh 5:00p-6:20p CENTR223 Franceschetti, Massimo
  • MATH 287D Statistical Learning
    • TuTh 5:00p-6:20p APM 5402 Bradic, Jelena
  • MATH 281C Mathematical Statistics
    • TuTh 2:00p-3:20p APM 5402 Arias-Castro, Ery 
  • MATH 280C Probability Theory
    • MW 5:00p-6:20p APM 5402 Williams, Ruth J
  • MATH 245C Convex Analysis and Optimization III
    • MWF 4:00p-4:50p APM 5402 Nie, Jiawang 
  • MATH 140C Foundations of Real Analysis III
    • MWF 1:00p-1:50p HSS 1128A Saab, Rayan
  • CSE 255 Data Mining and Predictive Analytics
    • TuTh 3:30p-4:50p WLH 2207 Freund, Yoav 
  • CSE 291 Neural Networks
    • TuTh 2:00p-3:20p WLH 2113 Cottrell, Garrison W 
  • CSE 272 Advanced Image Synthesis
    • TuTh 11:00a-12:20p EBU3B 4140 Jensen, Henrik

Tuesday, February 24, 2015

Epigraph

$ \newcommand{\epi}{\mathop{\mathrm{epi}}} $
The epigraph of a function f:RnR is the set of points lying on or above its graph:
epif={(x,μ):xRn,μR,μf(x)}Rn+1
Properties:

A function is convex if and only if its epigraph is a convex set.

A function is lower semicontinuous if and only if its epigraph is closed.

Saturday, February 14, 2015

Entropy maximization, part II

Let X{α1,,αn} be a random variable with finite alphabet, what is the distribution that will achieve maximum entropy given the constraint that E[f(X)]=β ?

We define the expectation
iαipi=β The optimization problem is given as
maxpH(p)s.t.,ipi=1,iαipi=β Optimizing using Lagrange multipliers λ and μ, we have
pi=exp(1λ)expμαi which turns out to be the familiar expression
pi=1Zexpμαi with Z=iexpμαi being the partition function.

The above form is called a Gibbs distribution.

Now consider the problem where with XRp and we are given the moments E[fj(X)]=βj,j=1,,p.  The optimization problem becomes
maxpH(p)s.t.,ipi=1,iαijpi=βj The expression have the form
pi=1Zexppj=1μjαij=1ZexpμTαi Since the distribution in general has infinite support the constraint on the moment will allow one to reach a unique solution.

Friday, February 13, 2015

Entropy maximization

One interesting and well known result is that out of all finite alphabet distributions, the uniform distribution achieves maximum entropy.

The Laplace's principle of insufficient reasoning, calls for assuming uniformity unless there is additional information.

Entropy maximization is the equivalent of minimizing the KL-Divergence between the distribution p and the uniform distribution.

More precisely, let X{α1,,αn} be a random variable with finite alphabet, given a family of distribution P and the uniform distribution u,
arg minpPDKL(pu)=arg maxpPH(p) where H(p) is the entropy.

Proof:
D(pu)=ipilogpi+(ipi)log(n)=log(n)H(p)
Note
This is true in general for random variables with finite support.  For RVs with infinite support,  additional constraint is required.  See Gibbs measure...

Thursday, February 12, 2015

A Side Path to Statistical Mechanics

Consider a physical system with many degrees of freedom, that can reside in any one of a large number of possible states.  Let pi denote the probability of occurrence of state i, for example, with the following properties:
pi0for alliand 
ipi=1 Let Ei denote the energy of the system when it is in state i.  A fundamental results from statistical mechanics tells us that when the system is in thermal equilibrium with its surrounding environment, statie i occurs with a probability define by 
pi=1Zexp(EikBT) where T is the absolute temperature in kelvins, kB is the Boltzmann's constant, and Z is a constant that is independent of all states.  The partition function Z is the normalizing constant with 
Z=iexp(EikBT).  The probability distribution is called the Gibbs distribution.

Two interesting properties of the Gibbs distribution are:
  1. States of low energy have a higher probability of occurrence than states of high energy.
  2. As the temperature T is reduced, the probability is concentrated on a smaller subset of low-energy states.
In the context of neural networks, the parameter T may be viewed as a pseudo-temperature that controls thermal fluctualtions representing the effect of "synaptic noise" in a neuron.  Its precise scale is irrelevant.  We can redefine the probability pi and partition function Z as
pi=1Zexp(EiT) and
Z=iexp(EiT) where T is referred to simply as the temperature of the system. 

Note that logpi may be viewed as a form of "energy" measured at unit temperature.

Free Energy and Entropy

The Helmholtz free energy of a physical system, denoted by F, is defined in terms of the partition function Z as follows
F=TlogZ.  The average energy of the system is defined by
<E>=ipiEi The difference between the average energy and free energy is
<E>F=Tipilogpi which we can rewrite in terms of entropy H
<E>F=TH or, equivalently,
F=<E>TH.  The entropy of any systems tend to increase until it reaches an equilibrium, and therefore the free energy of the system will reach a minimum.

This is an important principle called the principle of minimal free energy.

Cross Entropy

The cross entropy of distributions p and q is
H(p,q)=Ep[logq] It can be viewed as
H(p,q)=H(p)+DKL(pq) where H(p) is the entropy of p and DKL(pq) is the non-negative Kullback–Leibler divergence.

From an source coding perspective, it is the total bits required to encode information if the estimated distributed q diverged from the true distribution p, where H(p) is the minimum.

This quantity is very useful in machine learning.  Viewed from a vector quantization point of view, logistic regression is a way of finding an optimal boundary to classifying samples in a (possibly high dimensional) space of interest.   This expression quantifies the loss of estimating distribution q instead of the true distribution p.  Since H(p) is fixed because it is a property of the underlying true distribution, minimizing cross-entropy is equivalent to minimizing KL divergence in this setting.