Dave's Journal: September 2016

Friday, September 16, 2016

Enabling copy/paste in vmware 12 player

1) sudo apt-get autoremove open-vm-tools

2) Install VMware Tools by following the usual method (Virtual Machine --> Reinstall VMWare Tools)

3) Reboot the VM

4) sudo apt-get install open-vm-tools-desktop

5) Reboot the VM, after the reboot copy/paste and drag/drop will work!

Monday, September 12, 2016

Properties of MMSE and MAP estimator (Bayesian)

The MMSE estimator is the mean of the posterior pdf \(E(x|y)\) of \(x\) given observation \(y\).

The estimator is unbiased.
The covariance is reduced compared to the a priori information.
Commutes over affine transformation.
Additivity property for independent data sets.
Linear in the Gaussian case.
The estimator error is orthogonal to the space spanned by all Y-measurable functions (affine functions being a subset)

The MAP estimator \( \textsf{arg max}_\theta \; p(\theta|x) \) given observation \(x\)

Jointly Gaussian case, MAP = MMSE (posterior is Gaussian, hence pdf unimodal and symmetric, mean = mode = median)
Do not commute over nonlinear transformation. (Invariant property does not hold, unlike ML)
Commutes over linear transformation.

MAP tends to ML when

Prior is uninformative
Large amount of information in data compared to prior

Gaussian linear model

Let the observed samples takes on the model

\[ x = H\theta + w\] with prior \(\mathcal{N}(\mu_\theta, C_\theta)\) and noise vector \(\mathcal{N}(0, C_w)\) independent of \(\theta\), then the posterior is Gaussian with mean

\[ E(\theta|x) = \mu_\theta + C_\theta H^T (H C_\theta H^T + C_w)^{-1} (x - H \mu_\theta) \] and covariance \[ C_{\theta|x} = C_\theta - C_\theta H^T (H C_\theta H^T + C_w)^{-1} H C_\theta \] Contrary to the classical Gaussian linear model \(H\) does not need to be full rank.

In alternative form,

\[ E(\theta|x) = \mu_\theta + ( C_\theta^{-1} + H^T C_w^{-1} H )^{-1} H^T C_w^{-1} (x - H \mu_\theta)\] and \[ C_{\theta|x} = ( C_\theta^{-1} + H^T C_w^{-1} H )^{-1} \]

LMMSE estimator \( E^*[X|Y] \)

A function of first and second order statistics only. \[E^*[X|Y] = \mu_x + \Sigma_{xy} \Sigma_{yy}^{-1} ( y - \mu_y) \] (inverse can be replaced with pseudo-inverse if necessary)
Jointly Gaussian case, \(E^*[X|Y] = E[X|Y]\)
Error orthogonal to subspace spanned by \(Y\)
Additivity property \[E^*[X|Y_1,\dotsc,Y_k] = \sum_{j=1}^k E^*[X|Y_j] - (k-1)\mu_x \]

Properties of the exponential family of distributions

From Dasgupta (see link)

One parameter Exponential family

Given the family of distribution \(\{ P_\theta, \theta \in \Theta \subset \mathbb{R} \} \), the pdf of which has the form

\[ f(x|\theta) = h(x) e^{\eta(\theta) T(x) - \psi^*(\theta)} \]

If \(\eta(\theta)\) is a 1-1 function of \(\theta\) we can drop \(\theta\) in the discussion. Thus the family of distributions \(\{ P_\eta, \eta\in \Xi \subset \mathbb{R} \} \) is in canonical form.

\[ f(x|\theta) = h(x) e^{\eta T(x) - \psi(\eta)} \] and define the set

\[ \mathcal{T} = \{ \eta : e^{\psi(\eta)} < \infty \}\]

\(\eta\) is the natural parameter, and \(\mathcal{T}\) the natural parameter space.

The family is called the canonical one parameter Exponential family.

[Brown] The family is called full if \(\Xi = \mathcal{T}\), regular if \(\mathcal{T}\) is open.

[Brown] Let K be the convex support of the measure \(\nu\)

The family is minimal if \(\dim \Xi = \dim K = k\)

It is nonsingular if \( Var_\eta (T(X)) > 0 \) for all \(\eta \in \overset{\circ}{\mathcal{T}}\), the interior of \(\mathcal{T}\).

Theorem 1. \(\psi(\eta)\) is a convex function on \(\mathcal{T}\).

Theorem 2. \(\psi(\eta)\) is a cumulant generating function for any \( \eta \in \overset{\circ}{\mathcal{T}}\).

Note: 1st cumulant is the expectation, 2nd,3rd are the central moments (2nd being the variance), 4th and higher order cumulants are neither moments or central moments.

There are more properties...

Multi-parameter Exponential family

Given the family of distribution \(\{ P_\theta, \theta \in \Theta \subset \mathbb{R}^k \} \), the pdf of which has the form

\[ f(x|\theta) = h(x) e^{\sum_{i=1}^k \eta_i(\theta) T_i(x) - \psi^*(\theta)} \] is the k-parameter Exponential family.

Where we reparametrize using \(\eta_i = \eta_i(\theta)\), we have the k-parameter canonical family.

The assumption here is that the dimension of \(\Theta\) and dimension of the image of \(\Theta\) under the map \( (\theta) \rightarrow (\eta_1(\theta),\dotsc,\eta_k(\theta) )\) are equal to \(k\).

The canonical form is

\[ f(x|\theta) = h(x) e^{\sum_{i=1}^k \eta_i T_i(x) - \psi(\eta)} \]

Theorem 7. Given a sample having a distribution \(P_\eta, \eta\in\mathcal{T}\) in the canonical k-parameter Exponential family. with \( \mathcal{T} = \{ \eta \in \mathbb{R}^k : e^{\psi(\eta)} < \infty \} \)

\(\psi(\eta))\) the partial derivatives of any order exists for any \(\eta \in \overset{\circ}{\mathcal{T}}\)

Definition. The family is full rank if at every \(\eta \in \overset{\circ}{\mathcal{T}}\) the covariance matrix \[ I(\eta) = \frac{\partial^2}{\partial \eta_i \partial \eta_j} \psi(\eta) \ge 0 \] is nonsingular.

Definition/Theorem. If the family is nonsingular, then the matrix \(I(\eta)\) is called the Fisher information matrix at \(\eta\) (for the natural parameter).

Proof. For canonical exponential family, we have \(L(x,\eta) = \log p_\eta(x) \doteq \langle \eta, T(x) \rangle - \psi(\eta) \), \(L'(x;\eta) = T(x) - \frac{\partial }{\partial \eta} \psi(\eta) \) and \( L''(x;\eta) = - \frac{\partial^2}{\partial \eta \partial \eta^T} \psi(\eta)\) is constant for fixed \(\eta\), so

\[ I(\eta) = \frac{\partial^2}{\partial \eta \partial \eta^T} \psi(\eta)\]

Sufficiency and Completeness

Theorem 8. Suppose a family of distribution \(\mathcal{F} = \{ P_\theta, \theta \in \Theta\} \) belongs to a k-parameter Exponential family and that the "true" parameter space \(\Theta\) has a nonempty interior, then the family \(\mathcal{F}\) is complete.

Theorem 9. (Basu's Theorem for the Exponential Family) In any k-parameter Exponential family \(\mathcal{F}\), with a parameter space \(\Theta\) that has a nonempty interior, the natural sufficient statistic of the family \(T(X)\) and any ancillary statistic \(S(X)\) are independently distributed under each \(\theta \in \Theta\).

MLE of exponential family

Recall, \(L(x,\theta) = \log p_\theta(x) \doteq \langle \theta, T(x) \rangle - \psi(\theta) \). The solution of the MLE satisfies

\[ S(\theta) = \left. \frac{\partial}{\partial \theta} L(x;\theta) \right\vert_{\theta = \theta_{ML}}= 0 \; \Longleftrightarrow \; T(x) = E_{\theta_{ML}} [ T(X) ] \] where \( \frac{\partial}{\partial \theta} \psi(\theta) = E_\theta [ T(X) ] \)

The second derivative gives us

\[ \frac{\partial^2}{ \partial \theta \partial \theta^T} L(x;\theta) = - I(\theta) = - Cov_\theta [ T(X) ] \] The right hand side is negative definite for full rank family. Therefore the log likelihood function is strictly concave in \(\theta\).

Existence of conjugate prior

For likelihood functions within the exponential family, a conjugate prior can be found within the exponential family. The marginalization to \(p(x) = \int p(x|\theta) p(\theta) d\theta \) is also tractable.

From Casella-Berger.

Note that the parameter space is the "natural" parameter space.

Tuesday, September 06, 2016

Local convergence for exponential mixture family

From Redner, Walker 1984

Theorem 5.2. Suppose that the Fisher information matrix \(I(\Phi)\) is positive definite at the true parameter \(\Phi^*\) and that \(\Phi^* = (\alpha_1^*, \dotsc, \alpha_m^*, \phi_1^*, \dotsc, \phi_m^*)\) is such that \(\alpha_i^* > 0 \text{ for } i = 1,\dotsc,m\). For \(\Phi^{(0)} \in \Omega\), denote by \(\{\Phi^{(j)}\}_{j=0,1,2,\dotsc}\) the sequence in \(\Omega\) generated by the EM iteration. Then with probability 1, whenever N is sufficiently large, the unique strongly consistent solution \(\Phi^N = (\alpha_1^N, \dotsc, \alpha_m^N, \phi_1^N, \dotsc, \phi_m^N)\) of the likelihood equations is well defined and there is a certain norm on \(\Omega\) in which \(\{\Phi^{(j)}\}_{j=0,1,2,\dotsc}\) converges linearly to \(\Phi^N\) whenever \(\Phi^{(0)}\) is sufficiently near \(\Phi^N\), i.e. there is a constant \( 0 \leq \lambda < 1\), for which
\[ \lVert \Phi^{(j+1)} - \Phi^N \rVert \leq \lambda \lVert \Phi^{(j)} - \Phi^N \rVert, \quad j = 0,1,2,\dotsc \] whenever \(\Phi^{(0)}\) is sufficiently near \(\Phi^{N}\).

Differentiability of jump functions

Let
\[ j_n(x) = \left\lbrace \begin{matrix}{} 0 & \text{if } x < x_n, \\ \theta_n & \text{if } x = x_n, \\ 1 & \text{if } x > x_n , \end{matrix} \right. \] For some \(0\leq \theta_n \leq 1\), then the jump function is defined as
\[ J(x) = \sum_{n=1}^\infty \alpha_n j_n(x).\] with \(\sum_{n=1}^\infty \alpha_n < \infty\).
Theorem. If \(J\) is the jump function, then \(J'(x)\) exists and vanishes almost everywhere. (non-zero in a set of measure zero, \( E = \{x : J'(x)\neq 0, x\in \mathcal{B} \}, m(E) = 0\) ).

Typical a probability distribution \(F\) is defined as a nondecreasing, right continuous function with \(F(-\infty) = 0,\; F(\infty)=1\).

MathJax