memo: Gaussian Mixture and EM Algorithm

Introduction

In this page, I will describe a brief explanation on the Gaussian Mixture and the EM Algorithm. (Here, I show how to use the implementation of the EM Algorithm the OpenCV provides to us.)

Gaussian Mixture

Suppose we have a probability of an observed data $p(\vec{x})$ . We can represent the probability as a superposition of K Gaussian distributions of the form
$p(\vec{x})=\sum\limits_{k=1}^{K}\;\pi_{k}\;\cal N(\vec{x}\|\vec{\mu}_{k},\;\bf{\Sigma}_{k})$ ,
where, $\cal N(\vec{x}\|\vec{\mu}_{k},\;\bf{\Sigma}_{k})$ is a Gaussian density, and $\vec{\mu}_{k}$ and $\bf{\Sigma}_{k}$ indicate a mean vector and a covariance matrix, respectively. $\pi_{k}$ is a weight for each Gaussian density. The superposition is called a Gaussian Mixture or Mixture of Gaussians.

When we have a data set consisting of N observations $\{\vec{x}_{1},\;\vec{x}_{2},\;\cdots,\;\vec{x}_{N}\}$ , we can write a probability of the data set in the form
$p(\vec{x}_{1},\;\vec{x}_{2},\;\cdots,\;\vec{x}_{N})=\prod\limits_{n=1}^{N}\;p(\vec{x}_{n})$ .
By substituting the Gaussian mixture representation into the equation, we obtain
$p(\vec{x}_{1},\;\vec{x}_{2},\;\cdots,\;\vec{x}_{N})=\prod\limits_{n=1}^{N}\;\sum\limits_{k=1}^{K}\;\pi_{k}\;\cal N(\vec{x}_{n}\|\vec{\mu}_{k},\;\bf{\Sigma}_{k})$ .
An elegant and powerful method to decide unknown parameters in the right hand side, $(\pi_{k},\vec{\mu}_{k},\bf{\Sigma}_{k}), \;\;k=1,\cdots,K$ , is called the EM Algorithm.

EM Algorithm

To make notations simple, we introduce the following variables:
$\bf{X}=(\vec{x}_{1},\cdots,\vec{x}_{N})$
$\bf{\pi}=(\pi_{1},\cdots,\pi_{K})$
$\bf{\mu}=(\vec{\mu}_{1},\cdots,\vec{\mu}_{K})$
$\bf{\Sigma}=(\bf{\Sigma}_{1},\cdots,\bf{\Sigma}_{K})$
Then, $p(\vec{x}_{1},\;\vec{x}_{2},\;\cdots,\;\vec{x}_{N})$ is rewritten as $p(\bf{X}\|\bf{\pi},\bf{\mu},\bf{\Sigma})$ . This means that how probable the observed data $\bf{X}$ is under the condition of the given parameters $(\bf{\pi},\bf{\mu},\bf{\Sigma})$ . The quantity is called likelihood function. The method of the maximum likelihood selects a set of values of the parameters that maximize the likelihood function to approximate the true distribution by our Gaussian mixture model.
All we have to do is to maximize the following quantity:
$p(\bf{X}|\bf{\pi},\bf{\mu},\bf{\Sigma})=\prod\limits_{n=1}^{N}\;\sum\limits_{k=1}^{K}\;\pi_{k}\;\cal N(\vec{x}_{n}|\vec{\mu}_{k},\;\bf{\Sigma}_{k})$ .
For simplicity, we maximize the log of the likelihood function given by
$\ln{p(\bf{X}|\bf{\pi},\bf{\mu},\bf{\Sigma})}=\sum\limits_{n=1}^{N}\;\ln{\left{\sum\limits_{k=1}^{K}\;\pi_{k}\;\cal N(\vec{x}_{n}|\vec{\mu}_{k},\;\bf{\Sigma}_{k})\right}}$ .
Before going any further, we introduce a constraint on $\pi_{k}$ .
The probability $p(\vec{x})$ is given by
$p(\vec{x})=\sum\limits_{k=1}^{K}\;\pi_{k}\;\cal N(\vec{x}\|\vec{\mu}_{k},\;\bf{\Sigma}_{k})$ .
If we integrate both sides of the equation with respect to $\vec{x}$ , we obtain
$\sum\limits_{k=1}^{K}\;\pi_{k}=1$ .
It should be noticed that the individual Gaussian components are normalized. Using a Lagrange multiplier, the quantity to maximize taking into account the constraint is written as
$J=\ln{p(\bf{X}|\bf{\pi},\bf{\mu},\bf{\Sigma})}+\lambda\;\left(\sum\limits_{k=1}^{K}\;\pi_{k}-1\right)$ .
The rest of our work is setting the derivatives of $J$ with respect to $\vec{\mu}_{k}$ , $\pi_{k}$ , and $\bf{\Sigma}_{k}$ to zero.　We show only the results.

The derivative with respect to $\vec{\mu}_{k}$ yields the following equation:
$\vec{\mu}_{k}=\frac{1}{N_{k}}\;\sum\limits_{n=1}^{N}\;\gamma_{n,k}\;\vec{x}_{n}$
where,
$\gamma_{n,k}\equiv\frac{\pi_{k}\;\cal N(\vec{x}_{n}|\vec{\mu}_{k},\bf{\Sigma}_{k})}{\sum\limits_{j=1}^{K}\pi_{j}\;\cal N(\vec{x}_{n}|\vec{\mu}_{j},\bf{\Sigma}_{j})}$
$N_{k}\equiv\sum\limits_{n=1}^{N}\;\gamma_{n,k}$ .
The derivative with respect to $\bf{\Sigma}_{k}$ yields the following equation:
$\bf{\Sigma}_{k}=\frac{1}{N_{k}}\;\sum\limits_{n=1}^{N}\;\gamma_{n,k}\;\left(\vec{x}_{n}-\vec{\mu}_{k}\right)^{\;}\;\left(\vec{x}_{n}-\vec{\mu}_{k}\right)^{T}$
where, T indicates transposition.
The derivative with respect to $\pi_{k}$ yields the following equation:
$\pi_{k}=\frac{N_{k}}{N}$ .

We obtain three equations to decide parameters. These equations depend on each other. To satisfy all equations, we can use a simple iterative scheme. The scheme is called the EM Algorithm. The procedures are as follows:

Set initial values to $\pi_{k},\vec{\mu}_{k},\bf{\Sigma}_{k}$ .
Calculate $\gamma_{n,k}$ using the current parameter values.
Using $\gamma_{n,k}$ , calculate three parameters
$\vec{\mu}_{k}=\frac{1}{N_{k}}\;\sum\limits_{n=1}^{N}\;\gamma_{n,k}\;\vec{x}_{n}$
$\bf{\Sigma}_{k}=\frac{1}{N_{k}}\;\sum\limits_{n=1}^{N}\;\gamma_{n,k}\;\left(\vec{x}_{n}-\vec{\mu}_{k}\right)^{\;}\;\left(\vec{x}_{n}-\vec{\mu}_{k}\right)^{T}$
$\pi_{k}=\frac{N_{k}}{N}$ .
Calculate $\ln{p(\bf{X}|\bf{\pi},\bf{\mu},\bf{\Sigma})}$ .
Until $\pi_{k},\vec{\mu}_{k},\bf{\Sigma}_{k}$ or $\ln{p(\bf{X}|\bf{\pi},\bf{\mu},\bf{\Sigma})}$ converge, repeat the loop from the step 2 to the step 4.

The steps 2 and 4 are called the expectation and the maximization steps, respectively. In the step 1, the K-means clustering is often used for initialization of the parameters. We consider the center of the k-th cluster as the mean $\vec{\mu}_{k}$ . The covariance matrix can be calculated as
$\bf{\Sigma}_{k}=\frac{1}{N_{k}}\;\sum\limits_{n=1}^{N}\;\left(\vec{x}_{n}-\vec{\mu}_{k}\right)^{\;}\;\left(\vec{x}_{n}-\vec{\mu}_{k}\right)^{T}$ .
The weight is given by
$\pi_{k}=\frac{N_{k}}{N}$
where, $N_{k}$ is the number of data points assigned to the k-th cluster.

References

Pattern Recognition and Machine Learning, Christopher M. Bishop, Springer

memo

2013年3月16日土曜日

Gaussian Mixture and EM Algorithm

Introduction

Gaussian Mixture

EM Algorithm

References

0 件のコメント:

コメントを投稿