memo: 7月 2014

2014年7月28日月曜日

gnuplotの図を保存する。

覚え書き

以下をdecompose_curve.pltとして保存し、コマンドライン上でを実行すると、指定したディレクトリにこんなpng画像が保存される。

同様に以下をdecompose_surface.pltとして保存し、以下を実行すると、こんな画像が保存される。

曲面をグリッドにするには、こんなフォーマットでファイルを作る。つまり、

データを、空行を使って、ブロックに分ける。
各ブロックはX座標あるいはY座標が同じである。

このフォマットを使ったファイル bspline_surface、bezier_surface_0_0、bezier_surface_0_1があるとして、以下のpltファイルを実行するとこうなる。

2014年7月27日日曜日

1. Introduction

　In recent years a considerable number of studies have been made on the deep learning. The study by A. Coates et al. [1] focused on the reason why the deep learning achieves high performance. I have implemented their algorithm and applied it to the dataset I have at hand. In this page, I will provide a detailed explanation on their algorithm and show my results.

2. Algorithm

　To realize the Coates's method, three datasets are needed:

$S_{\rm unsupervised}$ : a dataset to construct the function which maps an image patch to a feature vector
$S_{\rm training}$ : a dataset to train the Support Vector Machine (SVM)
$S_{\rm test}$ : a dataset to test the SVM

In my study, $S_{\rm unsupervised}$ and $S_{\rm training}$ are the same. First I conduct the following procedures to the dataset $S_{\rm unsupervised}$.

〜 Random Extraction of Patches 〜

Extract randomly $n$ patches of $w \times w$ pixels from an image.
Arrange pixels in a row from the left-top pixel to the right-bottom one of the patch to make a $N$-dimensional vector $\vec{x}$, where $N=d\times w\times w$ and $d$ is the number of channels. $d=3$ if the patch is color and $d=1$ if it is gray. As in my study the color image is converted to the gray one, $d$ is always 1.
If there are $m$ images, the number of $N$-dimensional vectors is $M(=n\times m)$ as \begin{equation} \vec{x}^{\;\mu}, \mu=1,\cdots,M \end{equation}

〜 Normalization and Whitening 〜

For each vector $\vec{x}^{\;\mu}$, the average and the variance are calculated as \begin{eqnarray} \bar{x}^{\mu}&=&\frac{1}{N}\sum_{i=1}^{N}x_{i}^{\mu} \nonumber \\ \sigma_{\mu}^2 &=& \frac{1}{N}\sum_{i=1}^{N}\left(x_{i}^{\mu} - \bar{x}^{\mu}\right)^{2}. \nonumber \end{eqnarray} Using them, $\vec{x}^{\;\mu}$ is redefined by \begin{equation} x_{i}^{\mu} = \frac{x_{i}^{\mu} - \bar{x}^{\mu} }{ \sqrt{ \sigma_{\mu}^{2}+\epsilon_{\rm n} } }, \end{equation} where $\epsilon_{\rm n}$ is a small value to avoid divide-by-zero. This procedure corresponds to the brightness and contrast normalization of the local area.
After normalizing $M$ vectors, they are whitened. Here is the detailed explanation on the whitening.

〜 Clustering 〜

　The $K$-means clustering is applied to $M$ $N$-dimensional vectors. I have fully explained the clustering in this page. $K$ centroids are obtained.

　The image shown below indicates 961 centroids when $K=1000$ and $w=6$. To draw the image, after the elements of each vector are scaled so that they stay within the range of [0,255], the vector is converted to a 6$\times$6 image.

〜 Mapping to Feature Vector 〜

　At this point, the following quantities are obtained:

The whitening matrix, $M_{\rm W}$
$K$ centroids, $\vec{c}_{\;k}, k=1,\cdots,K$.

Using them, the function which maps a patch to a feature vector is constructed. The procedures are as follows:

Arrange pixels in a row from the left-top pixel to the right-bottom one of a path to make a $N$-dimensional vector $\vec{x}$.
Normalize the vector $\vec{x}$.
Whiten the normalized vector as $\vec{x}_{\rm W} = M_{\rm W}\;\vec{x}$
Define the function $f_{k}(\vec{x}_{\rm W})$ as \begin{equation} f_{k}(\vec{x}_{\rm W})=\max{\left(0, \mu(z)-z_k\right)}, \nonumber \end{equation} where \begin{eqnarray} z_k&=&\|\vec{x}_{\rm W}-\vec{c}_{k} \| \nonumber \\ \mu(z)&=&\frac{1}{K}\sum_{k=1}^{K}z_k. \nonumber \end{eqnarray} $\mu(z)$ is the average distance between the vector $\vec{x}_{\rm W}$ and the $K$ centroids. The function $f_{k}$ outputs 0 if the distance $z_{k}$ between $\vec{x}_{\rm W}$ and the $k$-th centroid $\vec{c}_{k}$ is greater than the average $\mu(z)$ and outputs the finite value $\mu(z) - z_{k}$ if $z_{k}$ is less than $\mu(z)$. In other words in the context of the centroid images shown above, the function $f_{k}$ represents a degree of contribution of $K$ centroids to a patch. The function $f_{k}$ maps a $N$-dimensional vector to a $K$-dimensional vector. This is the feature vector.

Next the following procedures are conducted to the dataset $S_{\rm training}$.

〜 Extraction of Feature Vectors From Training Images〜

Suppose that one image is given. The patches of $w\times w$ pixels are sequentially extracted from the image with the step size $s$ (stride).
Convert a patch to a $K$-dimensional vector.
The entire region which consists of all extracted patches is split into four equal-sized quadrants. Compute the average of the feature vectors in each quadrant. This procedure corresponds to the pooling in the deep learning. It not only decreases the location-dependency but also reduces the dimensionality of the feature vector.
The four $K$-dimensional vectors yields a $4K$-dimensional vector. This is the feature vector made from one image. It is used for the classification by the SVM.
If there are $D$ images, the number of $4K$-dimensional feature vectors is $D$.
Train the SVM by using them.

　In the Coates's method, both the patch extraction and the pooling are executed only once. Moreover, the mapping function $f_{k}$ is computed using the unsupervised learning algorithm ($K$-means clustering). This is why the title of their paper is "Single-Layer Networks in Unsupervised Feature Learning."

Finally, the following procedures are conducted to the dataset $S_{\rm test}$.

〜 Classification of Test Dataset 〜

Convert a given image to a $4K$-dimensional feature vector.
Classify it by the SVM.

3. Implementation

I have implemented the algorithm in C++ and used the following libraries:

opencv2: in order to implement procedures related to the image processing and the $K$-means clustering.
boost::numeric::ublas: in order to implement the whitening computation.
liblinear: in order to train SVM and classify images.

I have also used the python as a glue that binds procedures together.

4. Image Dataset

　I have used the dataset called LSP15. It can be downloaded by clicking the title "scene category dataset" on this site. It has indoor and outdoor images that are classified into 15 categories. The number of images in each category is about 200 to 300, and image size is approximately 300$\times$300 pixels.
　I have selected two categories and constructed four datasets $S_{\rm training}^{A/B}$ and $S_{\rm test}^{A/B}$.

category	training	test	label
A	1-150 ($S_{\rm training}^{A}$)	151-200 ($S_{\rm test}^{A}$)	-1
B	1-150 ($S_{\rm training}^{B}$)	151-200 ($S_{\rm test}^{B}$)	+1

The name of the image has an integer value like "image_0001.jpg." The integer value in the above table except for labels indicates the image name. The number of the training images and the test images are 150 and 50, respectively. The dataset used for making the feature mapping function $f_{k}$ is $S_{\rm unsupervised}=S_{\rm training}^{A} + S_{\rm training}^{B}$ which contains 300 images. Since the dataset includes 3-channel images (color images), they are converted to the gray images.

5. Parameters

　I have chosen the following parameters:

$n$	$w$	$s$	$K$	$\epsilon_{\rm n}$	$\epsilon_{\rm w}$
50	6	1	1000	10	0.1

$\epsilon_{\rm w}$ is a small value to avoid divide-by-zero in the whitening matrix $M_{\rm W}$ defined by \begin{equation} M_{\rm W}=R\;\left(\Lambda+\epsilon_{\rm w}I\right)^{-1/2}\;R^{T}, \label{eq7} \end{equation} where $I$ is the identity matrix. I have already described the detailed explanation on the whitening here.

6. Accuracy

　The results are shown below:

category A	category B	accuracy	accuracy(Hellinger Kernel)
bedroom	CALsuburb	99%(99/100)	97%(97/100)
livingroom	industrial	94%(94/100)	97%(97/100)
MITforest	MITmountain	81%(81/100)	86%(86/100)
industrial	CALsuburb	88%(88/100)	97%(97/100)

　The Hellinger Kernel is able to be introduced by using the linear SVM after computing the square root of the element of the feature vector (Homogeneous Kernel Map).

7. Discussion

The accuracies are high in despite of the linear SVM.
The Hellinger Kernel is effective except for the pair of bedroom and CALsuburb.
The accuracy of the pair of MITforest and MITmountain is low compared with other pairs. If you see the sample images as shown below, it turns out that the images of the two categories are similar.
The high performance is achieved without the commonly-used local feature algorithms, e.g. SIFT, SURF, BRIST, and so on. Will the study on them fade out?

　Two test images from each category are shown below:

〜 bedroom 〜

〜 livingroom 〜

〜 MITforest 〜

〜 industrial 〜

〜 MITmountain 〜

〜 CALsuburb 〜

8. References

An Analysis of Single-Layer Networks in Unsupervised Feature Learning, A. Coates, H. Lee, and A. Y. Ng, 2010

2014年7月23日水曜日

Coatesの方法

in English

1. はじめに

　近年、Deep Learning の研究が盛んである。その中にはその高精度の原因を解明しようする研究もある。そのひとつが A.Coates らの "An Analysis of Single-Layer Networks in Unsupervised Feature Learning" である。今回これを実装し手元にある画像に適用してみたのでまとめます。

2. アルゴリズム

　このアルゴリズムを検証するには、3つの画像セットが必要である。

特徴ベクトルを生成する関数を作るための画像セット（$S_{\rm unsupervised}$）
訓練用画像セット（$S_{\rm training}$）
テスト用画像セット（$S_{\rm test}$）

今回は、1と2を同じ画像セットとした。以下、アルゴリズムの詳細を順に示す。

最初に、$S_{\rm unsupervised}$ を使用して以下を行う。

〜パッチの切り出し〜

1枚の画像からサイズ $w \times w$ のパッチをランダムに $n$ 枚切り出す。
1つのパッチの画素を左上から右下に向かって1列に並べ、$N$ 次元のベクトル $\vec{x}$ を作る。ここで $N=d\times w \times w$、$d$ は画像のチャンネル数である。カラー画像なら3、グレイ画像なら1である。
複数枚の画像について同じことを繰り返す。画像が $m$ 枚あれば $N$ 次元ベクトルは $M(=n\times m)$ 個作られる( $\vec{x}^{\;\mu}, \mu=1,\cdots,M$ )。

〜正規化と白色化〜

現在、$M$ 個の$N$ 次元ベクトルがある。個々のベクトルに対し、その成分の平均値と分散を計算し、正規化を行う。 \begin{eqnarray} \bar{x}^{\mu}&=&\frac{1}{N}\sum_{i=1}^{N}x_{i}^{\mu} \nonumber \\ \sigma_{\mu}^2 &=& \frac{1}{N}\sum_{i=1}^{N}\left(x_{i}^{\mu} - \bar{x}^{\mu}\right)^{2} \nonumber \end{eqnarray} これらを使って、改めて $\vec{x}^{\;\mu}$ を定義し直す。 \begin{equation} x_{i}^{\mu} = \frac{x_{i}^{\mu} - \bar{x}^{\mu} }{ \sqrt{ \sigma_{\mu}^{2}+\epsilon_{\rm n} } } \end{equation} ここで、$\epsilon_{\rm n}$ は0割りを防ぐ微小量である。この作業は、局所領域（パッチ）内での明るさとコントラストの正規化に相当する。
正規化された $M$ 個のベクトルを使って、白色化を行う。白色化の手順はここにまとめてある。

〜クラスタリング〜

　$M$ 個の $N$ 次元ベクトルに対し、$K$-meansクラスタリングを行う。クラスタリングの手順はここにまとめてある。$K$ 個の重心が求まる。

　下の画像は $K=1000$ としたときの961個の重心ベクトル（$N=36$次元ベクトル）を画像化したものである（成分の最小値を0に最大値を255に置き換え6$\times$6の画像に変換した）。

〜特徴ベクトルの作成〜

　ここまでの手順で以下のものが取得できた。

白色化行列 $M_{\rm W}$
$K$ 個の重心座標 $\vec{c}_{\;k}, k=1,\cdots,K$

これらを使って、画像から取り出した1つのパッチから1つの特徴ベクトルを生成する関数を組み立てる。その手順は以下の通りである。

パッチの画素を左上から右下に向かって1列に並べ、$N$ 次元のベクトル $\vec{x}$ を作成する。
$\vec{x}$ を正規化する。
正規化した $\vec{x}$ に白色化行列 $M_{\rm W}$ をかけて白色化する。$\vec{x}_{\rm W} = M_{\rm W}\;\vec{x}$
関数 $f_{k}(\vec{x}_{\rm W})$ を次式で定義する。 \begin{equation} f_{k}(\vec{x}_{\rm W})=\max{\left(0, \mu(z)-z_k\right)} \nonumber \end{equation} ここで、 \begin{eqnarray} z_k&=&\|\vec{x}_{\rm W}-\vec{c}_{\;k} \| \nonumber \\ \mu(z)&=&\frac{1}{K}\sum_{k=1}^{K}z_k \nonumber \end{eqnarray} である。つまり、ある点 $\vec{x}_{\rm W}$ に注目したとき、この点と $K$ 個の重心との平均距離を算出する。そして、平均距離より近い重心のインデックスには有限値を、遠い重心のインデックスには0を割り振る。上の重心画像を使ってもう少し言い換えると、個々のパッチに対する各重心画像の寄与の度合いを表していることになる。関数 $f_{k}$ により $N$ 次元ベクトルは、$K$ 次元ベクトルに変換される。これが特徴ベクトルである。

次に、$S_{\rm training}$ に対し以下を行う。

〜訓練画像からの特徴ベクトルの抽出〜

1枚の訓練画像を考える。サイズ $w\times w$ のパッチを切り出す。その際、次のパッチとの距離（ストライド）を $s$ とする。左上から右下に向かって規則正しくパッチを取り出す。
各パッチから $K$ 次元の特徴ベクトルを算出する。
取り出したパッチが作る全領域を４等分し、各矩形に属する特徴ベクトルの平均ベクトルを計算する。この作業は、Deep LearningにおけるPoolingに相当する。識別対象の位置依存性を減らす手続きでもある。
4つの $K$ 次元ベクトルを並べた4 $K$ 次元ベクトルを、画像1枚に対する特徴ベクトルとする。
全ての訓練画像に対し同じことを繰り返す。訓練画像が $D$ 枚あれば、$D$ 個の4 $K$ 次元ベクトルが作られる。
これらを使って、SVMで訓練する。

　上の図を見れば明らかであるが、本手法はパッチの走査とPoolingがそれぞれ一度だけ行われる。また、パッチを特徴ベクトルへ変換する関数 $f_k$ は教師なし学習（$K$-meansクラスタリング）により得られる。よって、Single-Layer Networks in Unsupervised Feature Learning なのである。

最後に、$S_{\rm test}$ に対し以下を行う。

〜テスト画像の評価〜

上と全く同じ手順で1枚の画像を4 $K$ 次元の特徴ベクトルに変換する。
SVMで識別する。

3. 実装

実装言語はC++、使用したライブラリは以下の通りである。

opencv2: 画像周りや $K$-meansクラスタリングなど。
boost::numeric::ublas: 白色化の計算など。
liblinear: 線形SVMを使った訓練とテスト。

また、アルゴリズムの各手順をつなぎ合わせる際、pythonを使用した。

4. 画像セット

　使用した画像セットはLSP15と呼ばれているものである。ここの scene category dataset という項目をクリックするとダウンロードできる。15個のカテゴリに分類された室内・室外画像である。各カテゴリには、縦横200から300ピクセル程度の画像が200から300枚ほど含まれている。
　今回の検証では、15個のカテゴリから適当に2つを選択し、画像セット $S_{\rm training}^{A/B}$、$S_{\rm test}^{A/B}$ を構成した。

category	training	test	label
A	1-150 ($S_{\rm training}^{A}$)	151-200 ($S_{\rm test}^{A}$)	-1
B	1-150 ($S_{\rm training}^{B}$)	151-200 ($S_{\rm test}^{B}$)	+1

ダウンロードした画像にはimage_0001.jpgのような番号を含む名前が付けられている。上の表の番号はこれを表す。訓練画像は各カテゴリから150枚、テスト画像は50枚を選択した。また、関数 $f_{k}$ を作るための画像セット $S_{\rm unsupervised}$は、$S_{\rm training}^{A} + S_{\rm training}^{B}$ とした。総数は300枚である。画像セットにはカラー画像も混ざっているが全てグレー画像に変換して検証を行った。

5. パラメータ

　各種パラメータの値は以下の通りである。

$n$	$w$	$s$	$K$	$\epsilon_{\rm n}$	$\epsilon_{\rm w}$
50	6	1	1000	10	0.1

ここで、$\epsilon_{\rm w}$ は白色化行列 $M_{\rm W}$ の0割りを防ぐ微小量である。 \begin{equation} M_{\rm W}=R\;\left(\Lambda+\epsilon_{\rm w}I\right)^{-1/2}\;R^{T} \label{eq7} \end{equation} ただし、$I$ は単位行列である。白色化行列の詳細はここにまとめてある。

6. 識別率

　結果を以下に示す。accuracyは通常の線形SVMを、accuracy(Hellinger Kernel)はHellinger Kernelを使用したときの識別率である。

category A	category B	accuracy	accuracy(Hellinger Kernel)
bedroom	CALsuburb	99%(99/100)	97%(97/100)
livingroom	industrial	94%(94/100)	97%(97/100)
MITforest	MITmountain	81%(81/100)	86%(86/100)
industrial	CALsuburb	88%(88/100)	97%(97/100)

　Hellinger Kernelは、特徴ベクトルの各成分の平方根を計算したあと線形SVMを適用すれば実現できる（Homogeneous Kernel Map）。