memo: 9月 2015

2015年9月10日木曜日

Feature Extraction by CNN and Classification by SVM

Introduction

　In the previous page, I performed the scene recognition using the Convolutional Neural Network (CNN) that the library Caffe provides. In this page, for the same dataset the same CNN is used for extracting feature vectors and the classification is accomplished by means of the Support Vector Machine (SVM) in the library LIBLINEAR. The recognition accuracy reaches about 95% .

Feature Extraction

　The following code is used to extract feature vectors from the CNN and to convert them into the input format that LIBLINEAR requires. The format of training and testing data files for LIBLINEAR is the same as that for LIBSVM: Each line contains a label and a feature vector. I extracted feature vectors from the layer "fc7." Their dimension is 4096. The contents of the file "total_list_15_in_local_machine.txt" described in the above code are as follows: Each line consists of a file path, a label, and a phase ("train"/"valid"/"test"). In this work, two phases "train" and "valid" are merged into one phase "train." After executing the above python code, I got two files "libsvm_train_inputs.txt" and "libsvm_test_inputs.txt" which are input files for LIBLINEAR. The number of training images are 7560 and the number of testing images 1220.

Execution of SVM

　The following command is run to train a SVM. Then, this command is run to predict the categories. The recognition accuracy is about as much as that of the CNN.

2015年9月6日日曜日

Scene Recognition by Caffe

Introduction

　In this page, I perform a scene recognition by means of the library Caffe. It is shown that with the pre-training model that Caffe provides and its fine-tuning by scene images, the recognition accuracy achieves about 95% .

Computation Environment

　I used that instance g2.2xlarge in the Amazon EC2 which mounts the GPU device.

Dataset

　I trained the CNN using the dataset LSP15 in this page. The dataset consists of the 15 directories as follows:

MITcoast
MITforest
MIThighway
MITinsidecity
MITmountain
MITopencountry
MITstreet
MITtallbuilding
bedroom
CALsuburb
industrial
kitchen
livingroom
PARoffice
store

The name of the directory represents the category of the scene. Each directory contains about 200 to 300 images which belong to their category.

Data Augmentation

　In order to augment dataset, I added the mirror images to it. Moreover, the images are split into two groups "train" and "test." The size of the image is 256 $\times$ 256, and the number of the channel is 3. The number of the images in each category is as follows:
　　　　　　　　　　　　　　　　　　

label	name	number of train	number of test
0	MITcoast	610	100
1	MIThighway	440	70
2	MITmountain	630	100
3	MITstreet	490	80
4	MITforest	550	90
5	MITinsidecity	520	80
6	MITopencountry	690	110
7	MITtallbuilding	600	100
8	bedroom	360	60
9	CALsuburb	400	60
10	industrial	520	80
11	kitchen	360	60
12	livingroom	490	80
13	PARoffice	360	60
14	store	540	90
		7560	1220

Dataset for Caffe

　Caffe requires the following directories and files:

a directory which contains training images
a directory which contains test images
a text file in which names and labels of training images are described
a text file in which names and labels of test images are described

In my environment, they are put in the following paths:

/home/ubuntu/data/caffe_256_15/train/
/home/ubuntu/data/caffe_256_15/test/
/home/ubuntu/data/caffe_256_15/train.txt
/home/ubuntu/data/caffe_256_15/test.txt

The contents of the file "test.txt" are as follows:

MITstreet_image_0179_flipped.jpg 3
MITtallbuilding_image_0173_flipped.jpg 7
MITcoast_image_0126.jpg 0
store_image_0158_flipped.jpg 14
MITinsidecity_image_0102_flipped.jpg 5
MITforest_image_0200_flipped.jpg 4
industrial_image_0189_flipped.jpg 10
MITcoast_image_0142.jpg 0
kitchen_image_0019_flipped.jpg 11
bedroom_image_0210_flipped.jpg 8
bedroom_image_0116_flipped.jpg 8
livingroom_image_0008_flipped.jpg 12
kitchen_image_0051_flipped.jpg 11
MITstreet_image_0167_flipped.jpg 3
MITcoast_image_0315.jpg 0
....

The contents of the file "train.txt" are as follows:

industrial_image_0190.jpg 10
CALsuburb_image_0103_flipped.jpg 9
bedroom_image_0022_flipped.jpg 8
MITopencountry_image_0222.jpg 6
MITstreet_image_0040.jpg 3
MIThighway_image_0053_flipped.jpg 1
livingroom_image_0063_flipped.jpg 12
store_image_0106_flipped.jpg 14
industrial_image_0144.jpg 10
kitchen_image_0085_flipped.jpg 11
bedroom_image_0040.jpg 8
MIThighway_image_0088_flipped.jpg 1
industrial_image_0264.jpg 10
bedroom_image_0117_flipped.jpg 8
MITcoast_image_0021_flipped.jpg 0
...

After storing the images specified in the files "test.txt" and "train.txt" in the directories "test" and "train" respectively, this script is run to create the dataset for Caffe. "test_leveldb" and "train_leveldb" which are the inputs for Caffe are output as shown below.

Definition of CNN

　I defined the structure of the CNN in the file named "model/scene_recognition/train_val.prototxt" as: The file is based on the file "/home/ubuntu/buildspace/caffe-master/models/bvlc_reference_caffenet/train_val.prototxt." In the layers "data" and "fc8," there are differences between the original and my own files. I replaced the layer "fc8" with the new layer "scene_fc8." Moreover, in accordance with the explanation in this page, parameters in the layer "scene_fc8" were modified as shown above.

Definition of Solver

　Based on the file "models/bvlc_reference_caffenet/solver.prototxt," the text file used for training the CNN is defined as follows: The path of that file is "model/scene_recognition/solver.prototxt."

Training

　This script is run to train the CNN. The pre-training model which Caffe provides is "models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel" which is passed as the argument of the command option "-weights." The script fine-tunes the pre-training model by using the current dataset.

Result

The x-axis indicates the iteration number and the y-axis the recognition accuracy. Because in the current case the total iteration number is 80,000 and the solver is designed to output the accuracy once per 500 iterations, the maximum value of the x-axis is 160(=80,000/500). The recognition accuracy reaches about 95%.

Construction of Classifier

　After the training, the file "scene_train_iter_80000.caffemodel" is created. The file contains the information of the fine-tuned CNN. In order to construct the classifier from the model file, the following file is needed. That file is named "deploy.ptototxt." It is made from the file "model/scene_recognition/train_val.prototxt" according to the following procedures.

Remove the layer "data," and add the four lines as shown below.
Remove the layers "loss" and "accuracy", and add this layer.

The four lines with which the layer "data" is replaced means:

input_dim: 20 --- batch size
input_dim: 3 --- channel number
input_dim: 227 --- width of an image
input_dim: 227 --- height of an image

The code to classify the image is implemented as follows: It is named "classifier.py." Now I can classify the images.

2015年9月2日水曜日

Caffeによるシーン認識（８分類問題）2

はじめに

　先のページで caffeを使ってシーン認識（８分類問題）を試みた。今回は、caffe が提供する pre-training モデルを用いて、同じ問題を考察する。

計算機環境

　AmazonのEC2を利用した。インスタンス名は g2.2xlarge である。GPUを搭載したマシーンである。

データセット

　データセットは前回と同じである。ただし、pre-trainingモデルに合わせるため、画像サイズを256$\times$256に、チャンネル数を3に変更した。

caffe用データの作成

　ここも前回と同じである。以下のコマンドにより caffe 用の入力データを作成する。前回はグレー画像を扱ったのでコマンドオプションに -gray を付けたが、今回は3チャンネルなので -gray は付けない。

ネットワークの設計

　caffeのソースコードを納めたディレクトリ

/home/ubuntu/buildspace/caffe-master/models/bvlc_reference_caffenet

にある train_val.prototxt を下敷きにして以下のようなネットワーク構造を定義した（model/scene_recognition/train_val.prototxt）。変更部分は、data 層と fc8 層の2つである（fc8 層を scene_fc8 層に置き換えた）。また、このページの解説に従って、scene_fc8 層のパラメータ lr_mult を変更した。

solverの作成

　訓練を行なうためのテキストファイル model/scene_recognition/solver.prototxt を以下のように記述した。caffe の提供するサンプルファイル（models/bvlc_reference_caffenet/solver.prototxt）を下敷きにした。現在の test 画像の枚数は730枚、batch size を10としたので73(=test_iter)回で一通り画像を走査することになる。また、testの実行はtrainを500(=test_interval)回まわすごとに行なうようにした。

訓練

　以下を実行する。コマンドオプション -weights の引数として与えられている models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel がcaffeが提供するpre-trainingモデルである。上記のコマンドにより、大規模データであらかじめ作られた pre-training モデルを、私のデータで fine-tuning することができる。

結果

　以下に結果を示す。
test画像：

train画像：

縦軸はaccuracy（正解率）、横軸は iteration（繰り返し数）である。いまの場合、繰り返し数の総数が44238である。500回ごとにtest画像でテストするようにしたので、test画像の結果の横軸は89(=44238/500)まで表示されている。一方、 trainに対しては、73(=display)回ごとに出力するようにしたので、607(=44238/73)個の結果が表示されている。前回、test画像に対する正解率は88%であったが、今回は97%となり大幅に改善していることが分る。

検出器の作成

　ここまでの計算で、ファイル scene_train_iter_40000.caffemodel ができている。ここには、fine-tuning によりパラメータの確定したネットワーク構造が納められている。このファイルから検出器を作るには、以下のようなdeploy.prototxt ファイルが必要である。これは、先に定義した model/scene_recognition/train_val.prototxt のデータ層を取り除きを追加し、最終層にある loss 層と accuracy 層を取り除きを追加したものである。データ層の代わりに挿入した input_dim の意味は以下の通りである。

input_dim: 10 --- バッチサイズ
input_dim: 3 --- 入力画像のチャンネル数
input_dim: 227 --- 入力画像の幅
input_dim: 227 --- 入力画像の高さ

検出器を使って、ファイルを識別するコードを以下に示す。実行するとこうなる。パーフェクトである。