## 2015年9月10日木曜日

### Feature Extraction by CNN and Classification by SVM

#### Introduction

In the previous page, I performed the scene recognition using the Convolutional Neural Network (CNN) that the library Caffe provides. In this page, for the same dataset the same CNN is used for extracting feature vectors and the classification is accomplished by means of the Support Vector Machine (SVM) in the library LIBLINEAR. The recognition accuracy reaches about 95% .

#### Feature Extraction

The following code is used to extract feature vectors from the CNN and to convert them into the input format that LIBLINEAR requires. The format of training and testing data files for LIBLINEAR is the same as that for LIBSVM: Each line contains a label and a feature vector. I extracted feature vectors from the layer "fc7." Their dimension is 4096. The contents of the file "total_list_15_in_local_machine.txt" described in the above code are as follows: Each line consists of a file path, a label, and a phase ("train"/"valid"/"test"). In this work, two phases "train" and "valid" are merged into one phase "train." After executing the above python code, I got two files "libsvm_train_inputs.txt" and "libsvm_test_inputs.txt" which are input files for LIBLINEAR. The number of training images are 7560 and the number of testing images 1220.

#### Execution of SVM

The following command is run to train a SVM. Then, this command is run to predict the categories. The recognition accuracy is about as much as that of the CNN.

## 2015年9月6日日曜日

### Scene Recognition by Caffe

#### Introduction

In this page, I perform a scene recognition by means of the library Caffe. It is shown that with the pre-training model that Caffe provides and its fine-tuning by scene images, the recognition accuracy achieves about 95% .

#### Computation Environment

I used that instance g2.2xlarge in the Amazon EC2 which mounts the GPU device.

#### Dataset

I trained the CNN using the dataset LSP15 in this page. The dataset consists of the 15 directories as follows:
1. MITcoast
2. MITforest
3. MIThighway
4. MITinsidecity
5. MITmountain
6. MITopencountry
7. MITstreet
8. MITtallbuilding
9. bedroom
10. CALsuburb
11. industrial
12. kitchen
13. livingroom
14. PARoffice
15. store
The name of the directory represents the category of the scene. Each directory contains about 200 to 300 images which belong to their category.

#### Data Augmentation

In order to augment dataset, I added the mirror images to it. Moreover, the images are split into two groups "train" and "test." The size of the image is 256 $\times$ 256, and the number of the channel is 3. The number of the images in each category is as follows:

label name number of train number of test
0 MITcoast 610 100
1 MIThighway 440 70
2 MITmountain 630 100
3 MITstreet 490 80
4 MITforest 550 90
5 MITinsidecity 520 80
6 MITopencountry 690 110
7 MITtallbuilding 600 100
8 bedroom 360 60
9 CALsuburb 400 60
10 industrial 520 80
11 kitchen 360 60
12 livingroom 490 80
13 PARoffice 360 60
14 store 540 90
7560 1220

#### Dataset for Caffe

Caffe requires the following directories and files:
1. a directory which contains training images
2. a directory which contains test images
3. a text file in which names and labels of training images are described
4. a text file in which names and labels of test images are described
In my environment, they are put in the following paths:
1. /home/ubuntu/data/caffe_256_15/train/
2. /home/ubuntu/data/caffe_256_15/test/
3. /home/ubuntu/data/caffe_256_15/train.txt
4. /home/ubuntu/data/caffe_256_15/test.txt
The contents of the file "test.txt" are as follows:
MITstreet_image_0179_flipped.jpg 3
MITtallbuilding_image_0173_flipped.jpg 7
MITcoast_image_0126.jpg 0
store_image_0158_flipped.jpg 14
MITinsidecity_image_0102_flipped.jpg 5
MITforest_image_0200_flipped.jpg 4
industrial_image_0189_flipped.jpg 10
MITcoast_image_0142.jpg 0
kitchen_image_0019_flipped.jpg 11
bedroom_image_0210_flipped.jpg 8
bedroom_image_0116_flipped.jpg 8
livingroom_image_0008_flipped.jpg 12
kitchen_image_0051_flipped.jpg 11
MITstreet_image_0167_flipped.jpg 3
MITcoast_image_0315.jpg 0
....
The contents of the file "train.txt" are as follows:
industrial_image_0190.jpg 10
CALsuburb_image_0103_flipped.jpg 9
bedroom_image_0022_flipped.jpg 8
MITopencountry_image_0222.jpg 6
MITstreet_image_0040.jpg 3
MIThighway_image_0053_flipped.jpg 1
livingroom_image_0063_flipped.jpg 12
store_image_0106_flipped.jpg 14
industrial_image_0144.jpg 10
kitchen_image_0085_flipped.jpg 11
bedroom_image_0040.jpg 8
MIThighway_image_0088_flipped.jpg 1
industrial_image_0264.jpg 10
bedroom_image_0117_flipped.jpg 8
MITcoast_image_0021_flipped.jpg 0
...
After storing the images specified in the files "test.txt" and "train.txt" in the directories "test" and "train" respectively, this script is run to create the dataset for Caffe. "test_leveldb" and "train_leveldb" which are the inputs for Caffe are output as shown below.

#### Definition of CNN

I defined the structure of the CNN in the file named "model/scene_recognition/train_val.prototxt" as: The file is based on the file "/home/ubuntu/buildspace/caffe-master/models/bvlc_reference_caffenet/train_val.prototxt." In the layers "data" and "fc8," there are differences between the original and my own files. I replaced the layer "fc8" with the new layer "scene_fc8." Moreover, in accordance with the explanation in this page, parameters in the layer "scene_fc8" were modified as shown above.

#### Definition of Solver

Based on the file "models/bvlc_reference_caffenet/solver.prototxt," the text file used for training the CNN is defined as follows: The path of that file is "model/scene_recognition/solver.prototxt."

#### Training

This script is run to train the CNN. The pre-training model which Caffe provides is "models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel" which is passed as the argument of the command option "-weights." The script fine-tunes the pre-training model by using the current dataset.

#### Result

The x-axis indicates the iteration number and the y-axis the recognition accuracy. Because in the current case the total iteration number is 80,000 and the solver is designed to output the accuracy once per 500 iterations, the maximum value of the x-axis is 160(=80,000/500). The recognition accuracy reaches about 95%.

#### Construction of Classifier

After the training, the file "scene_train_iter_80000.caffemodel" is created. The file contains the information of the fine-tuned CNN. In order to construct the classifier from the model file, the following file is needed. That file is named "deploy.ptototxt." It is made from the file "model/scene_recognition/train_val.prototxt" according to the following procedures.
1. Remove the layer "data," and add the four lines as shown below.
2. Remove the layers "loss" and "accuracy", and add this layer.
The four lines with which the layer "data" is replaced means:
1. input_dim: 20 --- batch size
2. input_dim: 3 --- channel number
3. input_dim: 227 --- width of an image
4. input_dim: 227 --- height of an image
The code to classify the image is implemented as follows: It is named "classifier.py." Now I can classify the images.

## 2015年9月2日水曜日

### Caffeによるシーン認識（８分類問題）2

#### はじめに

先のページで caffeを使ってシーン認識（８分類問題）を試みた。今回は、caffe が提供する pre-training モデルを用いて、同じ問題を考察する。

#### 計算機環境

AmazonのEC2を利用した。インスタンス名は g2.2xlarge である。GPUを搭載したマシーンである。

#### データセット

データセットは前回と同じである。ただし、pre-trainingモデルに合わせるため、画像サイズを256$\times$256に、チャンネル数を3に変更した。

#### caffe用データの作成

ここも前回と同じである。 以下のコマンドにより caffe 用の入力データを作成する。 前回はグレー画像を扱ったのでコマンドオプションに -gray を付けたが、今回は3チャンネルなので -gray は付けない。

#### ネットワークの設計

caffeのソースコードを納めたディレクトリ
/home/ubuntu/buildspace/caffe-master/models/bvlc_reference_caffenet
にある train_val.prototxt を下敷きにして以下のようなネットワーク構造を定義した（model/scene_recognition/train_val.prototxt）。 変更部分は、data 層と fc8 層の2つである（fc8 層を scene_fc8 層に置き換えた）。 また、このページの解説に従って、scene_fc8 層のパラメータ lr_mult を変更した。

#### solverの作成

訓練を行なうためのテキストファイル model/scene_recognition/solver.prototxt を以下のように記述した。caffe の提供するサンプルファイル（models/bvlc_reference_caffenet/solver.prototxt）を下敷きにした。 現在の test 画像の枚数は730枚、batch size を10としたので73(=test_iter)回で一通り画像を走査することになる。また、testの実行はtrainを500(=test_interval)回まわすごとに行なうようにした。

#### 訓練

以下を実行する。 コマンドオプション -weights の引数として与えられている models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel がcaffeが提供するpre-trainingモデルである。上記のコマンドにより、大規模データであらかじめ作られた pre-training モデルを、私のデータで fine-tuning することができる。

以下に結果を示す。
test画像：
train画像：

#### 検出器の作成

ここまでの計算で、ファイル scene_train_iter_40000.caffemodel ができている。ここには、fine-tuning によりパラメータの確定したネットワーク構造が納められている。このファイルから検出器を作るには、以下のようなdeploy.prototxt ファイルが必要である。 これは、先に定義した model/scene_recognition/train_val.prototxt のデータ層を取り除き を追加し、最終層にある loss 層と accuracy 層を取り除き を追加したものである。データ層の代わりに挿入した input_dim の意味は以下の通りである。
1. input_dim: 10 --- バッチサイズ
2. input_dim: 3 --- 入力画像のチャンネル数
3. input_dim: 227 --- 入力画像の幅
4. input_dim: 227 --- 入力画像の高さ