## 2012年6月28日木曜日

### Introduction

In this page, I explain my attempt to implement a scene recognition. This site has an interpretation enough to accomplish my purpose. As shown in the site, the steps to realize the scene recognition are as follows:
1. extraction of local features
2. detection of visual words
3. construction of histograms by visual words
4. training of Support Vector Machine(SVM)
5. prediction by SVM
As the SVM is a supervised learning machine, I needed a number of images to train it. I selected a dataset LSP15. The dataset has indoor and outdoor images that are classified into 15 categories. The number of images in each category is about 200 to 300, and their size is approximately 300x300 pixels. I re-classified 15 categories into 2 ones, and considered a two-class classifier. My guideline is to use the open source libraries such as opencv2 to implement it.

### Extraction of Local Features

The opencv2 supplies the class DescriptorExtractor. It has the static method create. Passing a string that indicates a type of local features as an argument to it, it returns the following extractors:
1. SIFT
2. SURF
3. ORB
4. BRIEF
To densely extract local features from an image, I calculated them on a grid of the image. where, src is an input image, keypoints is a cv::KeyPoint object including coordinates of grid points, and descriptors_ is an output (i.e. descriptors of local features) whose type is cv::Mat.

### Detection of Visual Words

From the obtained local features I chose some ones, and ran the k-means clustering over them. The opencv2 has the class BOWKMeansTrainer to do the k-means clustering. After adding the selected local features by the method add, the method cluster is used to run the clustering where, trainer_ is a BOWKMeansTrainer object, descriptors is a set of selected local features, and vocabulary is an output (i.e. centroids of clusters).

### Construction of Histograms

For this purpose, I used the class BOWImgDescriptorExtractor supplied by the opencv2. Setting the vocabulary calculated earlier to a BOWImgDescriptorExtractor object bow_extractor_, I calculated a histogram as below, where, keypoints is the same cv::KeyPoint object as used in Extraction of Local Features, and histogram is an output (i.e. histogram). One histogram is obtained from one image. Notice that the histogram calculated above is already normalized.

### Training of SVM

The effective kernels to realize a scene recognition are as follows:
1. χ2 kernel
2. histogram-intersection kernel
In this study, I used the library shogun that supports these kernels. The library provides unified interfaces to several existing SVM implementations. My calculation procedure is as follows: Two arguments positive and negative passed to the method SvmTrainer::execute are histograms calculated from indoor and outdoor images. On the forth line, they are unified into one quantity features. On the ninth and tenth lines, I creat labels to distinguish between positive(indoor) and negative(outdoor) data. The kernel is constructed and initialized on the twelfth and thirteenth lines. Lastly, on the sixteenth and seventeenth lines, the svm is trained. As shown in the above codes, the backend library is libsvm. After training, the svm object is saved as follows: The library shogun has the method save_serializable that serializes the object. At this point, the classifier is completed.

### Prediction by SVM

To restore the svm object, I used the method load_serializable. On the ninth line, the svm object is restored, and the prediction is executed on the twenty first line. The result is returned with the type of shogun::CLabels. It is saved on the twenty third line.

### Parameter Setting

#### extraction of local features

I selected SIFT as the algorithm of the descriptor. Its dimension is 128. The grid interval is 5 pixels. The member variables size and angle in the class cv::KeyPoint are (16,24,32)-pixels and (0,120,240)-degrees, respectively.

#### detection of visual words

The number of clusters is 10000. The number of descriptors that are passed as an input is 648659. These descriptors are calculated from about one 60th of all images.

#### construction of histograms

I selected the BruteForce as the local search algorithm.

#### others

As I said at the start of this page, 15 categories are classified into the indoor group and the outdoor one. The number of images belonging to the indoor group is about half that belonging to the outdoor group. Adding mirror-reversed images to the indoor group, I balanced the number of the indoor images with that of the outdoor images.

### Results

I carried out K-fold cross-validation. I set K=3. I partitioned all images into 3 groups, one group was retained as the validation data for testing the svm predictor, and remaining groups were used as training data. The cross-validation process was then repeated 3 times. The 3 results was averaged to produce a single estimation.

#### χ2kernel

width=0.5, c=10

indoor outdoor
recognition rate 0.820 0.819

#### histogram-intersection kernel

β=1,c=10

indoor outdoor
recognition rate 0.781 0.827

#### gauss kernel

width=0.5,c=100

indoor outdoor
recognition rate 0.813 0.805

### Todo List

1. I will open my source codes.
2. My purpose was to construct the framework of the scene recognition. I used the standard algorithms. I will consider the Gaussian Mixture Model that has a continuous distribution in contrast to the histogram. I would like to study the Fisher's Vector under intense investigation.
3. I will extend the two-class classifier to 15-class one.
4. I will consider parallelization by Hadoop to train the svm and predict by it.
5. I will implement a web application that receives an image from an user and sends a prediction to the user.

### Reference Sites

1. n_hidekey's dialy：This site gives a detailed description of procedures for a scene recognition.