disclaimer - seoul national...
TRANSCRIPT
저 시-비 리- 경 지 2.0 한민
는 아래 조건 르는 경 에 한하여 게
l 저 물 복제, 포, 전송, 전시, 공연 송할 수 습니다.
다 과 같 조건 라야 합니다:
l 하는, 저 물 나 포 경 , 저 물에 적 된 허락조건 명확하게 나타내어야 합니다.
l 저 터 허가를 면 러한 조건들 적 되지 않습니다.
저 에 른 리는 내 에 하여 향 지 않습니다.
것 허락규약(Legal Code) 해하 쉽게 약한 것 니다.
Disclaimer
저 시. 하는 원저 를 시하여야 합니다.
비 리. 하는 저 물 리 목적 할 수 없습니다.
경 지. 하는 저 물 개 , 형 또는 가공할 수 없습니다.
공학석사학위논문
SRAM: Saliency aided Recurrent Attention
Model for Image Classification
이미지 분류 문제 해결을 위한 이미지 돌출성 활용 Recurrent
Attention Model
2018 년 8 월
서울대학교 대학원
산업공학과
정 민 기
SRAM: Saliency aided Recurrent Attention
Model for Image Classification
이미지 분류 문제 해결을 위한 이미지 돌출성 활용
Recurrent Attention Model
지도교수 조 성 준
이 논문을 공학석사 학위논문으로 제출함
2018 년 8 월
서울대학교 대학원
산업공학과
정 민 기
정민기의 공학석사 학위논문을 인준함
2018 년 6 월
위 원 장 박 종 헌 (인)
부위원장 조 성 준 (인)
위 원 이 재 욱 (인)
Abstract
SRAM: Saliency aided Recurrent AttentionModel for Image Classification
Minki Chung
Department of Industrial Engineering
The Graduate School
Seoul National University
To overcome the poor scalability of convolutional neural network, recurrent attention
model(RAM) selectively choose what and where to look on the image. By directing
recurrent attention model how to look the image, RAM can be even more successful
in that the given clue narrows down the scope of the possible focus zone. In this per-
spective, this work proposes a novel model, saliency aided recurrent attention model
(SRAM) which adopts visual saliency as additional information for image classifi-
cation. First, a saliency detector with an intuitive structure for extracting image
saliency information was constructed. By combining saliency map as an additional
image channel, visual saliency presented as a milestone for viewing the image. SRAM
follows encoder-decoder framework, encoder utilizes recurrent attention model with
spatial transformer network and decoder for classification. To ensure the perfor-
mance, we evaluate the SRAM on the cifar10 dataset and showing 2.209% decrease
at most in error rate compared to RAM model.
i
Keywords: Visual saliency, Image classification, Saliency based classification, Re-
current attention model
Student Number: 2016-26799
ii
Contents
Abstract i
Contents iv
List of Tables v
List of Figures vii
Chapter 1 Introduction 1
Chapter 2 Related Work 4
2.1 Recurrent Attention Model (Mnih et al. 2014) . . . . . . . . . . . . . 4
2.2 Spatial Transformer Network (Jaderberg et al. 2015) . . . . . . . . . 6
2.3 Recurrent Attentional Networks for Saliency Detection (Kuen et al.
2016) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Learning Deconvolution Network for Semantic Segmentation (Noh et
al. 2015) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.5 Visual Saliency based Classification . . . . . . . . . . . . . . . . . . . 9
Chapter 3 Proposed Method 11
3.1 Saliency Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 SRAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
iii
3.2.1 Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.2 Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Chapter 4 Implementation Details 20
4.1 Saliency Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2 SRAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Chapter 5 Experiments and Results 22
5.1 Saliency Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.2 Performance on Cifar10 dataset . . . . . . . . . . . . . . . . . . . . . 24
5.3 Computation Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Chapter 6 Discussion 27
Chapter 7 Conclusion 29
Bibliography 31
국문초록 36
iv
List of Tables
Table 5.1 Performance comparison on Cifar10 dataset . . . . . . . . . . 25
Table 5.2 Computation cost of SRAM vs RAM vs VGG . . . . . . . . . 26
v
List of Figures
Figure 1.1 An example of image saliency . . . . . . . . . . . . . . . . . 3
Figure 2.1 Architecture of Recurrent Attention Model (RAM) (Mnih et
al. 2014) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Figure 2.2 Architecture of Spatial Transformer Network (STN) (Jader-
berg et al. 2015) . . . . . . . . . . . . . . . . . . . . . . . . . 6
Figure 2.3 Process of spatial transformer (Jaderberg et al. 2015) . . . . 6
Figure 2.4 Architecture of Recurrent Attentional Convolutional-Deconvolutional
Network (RACDNN) (Kuen et al. 2016) . . . . . . . . . . . 7
Figure 2.5 Architecture of convolution, transpose-convolution network
for segmentation (Noh et al. 2015) . . . . . . . . . . . . . . . 8
Figure 2.6 Representation of image with saliency maps (Sharma et al.
2012) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Figure 2.7 Architecture of the SalClassNet– saliency detection guided
by a visual classification task (Murabito et al. 2017) . . . . . 10
Figure 3.1 Architecture of saliency detector . . . . . . . . . . . . . . . . 11
Figure 3.2 Overall architecture of SRAM . . . . . . . . . . . . . . . . . 13
Figure 3.3 Architecture of SRAM encoder . . . . . . . . . . . . . . . . 14
Figure 3.4 Architecture of STN . . . . . . . . . . . . . . . . . . . . . . 16
vi
Figure 5.1 Visual saliency of cifar10 dataset obtained with saliency de-
tector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Figure 5.2 Visual saliency of Imagenet dataset obtained with saliency
detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Figure 5.3 Examples of Cifar10 dataset . . . . . . . . . . . . . . . . . . 24
Figure 6.1 Failure cases of saliency detection of Imagenet dataset . . . 28
vii
Chapter 1
Introduction
Convolutional neural network (CNN) [19] brought huge success in various computer
vision tasks from simple tasks including image classification [16] or object detection
[29], to complex ones including motion tracking [34] or image inpainting [27]. CNN,
however, is not perfect. One of limitation of CNNs is its poor scalability with in-
creasing input image size in terms of computation efficiency. With limited time and
resources, it is necessary to select where, what, and how to look the image. Facing
bird specific fine grained classification task, for example, it would be meaningless to
pay attention on non-bird image part such as tree or sky. Rather, one should focus on
regions which play decisive roles on classification such as beak or wings. If machine
can learn how to pay attention on those regions will results better performance with
lower energy usage.
In this context, Recurrent Attention Model(RAM) [22] introduces visual atten-
tion method that can be applied to numerous computer vision tasks. By sequentially
choosing where and what to look, RAM achieved better performance even with lesser
computation and lower usage of memory. Moreover, the attention mechanism tackled
the vulnerability of deep learning model that is difficult to interpret by proposing
glimpse patches which show the regions RAM attend. But still there is more room for
1
RAM for improvement. In addition to where and what to look, if one can give some
clues on how to look, the task specific hint, learning could be more intuitive and
efficient. From this perspective, we propose the novel architecture, Saliency aided
Recurrent Attention Model(SRAM) which adopts image saliency as problem
oriented clue on RAM. By applying visual saliency exploiting existing saliency de-
tector, RAM structure could achieve faster convergence and better performance.
Visual saliency is the distinct object or region in an image which stands out
from their neighbors and grab the attention. As shown in Figure 1.1, bird object
becomes an salient part of an image. Due to its broad applications visual saliency
has received an increasing focus in recent years. Simple examples of the applications
are object recognition [30], image retrieval [3], and visual tracking [2]. Since salient
region of an image is the target of interest in most cases, we expect classification task
would receive benefit from the additional information by advising where to attend
preferentially. This shares the concept of visual attention with [12, 22, 35, 13, 7, 8].
Considering how the human visual perception system works, this approach makes
more sense: gating visual information to be processed by the brain cognitive system
depending on the distinct visual characteristics of the images. Even though image
saliency is implicit information which image itself already contains, we believe there
is a difference in turning implicit information to explicit one.
As a preliminary work of SRAM, we build a saliency detector based on the
architecture of [26]. By training with the benchmark saliency dataset MSRA10K [?]
for training, we were able to extract image saliencies which are sufficiently descriptive
clues for image classification. Then we evaluated our model on the cifar10 dataset
[15] showing better performance with only fractional increase in computation and
2
Figure 1.1: An example of image saliency
memory usage compared to RAM.
In summary, this paper has the following contributions:
(a) Constructed saliency detector network which can extract sufficiently explana-
tory visual saliency helpful to image classification.
(b) Proposed novel model, saliency aided recurrent attention model (SRAM) which
receives visual saliency on RAM for intensive problem solving.
(b) Evaluated on cifar10 dataset, showing the successful extension of RAM.
The rest of the paper is structured as follows. In Chapter 2, we review literatures
related to the problem. In Chapter 3, we propose visual saliency detector network
and saliency aided recurrent attention model (SRAM). In Chapter 4, we explain
the details of above mention structures. In Chapter 5, we shows experiments and
results with proposed method. In Chapter 6, we give discussions on experimental re-
sults. Finally, in Chapter 7, we give concluding remarks and possible future research
directions of the paper.
3
Chapter 2
Related Work
2.1 Recurrent Attention Model (Mnih et al. 2014)
RAM [22] first proposed recurrent neural network(RNN) [21] based attention model.
Subsequent studies have been steadily publishing to the present days considering its
novel attention mechanism. The idea of RAM is inspired by human visual system:
encountered by the image which is too large to be processed at a glance, human
processes the image from part to part sequentially depending on one’s interest. By
selectively choosing what and where to look RAM showed higher performance while
reducing calculations and memory usage.
The overall architecture of RAM is shown in Figure 2.1. RAM includes two com-
ponents: Glimpse sensor (A in Figure 2.1) which extracts a retina-like representation
that contains patches of multiple resolution, and glimpse network (B in Figure 2.1)
which uses the glimpse sensor to produce glimpse representation. RAM is RNN
based model which takes glimpse representation as an input and combines with the
internal representation at previous time step and produces the new internal state.
The location network and action network (bottom parts of C in Figure 2.1 use the
new internal state to decide where to look in next time step.
However, since RAM attend the image region by using sampling method, it
4
has fatal weakness of using REINFORCE [33] for optimization, difficult to train
compared to other models using back-propagation for optimization. After works
of RAM, Deep Recurrent Attention Model(DRAM) [1] showed advanced architec-
ture for multiple object recognition and Deep Recurrent Attentive Writer(DRAW)
[9] introduced sequential image generation method without using REINFORCE.
Recurrent Attentional Convolutional-Deconvolutional Network(RACDNN) [17] re-
places locating module with spatial transformer network [12] and exploits recurrent
attention models for visual saliency refinement (more details at Section 2.3).
Figure 2.1: Architecture of Recurrent Attention Model (RAM) (Mnih et al. 2014)
5
2.2 Spatial Transformer Network (Jaderberg et al. 2015)
Spatial transformer network (STN) [12] first proposed a parametric spatial attention
module for image classification and object detection. It tackles another limitation of
CNN structures, where the invariance of rotations and translations are only manifi-
ested with deep layers. As shown in Figure 2.2, spatial transformer consists of three
components: localisation net, grid generator and sampler. Localization network takes
feature map as an input and outputs θ, the parameter for transformation. With the
θ, grid generator produces new coordinate system for transformed image. Finally
the sampler exploits the new coordinates to transform input image to favorable
formation. Figure 2.3 shows the overall process of spatial transformer.
Figure 2.2: Architecture of Spatial Transformer Network (STN) (Jaderberg et al.2015)
Figure 2.3: Process of spatial transformer (Jaderberg et al. 2015)
6
2.3 Recurrent Attentional Networks for Saliency Detec-
tion (Kuen et al. 2016)
Recurrent Attentional Convolutional-Deconvolutional Network(RACDNN) [17] com-
bines the strengths of RAM and STN for saliency detection task. While following
RAM based structure, RACDNN replaces the locating module of RAM structure
with STN enabling back-propagation. RACDNN sequentially selects where to at-
tend on the image and refines image saliency of selected image by applying transpose
convolutional layers. Starting from the basic saliency map generated from convolu-
tional, RACDNN reinforce image saliency through RNN steps. Overall process of
RACDNN is shown in Figure 2.4.
Figure 2.4: Architecture of Recurrent Attentional Convolutional-DeconvolutionalNetwork (RACDNN) (Kuen et al. 2016)
7
2.4 Learning Deconvolution Network for Semantic Seg-
mentation (Noh et al. 2015)
By using convolutional and transpose convolutional network as the form of encoder-
decoder, [26] successfully extracts the object segmentations of the image. On top of
the CNN based on VGG 16 [32], there is multiple transpose-convolution layers to gen-
erate the accurate segmentation map of an input proposal. As shown in Figure 2.5,
given a feature representation encoded from the CNN, dense pixelwise class predic-
tion map is constructed through multiple series of unpooling, transpose-convolution
and rectification operations [24]. Since the propose architecture is intuitive yet pow-
erful in segmentation, we constructed the saliency detector for SRAM based on this
study. By replacing last layer from class predictor to binary saliency detector and
other additional twists, we could obtain reasonable saliency which plays a role as ad-
visor on classification. More details on the saliency detector for SRAM will discussed
later at Section 3.1.
Figure 2.5: Architecture of convolution, transpose-convolution network for segmen-tation (Noh et al. 2015)
8
2.5 Visual Saliency based Classification
There have been similar attempts in the past of using visual saliency on image clas-
sification. [31] used saliency maps to weight the corresponding visual features by
partitioning the image. Image representation of this paper is shown in Figure 2.6.
Without appropriate model for acquisition of visual saliency, however, this work de-
pends on relatively low quality of saliency map. Also obvious limitation of this work
is that images and saliency maps are fragmented before multiplication inevitably
resulting loss of information at the junctions of segments.
Figure 2.6: Representation of image with saliency maps (Sharma et al. 2012)
Unlike using saliency maps on classification, [23] detects salient objects guided
by visual classification. They showed improved classification performance by ex-
ploiting classification task-based saliency maps. The process of proposed method,
SalClassNet, is as follows: First, CNN saliency detector obtains top-down saliency
map through multiple convolutional layers. This saliency map is combined as addi-
tional channel of image, constructing RGBS(RGB+Saliency) image. RGBS image
then passes through CNN classifier for predictions of target class. The overall archi-
tecture is shown in Figure 2.7.
9
The whole process is trained on end-to-end which means training process re-
quires ground truth of visual saliency. In the actual cases, it is difficult to expect
the visual saliency ground truth information exists moreover in solving the classifi-
cation problem. Besides, the task-based saliency maps from this work is far from the
meaningful visual saliency maps. Instead, SRAM exploits existing visual saliency
detection models to extract image saliency that can be applied to image classifica-
tion. One methodology we adopt from this work is how they combine visual saliency
into input image explicitly. SRAM also incorporates image saliency as an additional
channel of an input image reshaping as RGBS(RGB+Saliency) format.
Figure 2.7: Architecture of the SalClassNet– saliency detection guided by a visualclassification task (Murabito et al. 2017)
10
Chapter 3
Proposed Method
3.1 Saliency Detector
Inspired by the success of [26], we constructed saliency detector composed of encoder
using CNN and decoder using transpose-CNNs(T-CNN). CNNs downsize feature
maps over multiple convolutional constructing spatially compact image representa-
tions, while transpose CNN upsizes feature maps and learns increasingly localized
representations. In such a CNN, T-CNN framework, we can obtain proper image
representation for image saliency detection. Unlike the work of [26], we stop apply-
ing CNN before feature map shrinks to spatial size of 1x1. Through experiments we
found that extreme embedding with CNNs results in loss of spatial information loss
or saliency detection. The architecture of CNN, T-CNN based saliency detector is
shown in Figure 3.1.
Figure 3.1: Architecture of saliency detector
11
T-CNN in decoder part is almost identical to conventional CNNs in encoder
part except for a few minor differences. In T-CNNs, convolution operations are
carried out in such a way that the resulting feature maps retain the bigger spatial
sizes as those of the input feature maps. This is done by applying fractional strides
with appropriate zero paddings. In the processing pipeline of CNN, T-CNN for
saliency detection, the CNN first transforms the input image, ’x’ to a spatially
compact hidden representation ’z’, as z = CNN(x). Then, ’z’ is transformed to a
visual saliency map S through the T-CNN as S = T-CNN(z). We perform sigmoid
activation function in order to make sure saliency map takes values range of [0,
1]. The loss function of CNN, T-CNN for saliency detection is the L1 loss between
ground truth saliency map G and S. The resulting network is saliency detector
that can be trained in end-to-end fashion. The obtained saliency map from saliency
detector is combined with the input image as an additional channel.
In equations, the Ls, which is L1 loss(Mean Absolute Error, MAE) for saliency
detection is as follows:
Ls = mean(∑w
∑h
|Gwh − Swh|) (3.1)
where Gwh is ground truth saliency binary value at pixel point (w, h) and Swh
is predicted saliency values at pixel point (w, h) range of [0, 1].
12
3.2 SRAM
Visual saliency obtained from 3.1 is combined as saliency channel reshaping input
image to RGBS(RGB+Saliency) format. The architecture of SRAM is based on
encoder-decoder structure. Encoder is similar to that of RACDNN[17], recurrent
attention model with spatial transformers in each step for locating next glimpse.
Encoded value which is the output of last first layer’s RNN cell, passes through de-
coder network. Decoder consists of multiple fully connected layers (MLP), last layer
responsible for classification. Figure 3.2 shows the overall architecture of SRAM.
Figure 3.2: Overall architecture of SRAM
13
3.2.1 Encoder
Encoder is composed of 4 subnetworks: context network, spatial transformer net-
work, glimpse network and core recurrent neural network. The overall architecture
of encoder is shown in Figure 3.3. Considering the flow of information, we will go
into details of each network one by one.
Figure 3.3: Architecture of SRAM encoder
Context Network: The context network is the first part of encoder which
receives image with saliency map (RGB+S) as an inputs and outputs the initial
state tuple of r20. r20 is first input of second layer of core recurrent neural network as
shown in Figure 3.3. Using downsampled RGBS image(img(RGBS)coarse),, context
network provides reasonable starting point for choosing image region to concentrate.
Downsampled images are processed with CNN followed by MLP.
14
c0 = MLPc(CNNcontext(img(RGBS)coarse)) (3.2)
h0 = MLPh(CNNcontext(img(RGBS)coarse)) (3.3)
where (c0, h0) is the first state tuple of r20.
Spatial Transformer Network: Spatial transformer network(STN) selects
region to attend considering given RGBS image [12]. Different from existing STN,
SRAM uses modified STN which receives RGBS image and output of second layer of
core RNN as an inputs and outputs glimpse patch. From now on, glimpse patch in-
dicates the attended image which is cropped and zoomed in by STN. Here, the STN
is composed of two parts. One is the localization part which calculates the trans-
formation matrix τ with CNN and MLP. The other is the transformer part which
zoom in the image using obatined transformation matrix τ and produce the glimpse
patch. The affine transformation matrix τ with isotropic scaling and translation is
given as Equation 3.4.
τ =
s 0 tx
0 s ty
0 0 1
(3.4)
where s, tx, ty are the scaling, horizontal translation and vertical translation param-
eter respectively.
Total process of STN is shown in Figure 3.4.
In equations, the process STN is as follows:
15
Figure 3.4: Architecture of STN
glimpse patchn = STN(img(RGBS)τn) (3.5)
where n in glimpse patchn is the step of core RNN ranging from 1 to total glimpse
number.
Process for extracting τ is as follows:
τn = MLPloc(CNNi(img(RGBS))⊕MLPr(r(2)n )) (3.6)
where ⊕ is concat operation.
Glimpse Network: The glimpse network is a non-linear function which receives
current glimpse patch, glimpse patchn(orgpn) and attend region information, τ as
inputs and outputs current step glimpse vector. Glimpse vector is later used as
the input of first core RNN. glimpse vectorn(orgvn) is obtained by multiplicative
16
interaction between extracted features of glimpse patchn and τ . The method of
interaction is first proposed by [18]. Similar to previous mentioned networks, CNN
and MLP are used for feature representation.
gvn = MLPwhat(CNNwhat(gpn))�MLPwhere(τn) (3.7)
where � is a element-wise vector multiplication operation.
Core Recurrent Neural Network: Recurrent neural network is the core
architecture in SRAM, which aggregates information extracted from the stepwise
glimpses and calculates encoded vector z. Iterating for set RNN steps(total glimpse
numbers), core RNN receives glimpse vectorn at the first layer. The output of second
layer r(2)n is again used by spatial transformer network’s localization module.
r(1)n = R1recur(glimpse vectorn, r
(1)n−1) (3.8)
r(2)n = R2recur(r
(1)n , r
(2)n−1) (3.9)
3.2.2 Decoder
Classification Network: Like general image classification approach, encoded z is
passed through MLP and last layer outputs possibility for each class.
17
3.2.3 Training
Loss function of SRAM can be divided into two: saliency related loss, Lsal and classi-
fication related loss, Lcls. Lsal constraints the glimpse patches to be consistent with
the image saliency obtained from the previous saliency detector. For classification
task, it is favorable if the glimpse patches covers the salient part at most. Equations
of Lsal and Lcls is as follows:
Lsal = Lsal(S, ST, τ) = −mean(∑n
ST (S, τn)) (3.10)
where is visual saliency obtained when input image passes through saliency de-
tector. STN is the trained spatial transformer network in Equation 3.5 and τn is
obtained from Equation 3.4 in each step of core RNN.
Since Lcls cross entropy loss like general classification approach.
Lcls = Lcls(Y, Y∗) = cross entropy(Y, Y ∗) (3.11)
= −mean(∑c
yc lnσ(y∗c )) (3.12)
where σ(·) is sigmoid function, c is number of classes, Y and Y ∗ are predicted class
label vector and ground truth class label vector respectively.
Then total loss Ltot of SRAM for image classification becomes:
Ltot = Lcls + α ∗ Lsal (3.13)
= Lcls(Y, Y∗) + α ∗ Lsal(S, ST, τs) (3.14)
18
where α is balancing parameter between Lcls and Lsal.
19
Chapter 4
Implementation Details
4.1 Saliency Detector
Encoder of saliency detector is composed 6 layer-CNN, 3 sets in pairs of 2 (3 x 3
kernel size, 1 x 1 stride first then 2 x 2 stride repeatedly, same zero padding), while
decoder part is composed of 6 layer-T-CNN, 3 sets in pairs of 2 (3 x 3 kernel size, 1
x 1 stride first then 1/2 x 1/2 stride repeatedly, same zero padding). The product
of T-CNN then passes through transpose convolutional layer twice more in order to
match the shape of the output into grayscale format.
All CNN and T-CNN except last layer in encoder and decoder in saliency detector
include batch normalization [11] and elu activation [5]. We used Adam optimizer
[14] for training with learning rate starting from 1e-3 and decaying exponentially in
respect to train epochs. We trained saliency detector for 200 epochs.
4.2 SRAM
As mentioned earlier, encoder consists of 4 subnetworks: context network, spatial
transformer network, glimpse network, and core RNN. Image and saliency is 4 times
downsampled and used an input of context network. Each passes 6 layer-CNN, 3 sets
in pairs of 2 (3 x 3 kernel size, 1 x 1 stride first then 2 x 2 stride repeatedly, same zero
20
padding) and outputs representation vectors. These vectors are concatenated once
again, passes through 2 fully connected layers and outputs the initial state for second
layer of core RNN. Localization part of spatial transformer network are constructed
CNN followed by MLP for regression. For RGBS image input, 3 layer-CNN(5 x 5
kernel size, 2 x 2 stride, same zero padding) is applied. 2 layer-MLP is applied to
second core RNN output input. Output vectors of CNN and MLP are concatenated
and once again pass through 2-layer MLP to produce s, tx, ty, the parameters of τ .
Glimpse network receives glimpse patch and τ obtained above as an input. 1-layer
MLP is applied on τ while Glimpse patch passes through 3-layer CNN followed by 1-
layer MLP to match the vector length with the τ vector. Glimpse vector is obtained
by element-wise vector multiplication operation of above explained output vectors.
Core RNN is composed of 2 layer with Long-Short-Term Memory (LSTM) units for
[10] for of its ability to learn long-range dependencies and stable learning dynamics.
Finally classification network, the decoder part is made up of 3-layer MLP.
Same as saliency detector all CNNs and MLPs except last layer in encoder and
decoder in SRAM subnetworks include batch normalization and elu activation. We
used Adam optimizer with learning rate for training starting from 1e-4 and decaying
exponentially in respect to train epochs. During training, gradients are hard-clipped
to the range of [-5, 5] in order to mitigate the gradient explosion problem which
occurs commonly when training RNN. We set hyper-parameter alpha 0.1, which is
balancing parameter for Lcls and Lsal. We trained SRAM for 200 epochs.
21
Chapter 5
Experiments and Results
5.1 Saliency Detector
We trained above explained saliency detector with the MSRA10k[4] dataset which is
by far the largest publicly available dataset for saliency detection. It contains 10,000
annotated saliency images. Dataset is divided into 70% as training set and 30% as
validation set. Quantitatively we could achieve MAE value of 0.0311 and 0.0119 on
validation set for 32 and 224 image size respectively.
Besides quantitative analysis, we also carried qualitative evaluations of saliency
detector on seperate dataset: cifar10 and imagenet [6]. Saliency maps extracted
on cifar10 dataset is shown in Figure 5.1. It seems quite reasonable detecting cat,
airplane, deer and bird regions from their backgrounds.
We could also found that saliency detector extracts explanatory salient regions
on more complex image. Figure 5.2 shows obtained image saliency on imagenet
dataset.
22
Figure 5.1: Visual saliency of cifar10 dataset obtained with saliency detector
Figure 5.2: Visual saliency of Imagenet dataset obtained with saliency detector
23
5.2 Performance on Cifar10 dataset
In order to ensure the performance of SRAM, we evaluated on cifar10 dataset.
Cifar10 dataset consists of 60,000 32 x 32 RGB images in 10 classes, with 6000
images per class. Dataset is divided into 50000 training images and 10000 test images.
Example images of cifar10 dataset is shown as Figure 5.3.
Figure 5.3: Examples of Cifar10 dataset
Since SRAM mainly exploits multiple 3x3 CNNs, we selected VGG19 [32] net-
work for baseline network . We reimplemented VGG19 for fair comparison in fully
convolutional way [20], removing max pooling and replacing with the size of stride.
We compared the error rate on cifar10 test set by changing number of glimpses in
RAM and SRAM. The comparison results are arranged in Table 5.1. While RAM
structure with 10 or more glimpse numbers achieve lower error rate than VGG19
network, SRAM outperforms RAM, with the value of 0.607 2.209%. We could also
24
Table 5.1: Performance comparison on Cifar10 dataset
Model # of glimpses scale of glimpses Error(%)
VGG19(reimplemented) 23.783RAM 9 8 x 8 24.562RAM 10 8 x 8 23.828RAM 11 8 x 8 23.445RAM 12 8 x 8 23.274SRAM 9 8 x 8 23.221SRAM 10 8 x 8 23.235SRAM 11 8 x 8 22.794SRAM 12 8 x 8 21.065
found that SRAM using 9 glimpses shows better performance than RAM with 12
glimpses showing the powerful extension of RAM. Moreover, we note that perfor-
mance improves as glimpse number increases until the number of 12, but little effect
beyond 12. All models, VGG19(re-implemented), RAM and SRAM, are trained with
GTX 1080ti.
25
5.3 Computation Cost
In Table 5.2, we compared the computation cost of our proposed saliency aided
recurrent attention model (SRAM) with that of RAM and VGG 19 in terms of
floating-point ops and number of parameters. RAM (12 glimpses and 8x8 glimpse
size) shows almost 5 times more efficient in computations and almost 18 times more
efficient in memory usage. By increasing operations and parameter numbers only
a small proportion (3.05 millions floating-points operations and 0.23 millions pa-
rameters), SRAM could achieve better performance than RAM. Note that values
inside the parenthesis are that of saliency detector which only includes 2.48 mil-
lions of floating-points operations and 0.22 millions of parameters, only a small part
compared to VGG19.
Table 5.2: Computation cost of SRAM vs RAM vs VGG
Model floating-points, op. (millions) params. (millions)
VGG19 (reimplemented) 168.23 33.66RAM, 12 glimpses, 8x8 36.26 1.91SRAM, 12 glimpses, 8x8 39.31(2.48) 2.14(0.22)
26
Chapter 6
Discussion
In our experiments, the proposed saliency aided deep recurrent attention model
(SRAM) outperforms recurrent attention model (RAM) with the help of saliency
information. As we expected we found that saliency map acts as advisor for classi-
fication task by narrowing down the scope of examination.
There are clearly improved parts, but there are also some unfortunate part. First
of all we expected SRAM will achieve faster convergence than RAM with the help
of saliency advisor, however, but there was no meaningful difference. By examining
saliency map obtained saliency detector we could find possible causes. The trained
saliency detector was not powerful enough to cope with all given input images,
sometimes unable to play the role of advisor, rather confusing RAM structure for
classification. Figure 6.1 shows some failure cases of saliency detection with the
above saliency detector.
In the first bridge part of Figure 6.1, only a right part of bridge is accepted
as visual saliency. With only saliency information itself, it seems it is easier to
classify the object as fish instead, showing the confusing case caused by incomplete
advisor. Second picture of Figure 6.1, saliency detector fails to capture the lizard
due to camouflage. The fourth part of Figure 6.1 which is pattern of snake, saliency
27
Figure 6.1: Failure cases of saliency detection of Imagenet dataset
advisor will be not that helpful since the image requires judgement from overall view.
Perhaps it would be difficult for SRAM to decide with the limited image information
biased by given image saliency. Also saliency detector is vulnerable when the image
contains multiple same objects, failing to extract proper saliency map as shown
in Figure 6.1. Even though there were not many mistakes like above cases, those
cases hinder better performance and faster convergence of SRAM. Since SRAM is
independent to saliency detector, exploiting more powerful saliency detector or using
task-specific dataset to train saliency detector will mitigate those problems.
28
Chapter 7
Conclusion
In this paper we proposed a novel computer vision architecture for classification
that exploits recurrent attention model with saliency map. First we build intu-
itive saliency detector based on convolutional and transpose convolutional neural
networks. Saliency map extracted from the saliency detector is then combined as
an additional channel reshaping input image format to RGBS (RGB+Saliency).
Saliency aided recurrent attention model (SRAM), our proposed model, receives
RGBS image as an input and processes image deciding what and where to see for
efficient problem solving with limited resources.
A key contribution of this work is that by incorporating visual saliency infor-
mation, recurrent attention model can now choose what, where and more, ’how’ to
process the image. By giving constraints to attention model to focus on given visual
saliency, obtained image saliency works as an advisor for classification task. The
model outperforms baseline recurrent attention model on cifar10 dataset while us-
ing fewer parameters and lesser computations showing successful extension of RAM
model. We expect adopting complex architecture for saliency detection or training
saliency detector with the task-specific dataset, even higher performance can be
achieved.
29
The idea of SRAM is applicable to other computer vision tasks where saliency
map can play a role as an advisor such as object recognition, visual tracking and
segmentation. In the future work, we try to extend SRAM exploiting other side
information instead or in addition to visual saliency. We expect by using one or even
more side information such as depth information, region proposal information as
the format of additional channel, we could solve more challenging computer vision
tasks.
30
Bibliography
[1] J. Ba, V. Mnih, and K. Kavukcuoglu, Multiple object recognition with
visual attention, arXiv preprint arXiv:1412.7755, (2014).
[2] A. Borji, S. Frintrop, D. N. Sihite, and L. Itti, Adaptive object tracking
by learning background context, in Computer Vision and Pattern Recognition
Workshops (CVPRW), 2012 IEEE Computer Society Conference on, IEEE,
2012, pp. 23–30.
[3] T. Chen, M.-M. Cheng, P. Tan, A. Shamir, and S.-M. Hu, Sketch2photo:
Internet image montage, in ACM Transactions on Graphics (TOG), vol. 28,
ACM, 2009, p. 124.
[4] M.-M. Cheng, N. J. Mitra, X. Huang, P. H. Torr, and S.-M. Hu, Global
contrast based salient region detection, IEEE Transactions on Pattern Analysis
and Machine Intelligence, 37 (2015), pp. 569–582.
[5] D.-A. Clevert, T. Unterthiner, and S. Hochreiter, Fast and accu-
rate deep network learning by exponential linear units (elus), arXiv preprint
arXiv:1511.07289, (2015).
[6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, Ima-
genet: A large-scale hierarchical image database, in Computer Vision and Pat-
31
tern Recognition, 2009. CVPR 2009. IEEE Conference on, IEEE, 2009, pp. 248–
255.
[7] J. Fu, H. Zheng, and T. Mei, Look closer to see better: Recurrent attention
convolutional neural network for fine-grained image recognition, in Conf. on
Computer Vision and Pattern Recognition, 2017.
[8] S. Gorji and J. J. Clark, Attentional push: Augmenting salience with shared
attention modeling, arXiv preprint arXiv:1609.00072, (2016).
[9] K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, and D. Wier-
stra, Draw: A recurrent neural network for image generation, arXiv preprint
arXiv:1502.04623, (2015).
[10] S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural com-
putation, 9 (1997), pp. 1735–1780.
[11] S. Ioffe and C. Szegedy, Batch normalization: Accelerating deep network
training by reducing internal covariate shift, arXiv preprint arXiv:1502.03167,
(2015).
[12] M. Jaderberg, K. Simonyan, A. Zisserman, et al., Spatial transformer
networks, in Advances in neural information processing systems, 2015, pp. 2017–
2025.
[13] S. Jetley, N. A. Lord, N. Lee, and P. H. Torr, Learn to pay attention,
arXiv preprint arXiv:1804.02391, (2018).
[14] D. Kingma and J. Ba, Adam: A method for stochastic optimization, arXiv
preprint arXiv:1412.6980, (2014).
32
[15] A. Krizhevsky and G. Hinton, Learning multiple layers of features from
tiny images, (2009).
[16] A. Krizhevsky, I. Sutskever, and G. E. Hinton, Imagenet classification
with deep convolutional neural networks, in Advances in neural information
processing systems, 2012, pp. 1097–1105.
[17] J. Kuen, Z. Wang, and G. Wang, Recurrent attentional networks for
saliency detection, arXiv preprint arXiv:1604.03227, (2016).
[18] H. Larochelle and G. E. Hinton, Learning to combine foveal glimpses with
a third-order boltzmann machine, in Advances in neural information processing
systems, 2010, pp. 1243–1251.
[19] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard,
W. Hubbard, and L. D. Jackel, Backpropagation applied to handwritten
zip code recognition, Neural computation, 1 (1989), pp. 541–551.
[20] J. Long, E. Shelhamer, and T. Darrell, Fully convolutional networks for
semantic segmentation, in Proceedings of the IEEE conference on computer
vision and pattern recognition, 2015, pp. 3431–3440.
[21] T. Mikolov, M. Karafiat, L. Burget, J. Cernocky, and S. Khudan-
pur, Recurrent neural network based language model, in Eleventh Annual Con-
ference of the International Speech Communication Association, 2010.
[22] V. Mnih, N. Heess, A. Graves, et al., Recurrent models of visual attention,
in Advances in neural information processing systems, 2014, pp. 2204–2212.
33
[23] F. Murabito, C. Spampinato, S. Palazzo, K. Pogorelov, and
M. Riegler, Top-down saliency detection driven by visual classification, arXiv
preprint arXiv:1709.05307, (2017).
[24] V. Nair and G. E. Hinton, Rectified linear units improve restricted boltz-
mann machines, in Proceedings of the 27th international conference on machine
learning (ICML-10), 2010, pp. 807–814.
[25] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng,
Reading digits in natural images with unsupervised feature learning, in NIPS
workshop on deep learning and unsupervised feature learning, vol. 2011, 2011,
p. 5.
[26] H. Noh, S. Hong, and B. Han, Learning deconvolution network for semantic
segmentation, in Proceedings of the IEEE International Conference on Com-
puter Vision, 2015, pp. 1520–1528.
[27] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros,
Context encoders: Feature learning by inpainting, in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, 2016, pp. 2536–2544.
[28] A. Radford, L. Metz, and S. Chintala, Unsupervised representation learn-
ing with deep convolutional generative adversarial networks, arXiv preprint
arXiv:1511.06434, (2015).
[29] S. Ren, K. He, R. Girshick, and J. Sun, Faster r-cnn: Towards real-time
object detection with region proposal networks, in Advances in neural informa-
tion processing systems, 2015, pp. 91–99.
34
[30] Z. Ren, S. Gao, L.-T. Chia, and I. W.-H. Tsang, Region-based saliency de-
tection and its application in object recognition, IEEE Transactions on Circuits
and Systems for Video Technology, 24 (2014), pp. 769–779.
[31] G. Sharma, F. Jurie, and C. Schmid, Discriminative spatial saliency for
image classification, in Computer Vision and Pattern Recognition (CVPR),
2012 IEEE Conference on, IEEE, 2012, pp. 3506–3513.
[32] K. Simonyan and A. Zisserman, Very deep convolutional networks for large-
scale image recognition, arXiv preprint arXiv:1409.1556, (2014).
[33] R. S. Sutton and A. G. Barto, Introduction to reinforcement learning,
vol. 135, MIT press Cambridge, 1998.
[34] N. Wang and D.-Y. Yeung, Learning a deep compact image representation
for visual tracking, in Advances in neural information processing systems, 2013,
pp. 809–817.
[35] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov,
R. Zemel, and Y. Bengio, Show, attend and tell: Neural image caption gen-
eration with visual attention, in International Conference on Machine Learning,
2015, pp. 2048–2057.
35
국문초록
Convolutional neural network의 한계점인 확장성 문제를 극복하기 위하여 Recurrent
Attention Model(RAM)는 이미지에서 무엇을 그리고 어디를 볼 지를 순차적으로 결
정한다. 여기서 RAM 구조에 ’어떻게’ 이미지를 볼 지에 대한 적절한 방향성을 부여할
수 있다면, 더 효율적인 문제 해결이 가능하다. 이러한 관점에서 본 연구에서는 이미
지 돌출성 정보를 활용하여 이미지 분류 문제를 해결하기 위해 이미지를 어떻게 볼
지 지도하는 Saliency aided Recurrent Attention Model (SRAM)을 제안한다. 먼저
이미지 돌출성 정보를 추출할 수 있는 직관적인 구조의 돌출성 검출기를 구성하였고,
이미지 돌출성을 입력 이미지의 채널로 추가함으로써 해당 정보가 이미지를 보는 방
법에 대한 이정표를 제시하게끔 하였다. SRAM은 이미지를 어떻게 순차적으로 처리할
지 결정하는 인코더와 분류를 위한 디코더로 구성된다. Cifar10 데이터세트를 활용한
성능평가에서 SRAM이기존 RAM과비교했을때계산량과메모리사용량의아주적은
손실로도 오류율이 최대 2.209% 감소함을 확인하였다.
주요어: 이미지 돌출성, 이미지 분류, Recurrent attention model
학번: 2016-26799
36