disclaimer - seoul national...

저 시-비 리- 경 지 2.0 한민

는 아래 조건 르는 경 에 한하여 게

l 저 물 복제, 포, 전송, 전시, 공연 송할 수 습니다.

다 과 같 조건 라야 합니다:

l 하는, 저 물 나 포 경 , 저 물에 적 된 허락조건 명확하게 나타내어야 합니다.

l 저 터 허가를 면 러한 조건들 적 되지 않습니다.

저 에 른 리는 내 에 하여 향 지 않습니다.

것 허락규약(Legal Code) 해하 쉽게 약한 것 니다.

Disclaimer

저 시. 하는 원저 를 시하여야 합니다.

비 리. 하는 저 물 리 목적 할 수 없습니다.

경 지. 하는 저 물 개 , 형 또는 가공할 수 없습니다.

http://creativecommons.org/licenses/by-nc-nd/2.0/kr/legalcode

http://creativecommons.org/licenses/by-nc-nd/2.0/kr/

공학석사학위논문

SRAM: Saliency aided Recurrent Attention

Model for Image Classification

이미지 분류 문제 해결을 위한 이미지 돌출성 활용 Recurrent

Attention Model

2018 년 8 월

서울대학교 대학원

산업공학과

정 민 기

SRAM: Saliency aided Recurrent Attention

Model for Image Classification

이미지 분류 문제 해결을 위한 이미지 돌출성 활용

Recurrent Attention Model

지도교수 조 성 준

이 논문을 공학석사 학위논문으로 제출함

2018 년 8 월

서울대학교 대학원

산업공학과

정 민 기

정민기의 공학석사 학위논문을 인준함

2018 년 6 월

위 원 장 박 종 헌 (인)

부위원장 조 성 준 (인)

위 원 이 재 욱 (인)

Abstract

SRAM: Saliency aided Recurrent AttentionModel for Image Classification

Minki Chung

Department of Industrial Engineering

The Graduate School

Seoul National University

To overcome the poor scalability of convolutional neural network, recurrent attention

model(RAM) selectively choose what and where to look on the image. By directing

recurrent attention model how to look the image, RAM can be even more successful

in that the given clue narrows down the scope of the possible focus zone. In this per-

spective, this work proposes a novel model, saliency aided recurrent attention model

(SRAM) which adopts visual saliency as additional information for image classifi-

cation. First, a saliency detector with an intuitive structure for extracting image

saliency information was constructed. By combining saliency map as an additional

image channel, visual saliency presented as a milestone for viewing the image. SRAM

follows encoder-decoder framework, encoder utilizes recurrent attention model with

spatial transformer network and decoder for classification. To ensure the perfor-

mance, we evaluate the SRAM on the cifar10 dataset and showing 2.209% decrease

at most in error rate compared to RAM model.

i

Keywords: Visual saliency, Image classification, Saliency based classification, Re-

current attention model

Student Number: 2016-26799

ii

Contents

Abstract i

Contents iv

List of Tables v

List of Figures vii

Chapter 1 Introduction 1

Chapter 2 Related Work 4

2.1 Recurrent Attention Model (Mnih et al. 2014) . . . . . . . . . . . . . 4

2.2 Spatial Transformer Network (Jaderberg et al. 2015) . . . . . . . . . 6

2.3 Recurrent Attentional Networks for Saliency Detection (Kuen et al.

2016) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.4 Learning Deconvolution Network for Semantic Segmentation (Noh et

al. 2015) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.5 Visual Saliency based Classification . . . . . . . . . . . . . . . . . . . 9

Chapter 3 Proposed Method 11

3.1 Saliency Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.2 SRAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

iii

3.2.1 Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.2.2 Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.2.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

Chapter 4 Implementation Details 20


4.2 SRAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

Chapter 5 Experiments and Results 22


5.2 Performance on Cifar10 dataset . . . . . . . . . . . . . . . . . . . . . 24

5.3 Computation Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

Chapter 6 Discussion 27

Chapter 7 Conclusion 29

Bibliography 31

국문초록 36

iv

List of Tables

Table 5.1 Performance comparison on Cifar10 dataset . . . . . . . . . . 25

Table 5.2 Computation cost of SRAM vs RAM vs VGG . . . . . . . . . 26

v

List of Figures

Figure 1.1 An example of image saliency . . . . . . . . . . . . . . . . . 3

Figure 2.1 Architecture of Recurrent Attention Model (RAM) (Mnih et

al. 2014) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Figure 2.2 Architecture of Spatial Transformer Network (STN) (Jader-

berg et al. 2015) . . . . . . . . . . . . . . . . . . . . . . . . . 6

Figure 2.3 Process of spatial transformer (Jaderberg et al. 2015) . . . . 6

Figure 2.4 Architecture of Recurrent Attentional Convolutional-Deconvolutional

Network (RACDNN) (Kuen et al. 2016) . . . . . . . . . . . 7

Figure 2.5 Architecture of convolution, transpose-convolution network

for segmentation (Noh et al. 2015) . . . . . . . . . . . . . . . 8

Figure 2.6 Representation of image with saliency maps (Sharma et al.

2012) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

Figure 2.7 Architecture of the SalClassNet– saliency detection guided

by a visual classification task (Murabito et al. 2017) . . . . . 10

Figure 3.1 Architecture of saliency detector . . . . . . . . . . . . . . . . 11

Figure 3.2 Overall architecture of SRAM . . . . . . . . . . . . . . . . . 13

Figure 3.3 Architecture of SRAM encoder . . . . . . . . . . . . . . . . 14

Figure 3.4 Architecture of STN . . . . . . . . . . . . . . . . . . . . . . 16

vi

Figure 5.1 Visual saliency of cifar10 dataset obtained with saliency de-

tector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

Figure 5.2 Visual saliency of Imagenet dataset obtained with saliency

detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

Figure 5.3 Examples of Cifar10 dataset . . . . . . . . . . . . . . . . . . 24

Figure 6.1 Failure cases of saliency detection of Imagenet dataset . . . 28

vii

Chapter 1

Introduction

Convolutional neural network (CNN) [19] brought huge success in various computer

vision tasks from simple tasks including image classification [16] or object detection

[29], to complex ones including motion tracking [34] or image inpainting [27]. CNN,

however, is not perfect. One of limitation of CNNs is its poor scalability with in-

creasing input image size in terms of computation efficiency. With limited time and

resources, it is necessary to select where, what, and how to look the image. Facing

bird specific fine grained classification task, for example, it would be meaningless to

pay attention on non-bird image part such as tree or sky. Rather, one should focus on

regions which play decisive roles on classification such as beak or wings. If machine

can learn how to pay attention on those regions will results better performance with

lower energy usage.

In this context, Recurrent Attention Model(RAM) [22] introduces visual atten-

tion method that can be applied to numerous computer vision tasks. By sequentially

choosing where and what to look, RAM achieved better performance even with lesser

computation and lower usage of memory. Moreover, the attention mechanism tackled

the vulnerability of deep learning model that is difficult to interpret by proposing

glimpse patches which show the regions RAM attend. But still there is more room for

1

RAM for improvement. In addition to where and what to look, if one can give some

clues on how to look, the task specific hint, learning could be more intuitive and

efficient. From this perspective, we propose the novel architecture, Saliency aided

Recurrent Attention Model(SRAM) which adopts image saliency as problem

oriented clue on RAM. By applying visual saliency exploiting existing saliency de-

tector, RAM structure could achieve faster convergence and better performance.

Visual saliency is the distinct object or region in an image which stands out

from their neighbors and grab the attention. As shown in Figure 1.1, bird object

becomes an salient part of an image. Due to its broad applications visual saliency

has received an increasing focus in recent years. Simple examples of the applications

are object recognition [30], image retrieval [3], and visual tracking [2]. Since salient

region of an image is the target of interest in most cases, we expect classification task

would receive benefit from the additional information by advising where to attend

preferentially. This shares the concept of visual attention with [12, 22, 35, 13, 7, 8].

Considering how the human visual perception system works, this approach makes

more sense: gating visual information to be processed by the brain cognitive system

depending on the distinct visual characteristics of the images. Even though image

saliency is implicit information which image itself already contains, we believe there

is a difference in turning implicit information to explicit one.

As a preliminary work of SRAM, we build a saliency detector based on the

architecture of [26]. By training with the benchmark saliency dataset MSRA10K [?]

for training, we were able to extract image saliencies which are sufficiently descriptive

clues for image classification. Then we evaluated our model on the cifar10 dataset

[15] showing better performance with only fractional increase in computation and

2

Figure 1.1: An example of image saliency

memory usage compared to RAM.

In summary, this paper has the following contributions:

(a) Constructed saliency detector network which can extract sufficiently explana-

tory visual saliency helpful to image classification.

(b) Proposed novel model, saliency aided recurrent attention model (SRAM) which

receives visual saliency on RAM for intensive problem solving.

(b) Evaluated on cifar10 dataset, showing the successful extension of RAM.

The rest of the paper is structured as follows. In Chapter 2, we review literatures

related to the problem. In Chapter 3, we propose visual saliency detector network

and saliency aided recurrent attention model (SRAM). In Chapter 4, we explain

the details of above mention structures. In Chapter 5, we shows experiments and

results with proposed method. In Chapter 6, we give discussions on experimental re-

sults. Finally, in Chapter 7, we give concluding remarks and possible future research

directions of the paper.

3

Chapter 2

Related Work

2.1 Recurrent Attention Model (Mnih et al. 2014)

RAM [22] first proposed recurrent neural network(RNN) [21] based attention model.

Subsequent studies have been steadily publishing to the present days considering its

novel attention mechanism. The idea of RAM is inspired by human visual system:

encountered by the image which is too large to be processed at a glance, human

processes the image from part to part sequentially depending on one’s interest. By

selectively choosing what and where to look RAM showed higher performance while

reducing calculations and memory usage.

The overall architecture of RAM is shown in Figure 2.1. RAM includes two com-

ponents: Glimpse sensor (A in Figure 2.1) which extracts a retina-like representation

that contains patches of multiple resolution, and glimpse network (B in Figure 2.1)

which uses the glimpse sensor to produce glimpse representation. RAM is RNN

based model which takes glimpse representation as an input and combines with the

internal representation at previous time step and produces the new internal state.

The location network and action network (bottom parts of C in Figure 2.1 use the

new internal state to decide where to look in next time step.

However, since RAM attend the image region by using sampling method, it

4

has fatal weakness of using REINFORCE [33] for optimization, difficult to train

compared to other models using back-propagation for optimization. After works

of RAM, Deep Recurrent Attention Model(DRAM) [1] showed advanced architec-

ture for multiple object recognition and Deep Recurrent Attentive Writer(DRAW)

[9] introduced sequential image generation method without using REINFORCE.

Recurrent Attentional Convolutional-Deconvolutional Network(RACDNN) [17] re-

places locating module with spatial transformer network [12] and exploits recurrent

attention models for visual saliency refinement (more details at Section 2.3).

Figure 2.1: Architecture of Recurrent Attention Model (RAM) (Mnih et al. 2014)

5

2.2 Spatial Transformer Network (Jaderberg et al. 2015)

Spatial transformer network (STN) [12] first proposed a parametric spatial attention

module for image classification and object detection. It tackles another limitation of

CNN structures, where the invariance of rotations and translations are only manifi-

ested with deep layers. As shown in Figure 2.2, spatial transformer consists of three

components: localisation net, grid generator and sampler. Localization network takes

feature map as an input and outputs θ, the parameter for transformation. With the

θ, grid generator produces new coordinate system for transformed image. Finally

the sampler exploits the new coordinates to transform input image to favorable

formation. Figure 2.3 shows the overall process of spatial transformer.

Figure 2.2: Architecture of Spatial Transformer Network (STN) (Jaderberg et al.2015)

Figure 2.3: Process of spatial transformer (Jaderberg et al. 2015)

6

2.3 Recurrent Attentional Networks for Saliency Detec-

tion (Kuen et al. 2016)

Recurrent Attentional Convolutional-Deconvolutional Network(RACDNN) [17] com-

bines the strengths of RAM and STN for saliency detection task. While following

RAM based structure, RACDNN replaces the locating module of RAM structure

with STN enabling back-propagation. RACDNN sequentially selects where to at-

tend on the image and refines image saliency of selected image by applying transpose

convolutional layers. Starting from the basic saliency map generated from convolu-

tional, RACDNN reinforce image saliency through RNN steps. Overall process of

RACDNN is shown in Figure 2.4.

Figure 2.4: Architecture of Recurrent Attentional Convolutional-DeconvolutionalNetwork (RACDNN) (Kuen et al. 2016)

7

2.4 Learning Deconvolution Network for Semantic Seg-

mentation (Noh et al. 2015)

By using convolutional and transpose convolutional network as the form of encoder-

decoder, [26] successfully extracts the object segmentations of the image. On top of

the CNN based on VGG 16 [32], there is multiple transpose-convolution layers to gen-

erate the accurate segmentation map of an input proposal. As shown in Figure 2.5,

given a feature representation encoded from the CNN, dense pixelwise class predic-

tion map is constructed through multiple series of unpooling, transpose-convolution

and rectification operations [24]. Since the propose architecture is intuitive yet pow-

erful in segmentation, we constructed the saliency detector for SRAM based on this

study. By replacing last layer from class predictor to binary saliency detector and

other additional twists, we could obtain reasonable saliency which plays a role as ad-

visor on classification. More details on the saliency detector for SRAM will discussed

later at Section 3.1.

Figure 2.5: Architecture of convolution, transpose-convolution network for segmen-tation (Noh et al. 2015)

8

2.5 Visual Saliency based Classification

There have been similar attempts in the past of using visual saliency on image clas-

sification. [31] used saliency maps to weight the corresponding visual features by

partitioning the image. Image representation of this paper is shown in Figure 2.6.

Without appropriate model for acquisition of visual saliency, however, this work de-

pends on relatively low quality of saliency map. Also obvious limitation of this work

is that images and saliency maps are fragmented before multiplication inevitably

resulting loss of information at the junctions of segments.

Figure 2.6: Representation of image with saliency maps (Sharma et al. 2012)

Unlike using saliency maps on classification, [23] detects salient objects guided

by visual classification. They showed improved classification performance by ex-

ploiting classification task-based saliency maps. The process of proposed method,

SalClassNet, is as follows: First, CNN saliency detector obtains top-down saliency

map through multiple convolutional layers. This saliency map is combined as addi-

tional channel of image, constructing RGBS(RGB+Saliency) image. RGBS image

then passes through CNN classifier for predictions of target class. The overall archi-

tecture is shown in Figure 2.7.

9

The whole process is trained on end-to-end which means training process re-

quires ground truth of visual saliency. In the actual cases, it is difficult to expect

the visual saliency ground truth information exists moreover in solving the classifi-

cation problem. Besides, the task-based saliency maps from this work is far from the

meaningful visual saliency maps. Instead, SRAM exploits existing visual saliency

detection models to extract image saliency that can be applied to image classifica-

tion. One methodology we adopt from this work is how they combine visual saliency

into input image explicitly. SRAM also incorporates image saliency as an additional

channel of an input image reshaping as RGBS(RGB+Saliency) format.

Figure 2.7: Architecture of the SalClassNet– saliency detection guided by a visualclassification task (Murabito et al. 2017)

10

Chapter 3

Proposed Method

3.1 Saliency Detector

Inspired by the success of [26], we constructed saliency detector composed of encoder

using CNN and decoder using transpose-CNNs(T-CNN). CNNs downsize feature

maps over multiple convolutional constructing spatially compact image representa-

tions, while transpose CNN upsizes feature maps and learns increasingly localized

representations. In such a CNN, T-CNN framework, we can obtain proper image

representation for image saliency detection. Unlike the work of [26], we stop apply-

ing CNN before feature map shrinks to spatial size of 1x1. Through experiments we

found that extreme embedding with CNNs results in loss of spatial information loss

or saliency detection. The architecture of CNN, T-CNN based saliency detector is

shown in Figure 3.1.

Figure 3.1: Architecture of saliency detector

11

T-CNN in decoder part is almost identical to conventional CNNs in encoder

part except for a few minor differences. In T-CNNs, convolution operations are

carried out in such a way that the resulting feature maps retain the bigger spatial

sizes as those of the input feature maps. This is done by applying fractional strides

with appropriate zero paddings. In the processing pipeline of CNN, T-CNN for

saliency detection, the CNN first transforms the input image, ’x’ to a spatially

compact hidden representation ’z’, as z = CNN(x). Then, ’z’ is transformed to a

visual saliency map S through the T-CNN as S = T-CNN(z). We perform sigmoid

activation function in order to make sure saliency map takes values range of [0,

1]. The loss function of CNN, T-CNN for saliency detection is the L1 loss between

ground truth saliency map G and S. The resulting network is saliency detector

that can be trained in end-to-end fashion. The obtained saliency map from saliency

detector is combined with the input image as an additional channel.

In equations, the Ls, which is L1 loss(Mean Absolute Error, MAE) for saliency

detection is as follows:

Ls = mean(∑w

∑h

|Gwh − Swh|) (3.1)

where Gwh is ground truth saliency binary value at pixel point (w, h) and Swh

is predicted saliency values at pixel point (w, h) range of [0, 1].

12

3.2 SRAM

Visual saliency obtained from 3.1 is combined as saliency channel reshaping input

image to RGBS(RGB+Saliency) format. The architecture of SRAM is based on

encoder-decoder structure. Encoder is similar to that of RACDNN[17], recurrent

attention model with spatial transformers in each step for locating next glimpse.

Encoded value which is the output of last first layer’s RNN cell, passes through de-

coder network. Decoder consists of multiple fully connected layers (MLP), last layer

responsible for classification. Figure 3.2 shows the overall architecture of SRAM.

Figure 3.2: Overall architecture of SRAM

13

3.2.1 Encoder

Encoder is composed of 4 subnetworks: context network, spatial transformer net-

work, glimpse network and core recurrent neural network. The overall architecture

of encoder is shown in Figure 3.3. Considering the flow of information, we will go

into details of each network one by one.

Figure 3.3: Architecture of SRAM encoder

Context Network: The context network is the first part of encoder which

receives image with saliency map (RGB+S) as an inputs and outputs the initial

state tuple of r20. r20 is first input of second layer of core recurrent neural network as

shown in Figure 3.3. Using downsampled RGBS image(img(RGBS)coarse),, context

network provides reasonable starting point for choosing image region to concentrate.

Downsampled images are processed with CNN followed by MLP.

14

c0 = MLPc(CNNcontext(img(RGBS)coarse)) (3.2)

h0 = MLPh(CNNcontext(img(RGBS)coarse)) (3.3)

where (c0, h0) is the first state tuple of r20.

Spatial Transformer Network: Spatial transformer network(STN) selects

region to attend considering given RGBS image [12]. Different from existing STN,

SRAM uses modified STN which receives RGBS image and output of second layer of

core RNN as an inputs and outputs glimpse patch. From now on, glimpse patch in-

dicates the attended image which is cropped and zoomed in by STN. Here, the STN

is composed of two parts. One is the localization part which calculates the trans-

formation matrix τ with CNN and MLP. The other is the transformer part which

zoom in the image using obatined transformation matrix τ and produce the glimpse

patch. The affine transformation matrix τ with isotropic scaling and translation is

given as Equation 3.4.

τ =

s 0 tx

0 s ty

0 0 1

(3.4)

where s, tx, ty are the scaling, horizontal translation and vertical translation param-

eter respectively.

Total process of STN is shown in Figure 3.4.

In equations, the process STN is as follows:

15

Figure 3.4: Architecture of STN

glimpse patchn = STN(img(RGBS)τn) (3.5)

where n in glimpse patchn is the step of core RNN ranging from 1 to total glimpse

number.

Process for extracting τ is as follows:

τn = MLPloc(CNNi(img(RGBS))⊕MLPr(r(2)n )) (3.6)

where ⊕ is concat operation.

Glimpse Network: The glimpse network is a non-linear function which receives

current glimpse patch, glimpse patchn(orgpn) and attend region information, τ as

inputs and outputs current step glimpse vector. Glimpse vector is later used as

the input of first core RNN. glimpse vectorn(orgvn) is obtained by multiplicative

16

interaction between extracted features of glimpse patchn and τ . The method of

interaction is first proposed by [18]. Similar to previous mentioned networks, CNN

and MLP are used for feature representation.

gvn = MLPwhat(CNNwhat(gpn))�MLPwhere(τn) (3.7)

where � is a element-wise vector multiplication operation.

Core Recurrent Neural Network: Recurrent neural network is the core

architecture in SRAM, which aggregates information extracted from the stepwise

glimpses and calculates encoded vector z. Iterating for set RNN steps(total glimpse

numbers), core RNN receives glimpse vectorn at the first layer. The output of second

layer r(2)n is again used by spatial transformer network’s localization module.

r(1)n = R1recur(glimpse vectorn, r

(1)n−1) (3.8)

r(2)n = R2recur(r

(1)n , r

(2)n−1) (3.9)

3.2.2 Decoder

Classification Network: Like general image classification approach, encoded z is

passed through MLP and last layer outputs possibility for each class.

17

3.2.3 Training

Loss function of SRAM can be divided into two: saliency related loss, Lsal and classi-

fication related loss, Lcls. Lsal constraints the glimpse patches to be consistent with

the image saliency obtained from the previous saliency detector. For classification

task, it is favorable if the glimpse patches covers the salient part at most. Equations

of Lsal and Lcls is as follows:

Lsal = Lsal(S, ST, τ) = −mean(∑n

ST (S, τn)) (3.10)

where is visual saliency obtained when input image passes through saliency de-

tector. STN is the trained spatial transformer network in Equation 3.5 and τn is

obtained from Equation 3.4 in each step of core RNN.

Since Lcls cross entropy loss like general classification approach.

Lcls = Lcls(Y, Y∗) = cross entropy(Y, Y ∗) (3.11)

= −mean(∑c

yc lnσ(y∗c )) (3.12)

where σ(·) is sigmoid function, c is number of classes, Y and Y ∗ are predicted class

label vector and ground truth class label vector respectively.

Then total loss Ltot of SRAM for image classification becomes:

Ltot = Lcls + α ∗ Lsal (3.13)

= Lcls(Y, Y∗) + α ∗ Lsal(S, ST, τs) (3.14)

18

where α is balancing parameter between Lcls and Lsal.

19

Chapter 4

Implementation Details


Encoder of saliency detector is composed 6 layer-CNN, 3 sets in pairs of 2 (3 x 3

kernel size, 1 x 1 stride first then 2 x 2 stride repeatedly, same zero padding), while

decoder part is composed of 6 layer-T-CNN, 3 sets in pairs of 2 (3 x 3 kernel size, 1

x 1 stride first then 1/2 x 1/2 stride repeatedly, same zero padding). The product

of T-CNN then passes through transpose convolutional layer twice more in order to

match the shape of the output into grayscale format.

All CNN and T-CNN except last layer in encoder and decoder in saliency detector

include batch normalization [11] and elu activation [5]. We used Adam optimizer

[14] for training with learning rate starting from 1e-3 and decaying exponentially in

respect to train epochs. We trained saliency detector for 200 epochs.

4.2 SRAM

As mentioned earlier, encoder consists of 4 subnetworks: context network, spatial

transformer network, glimpse network, and core RNN. Image and saliency is 4 times

downsampled and used an input of context network. Each passes 6 layer-CNN, 3 sets

in pairs of 2 (3 x 3 kernel size, 1 x 1 stride first then 2 x 2 stride repeatedly, same zero

20

padding) and outputs representation vectors. These vectors are concatenated once

again, passes through 2 fully connected layers and outputs the initial state for second

layer of core RNN. Localization part of spatial transformer network are constructed

CNN followed by MLP for regression. For RGBS image input, 3 layer-CNN(5 x 5

kernel size, 2 x 2 stride, same zero padding) is applied. 2 layer-MLP is applied to

second core RNN output input. Output vectors of CNN and MLP are concatenated

and once again pass through 2-layer MLP to produce s, tx, ty, the parameters of τ .

Glimpse network receives glimpse patch and τ obtained above as an input. 1-layer

MLP is applied on τ while Glimpse patch passes through 3-layer CNN followed by 1-

layer MLP to match the vector length with the τ vector. Glimpse vector is obtained

by element-wise vector multiplication operation of above explained output vectors.

Core RNN is composed of 2 layer with Long-Short-Term Memory (LSTM) units for

[10] for of its ability to learn long-range dependencies and stable learning dynamics.

Finally classification network, the decoder part is made up of 3-layer MLP.

Same as saliency detector all CNNs and MLPs except last layer in encoder and

decoder in SRAM subnetworks include batch normalization and elu activation. We

used Adam optimizer with learning rate for training starting from 1e-4 and decaying

exponentially in respect to train epochs. During training, gradients are hard-clipped

to the range of [-5, 5] in order to mitigate the gradient explosion problem which

occurs commonly when training RNN. We set hyper-parameter alpha 0.1, which is

balancing parameter for Lcls and Lsal. We trained SRAM for 200 epochs.

21

Chapter 5

Experiments and Results


We trained above explained saliency detector with the MSRA10k[4] dataset which is

by far the largest publicly available dataset for saliency detection. It contains 10,000

annotated saliency images. Dataset is divided into 70% as training set and 30% as

validation set. Quantitatively we could achieve MAE value of 0.0311 and 0.0119 on

validation set for 32 and 224 image size respectively.

Besides quantitative analysis, we also carried qualitative evaluations of saliency

detector on seperate dataset: cifar10 and imagenet [6]. Saliency maps extracted

on cifar10 dataset is shown in Figure 5.1. It seems quite reasonable detecting cat,

airplane, deer and bird regions from their backgrounds.

We could also found that saliency detector extracts explanatory salient regions

on more complex image. Figure 5.2 shows obtained image saliency on imagenet

dataset.

22

Figure 5.1: Visual saliency of cifar10 dataset obtained with saliency detector

Figure 5.2: Visual saliency of Imagenet dataset obtained with saliency detector

23

5.2 Performance on Cifar10 dataset

In order to ensure the performance of SRAM, we evaluated on cifar10 dataset.

Cifar10 dataset consists of 60,000 32 x 32 RGB images in 10 classes, with 6000

images per class. Dataset is divided into 50000 training images and 10000 test images.

Example images of cifar10 dataset is shown as Figure 5.3.

Figure 5.3: Examples of Cifar10 dataset

Since SRAM mainly exploits multiple 3x3 CNNs, we selected VGG19 [32] net-

work for baseline network . We reimplemented VGG19 for fair comparison in fully

convolutional way [20], removing max pooling and replacing with the size of stride.

We compared the error rate on cifar10 test set by changing number of glimpses in

RAM and SRAM. The comparison results are arranged in Table 5.1. While RAM

structure with 10 or more glimpse numbers achieve lower error rate than VGG19

network, SRAM outperforms RAM, with the value of 0.607 2.209%. We could also

24

Table 5.1: Performance comparison on Cifar10 dataset

Model # of glimpses scale of glimpses Error(%)

VGG19(reimplemented) 23.783RAM 9 8 x 8 24.562RAM 10 8 x 8 23.828RAM 11 8 x 8 23.445RAM 12 8 x 8 23.274SRAM 9 8 x 8 23.221SRAM 10 8 x 8 23.235SRAM 11 8 x 8 22.794SRAM 12 8 x 8 21.065

found that SRAM using 9 glimpses shows better performance than RAM with 12

glimpses showing the powerful extension of RAM. Moreover, we note that perfor-

mance improves as glimpse number increases until the number of 12, but little effect

beyond 12. All models, VGG19(re-implemented), RAM and SRAM, are trained with

GTX 1080ti.

25

5.3 Computation Cost

In Table 5.2, we compared the computation cost of our proposed saliency aided

recurrent attention model (SRAM) with that of RAM and VGG 19 in terms of

floating-point ops and number of parameters. RAM (12 glimpses and 8x8 glimpse

size) shows almost 5 times more efficient in computations and almost 18 times more

efficient in memory usage. By increasing operations and parameter numbers only

a small proportion (3.05 millions floating-points operations and 0.23 millions pa-

rameters), SRAM could achieve better performance than RAM. Note that values

inside the parenthesis are that of saliency detector which only includes 2.48 mil-

lions of floating-points operations and 0.22 millions of parameters, only a small part

compared to VGG19.

Table 5.2: Computation cost of SRAM vs RAM vs VGG

Model floating-points, op. (millions) params. (millions)

VGG19 (reimplemented) 168.23 33.66RAM, 12 glimpses, 8x8 36.26 1.91SRAM, 12 glimpses, 8x8 39.31(2.48) 2.14(0.22)

26

Chapter 6

Discussion

In our experiments, the proposed saliency aided deep recurrent attention model

(SRAM) outperforms recurrent attention model (RAM) with the help of saliency

information. As we expected we found that saliency map acts as advisor for classi-

fication task by narrowing down the scope of examination.

There are clearly improved parts, but there are also some unfortunate part. First

of all we expected SRAM will achieve faster convergence than RAM with the help

of saliency advisor, however, but there was no meaningful difference. By examining

saliency map obtained saliency detector we could find possible causes. The trained

saliency detector was not powerful enough to cope with all given input images,

sometimes unable to play the role of advisor, rather confusing RAM structure for

classification. Figure 6.1 shows some failure cases of saliency detection with the

above saliency detector.

In the first bridge part of Figure 6.1, only a right part of bridge is accepted

as visual saliency. With only saliency information itself, it seems it is easier to

classify the object as fish instead, showing the confusing case caused by incomplete

advisor. Second picture of Figure 6.1, saliency detector fails to capture the lizard

due to camouflage. The fourth part of Figure 6.1 which is pattern of snake, saliency

27

Figure 6.1: Failure cases of saliency detection of Imagenet dataset

advisor will be not that helpful since the image requires judgement from overall view.

Perhaps it would be difficult for SRAM to decide with the limited image information

biased by given image saliency. Also saliency detector is vulnerable when the image

contains multiple same objects, failing to extract proper saliency map as shown

in Figure 6.1. Even though there were not many mistakes like above cases, those

cases hinder better performance and faster convergence of SRAM. Since SRAM is

independent to saliency detector, exploiting more powerful saliency detector or using

task-specific dataset to train saliency detector will mitigate those problems.

28

Chapter 7

Conclusion

In this paper we proposed a novel computer vision architecture for classification

that exploits recurrent attention model with saliency map. First we build intu-

itive saliency detector based on convolutional and transpose convolutional neural

networks. Saliency map extracted from the saliency detector is then combined as

an additional channel reshaping input image format to RGBS (RGB+Saliency).

Saliency aided recurrent attention model (SRAM), our proposed model, receives

RGBS image as an input and processes image deciding what and where to see for

efficient problem solving with limited resources.

A key contribution of this work is that by incorporating visual saliency infor-

mation, recurrent attention model can now choose what, where and more, ’how’ to

process the image. By giving constraints to attention model to focus on given visual

saliency, obtained image saliency works as an advisor for classification task. The

model outperforms baseline recurrent attention model on cifar10 dataset while us-

ing fewer parameters and lesser computations showing successful extension of RAM

model. We expect adopting complex architecture for saliency detection or training

saliency detector with the task-specific dataset, even higher performance can be

achieved.

29

The idea of SRAM is applicable to other computer vision tasks where saliency

map can play a role as an advisor such as object recognition, visual tracking and

segmentation. In the future work, we try to extend SRAM exploiting other side

information instead or in addition to visual saliency. We expect by using one or even

more side information such as depth information, region proposal information as

the format of additional channel, we could solve more challenging computer vision

tasks.

30

Bibliography

[1] J. Ba, V. Mnih, and K. Kavukcuoglu, Multiple object recognition with

visual attention, arXiv preprint arXiv:1412.7755, (2014).

[2] A. Borji, S. Frintrop, D. N. Sihite, and L. Itti, Adaptive object tracking

by learning background context, in Computer Vision and Pattern Recognition

Workshops (CVPRW), 2012 IEEE Computer Society Conference on, IEEE,

2012, pp. 23–30.

[3] T. Chen, M.-M. Cheng, P. Tan, A. Shamir, and S.-M. Hu, Sketch2photo:

Internet image montage, in ACM Transactions on Graphics (TOG), vol. 28,

ACM, 2009, p. 124.

[4] M.-M. Cheng, N. J. Mitra, X. Huang, P. H. Torr, and S.-M. Hu, Global

contrast based salient region detection, IEEE Transactions on Pattern Analysis

and Machine Intelligence, 37 (2015), pp. 569–582.

[5] D.-A. Clevert, T. Unterthiner, and S. Hochreiter, Fast and accu-

rate deep network learning by exponential linear units (elus), arXiv preprint

arXiv:1511.07289, (2015).

[6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, Ima-

genet: A large-scale hierarchical image database, in Computer Vision and Pat-

31

tern Recognition, 2009. CVPR 2009. IEEE Conference on, IEEE, 2009, pp. 248–

255.

[7] J. Fu, H. Zheng, and T. Mei, Look closer to see better: Recurrent attention

convolutional neural network for fine-grained image recognition, in Conf. on

Computer Vision and Pattern Recognition, 2017.

[8] S. Gorji and J. J. Clark, Attentional push: Augmenting salience with shared

attention modeling, arXiv preprint arXiv:1609.00072, (2016).

[9] K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, and D. Wier-

stra, Draw: A recurrent neural network for image generation, arXiv preprint

arXiv:1502.04623, (2015).

[10] S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural com-

putation, 9 (1997), pp. 1735–1780.

[11] S. Ioffe and C. Szegedy, Batch normalization: Accelerating deep network

training by reducing internal covariate shift, arXiv preprint arXiv:1502.03167,

(2015).

[12] M. Jaderberg, K. Simonyan, A. Zisserman, et al., Spatial transformer

networks, in Advances in neural information processing systems, 2015, pp. 2017–

2025.

[13] S. Jetley, N. A. Lord, N. Lee, and P. H. Torr, Learn to pay attention,

arXiv preprint arXiv:1804.02391, (2018).

[14] D. Kingma and J. Ba, Adam: A method for stochastic optimization, arXiv

preprint arXiv:1412.6980, (2014).

32

[15] A. Krizhevsky and G. Hinton, Learning multiple layers of features from

tiny images, (2009).

[16] A. Krizhevsky, I. Sutskever, and G. E. Hinton, Imagenet classification

with deep convolutional neural networks, in Advances in neural information

processing systems, 2012, pp. 1097–1105.

[17] J. Kuen, Z. Wang, and G. Wang, Recurrent attentional networks for

saliency detection, arXiv preprint arXiv:1604.03227, (2016).

[18] H. Larochelle and G. E. Hinton, Learning to combine foveal glimpses with

a third-order boltzmann machine, in Advances in neural information processing

systems, 2010, pp. 1243–1251.

[19] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard,

W. Hubbard, and L. D. Jackel, Backpropagation applied to handwritten

zip code recognition, Neural computation, 1 (1989), pp. 541–551.

[20] J. Long, E. Shelhamer, and T. Darrell, Fully convolutional networks for

semantic segmentation, in Proceedings of the IEEE conference on computer

vision and pattern recognition, 2015, pp. 3431–3440.

[21] T. Mikolov, M. Karafiat, L. Burget, J. Cernocky, and S. Khudan-

pur, Recurrent neural network based language model, in Eleventh Annual Con-

ference of the International Speech Communication Association, 2010.

[22] V. Mnih, N. Heess, A. Graves, et al., Recurrent models of visual attention,

in Advances in neural information processing systems, 2014, pp. 2204–2212.

33

[23] F. Murabito, C. Spampinato, S. Palazzo, K. Pogorelov, and

M. Riegler, Top-down saliency detection driven by visual classification, arXiv

preprint arXiv:1709.05307, (2017).

[24] V. Nair and G. E. Hinton, Rectified linear units improve restricted boltz-

mann machines, in Proceedings of the 27th international conference on machine

learning (ICML-10), 2010, pp. 807–814.

[25] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng,

Reading digits in natural images with unsupervised feature learning, in NIPS

workshop on deep learning and unsupervised feature learning, vol. 2011, 2011,

p. 5.

[26] H. Noh, S. Hong, and B. Han, Learning deconvolution network for semantic

segmentation, in Proceedings of the IEEE International Conference on Com-

puter Vision, 2015, pp. 1520–1528.

[27] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros,

Context encoders: Feature learning by inpainting, in Proceedings of the IEEE

Conference on Computer Vision and Pattern Recognition, 2016, pp. 2536–2544.

[28] A. Radford, L. Metz, and S. Chintala, Unsupervised representation learn-

ing with deep convolutional generative adversarial networks, arXiv preprint

arXiv:1511.06434, (2015).

[29] S. Ren, K. He, R. Girshick, and J. Sun, Faster r-cnn: Towards real-time

object detection with region proposal networks, in Advances in neural informa-

tion processing systems, 2015, pp. 91–99.

34

[30] Z. Ren, S. Gao, L.-T. Chia, and I. W.-H. Tsang, Region-based saliency de-

tection and its application in object recognition, IEEE Transactions on Circuits

and Systems for Video Technology, 24 (2014), pp. 769–779.

[31] G. Sharma, F. Jurie, and C. Schmid, Discriminative spatial saliency for

image classification, in Computer Vision and Pattern Recognition (CVPR),

2012 IEEE Conference on, IEEE, 2012, pp. 3506–3513.

[32] K. Simonyan and A. Zisserman, Very deep convolutional networks for large-

scale image recognition, arXiv preprint arXiv:1409.1556, (2014).

[33] R. S. Sutton and A. G. Barto, Introduction to reinforcement learning,

vol. 135, MIT press Cambridge, 1998.

[34] N. Wang and D.-Y. Yeung, Learning a deep compact image representation

for visual tracking, in Advances in neural information processing systems, 2013,

pp. 809–817.

[35] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov,

R. Zemel, and Y. Bengio, Show, attend and tell: Neural image caption gen-

eration with visual attention, in International Conference on Machine Learning,

2015, pp. 2048–2057.

35

국문초록

Convolutional neural network의 한계점인 확장성 문제를 극복하기 위하여 Recurrent

Attention Model(RAM)는 이미지에서 무엇을 그리고 어디를 볼 지를 순차적으로 결

정한다. 여기서 RAM 구조에 ’어떻게’ 이미지를 볼 지에 대한 적절한 방향성을 부여할

수 있다면, 더 효율적인 문제 해결이 가능하다. 이러한 관점에서 본 연구에서는 이미

지 돌출성 정보를 활용하여 이미지 분류 문제를 해결하기 위해 이미지를 어떻게 볼

지 지도하는 Saliency aided Recurrent Attention Model (SRAM)을 제안한다. 먼저

이미지 돌출성 정보를 추출할 수 있는 직관적인 구조의 돌출성 검출기를 구성하였고,

이미지 돌출성을 입력 이미지의 채널로 추가함으로써 해당 정보가 이미지를 보는 방

법에 대한 이정표를 제시하게끔 하였다. SRAM은 이미지를 어떻게 순차적으로 처리할

지 결정하는 인코더와 분류를 위한 디코더로 구성된다. Cifar10 데이터세트를 활용한

성능평가에서 SRAM이기존 RAM과비교했을때계산량과메모리사용량의아주적은

손실로도 오류율이 최대 2.209% 감소함을 확인하였다.

주요어: 이미지 돌출성, 이미지 분류, Recurrent attention model

학번: 2016-26799

36

disclaimer - seoul national...

Documents