wavelet convolutional neural networks - arxivthe university of tokyo, digital frontier inc....

10
Wavelet Convolutional Neural Networks Shin Fujieda The University of Tokyo, Digital Frontier Inc. [email protected] Kohei Takayama Digital Frontier Inc. [email protected] Toshiya Hachisuka The University of Tokyo [email protected] Abstract Spatial and spectral approaches are two major ap- proaches for image processing tasks such as image clas- sification and object recognition. Among many such algo- rithms, convolutional neural networks (CNNs) have recently achieved significant performance improvement in many chal- lenging tasks. Since CNNs process images directly in the spa- tial domain, they are essentially spatial approaches. Given that spatial and spectral approaches are known to have dif- ferent characteristics, it will be interesting to incorporate a spectral approach into CNNs. We propose a novel CNN architecture, wavelet CNNs, which combines a multiresolu- tion analysis and CNNs into one model. Our insight is that a CNN can be viewed as a limited form of a multiresolu- tion analysis. Based on this insight, we supplement missing parts of the multiresolution analysis via wavelet transform and integrate them as additional components in the entire architecture. Wavelet CNNs allow us to utilize spectral in- formation which is mostly lost in conventional CNNs but useful in most image processing tasks. We evaluate the prac- tical performance of wavelet CNNs on texture classification and image annotation. The experiments show that wavelet CNNs can achieve better accuracy in both tasks than exist- ing models while having significantly fewer parameters than conventional CNNs. 1. Introduction Convolutional neural networks (CNNs) [27, 26] are known to be good at capturing spatial features, while spec- tral analyses [38, 28] are good at capturing scale-invariant features based on the spectral information. It is thus prefer- able to consider both the spatial and spectral information within a single model, so that it captures both types of fea- tures simultaneously. While the connection between CNNs and spectral approaches have been considered unclear so far, we found that a CNN can be seen as a limited form of a multiresolution analysis. This observation points out that conventional CNNs are missing a large part of spectral information available via a multiresolution analysis. We thus propose to supplement those missing parts of a multiresolution analysis as novel additional components in a CNN architecture. Figure 1 shows the overview of our model; wavelet convolutional neural networks (wavelet CNNs). Besides its theoretical formulation, we demon- strate the practical benefit of wavelet CNNs in two chal- lenging tasks: texture classification and image annotation. We demonstrate that wavelet CNNs achieve better or com- petitive accuracies with a significantly smaller number of trainable parameters than conventional CNNs. Our model is thus easier to train, less prone to over-fitting, and consumes less memory than conventional CNNs. To summarize, our contributions are: Combination of CNNs and a multiresolution analysis as one model. Reformulation of CNNs as a limited form of a multireso- lution analysis. Accurate and efficient texture classification and image annotation using our model. 2. Related Work Convolutional Neural Networks: CNNs essentially re- placed conventional hand-crafted descriptors such as the Bag of Visual Words (BoVW) [8] due to the superior per- formance in various tasks [26]. The original network archi- tecture has been extended to deeper architectures since then. One such architecture is Residual Networks (ResNets) [16] which make deeper networks easier to train by introducing shortcut connections; connections which skip a few lay- ers and perform identity mappings. Inspired by ResNets, various networks with shortcut connections have been pro- posed [40, 41, 19]. 1 arXiv:1805.08620v1 [cs.CV] 20 May 2018

Upload: others

Post on 05-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Wavelet Convolutional Neural Networks - arXivThe University of Tokyo, Digital Frontier Inc. sfujieda@graphics.ci.i.u-tokyo.ac.jp Kohei Takayama Digital Frontier Inc. ktakayama@dfx.co.jp

Wavelet Convolutional Neural Networks

Shin FujiedaThe University of Tokyo, Digital Frontier Inc.

[email protected]

Kohei TakayamaDigital Frontier [email protected]

Toshiya HachisukaThe University of Tokyo

[email protected]

Abstract

Spatial and spectral approaches are two major ap-proaches for image processing tasks such as image clas-sification and object recognition. Among many such algo-rithms, convolutional neural networks (CNNs) have recentlyachieved significant performance improvement in many chal-lenging tasks. Since CNNs process images directly in the spa-tial domain, they are essentially spatial approaches. Giventhat spatial and spectral approaches are known to have dif-ferent characteristics, it will be interesting to incorporatea spectral approach into CNNs. We propose a novel CNNarchitecture, wavelet CNNs, which combines a multiresolu-tion analysis and CNNs into one model. Our insight is thata CNN can be viewed as a limited form of a multiresolu-tion analysis. Based on this insight, we supplement missingparts of the multiresolution analysis via wavelet transformand integrate them as additional components in the entirearchitecture. Wavelet CNNs allow us to utilize spectral in-formation which is mostly lost in conventional CNNs butuseful in most image processing tasks. We evaluate the prac-tical performance of wavelet CNNs on texture classificationand image annotation. The experiments show that waveletCNNs can achieve better accuracy in both tasks than exist-ing models while having significantly fewer parameters thanconventional CNNs.

1. IntroductionConvolutional neural networks (CNNs) [27, 26] are

known to be good at capturing spatial features, while spec-tral analyses [38, 28] are good at capturing scale-invariantfeatures based on the spectral information. It is thus prefer-able to consider both the spatial and spectral informationwithin a single model, so that it captures both types of fea-tures simultaneously. While the connection between CNNsand spectral approaches have been considered unclear so

far, we found that a CNN can be seen as a limited formof a multiresolution analysis. This observation points outthat conventional CNNs are missing a large part of spectralinformation available via a multiresolution analysis.

We thus propose to supplement those missing parts ofa multiresolution analysis as novel additional componentsin a CNN architecture. Figure 1 shows the overview ofour model; wavelet convolutional neural networks (waveletCNNs). Besides its theoretical formulation, we demon-strate the practical benefit of wavelet CNNs in two chal-lenging tasks: texture classification and image annotation.We demonstrate that wavelet CNNs achieve better or com-petitive accuracies with a significantly smaller number oftrainable parameters than conventional CNNs. Our model isthus easier to train, less prone to over-fitting, and consumesless memory than conventional CNNs. To summarize, ourcontributions are:

• Combination of CNNs and a multiresolution analysis asone model.

• Reformulation of CNNs as a limited form of a multireso-lution analysis.

• Accurate and efficient texture classification and imageannotation using our model.

2. Related Work

Convolutional Neural Networks: CNNs essentially re-placed conventional hand-crafted descriptors such as theBag of Visual Words (BoVW) [8] due to the superior per-formance in various tasks [26]. The original network archi-tecture has been extended to deeper architectures since then.One such architecture is Residual Networks (ResNets) [16]which make deeper networks easier to train by introducingshortcut connections; connections which skip a few lay-ers and perform identity mappings. Inspired by ResNets,various networks with shortcut connections have been pro-posed [40, 41, 19].

1

arX

iv:1

805.

0862

0v1

[cs

.CV

] 2

0 M

ay 2

018

Page 2: Wavelet Convolutional Neural Networks - arXivThe University of Tokyo, Digital Frontier Inc. sfujieda@graphics.ci.i.u-tokyo.ac.jp Kohei Takayama Digital Frontier Inc. ktakayama@dfx.co.jp

InputImage kh,1

kl,1

Conv, 64

Conv, 64, /2

+

Conv, 128

Conv, 128, /2

Conv, 256

Conv, 256, /2

+C

onv, 512, /2

Conv, 512

Ave pool

fc + +

kh,2

kl,2 kh,3

kl,3

+

+

kh,4

kl,4

+

/2 /2 /2 /2

: Projection shortcut+ : Channel-wise concatkl : Low-pass filter

: High-pass filterkh

k1 k2 k3 k4

+

Conv, 64

Conv, 64

Conv, 128 Conv, 256

Conv, 64

Conv, 128

Figure 1. Overview of wavelet CNN with 4-level decomposition of the input image. Wavelet CNN processes the input image throughconvolution layers with 3 × 3 kernels and 1 × 1 padding. The number after Conv denotes the number of channels of the output. 3 × 3convolutional kernels with the stride of 2 and 1× 1 padding are used to reduce the size of feature maps. Additionally, the input image isdecomposed through multiresolution analysis and the decomposed images are concatenated channel-wise. The projection shortcuts are doneby 1× 1 convolutions. The output of convolution layers is vectorized by global average pooling followed by a fully connected layer (fc).The size of the output is equal to the number of classes included in the input dataset.

Even with shortcut connections, however, deeper net-works still have a problem that information about the inputand gradient can quickly vanish as they propagate throughthe networks. To address this problem, Dense ConvolutionalNetwork (DenseNet) [18] further adds shortcut connectionswhich connect each layer with all its previous layers. Whilethese networks have achieved impressive results on manycomputer vision tasks, they require significant computationalresources because they have a large number of parameters.We designed our network after DenseNet, but the combina-tion with a multiresolution analysis allows us to significantlyreduce the number of parameters compared to conventionalnetworks.

Several recent works use CNNs as a feature extractor anda BoVW approach as pooling and encoding instead of thefully connected layers. Cimpoi et al. [6] demonstrated thata CNN in combination with Fisher Vectors (FV-CNN) canachieve much better accuracy than using a CNN alone. Theirmodel uses a pre-trained CNN to extract image features andthis CNN part is not trained with existing datasets. Lin etal. [30] achieved a remarkable improvement in fine-grainedvisual recognition by replacing the fully connected layerswith bilinear pooling. The dimension of the encoded featuresin the bilinear model is typically higher than 250,000 and it isdifficult to train. To address this difficulty, Gao et al. [9] pro-posed compact bilinear pooling which reduces the number ofparameters of bilinear pooling by 90% while maintaining its

performance. Despite this significant reduction in the num-ber of parameters, inherited from conventional CNNs, thesemodels still have a large number of trainable parametersthat makes them difficult to train in practice. We show thatour model achieves competitive results to compact bilinearpooling while further reducing the number of parameters ina texture classification task.

Spectral Approaches: Spectral approaches transform im-ages into the frequency domain using a set of spatial filters.The statistics of the spectral information at different scalesand orientations define image features. This approach hasbeen well studied in image processing and achieved practi-cal results [38, 2, 23]. Feature extraction in the frequencydomain has an advantage. A spatial filter can be easily madeselective by enhancing certain frequencies while suppressingthe others. This explicit selection of certain frequencies isdifficult to control in CNNs. While CNNs are know to beuniversal approximators, in practice, it is unclear whetherCNNs can learn to perform spectral analyses with availabledatasets. Rather than relying CNNs to learn performingspectral analysis, we propose to directly integrate spectralapproaches into CNNs, particularly based on a multiresolu-tion analysis using wavelet transform [33]. Our experimentsshow that a CNN with more parameters cannot be trained to

Page 3: Wavelet Convolutional Neural Networks - arXivThe University of Tokyo, Digital Frontier Inc. sfujieda@graphics.ci.i.u-tokyo.ac.jp Kohei Takayama Digital Frontier Inc. ktakayama@dfx.co.jp

become equivalent to our model with available datasets inpractice.

3. Wavelet Convolutional Neural NetworksOverview: We propose to formulate convolution and pool-ing in CNNs as filtering and downsampling. This formu-lation allows us to connect CNNs with a multiresolutionanalysis. In the following explanations, we use a single-channel 1D data for the sake of brevity. Applications to 2Dimages with multiple channels are trivially possible as wasdone by CNNs.

3.1. Convolutional Neural Networks

In addition to the use of an activation function and afully connected layer, CNNs introduce convolution/poolinglayers. Figure 2 illustrates the configuration we explain inthe following.

Convolution Layers: Given an input vector with n com-ponents x = (x0, x1, . . . , xn−1) ∈ Rn, a convolutionlayer outputs a vector of the same number of componentsy = (y0, y1, . . . , yn−1) ∈ Rn:

yi =∑j∈Ni

wjxj , (1)

where Ni is a set of indices of neighbors at xi and wj is aweight. Following the notational convention in CNNs, weconsider that wj includes the bias by having a constant inputof 1. The equation thus says that each output yi is a weightedsum of neighbors

∑j∈Ni

wjxj plus constant.Each layer defines the weights wj as constants over i. By

sharing parameters, CNNs reduce the number of parametersand achieve translation invariance in the image space. Thedefinition of yi in Equation 1 is equivalent to convolutionof xi via a filtering kernel wj , thus this layer is called aconvolution layer. We can thus rewrite yi in Equation 1using the convolution operator ∗ as

y = x ∗w, (2)

where w = (w0, w1, . . . , wo−1) ∈ Ro.

Pooling Layers: Pooling layers are typically used imme-diately after convolution layers to simplify the information.We focus on average pooling which allows us to see theconnection with a multiresolution analysis. Given an in-put x ∈ Rn, average pooling outputs a vector of a fewercomponents y ∈ Rm as

yj =1

p

p−1∑k=0

xpj+k, (3)

x0 x1 x2 xn-3 xn-2 xn-1

y0 y1 y2 yn-3 yn-2 yn-1

x3

y3

xn-4

yn-4

x0 x1 x2 xn-3 xn-2 xn-1

y0 y1 ym-2 ym-1

x3 xn-4

p p p

w w w w

N0 N1 Nn-1Nn-2

(a)

(b)

p

p = 2

Figure 2. Concepts of convolution and average pooling layers. (a)Convolution layers compute a weighted sum of neighbor. (b) Pool-ing layers take an average and perform downsampling.

where p defines the support of pooling and m = np . For

example, p = 2 means that we reduce the number of outputsto a half of the inputs by taking pair-wise averages. Us-ing the standard downsampling operator ↓, we can rewriteEquation 3 as

y = (x ∗ p) ↓ p, (4)

where p = (1/p, . . . , 1/p) ∈ Rp represents the averagingfilter. Average pooling mathematically involves convolutionvia p followed by downsampling with the stride of p.

3.2. Generalized Convolution and Pooling

Equation 2 and Equation 4 can be combined into a gener-alized form of convolution and downsampling as

y = (x ∗ k) ↓ p. (5)

The generalized weight k is defined as• k = w with p = 1 (convolution in Equation 2)• k = p with p > 1 (pooling in Equation 4)• k = w∗p with p > 1 (convolution followed by pooling).

Our insight is that Equation 5 is equivalent to a part ofa multiresolution analysis. To see this connection, let usconsider convolution followed by pooling with p = 2 and apair of convolution kernels kl and kh:

xl = (x ∗ kl) ↓ 2xh = (x ∗ kh) ↓ 2.

(6)

At this point, it is nothing but convolution and pooling withtwo different kernels kl and kh. The key idea is that a mul-tiresolution analysis [7] decomposes xl further as follows.

By defining xl,0 = x, a multiresolution analysis performsa hierarchical decomposition of xl,t into xl,t+1 and xh,t+1

Page 4: Wavelet Convolutional Neural Networks - arXivThe University of Tokyo, Digital Frontier Inc. sfujieda@graphics.ci.i.u-tokyo.ac.jp Kohei Takayama Digital Frontier Inc. ktakayama@dfx.co.jp

by repeatedly applying Equation 6 with different kl,t andkh,t at each t:

xl,t+1 = (xl,t ∗ kl,t) ↓ 2xh,t+1 = (xl,t ∗ kh,t) ↓ 2.

(7)

The number of applications t is called a level in a multiresolu-tion analysis. Based on our reformulation, CNNs essentiallydiscard xh,t entirely and use only one set of kernels kt:

xl,t+1 = (xl,t ∗ kt) ↓ 2. (8)

Therefore, CNNs can seen as a limited form of a multireso-lution analysis.

Figure 1 illustrates how CNNs and our wavelet CNNs dif-fer under this formulation. We call kl,t and kh,t as low-passand high-pass filter to follow the convention of multiresolu-tion analyses. Note, however, that they are not necessarilylow-pass and high-pass filters in the spectral domain. Con-ventional CNNs can be seen as a limited form of a multires-olution analysis that uses only kt without a characteristichierarchical decomposition (Equation 8). Our model supple-ments the missing part due to kh,t by introducing anotherset of kl,t to form a multiresolution analysis via wavelettransform inside a neural network architecture. While thisidea might look simple after the fact, our model is powerfulenough to outperform the existing more complex models aswe will show in the results.

Note that we cannot use an arbitrary pair of filters (kl,t

and kh,t) to perform multiresolution analysis. For wavelettransform, kh,t is known as the wavelet function and kl,t isknown as the scaling function. We used Haar wavelets [12]for our experiments, but our model is not restricted to Haar.This constraint also suggests why it is difficult to train con-ventional CNNs to perform the same computation as waveletCNNs do: weights kt in CNNs are ignorant of this importantconstraint and just try to learn it from datasets.

Rippel et al. [34] proposed a related approach of replacingconvolution and pooling by discrete Fourier transform andtruncation of the coefficients. This approach, called spectralpooling, is equivalent to Equation 8, thus it is not essen-tially different from conventional CNNs. Our model is alsodifferent from merely applying multiresolution analysis oninput data and using CNNs afterward, since multiresolutionanalysis is built inside the network with skip connections.

3.3. Implementation

Network Structure: Figure 1 illustrates our network struc-ture. We designed our main network structure after a VGGnetwork [37]. We use 3×3 convolutional kernels exclusivelyand 1 × 1 padding to ensure the output is the same size asthe input.

Instead of using the pooling layers to reduce the size ofthe feature maps, we exploit convolution layers with the

increased stride. If 1 × 1 padding is added to the layerwith a stride of two, the output becomes half the size ofthe input layer. This approach can be used to replace maxpooling without loss in accuracy [21]. In addition, sinceboth the VGG-like architecture and image decomposition inmultiresolution analysis have the same characteristic that thesize of images is reduced to a half successively, we combineeach level of decomposed images with feature maps of thespecific layer that are the same size as those images.

Furthermore, in order to use information of decomposedimages more efficiently, we use dense connections [18] andprojection shortcuts [16]. Dense connections allow eachlevel of decomposed images to be directly connected with allsubsequent layers through channel-wise concatenation. Withthis connectivity, our network can flow all the informationeffectively into the end of the network. Projection shortcutscan be used to increase dimensions of the input with 1× 1convolutional kernels. In our model, since dimensions offeature maps are different before and after the shortcut path,we use projection shortcuts in every shortcuts. We also useglobal average pooling [29] instead of fully connected layersto prevent overfitting.

Learning: Wavelet CNNs exploit global average poolingwith the same size as the input of the layer, so the size ofinput images is required to be the fixed size. We thus trainour proposed model exclusively with images of the size224 × 224. These images are achieved by first scaling thetraining images to 256 × 256 pixels and then conductingrandom crops to 224× 224 pixels and flipping. This randomvariation helps the model to prevent overfitting. For furtherrobustness, we use batch normalization [20] throughout ournetwork before activation layers during training. For theoptimizer, we exploit the Adam optimizer [24] instead ofSGD. We use the Rectified Linear Unit (ReLU) [10] as theactivation function in all the experiments.

4. Experiments

We provide details of two applications of wavelet CNNs.We applied wavelet CNNs to texture classification to confirmthat wavelet CNNs can capture small features of images. Wealso investigated how wavelet CNNs perform on naturalimages in an image annotation task.

4.1. Texture Classification

Texture classification is a challenging problem since tex-tures often vary a lot within the same class, due to changes inviewpoints, scales, lighting configurations, etc. In addition,textures usually do not contain enough information regardingthe shape of objects which are informative to distinguish dif-ferent objects in image classification tasks. Due to such dif-ficulties, even the latest approaches based on convolutional

Page 5: Wavelet Convolutional Neural Networks - arXivThe University of Tokyo, Digital Frontier Inc. sfujieda@graphics.ci.i.u-tokyo.ac.jp Kohei Takayama Digital Frontier Inc. ktakayama@dfx.co.jp

2-level 3-level 4-level 5-levelkth-tips2-b 59.1±2.5 61.4±1.7 63.5±1.3 63.7±2.3DTD 34.0±1.6 35.2±1.2 35.6±1.1 35.6±0.7

Table 1. Classification results for different levels within waveletCNNs trained from scratch indicated as accuracy (%).

AlexNet T-CNN WaveletCNN

kth-tips2-b 48.3±1.4 49.6±0.6 63.7±2.3DTD 22.7±1.3 27.8±1.2 35.6±0.7

Table 2. Classification results for networks trained from scratchindicated as accuracy (%).

neural networks achieved a limited success, when comparedto other tasks such as image classification [13]. Andrearczyket al. [1] proposed texture CNN (T-CNN) which is a CNNspecialized for texture classification. T-CNN uses a novelenergy layer in which each feature map is simply pooled bycalculating the average of its activated output. This resultsin a single value for each feature map, similar to an energyresponse to a filter bank. This approach does not improveclassification accuracy, but its simple architecture reducesthe number of parameters.

Datasets: For our experiments, we used two publicly avail-able texture datasets: kth-tips2-b [14] and DTD [5]. Thekth-tips2-b dataset contains 11 classes of 432 texture images.Each class consists of four samples and each sample has108 images. Each sample is used for training once whilethe remaining three samples are used for testing. The re-sults for kth-tips2-b are shown as the mean and the standarddeviation over the four splits. The DTD dataset contains47 classes of 120 images ”in the wild” which means thatimages are collected in uncontrolled conditions. This datasetincludes 10 available annotated splits with 40 training im-ages, 40 validation images, and 40 testing images for eachclass. The results for DTD are averaged over the 10 splits.We processed the images in each dataset by global contrastnormalization. We calculated the accuracy as percentage ofimages that are correctly labeled which is a common metricin texture classification.

Training from scratch: Table 1 shows the results of ourmodel with different levels of a multiresolution analysis. Forinitialization of the parameters, we used a robust methodfor ReLU [15]. For both datasets, the network with 5-leveldecomposition performed the best, though the model with4-level decomposition achieved almost the same accuracyas 5-level. Figure 3 and Table 2 compare our model withAlexNet [26] and T-CNN [1] using texture datasets to traineach model from scratch. Since the model with 5-leveldecomposition achieved the best accuracy in the previous

(a) kth-tips2-b (b) DTD

0

10

20

30

40

50

60

70

AlexNet T-CNN WaveletCNN

Acc

urac

y [%

]

0

10

20

30

40

50

60

70

AlexNet T-CNN WaveletCNN

Acc

urac

y [%

]

Figure 3. Classification results of (a) kth-tips2-b and (b) DTD fornetworks trained from scratch. We compared our models withAlexNet and T-CNN.

Shearlet VGG-M T-CNN TS+VGG-M

WaveletCNN

kth-tips2-b 62.3±0.8 70.7±1.7 72.4±2.1 71.8±1.6 74.0±1.2DTD 21.6±0.9 55.2±1.2 55.8±0.8 59.3±1.1 59.8±0.9

Table 3. Classification results for networks pre-trained with Ima-geNet indicated as accuracy (%).

experiment, we used this network in this and following ex-periments as well. Since VGG networks tend to performpoorly due to over-fitting if trained from scratch, we usedAlexNet as an example of conventional CNNs for this ex-periment. For both datasets, our model performs better thanAlexNet and T-CNN by a large margin.

Training with fine-tuning: Figure 4 and Table 3 showthe classification rates using the networks pre-trained withthe ImageNet 2012 dataset [35]. We compared our modelwith a spectral approach using shearlet transform [25], VGG-M [4], T-CNN [1], and VGG-M using compact bilinearpooling [9]. For compact bilinear pooling, we comparedour model only with Tensor Sketch (TS) since it workedthe best in practice. Our model again achieved the bestperformance for both datasets. While the improvement forthe DTD dataset might be marginal (less than 1%), as weshow later, this performance is achieved with a significantlyfewer parameters than other methods.

Visual comparisons of classified images: Figure 5 showssome extracted images for several classes in our experiments.The images in the top row are from kth-tips2-b dataset, whilethe images in the bottom row of Figure 5 are from DTDdataset. A red square indicates a incorrectly classified tex-ture. We can visually confirm that a spectral approach (shear-let) is insensitive to the scale variation and extract detailedfeatures, whereas a spatial approach (VGG-M) is insensitiveto distortion. For example, in Aluminium foil, a shearlettransform can correctly ignore the scale of wrinkles, butVGG-M failed to classify such an image into the same class.In Banded, VGG-M classifies distorted lines into the correct

Page 6: Wavelet Convolutional Neural Networks - arXivThe University of Tokyo, Digital Frontier Inc. sfujieda@graphics.ci.i.u-tokyo.ac.jp Kohei Takayama Digital Frontier Inc. ktakayama@dfx.co.jp

(a) kth-tips2-b (b) DTD

0

10

20

30

40

50

60

70

80

Shearlet VGG-M T-CNN TS+VGG-M

WaveletCNN

Acc

urac

y [%

]

0

10

20

30

40

50

60

70

80

Shearlet VGG-M T-CNN TS+VGG-M

WaveletCNN

Acc

urac

y [%

]

Figure 4. Classification results of (a) kth-tips2-b and (b) DTD fornetworks pre-trained with ImageNet 2012 dataset. We comparedour model (wavelet CNN with 5-level decomposition) with shearlettransform, VGG-M, T-CNN, and TS+VGG-M.

class, but a shearlet transform could not recognize this line-like structure well. Since our model is the combination ofboth approaches, it can assign texture images to the correctlabel in every variation above.

4.2. Image Annotation

The purpose of an image annotation task is to associatemultiple labels with an image regarding to its content. Thistask is more natural than single-label image classificationbecause a natural image actually includes various objects.The convolutional neural network - recurrent neural net-work (CNN-RNN) encoder-decoder model is a popular ap-proach [39, 22, 31] for this task. In this model, a CNNencodes the image into a fixed length vector, and then itis fed into an RNN that decodes it into a list of tags. Theexisting models share this concept and differ slightly in howthe CNN and RNN relate to each other.

Recurrent image annotator (RIA) [22] exploits imagefeatures output from the CNN as the RNN hidden states.They focus on the order of a list of input tags and show thatthe rare-first order, which put rarer tags first based on theirfrequency, improves the performance. Liu et al. proposedsemantically regularized CNN-RNN (S-CNN-RNN) [31]where the CNN model is regularized by semantic conceptswhich serve as strong deep supervision to guide the learningof the CNN layers. The prediction layer of the CNN in thismodel is also used as the RNN initial states. Both models useVGG-16 [37] as the CNN and the long short-term memory(LSTM) [17] as RNN. In our experiment, we compared ourmodel with RIA.

Datasets: We used two benchmark image annotationdatasets: IAPR-TC12 [11] and Microsoft COCO [29]. TheIAPR-TC12 dataset contains 20,000 images of natural sceneswith text captions in several languages. To use this dataset forimage annotation, it can be arranged by extracting commonnouns in accord with the previous work [32]. This processresults in a vocabulary size of 291. Training used 17,665

C-P C-R C-F1 O-P O-R O-F1VGG-16 [37] 22.97 27.39 24.99 33.87 34.93 34.40Wavelet CNN 29.01 30.62 29.79 37.43 37.66 37.54

Table 4. Annotation results of RIAs for IAPR-TC12.

C-P C-R C-F1 O-P O-R O-F1VGG-16 [37] 51.55 45.60 48.49 57.94 51.92 54.77Wavelet CNN 53.17 46.69 49.72 58.68 52.04 55.16

Table 5. Annotation results of RIAs for Microsoft COCO.

images while the remaining are used for testing. The Mi-crosoft COCO (MS-COCO) dataset contains 82,783 trainingimages and 40,504 testing images. Following the previousworks [39, 31], we employed 80 object annotations as labels.

Training Details: We replaced VGG-16 in RIA by awavelet CNN with 5-level decomposition. For LSTM, thedimension of both hidden states and the input is set to 1024and the number of hidden layer is 1. When training, thehidden state and the cell state are initialized by image fea-tures from CNN and zero respectively. Additionally, sinceoriginal RIA uses 4096 dimensional output from the lastfully-connected layer of VGG-16 as image features, we adda 2048 dimensional fully connected layer to our model justafter an average pooling layer and exploit the output fromthis layer as image features. Even though this additionallayer increases the number of trainable parameters in ourmodel, our model still has only 18.3 millions parameterswhile VGG-16 has 138.4 millions parameters. For the or-der of input tags, we use the rare-first order following theoriginal paper [22].

Results: We used per-class and overall metrics includingprecision (C-P and O-P), recall (C-R and O-R) and F1 score(C-F1 and O-F1) as evaluation metrics. Per-class metricstake the average over all classes while overall metrics takethe average over all test images. As RIA can produce an-notations in arbitrary length, we used the arbitrary-lengthresults to compare. Table 4 and Table 5 show the resultsusing RIA models with VGG-16 and the wavelet CNN forIAPR-TC12 and MS-COCO. For IAPR-TC12, RIA with ourmodel obtained much better results than original RIA. Theimprovement for MS-COCO was marginal in comparison.However, all these results are obtained with a significantlyfewer number of parameters; the number of parameters ofwavelet CNN is more than seven times smaller than that ofVGG-16.

Figure 7 shows some results in image annotation. Theimages in the top row are from IAPR-TC12 while the imagesin the bottom row are from MS-COCO. GT indicates theground-truth annotations, and they are organized in the rare-first order. VGG and ours show the results of RIA withVGG-16 and our model, where the order of the predictionsis preserved as RNN output.

Page 7: Wavelet Convolutional Neural Networks - arXivThe University of Tokyo, Digital Frontier Inc. sfujieda@graphics.ci.i.u-tokyo.ac.jp Kohei Takayama Digital Frontier Inc. ktakayama@dfx.co.jp

Alu

min

ium

foil

Ban

ded

(a) Shearlet Transform (c) Our Model(b) VGG-M (d) ReferencesFigure 5. Some results classified by (a) shearlet transform, (b) VGG-M, (c) our model and (d) references. The images on the top row areextracted from kth-tips2-b and the rests are extracted from DTD. The images in red squares are wrongly classified images.

4.3. Number of parameters

To assess the complexity of each model, we compared thenumber of trainable parameters such as weights and biasesfor classification to 1000 classes (Figure 6). ConventionalCNNs such as VGG-M and AlexNet have a large numberof parameters while their depth is a little shallower than ourproposed model. Even compared to T-CNN, which aims atreducing the model complexity, the number of parameters inour model with 5-level decomposition is about the half. Wealso remind that our model achieved higher accuracy thanT-CNN does in texture classification.

This result confirms that our model achieves better re-sults with a significantly reduced number of parameters thanexisting models. The memory consumption of each Caffemodel is: 392 MB (VGG-M), 232 MB (AlexNet), 89.1 MB(T-CNN), and 53.9 MB (Ours). The small number of param-eters generally suppresses over-fitting of the model for smalldatasets.

5. Discussion

Application to more general tasks: We applied ourmodel to two challenging tasks; texture classification and im-age annotation. However, since we do not assume anythingregarding the input, our model is not necessarily restrictedto these tasks. For example, we experimented training awavelet CNN with 5-level decomposition and AlexNet withthe ImageNet 2012 dataset from scratch to perform imageclassification. Our model obtained the accuracy of 59.4%

0

10

20

30

40

50

60

70

80

90

100

110

VGG-M AlexNet T-CNN 2-level 3-level 4-level 5-level

Num

ber o

f par

amet

ers

in m

illio

ns

102.9

60.9

7.79

23.4

8.53 11.5 14.1

Figure 6. The number of trainable parameters in millions. Ourmodel, even with 5-level of multiresolution analysis, has a fewerparameters than any other competing models we tested.

whereas AlexNet resulted in 57.1%. We should remind thatthe number of parameters of our model is about four timessmaller than that of AlexNet (Figure 6). Our model is thussuitable also for image classification with smaller memoryfootprint. Other applications such as image recognition andobject detection with our model should be similarly possible.

Lp pooling: An interesting generalization of max and av-erage pooling is Lp pooling [3, 36]. The idea of Lp poolingis that max pooling can be thought as computing L∞ norm,while average pooling can be considered as computing L1

norm. In this case, Equation 4 cannot be written as lin-ear convolution anymore due to non-linear transformationin norm calculation. Our overall formulation, however, is

Page 8: Wavelet Convolutional Neural Networks - arXivThe University of Tokyo, Digital Frontier Inc. sfujieda@graphics.ci.i.u-tokyo.ac.jp Kohei Takayama Digital Frontier Inc. ktakayama@dfx.co.jp

GT: peak, range, snow, lake, mountainVGG: peak, valley, range, snow, city, view, slope,mountainOurs: peak, range, snow, landscape, mountain

GT: tennis, court, player, tee-shirt, short, woman, manVGG: tennis, court, player, tee-shirt, shortOurs: tennis, court, seat, player, stadium, tee-shirt, short, woman, man

GT: light, church, night, city, square, view, frontVGG: night, lamp, middle, road, cloudOurs: light, church, night, city, square, view, front

GT: teddy bear, cat, backpackVGG: teddy bearOurs: teddy bear, cat,

GT: motorcycle, backpack, personVGG: bus, traffic light, car, personOurs: motorcycle, person

GT: airplane, umbrella, handbag, personVGG: suitcase, train, backpack, handbag, personOurs: umbrella, handbag, person

Figure 7. Some results of image annotaion. The images on the top row are extracted from IAPR-TC12 and the rests are extracted fromMicrosoft COCO. GT, VGG and ours show the ground-truth annotaions, the predictions of RIA with VGG-16 and the predictions of RIAwith wavelet CNN, respectively.

not necessarily limited to a multiresolution analysis either;we can just replace downsampling part by correspondingnorm computation to support Lp pooling. This modificationhowever will not retain all the frequency information of theinput as it is no longer a multiresolution analysis. We fo-cused on average pooling as it has a clear connection to amultiresolution analysis.

Limitations: We designed wavelet CNNs to put each highfrequency part between layers of the CNN. Since our net-work has four layers to reduce the size of feature maps, themaximum decomposition level is restricted to five. Thisdesign is likely to be less ideal since we cannot tweak thedecomposition level independently from the depth (therebythe number of trainable parameters) of the network. A dif-ferent network design might make this separation of hyper-parameters possible.

Wavelet CNNs achieved the best accuracy for both train-ing from scratch and with fine-tuning for texture classifica-tion. For the performance with fine-tuning, however, ourmodel outperforms other methods by a slight margin espe-cially for DTD, albeit with a significantly smaller numberof parameters. We speculated that it is partially becausepre-training with the ImageNet 2012 dataset is simply not

appropriate for texture classification. An exact reasoning offailure cases for texture classification, however, is generallydifficult for any neural network models, and our model isnot an exception.

6. Conclusion

We presented a novel CNN architecture which incorpo-rates a spectral analysis into CNNs. We showed how toreformulate convolution and pooling layers in CNNs intoa generalized form of filtering and downsampling. Thisreformulation shows how conventional CNNs perform alimited version of multiresolution analysis, which then al-lows us to integrate multiresolution analysis into CNNs as asingle model called wavelet CNNs. We demonstrated thatour model achieves better accuracy for texture classifica-tion and image annotation with smaller number of trainableparameters than existing models. In particular, our modeloutperformed all the existing models with significantly moretrainable parameters by a large margin when we trained eachmodel from scratch. A wavelet CNN is a general learningmodel and applications to other problems are interestingfuture works. Finally, for wavelet transform in our model,we use fixed-weight kernels as Haar wavelet. It is interesting

Page 9: Wavelet Convolutional Neural Networks - arXivThe University of Tokyo, Digital Frontier Inc. sfujieda@graphics.ci.i.u-tokyo.ac.jp Kohei Takayama Digital Frontier Inc. ktakayama@dfx.co.jp

to explore how wavelet kernels themselves can be trained inthe end-to-end learning framework.

References[1] V. Andrearczyk and P. F. Whelan. Using filter banks in con-

volutional neural networks for texture classification. PatternRecognition Letters, 84:63 – 69, 2016. 5

[2] S. Arivazhagan, T. G. Subash Kumar, and L. Ganesan. Textureclassification using curvelet transform. International Jour-nal of Wavelets, Multiresolution and Information Processing,05(03):451–464, 2007. 2

[3] Y. Boureau, J. Ponce, and Y. Lecun. A theoretical analysisof feature pooling in visual recognition. In InternationalConference on Machine Learning (ICML-10), pages 111–118.Omnipress, 2010. 7

[4] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman.Return of the devil in the details: Delving deep into convo-lutional nets. In British Machine Vision Conference, 2014.5

[5] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, andA. Vedaldi. Describing textures in the wild. In IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR),2014. 5

[6] M. Cimpoi, S. Maji, and A. Vedaldi. Deep filter banks fortexture recognition and segmentation. In IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR), 2015.2

[7] J. L. Crowley. A representation for visual information. Tech-nical report, 1981. 3

[8] G. Csurka, C. R. Dance, L. Fan, J. Willamowski, and C. Bray.Visual categorization with bags of keypoints. In Workshop onStatistical Learning in Computer Vision, ECCV, pages 1–22,2004. 1

[9] Y. Gao, O. Beijbom, N. Zhang, and T. Darrell. Compactbilinear pooling. In IEEE Conference on Computer Visionand Pattern Recognition (CVPR), pages 317–326, 2016. 2, 5

[10] X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectifierneural networks. In Proceedings of the Fourteenth Inter-national Conference on Artificial Intelligence and Statistics(AISTATS-11), volume 15, pages 315–323, 2011. 4

[11] M. Grubinger. Analysis and evaluation of visual informa-tion systems performance, 2007. Thesis (Ph. D.)–VictoriaUniversity (Melbourne, Vic.), 2007. 6

[12] A. Haar. Zur theorie der orthogonalen funktionensysteme.Mathematische Annalen, 69(3):331–371, 1910. 4

[13] L. G. Hafemann, L. S. Oliveira, and P. Cavalin. Forest speciesrecognition using deep convolutional neural networks. In In-ternational Conference on Pattern Recognition (ICPR), pages1103–1107, 2014. 5

[14] E. Hayman, B. Caputo, M. Fritz, and J.-O. Eklundh. On theSignificance of Real-World Conditions for Material Classifi-cation, volume 4, pages 253–266. 2004. 5

[15] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep intorectifiers: Surpassing human-level performance on imagenetclassification. In IEEE International Conference on ComputerVision (ICCV), pages 1026–1034, 2015. 5

[16] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learningfor image recognition. In IEEE Conference on ComputerVision and Pattern Recognition (CVPR), pages 770–778, June2016. 1, 4

[17] S. Hochreiter and J. Schmidhuber. Long short-term memory.Neural Comput., 9(8):1735–1780, 1997. 6

[18] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten.Densely connected convolutional networks. In IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR),2017. 2, 4

[19] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Wein-berger. Deep Networks with Stochastic Depth, pages 646–661.Springer International Publishing, 2016. 1

[20] S. Ioffe and C. Szegedy. Batch normalization: Acceleratingdeep network training by reducing internal covariate shift.CoRR, abs/1502.03167, 2015. 4

[21] T. B. J. T. Springenberg, A. Dosovitskiy and M. Riedmiller.Striving for simplicity: The all convolutional net. In Inter-national Conference on Learning Representations WorkshopTrack, 2015. 4

[22] J. Jin and H. Nakayama. Annotation order matters: Recurrentimage annotator for arbitrary length image tagging. In Inter-national Conference on Pattern Recognition (ICPR), pages2452–2457, 2016. 6

[23] M. Kanchana and P. Varalakshmi. Texture classification usingdiscrete shearlet transform. International Journal of ScientificResearch, 5, 2013. 2

[24] D. P. Kingma and J. Ba. Adam: A method for stochasticoptimization. CoRR, abs/1412.6980, 2014. 4

[25] K. G. Krishnan, P. T. Vanathi, and R. Abinaya. Performanceanalysis of texture classification techniques using shearlettransform. In International Conference on Wireless Com-munications, Signal Processing and Networking (WiSPNET),pages 1408–1412, 2016. 5

[26] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenetclassification with deep convolutional neural networks. InAdvances in Neural Information Processing Systems 25, pages1106–1114. 2012. 1, 5

[27] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E.Howard, W. Hubbard, and L. D. Jackel. Backpropagationapplied to handwritten zip code recognition. Neural Compu-tation, 1(4):541–551, Dec 1989. 1

[28] W. Q. Lim. The discrete shearlet transform: A new directionaltransform and compactly supported shearlet frames. IEEETransactions on Image Processing, 19(5):1166–1180, May2010. 1

[29] M. Lin, Q. Chen, and S. Yan. Network in network. CoRR,abs/1312.4400, 2014. 4, 6

[30] T.-Y. Lin, A. RoyChowdhury, and S. Maji. Bilinear cnns forfine-grained visual recognition. In Transactions of PatternAnalysis and Machine Intelligence (PAMI), 2017. 2

[31] F. Liu, T. Xiang, T. M. Hospedales, W. Yang, and C. Sun.Semantic Regularisation for Recurrent Image Annotation.ArXiv e-prints, 2016. 6

[32] A. Makadia, V. Pavlovic, and S. Kumar. A new baseline forimage annotation. In European Conference on ComputerVision (ECCV), pages 316–329, 2008. 6

Page 10: Wavelet Convolutional Neural Networks - arXivThe University of Tokyo, Digital Frontier Inc. sfujieda@graphics.ci.i.u-tokyo.ac.jp Kohei Takayama Digital Frontier Inc. ktakayama@dfx.co.jp

[33] S. G. Mallat. A theory for multiresolution signal decompo-sition: the wavelet representation. IEEE Transactions onPattern Analysis and Machine Intelligence, 11(7):674–693,Jul 1989. 2

[34] O. Rippel, J. Snoek, and R. P. Adams. Spectral representationsfor convolutional neural networks. In Neural InformationProcessing Systems (NIPS), pages 2449–2457, 2015. 4

[35] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg,and L. Fei-Fei. Imagenet large scale visual recognition chal-lenge. International Journal of Computer Vision, 115(3):211–252, 2015. 5

[36] P. Sermanet, S. Chintala, and Y. LeCun. Convolutional neuralnetworks applied to house numbers digit classification. In In-ternational Conference on Pattern Recognition (ICPR), pages3288–3291, 2012. 7

[37] K. Simonyan and A. Zisserman. Very deep convolu-tional networks for large-scale image recognition. CoRR,abs/1409.1556, 2014. 4, 6

[38] M. Unser. Texture classification and segmentation usingwavelet frames. IEEE Transactions on Image Processing,4(11):1549–1560, 1995. 1, 2

[39] J. Wang, Y. Yang, J. Mao, Z. Huang, C. Huang, and W. Xu.Cnn-rnn: A unified framework for multi-label image classifi-cation. In IEEE Conference on Computer Vision and PatternRecognition (CVPR), pages 2285–2294, 2016. 6

[40] S. Zagoruyko and N. Komodakis. Wide residual networks.CoRR, abs/1605.07146, 2016. 1

[41] K. Zhang, M. Sun, X. Han, X. Yuan, L. Guo, and T. Liu.Residual networks of residual networks: Multilevel residualnetworks. IEEE Transactions on Circuits and Systems forVideo Technology, PP(99):1–1, 2017. 1