20140530.journal club

20
Journal Club 論文紹介 Modeling Natural Images using Gated MRF Ranzato, M. et al. 2014/05/30 [email protected]

Upload: shouno-hayaru

Post on 06-May-2015

445 views

Category:

Engineering


3 download

DESCRIPTION

My interpretaion of the paper "Modeling Natural Image using Gated MRF" by M. Ranzato et al. (IEEE PAMI 2013 maybe)

TRANSCRIPT

Page 1: 20140530.journal club

Journal  Club  論文紹介  Modeling  Natural  Images  using  

Gated  MRF Ranzato,  M.  et  al.  

2014/05/30  [email protected]

Page 2: 20140530.journal club

やりたいこと

•  できるだけ同じメカニズムつかって  – 画像表現の獲得  – 画像識別機械の構成  – ノイズ除去  – 遮蔽シーンの推定,  etc…  

•  道具  – 確率的な機械学習モデル  MRFの拡張的な  (gatedRBM)  +  DeepBeliefNet(DBN)  

Page 3: 20140530.journal club

画像の表現形式のおさらい  2画素からなる画像で

サンプル

モデルでの表現

Page 4: 20140530.journal club

Toy  Problem:  パッチ画像の再構成 パッチを  0  平均 画像を  0  平均

Page 5: 20140530.journal club

必要な介在変数

•  平均(mean)  – Heavy  tailed  な構造ではない  

•  共分散(の逆行列:精度行列)(precision)  – Heavy  tailed  な構造(近いピクセルの相関大)  

•  介在変数として平均と共分散を表現しうる  →  Gated  MRF(mcRBM,  mPoT)

Page 6: 20140530.journal club

Gated  MRF(Markov  Random  Field) •  Mean  Units:  ピクセル値の符号化器  •  Precision  Units:  ピクセル間相関の符号化器

JOURNAL OF PAMI, VOL. ?, NO. ?, JANUARY 20?? 5

where � is the logistic function �(v) = 1/

�1 + exp(�v)

�.

Given the states of the hidden units, the input units forman MRF in which the effective pairwise interaction weightbetween x

i

and x

j

is 1

2

Pf

Pk

P

fk

h

p

k

C

if

C

jf

. Therefore,the conditional distribution over the input is:

p(x|hp

) = N(0,⌃), with ⌃

�1

= Cdiag(Ph

p

)C

T (5)

Notice that the covariance matrix is not fixed, but is afunction of the states of the precision latent variables h

p.In order to guarantee positive definiteness of the covariancematrix we need to constrain P to be non-negative and add asmall quadratic regularization term to the energy function3,here ignored for clarity of presentation.

As described in sec. 2.2, we want the conditional dis-tribution over the pixels to be a Gaussian with not onlyits covariance but also its mean depending on the states ofthe latent variables. Since the product of a full covarianceGaussian (like the one in eq. 5) with a spherical non-zero mean Gaussian is a non-zero mean full covarianceGaussian, we simply add the energy function of cRBM ineq. 3 to the energy function of a GRBM [29], yielding:

E(x,h

m

,h

p

) =

1

2

x

TCdiag(Ph

p

)C

Tx� b

p

Th

p

+

1

2

x

Tx� h

m

W

Tx� b

m

Th

m � b

x

Tx (6)

where h

m 2 {0, 1}M are called “mean” latent variables be-cause they contribute to control the mean of the conditionaldistribution over the input:

p(x|hm

,h

p

) = N

✓⌃(Wh

m

+ b

x

),⌃

◆, (7)

with ⌃

�1

= Cdiag(Ph

p

)C

T+ I

where I is the identity matrix, W 2 RD⇥M is a matrix oftrainable parameters and b

x 2 RD is a vector of trainablebiases for the input variables. The posterior distributionover the mean latent variables is4:

p(h

m

k

= 1|x) = �(

DX

i=1

W

ik

x

i

+ b

m

k

) (8)

The overall model, whose joint probability density functionis proportional to exp(�E

�x,h

m

,h

p

)

�, is called a mean

covariance RBM (mcRBM) [9] and is represented in fig. 3.

The demonstration in fig. 4 is designed to illustratehow the mean and precision latent variables cooperate torepresent the input. Through the precision latent variablesthe model knows about pair-wise correlations in the image.

3. In practice, this term is not needed when the dimensionality of h

p

is larger than x.4. Notice how the mean latent variables compute a non-linear projection

of a linear filter bank, akin to the most simplified “simple-cell” modelof area V1 of the visual cortex, while the precision units perform anoperation similar to the “complex-cell” model because rectified (squared)filter outputs are non-linearly pooled to produce their response. In thismodel, simple and complex cells perform their operations in parallel (notsequentially). If, however, we equate the factors used by the precisionunits to simple cells, we recover the standard model in which simple cellssend their squared outputs to complex cells.

FN M

precisionunits factors mean units

pixels

Fig. 3. Graphical model representation (with only three inputvariables): There are two sets of latent variables (the mean and theprecision units) that are conditionally independent given the inputpixels and a set of deterministic factor nodes that connect triplets ofvariables (pairs of input variables and one precision unit).

Fig. 4. A) Input image patch. B) Reconstruction performed usingonly mean hiddens (i.e. Wh

m

+b

x) (top) and both mean and preci-sion hiddens (bottom) (that is multiplying the patch on the top by theimage-specific covariance ⌃ =

�Cdiag(Ph

p

)CT+ I

��1, see meanof Gaussian in eq. 7). C) Reconstructions produced by combining thecorrect image-specific covariance as above with the incorrect, hand-specified pixel intensities shown in the top row. Knowledge aboutpair-wise dependencies allows a blob of high or low intensity to bespread out over the appropriate region. D) Reconstructions producedlike in C) showing that precision hiddens do not account for polarity(nor for the exact intensity values of regions) but only for correlations.

For instance, it knows that the pixels in the lower part ofthe image in fig. 4-A are strongly correlated; these pixelsare likely to take the same value, but the precision latentvariables do not carry any information about which valuethis is. Then, very noisy information about the values of theindividual pixels in the lower part of the image, as thoseprovided by the mean latent variables, would be sufficientto reconstruct the whole region quite well, since the modelknows which values can be smoothed. Mathematically, theinteraction between mean and precision latent variablesis expressed by the product between ⌃ (which dependsonly on h

p) and Wh

m

+ b

x in the mean of the Gaussiandistribution of eq. 7. We can repeat the same argumentfor the pixels in the top right corner and for those in themiddle part of the image as well. Fig. 4-C illustrates thisconcept, while fig. 4-D shows that flipping the sign ofthe reconstruction of the mean latent variables flips thesign of the overall reconstruction as expected. Informationabout intensity is propagated over each region thanks to thepair-wise dependencies captured by the precision hiddenunits. Fig. 4-B shows the same using the actual meanintensity produced by the mean hidden units (top). Thereconstruction produced by the model using the whole setof hidden units is very close to the input image, as can beseen in the bottom part of fig. 4-B.

In a mcRBM the posterior distribution over the latentvariables p(h|x) is a product of Bernoullis as shown ineq. 8 and 4. This distribution is particularly easy to usein a standard DBN [20] where each layer is trained usinga binary-binary RBM. Binary latent variables, however,are not very good at representing different real-values of

W C P

JOURNAL OF PAMI, VOL. ?, NO. ?, JANUARY 20?? 5

where � is the logistic function �(v) = 1/

�1 + exp(�v)

�.

Given the states of the hidden units, the input units forman MRF in which the effective pairwise interaction weightbetween x

i

and x

j

is 1

2

Pf

Pk

P

fk

h

p

k

C

if

C

jf

. Therefore,the conditional distribution over the input is:

p(x|hp

) = N(0,⌃), with ⌃

�1

= Cdiag(Ph

p

)C

T (5)

Notice that the covariance matrix is not fixed, but is afunction of the states of the precision latent variables h

p.In order to guarantee positive definiteness of the covariancematrix we need to constrain P to be non-negative and add asmall quadratic regularization term to the energy function3,here ignored for clarity of presentation.

As described in sec. 2.2, we want the conditional dis-tribution over the pixels to be a Gaussian with not onlyits covariance but also its mean depending on the states ofthe latent variables. Since the product of a full covarianceGaussian (like the one in eq. 5) with a spherical non-zero mean Gaussian is a non-zero mean full covarianceGaussian, we simply add the energy function of cRBM ineq. 3 to the energy function of a GRBM [29], yielding:

E(x,h

m

,h

p

) =

1

2

x

TCdiag(Ph

p

)C

Tx� b

p

Th

p

+

1

2

x

Tx� h

m

W

Tx� b

m

Th

m � b

x

Tx (6)

where h

m 2 {0, 1}M are called “mean” latent variables be-cause they contribute to control the mean of the conditionaldistribution over the input:

p(x|hm

,h

p

) = N

✓⌃(Wh

m

+ b

x

),⌃

◆, (7)

with ⌃

�1

= Cdiag(Ph

p

)C

T+ I

where I is the identity matrix, W 2 RD⇥M is a matrix oftrainable parameters and b

x 2 RD is a vector of trainablebiases for the input variables. The posterior distributionover the mean latent variables is4:

p(h

m

k

= 1|x) = �(

DX

i=1

W

ik

x

i

+ b

m

k

) (8)

The overall model, whose joint probability density functionis proportional to exp(�E

�x,h

m

,h

p

)

�, is called a mean

covariance RBM (mcRBM) [9] and is represented in fig. 3.

The demonstration in fig. 4 is designed to illustratehow the mean and precision latent variables cooperate torepresent the input. Through the precision latent variablesthe model knows about pair-wise correlations in the image.

3. In practice, this term is not needed when the dimensionality of h

p

is larger than x.4. Notice how the mean latent variables compute a non-linear projection

of a linear filter bank, akin to the most simplified “simple-cell” modelof area V1 of the visual cortex, while the precision units perform anoperation similar to the “complex-cell” model because rectified (squared)filter outputs are non-linearly pooled to produce their response. In thismodel, simple and complex cells perform their operations in parallel (notsequentially). If, however, we equate the factors used by the precisionunits to simple cells, we recover the standard model in which simple cellssend their squared outputs to complex cells.

Fig. 3. Graphical model representation (with only three inputvariables): There are two sets of latent variables (the mean and theprecision units) that are conditionally independent given the inputpixels and a set of deterministic factor nodes that connect triplets ofvariables (pairs of input variables and one precision unit).

Fig. 4. A) Input image patch. B) Reconstruction performed usingonly mean hiddens (i.e. Wh

m

+b

x) (top) and both mean and preci-sion hiddens (bottom) (that is multiplying the patch on the top by theimage-specific covariance ⌃ =

�Cdiag(Ph

p

)CT+ I

��1, see meanof Gaussian in eq. 7). C) Reconstructions produced by combining thecorrect image-specific covariance as above with the incorrect, hand-specified pixel intensities shown in the top row. Knowledge aboutpair-wise dependencies allows a blob of high or low intensity to bespread out over the appropriate region. D) Reconstructions producedlike in C) showing that precision hiddens do not account for polarity(nor for the exact intensity values of regions) but only for correlations.

For instance, it knows that the pixels in the lower part ofthe image in fig. 4-A are strongly correlated; these pixelsare likely to take the same value, but the precision latentvariables do not carry any information about which valuethis is. Then, very noisy information about the values of theindividual pixels in the lower part of the image, as thoseprovided by the mean latent variables, would be sufficientto reconstruct the whole region quite well, since the modelknows which values can be smoothed. Mathematically, theinteraction between mean and precision latent variablesis expressed by the product between ⌃ (which dependsonly on h

p) and Wh

m

+ b

x in the mean of the Gaussiandistribution of eq. 7. We can repeat the same argumentfor the pixels in the top right corner and for those in themiddle part of the image as well. Fig. 4-C illustrates thisconcept, while fig. 4-D shows that flipping the sign ofthe reconstruction of the mean latent variables flips thesign of the overall reconstruction as expected. Informationabout intensity is propagated over each region thanks to thepair-wise dependencies captured by the precision hiddenunits. Fig. 4-B shows the same using the actual meanintensity produced by the mean hidden units (top). Thereconstruction produced by the model using the whole setof hidden units is very close to the input image, as can beseen in the bottom part of fig. 4-B.

In a mcRBM the posterior distribution over the latentvariables p(h|x) is a product of Bernoullis as shown ineq. 8 and 4. This distribution is particularly easy to usein a standard DBN [20] where each layer is trained usinga binary-binary RBM. Binary latent variables, however,are not very good at representing different real-values of

Page 7: 20140530.journal club

DemonstraWon  of  Gated  MRF

•  Fig4.A  原画像(下部に強い相関アリ)  •  Fig4.B〜D  上段 mRBM  による再構成,  

下段は mcRBM  による再構成  •  Fig4.C  観測情報を入れると,mcRBM  は相関を再現  •  Fig4.D,  位相反転させても,(たぶん)正しそうな解釈

JOURNAL OF PAMI, VOL. ?, NO. ?, JANUARY 20?? 5

where � is the logistic function �(v) = 1/

�1 + exp(�v)

�.

Given the states of the hidden units, the input units forman MRF in which the effective pairwise interaction weightbetween x

i

and x

j

is 1

2

Pf

Pk

P

fk

h

p

k

C

if

C

jf

. Therefore,the conditional distribution over the input is:

p(x|hp

) = N(0,⌃), with ⌃

�1

= Cdiag(Ph

p

)C

T (5)

Notice that the covariance matrix is not fixed, but is afunction of the states of the precision latent variables h

p.In order to guarantee positive definiteness of the covariancematrix we need to constrain P to be non-negative and add asmall quadratic regularization term to the energy function3,here ignored for clarity of presentation.

As described in sec. 2.2, we want the conditional dis-tribution over the pixels to be a Gaussian with not onlyits covariance but also its mean depending on the states ofthe latent variables. Since the product of a full covarianceGaussian (like the one in eq. 5) with a spherical non-zero mean Gaussian is a non-zero mean full covarianceGaussian, we simply add the energy function of cRBM ineq. 3 to the energy function of a GRBM [29], yielding:

E(x,h

m

,h

p

) =

1

2

x

TCdiag(Ph

p

)C

Tx� b

p

Th

p

+

1

2

x

Tx� h

m

W

Tx� b

m

Th

m � b

x

Tx (6)

where h

m 2 {0, 1}M are called “mean” latent variables be-cause they contribute to control the mean of the conditionaldistribution over the input:

p(x|hm

,h

p

) = N

✓⌃(Wh

m

+ b

x

),⌃

◆, (7)

with ⌃

�1

= Cdiag(Ph

p

)C

T+ I

where I is the identity matrix, W 2 RD⇥M is a matrix oftrainable parameters and b

x 2 RD is a vector of trainablebiases for the input variables. The posterior distributionover the mean latent variables is4:

p(h

m

k

= 1|x) = �(

DX

i=1

W

ik

x

i

+ b

m

k

) (8)

The overall model, whose joint probability density functionis proportional to exp(�E

�x,h

m

,h

p

)

�, is called a mean

covariance RBM (mcRBM) [9] and is represented in fig. 3.

The demonstration in fig. 4 is designed to illustratehow the mean and precision latent variables cooperate torepresent the input. Through the precision latent variablesthe model knows about pair-wise correlations in the image.

3. In practice, this term is not needed when the dimensionality of h

p

is larger than x.4. Notice how the mean latent variables compute a non-linear projection

of a linear filter bank, akin to the most simplified “simple-cell” modelof area V1 of the visual cortex, while the precision units perform anoperation similar to the “complex-cell” model because rectified (squared)filter outputs are non-linearly pooled to produce their response. In thismodel, simple and complex cells perform their operations in parallel (notsequentially). If, however, we equate the factors used by the precisionunits to simple cells, we recover the standard model in which simple cellssend their squared outputs to complex cells.

Fig. 3. Graphical model representation (with only three inputvariables): There are two sets of latent variables (the mean and theprecision units) that are conditionally independent given the inputpixels and a set of deterministic factor nodes that connect triplets ofvariables (pairs of input variables and one precision unit).

Fig. 4. A) Input image patch. B) Reconstruction performed usingonly mean hiddens (i.e. Wh

m

+b

x) (top) and both mean and preci-sion hiddens (bottom) (that is multiplying the patch on the top by theimage-specific covariance ⌃ =

�Cdiag(Ph

p

)CT+ I

��1, see meanof Gaussian in eq. 7). C) Reconstructions produced by combining thecorrect image-specific covariance as above with the incorrect, hand-specified pixel intensities shown in the top row. Knowledge aboutpair-wise dependencies allows a blob of high or low intensity to bespread out over the appropriate region. D) Reconstructions producedlike in C) showing that precision hiddens do not account for polarity(nor for the exact intensity values of regions) but only for correlations.

For instance, it knows that the pixels in the lower part ofthe image in fig. 4-A are strongly correlated; these pixelsare likely to take the same value, but the precision latentvariables do not carry any information about which valuethis is. Then, very noisy information about the values of theindividual pixels in the lower part of the image, as thoseprovided by the mean latent variables, would be sufficientto reconstruct the whole region quite well, since the modelknows which values can be smoothed. Mathematically, theinteraction between mean and precision latent variablesis expressed by the product between ⌃ (which dependsonly on h

p) and Wh

m

+ b

x in the mean of the Gaussiandistribution of eq. 7. We can repeat the same argumentfor the pixels in the top right corner and for those in themiddle part of the image as well. Fig. 4-C illustrates thisconcept, while fig. 4-D shows that flipping the sign ofthe reconstruction of the mean latent variables flips thesign of the overall reconstruction as expected. Informationabout intensity is propagated over each region thanks to thepair-wise dependencies captured by the precision hiddenunits. Fig. 4-B shows the same using the actual meanintensity produced by the mean hidden units (top). Thereconstruction produced by the model using the whole setof hidden units is very close to the input image, as can beseen in the bottom part of fig. 4-B.

In a mcRBM the posterior distribution over the latentvariables p(h|x) is a product of Bernoullis as shown ineq. 8 and 4. This distribution is particularly easy to usein a standard DBN [20] where each layer is trained usinga binary-binary RBM. Binary latent variables, however,are not very good at representing different real-values of

強い相関

Page 8: 20140530.journal club

自然画像見せた時の基底() •  精度行列(C)  フィルタ:  Topographic  Mapライク表現 JOURNAL OF PAMI, VOL. ?, NO. ?, JANUARY 20?? 10

Fig. 8. Top: Precision filters (matrix C) of size 16x16 pixels learnedon color image patches of the same size. Matrix P was initialized witha local connectivity inducing a two dimensional topographic map.This makes nearby filters in the map learn similar features. Bottomleft: Random subset of mean filters (matrix W ). Bottom right (fromtop to bottom): independent samples drawn from the training data,mPoT, GRBM and PoT.

TABLE 1Log-probability estimates of test natural images x, and exact

log-probability ratios between the same test images x and randomimages n.

Model log

�p(x)

�log

�p(x)/p(n)

MoG -85 68GRBM FPCD -152 -3

PoT FPCD -101 65mPoT CD -109 82

mPoT PCD -92 98mPoT FPCD -94 102

mean filter resembling a similar Gabor specifies the initialintensity values before the interpolation.

The most intuitive way to test a generative model is todraw samples from it [49]. After training, we then run HMCfor a very long time starting from a random image. Thebottom right part of Fig. 8 shows that mPoT is actuallyable to generate samples that look much more like naturalimages than samples generated by a PoT model, whichonly models the pair-wise dependencies between pixels or aGRBM model, which only models the mean intensities. ThemPoT model succeeds in generating noisy texture patches,smooth patches, and patches with long elongated structuresthat cover the whole image.

More quantitatively, we have compared the differentmodels in terms of their log-probability on a hold outdataset of test image patches using a technique calledannealed importance sampling [50]. Tab. 1 shows greatimprovements of mPoT over both GRBM and PoT. How-ever, the estimated log-probability is lower than a mixtureof Gaussians (MoG) with diagonal covariances and thesame number of parameters as the other models. This isconsistent with previous work by Theis et al.[51]. Weinterpret this result with caution since the estimates forGRBM, PoT and mPoT can have a large variance due to thenoise introduced by the samplers, while the log-probabilityof the MoG is calculated exactly. This conjecture is con-firmed by the results reported in the rightmost columnshowing the exact log-probability ratio between the sametest images and Gaussian noise with the same covariance12.This ratio is indeed larger for mPoT than MoG suggestinginaccuracies in the former estimation. The table also reportsresults of mPoT models trained by using cheaper butless accurate approximations to the maximum likelihoodgradient, namely Contrastive Divergence [52] and PersistentContrastive Divergence [53]. The model trained with FPCDyields a higher likelihood ratio as expected [44].

We repeated the same data generation experiment us-ing the extension of mPoT to high-resolution images, bytraining on large patches (of size 238x238 pixels) pickedat random locations from a dataset of 16,000 gray-scalenatural images that were taken from ImageNet [54]13. Themodel was trained using 8x8 filters divided into four setswith different spatial phases. Each set tiles the image witha diagonal offset of two pixels from the previous set. Eachset consists of 64 covariance filters and 16 mean filters.Parameters were learned using the algorithm described insec. 3, but setting P to identity.

The first two rows of fig. 9 compare samples drawnfrom PoT with samples drawn from mPoT and show thatthe latter ones exhibit strong structure with smooth regionsseparated by sharp edges while the former ones lack anysort of long range structure. Yet, mPoT samples still lookrather artificial because the structure is fairly primitive andrepetitive. We then made the model deeper by adding twolayers on the top of mPoT.

All layers are trained by using FPCD but, as training pro-ceeds, the number of Markov chain steps between weightupdates is increased from 1 to 100 at the topmost layerin order to obtain a better approximation to the maximumlikelihood gradient. The second hidden layer has filters ofsize 3x3 that also tile the image with a diagonal offset ofone. There are 512 filters in each set. Finally, the thirdlayer has filters of size 2x2 and it uses a diagonal offsetof one; there are 2048 filters in each set. Every layerperforms spatial subsampling by a factor equal to the sizeof the filters used. This is compensated by an increase inthe number of channels which take contributions from the

12. The log probability ratio is exact since the intractable partitionfunction cancels out. This quantity is equal to the difference of energies.

13. Categories are: tree, dog, cat, vessel, office furniture, floor lamp,desk, room, building, tower, bridge, fabric, shore, beach and crater.

JOURNAL OF PAMI, VOL. ?, NO. ?, JANUARY 20?? 10

Fig. 8. Top: Precision filters (matrix C) of size 16x16 pixels learnedon color image patches of the same size. Matrix P was initialized witha local connectivity inducing a two dimensional topographic map.This makes nearby filters in the map learn similar features. Bottomleft: Random subset of mean filters (matrix W ). Bottom right (fromtop to bottom): independent samples drawn from the training data,mPoT, GRBM and PoT.

TABLE 1Log-probability estimates of test natural images x, and exact

log-probability ratios between the same test images x and randomimages n.

Model log

�p(x)

�log

�p(x)/p(n)

MoG -85 68GRBM FPCD -152 -3

PoT FPCD -101 65mPoT CD -109 82

mPoT PCD -92 98mPoT FPCD -94 102

mean filter resembling a similar Gabor specifies the initialintensity values before the interpolation.

The most intuitive way to test a generative model is todraw samples from it [49]. After training, we then run HMCfor a very long time starting from a random image. Thebottom right part of Fig. 8 shows that mPoT is actuallyable to generate samples that look much more like naturalimages than samples generated by a PoT model, whichonly models the pair-wise dependencies between pixels or aGRBM model, which only models the mean intensities. ThemPoT model succeeds in generating noisy texture patches,smooth patches, and patches with long elongated structuresthat cover the whole image.

More quantitatively, we have compared the differentmodels in terms of their log-probability on a hold outdataset of test image patches using a technique calledannealed importance sampling [50]. Tab. 1 shows greatimprovements of mPoT over both GRBM and PoT. How-ever, the estimated log-probability is lower than a mixtureof Gaussians (MoG) with diagonal covariances and thesame number of parameters as the other models. This isconsistent with previous work by Theis et al.[51]. Weinterpret this result with caution since the estimates forGRBM, PoT and mPoT can have a large variance due to thenoise introduced by the samplers, while the log-probabilityof the MoG is calculated exactly. This conjecture is con-firmed by the results reported in the rightmost columnshowing the exact log-probability ratio between the sametest images and Gaussian noise with the same covariance12.This ratio is indeed larger for mPoT than MoG suggestinginaccuracies in the former estimation. The table also reportsresults of mPoT models trained by using cheaper butless accurate approximations to the maximum likelihoodgradient, namely Contrastive Divergence [52] and PersistentContrastive Divergence [53]. The model trained with FPCDyields a higher likelihood ratio as expected [44].

We repeated the same data generation experiment us-ing the extension of mPoT to high-resolution images, bytraining on large patches (of size 238x238 pixels) pickedat random locations from a dataset of 16,000 gray-scalenatural images that were taken from ImageNet [54]13. Themodel was trained using 8x8 filters divided into four setswith different spatial phases. Each set tiles the image witha diagonal offset of two pixels from the previous set. Eachset consists of 64 covariance filters and 16 mean filters.Parameters were learned using the algorithm described insec. 3, but setting P to identity.

The first two rows of fig. 9 compare samples drawnfrom PoT with samples drawn from mPoT and show thatthe latter ones exhibit strong structure with smooth regionsseparated by sharp edges while the former ones lack anysort of long range structure. Yet, mPoT samples still lookrather artificial because the structure is fairly primitive andrepetitive. We then made the model deeper by adding twolayers on the top of mPoT.

All layers are trained by using FPCD but, as training pro-ceeds, the number of Markov chain steps between weightupdates is increased from 1 to 100 at the topmost layerin order to obtain a better approximation to the maximumlikelihood gradient. The second hidden layer has filters ofsize 3x3 that also tile the image with a diagonal offset ofone. There are 512 filters in each set. Finally, the thirdlayer has filters of size 2x2 and it uses a diagonal offsetof one; there are 2048 filters in each set. Every layerperforms spatial subsampling by a factor equal to the sizeof the filters used. This is compensated by an increase inthe number of channels which take contributions from the

12. The log probability ratio is exact since the intractable partitionfunction cancels out. This quantity is equal to the difference of energies.

13. Categories are: tree, dog, cat, vessel, office furniture, floor lamp,desk, room, building, tower, bridge, fabric, shore, beach and crater.

Page 9: 20140530.journal club

サンプリングによる再構成評価

•  原画像  •  mPoT(提案モデル)  •  GRBM  •  PoT  

JOURNAL OF PAMI, VOL. ?, NO. ?, JANUARY 20?? 10

Fig. 8. Top: Precision filters (matrix C) of size 16x16 pixels learnedon color image patches of the same size. Matrix P was initialized witha local connectivity inducing a two dimensional topographic map.This makes nearby filters in the map learn similar features. Bottomleft: Random subset of mean filters (matrix W ). Bottom right (fromtop to bottom): independent samples drawn from the training data,mPoT, GRBM and PoT.

TABLE 1Log-probability estimates of test natural images x, and exact

log-probability ratios between the same test images x and randomimages n.

Model log

�p(x)

�log

�p(x)/p(n)

MoG -85 68GRBM FPCD -152 -3

PoT FPCD -101 65mPoT CD -109 82

mPoT PCD -92 98mPoT FPCD -94 102

mean filter resembling a similar Gabor specifies the initialintensity values before the interpolation.

The most intuitive way to test a generative model is todraw samples from it [49]. After training, we then run HMCfor a very long time starting from a random image. Thebottom right part of Fig. 8 shows that mPoT is actuallyable to generate samples that look much more like naturalimages than samples generated by a PoT model, whichonly models the pair-wise dependencies between pixels or aGRBM model, which only models the mean intensities. ThemPoT model succeeds in generating noisy texture patches,smooth patches, and patches with long elongated structuresthat cover the whole image.

More quantitatively, we have compared the differentmodels in terms of their log-probability on a hold outdataset of test image patches using a technique calledannealed importance sampling [50]. Tab. 1 shows greatimprovements of mPoT over both GRBM and PoT. How-ever, the estimated log-probability is lower than a mixtureof Gaussians (MoG) with diagonal covariances and thesame number of parameters as the other models. This isconsistent with previous work by Theis et al.[51]. Weinterpret this result with caution since the estimates forGRBM, PoT and mPoT can have a large variance due to thenoise introduced by the samplers, while the log-probabilityof the MoG is calculated exactly. This conjecture is con-firmed by the results reported in the rightmost columnshowing the exact log-probability ratio between the sametest images and Gaussian noise with the same covariance12.This ratio is indeed larger for mPoT than MoG suggestinginaccuracies in the former estimation. The table also reportsresults of mPoT models trained by using cheaper butless accurate approximations to the maximum likelihoodgradient, namely Contrastive Divergence [52] and PersistentContrastive Divergence [53]. The model trained with FPCDyields a higher likelihood ratio as expected [44].

We repeated the same data generation experiment us-ing the extension of mPoT to high-resolution images, bytraining on large patches (of size 238x238 pixels) pickedat random locations from a dataset of 16,000 gray-scalenatural images that were taken from ImageNet [54]13. Themodel was trained using 8x8 filters divided into four setswith different spatial phases. Each set tiles the image witha diagonal offset of two pixels from the previous set. Eachset consists of 64 covariance filters and 16 mean filters.Parameters were learned using the algorithm described insec. 3, but setting P to identity.

The first two rows of fig. 9 compare samples drawnfrom PoT with samples drawn from mPoT and show thatthe latter ones exhibit strong structure with smooth regionsseparated by sharp edges while the former ones lack anysort of long range structure. Yet, mPoT samples still lookrather artificial because the structure is fairly primitive andrepetitive. We then made the model deeper by adding twolayers on the top of mPoT.

All layers are trained by using FPCD but, as training pro-ceeds, the number of Markov chain steps between weightupdates is increased from 1 to 100 at the topmost layerin order to obtain a better approximation to the maximumlikelihood gradient. The second hidden layer has filters ofsize 3x3 that also tile the image with a diagonal offset ofone. There are 512 filters in each set. Finally, the thirdlayer has filters of size 2x2 and it uses a diagonal offsetof one; there are 2048 filters in each set. Every layerperforms spatial subsampling by a factor equal to the sizeof the filters used. This is compensated by an increase inthe number of channels which take contributions from the

12. The log probability ratio is exact since the intractable partitionfunction cancels out. This quantity is equal to the difference of energies.

13. Categories are: tree, dog, cat, vessel, office furniture, floor lamp,desk, room, building, tower, bridge, fabric, shore, beach and crater.

JOURNAL OF PAMI, VOL. ?, NO. ?, JANUARY 20?? 10

Fig. 8. Top: Precision filters (matrix C) of size 16x16 pixels learnedon color image patches of the same size. Matrix P was initialized witha local connectivity inducing a two dimensional topographic map.This makes nearby filters in the map learn similar features. Bottomleft: Random subset of mean filters (matrix W ). Bottom right (fromtop to bottom): independent samples drawn from the training data,mPoT, GRBM and PoT.

TABLE 1Log-probability estimates of test natural images x, and exact

log-probability ratios between the same test images x and randomimages n.

Model log

�p(x)

�log

�p(x)/p(n)

MoG -85 68GRBM FPCD -152 -3

PoT FPCD -101 65mPoT CD -109 82

mPoT PCD -92 98mPoT FPCD -94 102

mean filter resembling a similar Gabor specifies the initialintensity values before the interpolation.

The most intuitive way to test a generative model is todraw samples from it [49]. After training, we then run HMCfor a very long time starting from a random image. Thebottom right part of Fig. 8 shows that mPoT is actuallyable to generate samples that look much more like naturalimages than samples generated by a PoT model, whichonly models the pair-wise dependencies between pixels or aGRBM model, which only models the mean intensities. ThemPoT model succeeds in generating noisy texture patches,smooth patches, and patches with long elongated structuresthat cover the whole image.

More quantitatively, we have compared the differentmodels in terms of their log-probability on a hold outdataset of test image patches using a technique calledannealed importance sampling [50]. Tab. 1 shows greatimprovements of mPoT over both GRBM and PoT. How-ever, the estimated log-probability is lower than a mixtureof Gaussians (MoG) with diagonal covariances and thesame number of parameters as the other models. This isconsistent with previous work by Theis et al.[51]. Weinterpret this result with caution since the estimates forGRBM, PoT and mPoT can have a large variance due to thenoise introduced by the samplers, while the log-probabilityof the MoG is calculated exactly. This conjecture is con-firmed by the results reported in the rightmost columnshowing the exact log-probability ratio between the sametest images and Gaussian noise with the same covariance12.This ratio is indeed larger for mPoT than MoG suggestinginaccuracies in the former estimation. The table also reportsresults of mPoT models trained by using cheaper butless accurate approximations to the maximum likelihoodgradient, namely Contrastive Divergence [52] and PersistentContrastive Divergence [53]. The model trained with FPCDyields a higher likelihood ratio as expected [44].

We repeated the same data generation experiment us-ing the extension of mPoT to high-resolution images, bytraining on large patches (of size 238x238 pixels) pickedat random locations from a dataset of 16,000 gray-scalenatural images that were taken from ImageNet [54]13. Themodel was trained using 8x8 filters divided into four setswith different spatial phases. Each set tiles the image witha diagonal offset of two pixels from the previous set. Eachset consists of 64 covariance filters and 16 mean filters.Parameters were learned using the algorithm described insec. 3, but setting P to identity.

The first two rows of fig. 9 compare samples drawnfrom PoT with samples drawn from mPoT and show thatthe latter ones exhibit strong structure with smooth regionsseparated by sharp edges while the former ones lack anysort of long range structure. Yet, mPoT samples still lookrather artificial because the structure is fairly primitive andrepetitive. We then made the model deeper by adding twolayers on the top of mPoT.

All layers are trained by using FPCD but, as training pro-ceeds, the number of Markov chain steps between weightupdates is increased from 1 to 100 at the topmost layerin order to obtain a better approximation to the maximumlikelihood gradient. The second hidden layer has filters ofsize 3x3 that also tile the image with a diagonal offset ofone. There are 512 filters in each set. Finally, the thirdlayer has filters of size 2x2 and it uses a diagonal offsetof one; there are 2048 filters in each set. Every layerperforms spatial subsampling by a factor equal to the sizeof the filters used. This is compensated by an increase inthe number of channels which take contributions from the

12. The log probability ratio is exact since the intractable partitionfunction cancels out. This quantity is equal to the difference of energies.

13. Categories are: tree, dog, cat, vessel, office furniture, floor lamp,desk, room, building, tower, bridge, fabric, shore, beach and crater.

Page 10: 20140530.journal club

高解像度化

•  このままだと単なるパッチ画像の処理装置  •  画像全体にグラフィカルモデルを拡張するのは  

計算量的にしんどい,Full  ConecWon(A)  •  ブロック分割してConvoluWon(D)的な伺かで対応  Tiled  ConvoluWon  (C,  E)  

Figure 2: Illustration of different choices of weight-sharing scheme for a RBM. Links converging to one latentvariable are filters. Filters with the same color share the same parameters. Kinds of weight-sharing scheme: A)Global, B) Local, C) TConv and D) Conv. E) TConv applied to an image. Cells correspond to neighborhoodsto which filters are applied. Cells with the same color share the same parameters. F) 256 filters learned bya Gaussian RBM with TConv weight-sharing scheme on high-resolution natural images. Each filter has size16x16 pixels and it is applied every 16 pixels in both the horizontal and vertical directions. Filters in position(i, j) and (1, 1) are applied to neighborhoods that are (i, j) pixels away form each other. Best viewed in color.

where h

m is another set of latent variables that are assumed to be Bernoulli distributed (but otherdistributions could be used). The new energy term is:

Em(x,hm

) =

1

2

x

Tx�

X

j

hmj Wj

Tx (4)

yielding the following conditional distribution over the input pixels:

p(x|hc,hm) = N(⌃(Wh

m),⌃), ⌃ = (⌃

c�1+ I)�1 (5)

with ⌃

c defined in eq. 2. As desired, the conditional distribution has non-zero mean2.

Patch-based models like PoT have been extended to high-resolution images by using spatially lo-calized filters [6]. While we can subtract off the mean intensity from independent image patches tosuccessfully train PoT, we cannot do that on a high-resolution image because overlapping patchesmight have different mean. Unfortunately, replicating potentials over the image ignoring variationsof mean intensity has been the leading strategy up to date [6]3. This is the major reason why gener-ation of high-resolution images is so poor. Sec. 4 shows that generation can be drastically improvedby explicitly accounting for variations of mean intensity, as performed by mPoT and mcRBM.

3 Weight-Sharing Schemes

By integrating out the latent variables, we can write the density function of any gated MRF as anormalized product of potential functions (for mPoT refer to eq. 6). In this section we investigatedifferent ways of constraining the parameters of the potentials of a generic MRF.Global The obvious way to extend a patch-based model like PoT to high-resolution images is todefine potentials over the whole image; we call this scheme global. This is not practical because1) the number of parameters grows about quadratically with the size of the image making trainingtoo slow, 2) we do not need to model interactions between very distant pairs of pixels since theirdependence is negligible, and 3) we would not be able to use the model on images of different size.Conv The most popular way to handle big images is then to define potentials on small subsetsof variables (e.g., neighborhoods of size 5x5 pixels) and to replicate these potentials across space

2The need to model the means was clearly recognized in [21] but they used conjunctive latent features thatsimultaneously represented a contribution to the “precision matrix” in a specific direction and the mean alongthat same direction.

3The success of PoT-like models in Bayesian denoising is not surprising since the noisy image effectivelyreplaces the reconstruction term from the mean hidden units (see eq. 5), providing a set of noisy mean intensitiesthat are cleaned up by the patterns of correlation enforced by the covariance latent variables.

4

画像的にはこんな感じ

JOURNAL OF PAMI, VOL. ?, NO. ?, JANUARY 20?? 9

��������

��������������� ���������

����������

�������

������������������������������� �����������������

������

�����

����

�������

�������

��������

��������������� ���������

�����

���

�������������

Fig. 7. A toy illustration of how units are combined across layers.Squares are filters, gray planes are channels and circles are latentvariables. Left: illustration of how input channels are combined into asingle output channel. Input variables at the same spatial locationacross different channels contribute to determine the state of thesame latent variable. Input units falling into different tiles (withoutoverlap) determine the state of nearby units in the hidden layer (here,we have only two spatial locations). Right: illustration of how filtersthat overlap with an offset contribute to hidden units that are atthe same output spatial location but in different hidden channels. Inpractice, the deep model combines these two methods at each layerto map the input channels into the output channels.

in the image, the representation will be about k timesovercomplete8.

Like other recent papers [47], [48], we propose anintermediate solution that yields a better trade-off betweencompactness of the parameterization and compactness ofthe latent representation: Tiling [13], [22]. Each local filteris replicated so that it tiles the image without overlapswith itself (i. e. it uses a stride that is equal to thediameter of the filter). This reduces the spatial redundancyof the latent variables and allows the input images to havearbitrary size without increasing the number of parameters.To reduce tiling artifacts, different filters typically havedifferent spatial phases so the borders of the different localfilters do not all align. In our experiments, we divide thefilters into sets with different phases. Filters in the same setare applied to the same image locations (to tile the image),while filters in different sets have a fixed diagonal offset, asshown in fig. 6. In our experiments we only experimentedwith diagonal offsets but presumably better schemes couldbe employed, where the whole image is evenly tiled but atthe same time the number of overlaps between filter bordersis minimized.

At any given layer, the model produces a three-dimensional tensor of values of its latent variables. Twodimensions of this tensor are the dimensions of the 2-Dimage and the third dimension, which we call a “channel”corresponds to filters with different parameters. The leftpanel of fig. 7 shows how each unit is connected toall patches centered at its location, across different inputchannels. The right panel of fig. 7 shows instead that outputchannels are generated by stacking filter outputs producedby different filters at nearby locations.

When computing the first layer hidden units we stackthe precision and the mean hidden units in such a waythat units taking input at the same image locations areplaced at the same output location but in different channels.Therefore, each second layer RBM hidden unit models

8. In the statistics literature, a convolutional weight-sharing scheme iscalled a homogeneous field while a locally connected one that does not tiethe parameters of filters at every location is called inhomogeneous field.

cross-correlations among mean and precision units at thefirst hidden layer.

6 EXPERIMENTSIn this section we first evaluate the mPoT gated MRF asa generative model by: a) interpreting the filters learnedafter training on an unconstrained distribution of naturalimages, b) by drawing samples from the model and c) byusing the model on a denoising task9. Second, we showthat the generative model can be used as a way of learningfeatures by interpreting the expected values of its latentvariables as features representing the input. In both cases,generation and feature extraction, the performance of themodel is significantly improved by adding layers of binarylatent variables on the top of mPoT latent variables. Finally,we show that the generative ability of the model can be usedto fill in occluded pixels in images before recognition. Thisis a difficult task which cannot be handled well without agood generative model of the input data.

6.1 Modeling Natural ImagesWe generated a dataset of 500,000 color images by picking,at random locations, patches of size 16x16 pixels fromimages of the Berkeley segmentation dataset10. Imageswere preprocessed by PCA whitening retaining 99% ofvariance, for a total of 105 projections11. We trained amodel with 1024 factors, 1024 precision hiddens and 256mean hiddens using filters of the same size as the input.P was initialized with a sparse connectivity inducing atwo-dimensional topography when filters in C are laid outon a grid, as shown in fig. 8. Each hidden unit takesas input a neighborhood of filter outputs that learn toextract similar features. Nearby hidden units in the griduse neighborhoods that overlap. Therefore, each precisionhidden unit is not only invariant to the sign of the input,because filter outputs are squared, but also it is invariantto a whole subspace of local distortions (that are learnedsince both C and P are learned). Roughly 90% of thefilters learn to be balanced in the R, G, and B channelseven though they were randomly initialized. They could bedescribed by localized, orientation-tuned Gabor functions.The colored filters generally have lower spatial frequencyand they naturally cluster together in large blobs.

The figure also shows some of the filters representingthe mean intensities (columns of matrix W ). These featuresare more complex, with color Gabor filters and on-centeroff-surround patterns. These features have different effectsthan the precision filters (see also fig. 4). For instance, aprecision filter that resembles an even Gabor is responsiblefor encouraging interpolation of pixel values along the edgeand discouraging interpolation across the edge, while a

9. Code available at:www.cs.toronto.edu/˜ranzato/publications/mPoT/mPoT.html

10. Available at: www.cs.berkeley.edu/projects/vision/grouping/segbench/11. The linear transform is: S� 1

2 U , where S is a diagonal matrix witheigenvalues on the diagonal entries and U is a matrix whose rows are theleading eigenvectors.

Ranzato  et  al,  NIPS  2010

Page 11: 20140530.journal club

Deep 化

•  介在変数が2値で書けてるんだし,  Deep  化できるんじゃね? DBNs$for$Classifica:on$

• $Aber$layerSbySlayer$unsupervised*pretraining,$discrimina:ve$fineStuning$$by$backpropaga:on$achieves$an$error$rate$of$1.2%$on$MNIST.$SVM’s$get$1.4%$and$randomly$ini:alized$backprop$gets$1.6%.$$

• $Clearly$unsupervised$learning$helps$generaliza:on.$It$ensures$that$most$of$the$informa:on$in$the$weights$comes$from$modeling$the$input$data.$

W +�W

W

W

W +�

W +�

W +�

W

W

W

W

1 11

500 500

500

2000

500

500

2000

5002

500

RBM

500

20003

Pretraining Unrolling Fine�tuning

4 4

2 2

3 3

1

2

3

4

RBM

10

Softmax Output

10RBM

T

T

T

T

T

T

T

T

今ココ

mcRBM/mPoT

Salakhutdinov  CVPR  2012  Tutorial

Page 12: 20140530.journal club

mPoT  +  DBN  による再構成

•  PoT  •  mPoT  •  DBN  (2layers)  +  mPoT  •  Time  course  of  DBN+mPoT

JOURNAL OF PAMI, VOL. ?, NO. ?, JANUARY 20?? 11

Fig. 9. Top row: Three representative samples generated by PoTafter training on an unconstrained distribution of high-resolution gray-scale natural images. Second row: Representative samples gener-ated by mPoT. Third row: Samples generated by a DBN with threehidden layers, whose first hidden layer is the gated MRF used for thesecond row. Bottom row: as in the third row, but samples (scannedfrom left to right) are taken at intervals of 50,000 iterations fromthe same Markov chain. All samples have approximate resolutionof 300x300 pixels.

filters that are applied with a different offset, see fig. 7.This implies that at the top hidden layer each latent variablereceives input from a very large patch in the input image.With this choice of filter sizes and strides, each unit at thetopmost layer affects a patch of size 70x70 pixels. Nearbyunits in the top layer represent very large (overlapping)spatial neighborhoods in the input image.

After training, we generate from the model by perform-ing 100,000 steps of blocked Gibbs sampling in the topmostRBM (using eq. 15 and 16) and then projecting the samplesdown to image space as described in sec. 4. Representativesamples are shown in the third row of fig. 9. The extrahidden layers do indeed make the samples look more“natural”: not only are there smooth regions and fairly sharpboundaries, but also there is a generally increased variety ofstructures that can cover a large extent of the image. Theseare among the most realistic samples drawn from modelstrained on an unconstrained distribution of high-resolutionnatural images. Finally, the last row of fig. 9 shows theevolution of the sampler. Typically, Markov chains slowlyand smoothly evolve morphing one structure into another.

TABLE 2Denoising performance using � = 20 (PSNR=22.1dB).

Barb. Boats Fgpt. House Lena Peprs.mPoT 28.0 30.0 27.6 32.2 31.9 30.7mPoT+A 29.2 30.2 28.4 32.4 32.0 30.7mPoT+A+NLM 30.7 30.4 28.6 32.9 32.4 31.0FoE [11] 28.3 29.8 - 32.3 31.9 30.6NLM [55] 30.5 29.8 27.8 32.4 32.0 30.3GSM [56] 30.3 30.4 28.6 32.4 32.7 30.3BM3D [57] 31.8 30.9 - 33.8 33.1 31.3LSSC [58] 31.6 30.9 28.8 34.2 32.9 31.4

6.2 DenoisingThe most commonly used task to quantitatively validatea generative model of natural images is image denoising,assuming homogeneous additive Gaussian noise of knownvariance [10], [11], [37], [58], [12]. We restore imagesby maximum a-posteriori (MAP) estimation. In the logdomain, this amounts to solving the following optimizationproblem: argmin

x

�||y � x||2 + F(x; ✓), where y is theobserved noisy image, F(x; ✓) is the mPoT energy function(see eq. 12), � is an hyper-parameter which is inverselyproportional to the noise variance and x is an estimate ofthe clean image. In our experiments, the optimization isperformed by gradient descent.

For images with repetitive texture, generic prior modelsusually offer only modest denoising performance comparedto non-parametric models [45], such as non-local means(NLM) [55] and BM3D [57] which exploit “image self-similarity” in order to denoise. The key idea is to computea weighted average of all the patches within the testimage that are similar to the patch that surrounds thepixel whose value is being estimated, and the current state-of-the-art method for image denoising [58] adapts sparsecoding to take that weighted average into account. Here,we adapt the generic prior learned by mPoT in two simpleways: 1) first we adapt the parameters to the denoisedtest image (mPoT+A) and 2) we add to the denoisingloss an extra quadratic term pulling the estimate close tothe denoising result of the non-local means algorithm [55](mPoT+A+NLM). The first approach is inspired by a largebody of literature on sparse coding [59], [58] and consistsof a) using the parameters learned on a generic dataset ofnatural images to denoise the test image and b) furtheradapting the parameters of the model using only the testimage denoised at the previous step. The second approachsimply consists of adding an additional quadratic termto the function subject to minimization, which becomes�||y � x||2 + F(x; ✓) + �||x

NLM

� x||2, where x

NLM

is the solution of NLM algorithm.Table 2 summarizes the results of these methods compar-

ing them to the current state-of-the-art methods on widelyused benchmark images at an intermediate level of noise.At lower noise levels the difference between these methodsbecomes negligible while at higher noise levels parametricmethods start outperforming non-parametric ones.

First, we observe that adapting the parameters and takinginto account image self-similarity improves performance

Page 13: 20140530.journal club

応用1:  ノイズ除去

•  ノイズ画像に対して,mPoT,  mPoT  +  NLM  などで修復 JOURNAL OF PAMI, VOL. ?, NO. ?, JANUARY 20?? 12

Fig. 10. Denoising of “Barbara” (detail). From left to right: original image, noisy image (� = 20, PSNR=22.1dB), denoising of mPoT(28.0dB), denoising of mPoT adapted to this image (29.2dB) and denoising of adapted mPoT combined with non-local means (30.7dB).

(over both baselines, mPoT+A and NLM). Second, on thistask a gated MRF with mean latent variables is no betterthan a model that lacks them, like FoE [11] (which is aconvolutional version of PoT)14. Mean hiddens are crucialwhen generating images in an unconstrained manner, butare not needed for denoising since the observation y alreadyprovides (noisy) mean intensities to the model (a similarargument applies to inpainting). Finally, the performanceof the adapted model is still slightly worse than the bestdenoising method.

6.3 Scene ClassificationIn this experiment, we use a classification task to com-pare SIFT features with the features learned by addinga second layer of Bernoulli latent variables that modelthe distribution of latent variables of an mPoT generativemodel. The task is to classify the natural scenes in the 15scene dataset [2] into one of 15 categories. The methodof reference on this dataset was proposed by Lazebnik etal. [2] and it can be summarized as follows: 1) denselycompute SIFT descriptors every 8 pixels on a regular grid,2) perform K-Means clustering on the SIFT descriptors, 3)compute histograms of cluster ids over regions of the imageat different locations and spatial scales, and 4) use an SVMwith an intersection kernel for classification.

We use a DBN with an mPoT front-end to mimic thispipeline. We treat the expected value of the latent variablesas features that describe the input image. We extract firstand second layer features (using the model that producedthe generations in the bottom of fig. 9) from a regulargrid with a stride equal to 8 pixels. We apply K-Meansto learn a dictionary with 1024 prototypes and then assigneach feature to its closest prototype. We compute a spatialpyramid with 2 levels for the first layer features ({hm

,h

p})and a spatial pyramid with 3 levels for the second layerfeatures (h2). Finally, we concatenate the resulting repre-sentations and train an SVM with an intersection kernel forclassification. Lazebnik et al. [2] reported an accuracy of81.4% using SIFT while we obtained an accuracy of 81.2%,which is not significantly different.

6.4 Object Recognition on CIFAR 10The CIFAR 10 dataset [60] is a hand-labeled subset of amuch larger dataset of 80 million tiny images [61], see

14. The difference of performance between Tiled PoT and convolutionalPoT, also called Field of Experts, is not statistically significant on this task.

Fig. 11. Example of images in the CIFAR 10 dataset. Each columnshows samples belonging to the same category.

TABLE 3Test and training (in parenthesis) recognition accuracy on the

CIFAR 10 dataset. The numbers in italics are the featuredimensionality at each stage.

Method Accuracy %

1) mean (GRBM): 11025 59.7 (72.2)2) cRBM (225 factors): 11025 63.6 (83.9)3) cRBM (900 factors): 11025 64.7 (80.2)4) mcRBM: 11025 68.2 (83.1)5) mcRBM-DBN (11025-8192) 70.7 (85.4)6) mcRBM-DBN (11025-8192-8192) 71.0 (83.6)7) mcRBM-DBN (11025-8192-4096-1024-384) 59.8 (62.0)

fig. 11. These images were downloaded from the web anddown-sampled to a very low resolution, just 32x32 pixels.The CIFAR 10 subset has ten object categories, namely air-plane, car, bird, cat, deer, dog, frog, horse, ship, and truck.The training set has 5000 samples per class, the test set has1000 samples per class. The low resolution and extremevariability make recognition very difficult and a traditionalmethod based on features extracted at interest-points isunlikely to work well. Moreover, extracting features fromsuch images using carefully engineered descriptors likeSIFT [3] or GIST [62] is also likely to be suboptimal sincethese descriptors were designed to work well on higherresolution images.

We use the following protocol. We train a gated MRF on8x8 color image patches sampled at random locations, andthen we apply the algorithm to extract features convolu-tionally over the whole 32x32 image by extracting featureson a 7x7 regularly spaced grid (stepping every 4 pixels).Then, we use a multinomial logistic regression classifier

Barbara  さん +AWGN    (σ=20,  PSNR=22.1[dB])

mPoT    (PSNR=28.0[dB])

mPoT+A    (PSNR=29.2[dB])

mPoT+A+NLM(って…)    (PSNR=30.7[dB])

JOURNAL OF PAMI, VOL. ?, NO. ?, JANUARY 20?? 11

Fig. 9. Top row: Three representative samples generated by PoTafter training on an unconstrained distribution of high-resolution gray-scale natural images. Second row: Representative samples gener-ated by mPoT. Third row: Samples generated by a DBN with threehidden layers, whose first hidden layer is the gated MRF used for thesecond row. Bottom row: as in the third row, but samples (scannedfrom left to right) are taken at intervals of 50,000 iterations fromthe same Markov chain. All samples have approximate resolutionof 300x300 pixels.

filters that are applied with a different offset, see fig. 7.This implies that at the top hidden layer each latent variablereceives input from a very large patch in the input image.With this choice of filter sizes and strides, each unit at thetopmost layer affects a patch of size 70x70 pixels. Nearbyunits in the top layer represent very large (overlapping)spatial neighborhoods in the input image.

After training, we generate from the model by perform-ing 100,000 steps of blocked Gibbs sampling in the topmostRBM (using eq. 15 and 16) and then projecting the samplesdown to image space as described in sec. 4. Representativesamples are shown in the third row of fig. 9. The extrahidden layers do indeed make the samples look more“natural”: not only are there smooth regions and fairly sharpboundaries, but also there is a generally increased variety ofstructures that can cover a large extent of the image. Theseare among the most realistic samples drawn from modelstrained on an unconstrained distribution of high-resolutionnatural images. Finally, the last row of fig. 9 shows theevolution of the sampler. Typically, Markov chains slowlyand smoothly evolve morphing one structure into another.

TABLE 2Denoising performance using � = 20 (PSNR=22.1dB).

Barb. Boats Fgpt. House Lena Peprs.mPoT 28.0 30.0 27.6 32.2 31.9 30.7mPoT+A 29.2 30.2 28.4 32.4 32.0 30.7mPoT+A+NLM 30.7 30.4 28.6 32.9 32.4 31.0FoE [11] 28.3 29.8 - 32.3 31.9 30.6NLM [55] 30.5 29.8 27.8 32.4 32.0 30.3GSM [56] 30.3 30.4 28.6 32.4 32.7 30.3BM3D [57] 31.8 30.9 - 33.8 33.1 31.3LSSC [58] 31.6 30.9 28.8 34.2 32.9 31.4

6.2 DenoisingThe most commonly used task to quantitatively validatea generative model of natural images is image denoising,assuming homogeneous additive Gaussian noise of knownvariance [10], [11], [37], [58], [12]. We restore imagesby maximum a-posteriori (MAP) estimation. In the logdomain, this amounts to solving the following optimizationproblem: argmin

x

�||y � x||2 + F(x; ✓), where y is theobserved noisy image, F(x; ✓) is the mPoT energy function(see eq. 12), � is an hyper-parameter which is inverselyproportional to the noise variance and x is an estimate ofthe clean image. In our experiments, the optimization isperformed by gradient descent.

For images with repetitive texture, generic prior modelsusually offer only modest denoising performance comparedto non-parametric models [45], such as non-local means(NLM) [55] and BM3D [57] which exploit “image self-similarity” in order to denoise. The key idea is to computea weighted average of all the patches within the testimage that are similar to the patch that surrounds thepixel whose value is being estimated, and the current state-of-the-art method for image denoising [58] adapts sparsecoding to take that weighted average into account. Here,we adapt the generic prior learned by mPoT in two simpleways: 1) first we adapt the parameters to the denoisedtest image (mPoT+A) and 2) we add to the denoisingloss an extra quadratic term pulling the estimate close tothe denoising result of the non-local means algorithm [55](mPoT+A+NLM). The first approach is inspired by a largebody of literature on sparse coding [59], [58] and consistsof a) using the parameters learned on a generic dataset ofnatural images to denoise the test image and b) furtheradapting the parameters of the model using only the testimage denoised at the previous step. The second approachsimply consists of adding an additional quadratic termto the function subject to minimization, which becomes�||y � x||2 + F(x; ✓) + �||x

NLM

� x||2, where x

NLM

is the solution of NLM algorithm.Table 2 summarizes the results of these methods compar-

ing them to the current state-of-the-art methods on widelyused benchmark images at an intermediate level of noise.At lower noise levels the difference between these methodsbecomes negligible while at higher noise levels parametricmethods start outperforming non-parametric ones.

First, we observe that adapting the parameters and takinginto account image self-similarity improves performance

Page 14: 20140530.journal club

応用2:  Scene  ClassificaWon

•  Lazebnik  ら  (2006)  の SceneDB  •  mPoT+DBN(2layer)+K-­‐means  →  81.2%  accuracy  •  SIFT  +  SVM  →  81.4%  accuracy(Lazebnik’06)  

 

• これ以上はパス  (あまりデータのってないお  ↑言い訳)

Page 15: 20140530.journal club

応用3:  物体認識

•  CIFAR10  データセット(Torralba  et  al’08)の識別  – Train  5K/cls,  Test  1K/cls

JOURNAL OF PAMI, VOL. ?, NO. ?, JANUARY 20?? 12

Fig. 10. Denoising of “Barbara” (detail). From left to right: original image, noisy image (� = 20, PSNR=22.1dB), denoising of mPoT(28.0dB), denoising of mPoT adapted to this image (29.2dB) and denoising of adapted mPoT combined with non-local means (30.7dB).

(over both baselines, mPoT+A and NLM). Second, on thistask a gated MRF with mean latent variables is no betterthan a model that lacks them, like FoE [11] (which is aconvolutional version of PoT)14. Mean hiddens are crucialwhen generating images in an unconstrained manner, butare not needed for denoising since the observation y alreadyprovides (noisy) mean intensities to the model (a similarargument applies to inpainting). Finally, the performanceof the adapted model is still slightly worse than the bestdenoising method.

6.3 Scene ClassificationIn this experiment, we use a classification task to com-pare SIFT features with the features learned by addinga second layer of Bernoulli latent variables that modelthe distribution of latent variables of an mPoT generativemodel. The task is to classify the natural scenes in the 15scene dataset [2] into one of 15 categories. The methodof reference on this dataset was proposed by Lazebnik etal. [2] and it can be summarized as follows: 1) denselycompute SIFT descriptors every 8 pixels on a regular grid,2) perform K-Means clustering on the SIFT descriptors, 3)compute histograms of cluster ids over regions of the imageat different locations and spatial scales, and 4) use an SVMwith an intersection kernel for classification.

We use a DBN with an mPoT front-end to mimic thispipeline. We treat the expected value of the latent variablesas features that describe the input image. We extract firstand second layer features (using the model that producedthe generations in the bottom of fig. 9) from a regulargrid with a stride equal to 8 pixels. We apply K-Meansto learn a dictionary with 1024 prototypes and then assigneach feature to its closest prototype. We compute a spatialpyramid with 2 levels for the first layer features ({hm

,h

p})and a spatial pyramid with 3 levels for the second layerfeatures (h2). Finally, we concatenate the resulting repre-sentations and train an SVM with an intersection kernel forclassification. Lazebnik et al. [2] reported an accuracy of81.4% using SIFT while we obtained an accuracy of 81.2%,which is not significantly different.

6.4 Object Recognition on CIFAR 10The CIFAR 10 dataset [60] is a hand-labeled subset of amuch larger dataset of 80 million tiny images [61], see

14. The difference of performance between Tiled PoT and convolutionalPoT, also called Field of Experts, is not statistically significant on this task.

Fig. 11. Example of images in the CIFAR 10 dataset. Each columnshows samples belonging to the same category.

TABLE 3Test and training (in parenthesis) recognition accuracy on the

CIFAR 10 dataset. The numbers in italics are the featuredimensionality at each stage.

Method Accuracy %

1) mean (GRBM): 11025 59.7 (72.2)2) cRBM (225 factors): 11025 63.6 (83.9)3) cRBM (900 factors): 11025 64.7 (80.2)4) mcRBM: 11025 68.2 (83.1)5) mcRBM-DBN (11025-8192) 70.7 (85.4)6) mcRBM-DBN (11025-8192-8192) 71.0 (83.6)7) mcRBM-DBN (11025-8192-4096-1024-384) 59.8 (62.0)

fig. 11. These images were downloaded from the web anddown-sampled to a very low resolution, just 32x32 pixels.The CIFAR 10 subset has ten object categories, namely air-plane, car, bird, cat, deer, dog, frog, horse, ship, and truck.The training set has 5000 samples per class, the test set has1000 samples per class. The low resolution and extremevariability make recognition very difficult and a traditionalmethod based on features extracted at interest-points isunlikely to work well. Moreover, extracting features fromsuch images using carefully engineered descriptors likeSIFT [3] or GIST [62] is also likely to be suboptimal sincethese descriptors were designed to work well on higherresolution images.

We use the following protocol. We train a gated MRF on8x8 color image patches sampled at random locations, andthen we apply the algorithm to extract features convolu-tionally over the whole 32x32 image by extracting featureson a 7x7 regularly spaced grid (stepping every 4 pixels).Then, we use a multinomial logistic regression classifier

JOURNAL OF PAMI, VOL. ?, NO. ?, JANUARY 20?? 12

Fig. 10. Denoising of “Barbara” (detail). From left to right: original image, noisy image (� = 20, PSNR=22.1dB), denoising of mPoT(28.0dB), denoising of mPoT adapted to this image (29.2dB) and denoising of adapted mPoT combined with non-local means (30.7dB).

(over both baselines, mPoT+A and NLM). Second, on thistask a gated MRF with mean latent variables is no betterthan a model that lacks them, like FoE [11] (which is aconvolutional version of PoT)14. Mean hiddens are crucialwhen generating images in an unconstrained manner, butare not needed for denoising since the observation y alreadyprovides (noisy) mean intensities to the model (a similarargument applies to inpainting). Finally, the performanceof the adapted model is still slightly worse than the bestdenoising method.

6.3 Scene ClassificationIn this experiment, we use a classification task to com-pare SIFT features with the features learned by addinga second layer of Bernoulli latent variables that modelthe distribution of latent variables of an mPoT generativemodel. The task is to classify the natural scenes in the 15scene dataset [2] into one of 15 categories. The methodof reference on this dataset was proposed by Lazebnik etal. [2] and it can be summarized as follows: 1) denselycompute SIFT descriptors every 8 pixels on a regular grid,2) perform K-Means clustering on the SIFT descriptors, 3)compute histograms of cluster ids over regions of the imageat different locations and spatial scales, and 4) use an SVMwith an intersection kernel for classification.

We use a DBN with an mPoT front-end to mimic thispipeline. We treat the expected value of the latent variablesas features that describe the input image. We extract firstand second layer features (using the model that producedthe generations in the bottom of fig. 9) from a regulargrid with a stride equal to 8 pixels. We apply K-Meansto learn a dictionary with 1024 prototypes and then assigneach feature to its closest prototype. We compute a spatialpyramid with 2 levels for the first layer features ({hm

,h

p})and a spatial pyramid with 3 levels for the second layerfeatures (h2). Finally, we concatenate the resulting repre-sentations and train an SVM with an intersection kernel forclassification. Lazebnik et al. [2] reported an accuracy of81.4% using SIFT while we obtained an accuracy of 81.2%,which is not significantly different.

6.4 Object Recognition on CIFAR 10The CIFAR 10 dataset [60] is a hand-labeled subset of amuch larger dataset of 80 million tiny images [61], see

14. The difference of performance between Tiled PoT and convolutionalPoT, also called Field of Experts, is not statistically significant on this task.

Fig. 11. Example of images in the CIFAR 10 dataset. Each columnshows samples belonging to the same category.

TABLE 3Test and training (in parenthesis) recognition accuracy on the

CIFAR 10 dataset. The numbers in italics are the featuredimensionality at each stage.

Method Accuracy %

1) mean (GRBM): 11025 59.7 (72.2)2) cRBM (225 factors): 11025 63.6 (83.9)3) cRBM (900 factors): 11025 64.7 (80.2)4) mcRBM: 11025 68.2 (83.1)5) mcRBM-DBN (11025-8192) 70.7 (85.4)6) mcRBM-DBN (11025-8192-8192) 71.0 (83.6)7) mcRBM-DBN (11025-8192-4096-1024-384) 59.8 (62.0)

fig. 11. These images were downloaded from the web anddown-sampled to a very low resolution, just 32x32 pixels.The CIFAR 10 subset has ten object categories, namely air-plane, car, bird, cat, deer, dog, frog, horse, ship, and truck.The training set has 5000 samples per class, the test set has1000 samples per class. The low resolution and extremevariability make recognition very difficult and a traditionalmethod based on features extracted at interest-points isunlikely to work well. Moreover, extracting features fromsuch images using carefully engineered descriptors likeSIFT [3] or GIST [62] is also likely to be suboptimal sincethese descriptors were designed to work well on higherresolution images.

We use the following protocol. We train a gated MRF on8x8 color image patches sampled at random locations, andthen we apply the algorithm to extract features convolu-tionally over the whole 32x32 image by extracting featureson a 7x7 regularly spaced grid (stepping every 4 pixels).Then, we use a multinomial logistic regression classifier

JOURNAL OF PAMI, VOL. ?, NO. ?, JANUARY 20?? 13

TABLE 4Test recognition accuracy on the CIFAR 10 dataset produced by

different methods. Features are fed to a multinomial logisticregression classifier for recognition.

Method Accuracy %

384 dimens. GIST 54.710,000 linear random projections 36.010K GRBM(*), 1 layer, ZCA’d images 59.610K GRBM(*), 1 layer 63.810K GRBM(*), 1layer with fine-tuning 64.810K GRBM-DBN(*), 2 layers 56.611025 mcRBM 1 layer, PCA’d images 68.2

8192 mcRBM-DBN, 3 layers, PCA’d images 71.0

384 mcRBM-DBN, 5 layers, PCA’d images 59.8

to recognize the object category in the image. Since ourmodel is unsupervised, we train it on a set of two millionimages from the TINY dataset that does not overlap withthe labeled CIFAR 10 subset in order to further improvegeneralization [20], [63], [64]. In the default set up welearn all parameters of the model, we use 81 filters in W

to encode the mean, 576 filters in C to encode covarianceconstraints and we pool these filters into 144 hidden unitsthrough matrix P . P is initialized with a two-dimensionaltopography that takes 3x3 neighborhoods of filters witha stride equal to 2. In total, at each location we extract144+81=225 features. Therefore, we represent a 32x32image with a 225x7x7=11025 dimensional descriptor.

Table 3 shows some comparisons. First, we assesswhether it is best to model just the mean intensity, or justthe covariance or both in 1), 2) and 4). In order to make afair comparison we used the same feature dimensionality.The covariance part of the model produces features thatare more discriminative, but modelling both mean andcovariance further improves generalization. In 2) and 3)we show that increasing the number of filters in C whilekeeping the same feature dimensionality (by pooling morewith matrix P ) also improves the performance. We canallow for a large number of features as long as we poollater into a more compact and invariant representation.Entries 4), 5) and 6) show that adding an extra layer on thetop (by training a binary RBM on the 11025 dimensionalfeature) improves generalization. Using three stages and8192 features we achieved the best performance of 71.0%.

We also compared to the more compact 384 dimensionalrepresentation produced by GIST and found that our fea-tures are more discriminative, as shown in table 4. Previousresults using GRBMs [60] reported an accuracy of 59.6%using whitened data while we achieve 68.2%. Their resultimproved to 64.8% by using unprocessed data and by fine-tuning, but it did not improve by using a deeper model.Our performance improves by adding other layers showingthat these features are more suitable for use in a DBN. Thecurrent state-of-the-art result on this dataset is 80.5% [65]and it employs a much bigger network and perturbation ofthe input to improve generalization.

Fig. 12. Top: Samples generated by a five-layer deep modeltrained on faces. The top layer has 128 binary latent variablesand images have size 48x48 pixels. Bottom: comparison betweensix samples from the model (top row) and the Euclidean distancenearest neighbor images in the training set (bottom row).

TABLE 5TFD: Facial expression classification accuracy using features

trained without supervision.

Method layer 1 layer 2 layer 3 layer 4raw pixels 71.5 - - -Gaussian SVM 76.2 - - -Sparse Coding 74.6 - - -Gabor PCA 80.2 - - -GRBM 80.0 81.5 80.3 79.5PoT 79.4 79.3 80.6 80.2mPoT 81.6 82.1 82.5 82.4

6.5 Recognition of Facial ExpressionsIn these experiments we study the recognition of facialexpressions under occlusion in the Toronto Face Database(TFD) [66]. This is the largest publicly available datasetof faces to date, created by merging together 30 pre-existing datasets [66]. It has about 100,000 images thatare unlabeled and more than 4,000 images that are labeledwith seven facial expressions, namely: anger, disgust, fear,happiness, sadness, surprise and neutral. Faces were prepro-cessed by: detection and alignment of faces using the Ma-chine Perception Toolbox [67], followed by down-samplingto a common resolution of 48x48 pixels. We choose topredict facial expressions under occlusion because thisis a particularly difficult problem: The expression is asubtle property of faces that requires good representationsof detailed local features, which are easily disrupted byocclusion.

Since the input images have fairly low resolution andthe statistics across the images are strongly non-stationary(because the faces have been aligned), we trained a deepmodel without weight-sharing. The first layer uses filtersof size 16x16 centered at grid-points that are four pixelsapart, with 32 covariance filters and 16 mean filters ateach grid-point. At the second layer we learn a fully-connected RBM with 4096 latent variables each of which

Page 16: 20140530.journal club

応用4:  顔画像生成とか認識

•  Tront  Face  Dataset(TFD,  Susskind  et  al.’10)  –  48x48,  100K  unlabeled,  4K  labeled  

•  mPoT  +  DBN(〜4layer)  +  mult.  cls  log  reg.(sommax?)  JOURNAL OF PAMI, VOL. ?, NO. ?, JANUARY 20?? 13

TABLE 4Test recognition accuracy on the CIFAR 10 dataset produced by

different methods. Features are fed to a multinomial logisticregression classifier for recognition.

Method Accuracy %

384 dimens. GIST 54.710,000 linear random projections 36.010K GRBM(*), 1 layer, ZCA’d images 59.610K GRBM(*), 1 layer 63.810K GRBM(*), 1layer with fine-tuning 64.810K GRBM-DBN(*), 2 layers 56.611025 mcRBM 1 layer, PCA’d images 68.2

8192 mcRBM-DBN, 3 layers, PCA’d images 71.0

384 mcRBM-DBN, 5 layers, PCA’d images 59.8

to recognize the object category in the image. Since ourmodel is unsupervised, we train it on a set of two millionimages from the TINY dataset that does not overlap withthe labeled CIFAR 10 subset in order to further improvegeneralization [20], [63], [64]. In the default set up welearn all parameters of the model, we use 81 filters in W

to encode the mean, 576 filters in C to encode covarianceconstraints and we pool these filters into 144 hidden unitsthrough matrix P . P is initialized with a two-dimensionaltopography that takes 3x3 neighborhoods of filters witha stride equal to 2. In total, at each location we extract144+81=225 features. Therefore, we represent a 32x32image with a 225x7x7=11025 dimensional descriptor.

Table 3 shows some comparisons. First, we assesswhether it is best to model just the mean intensity, or justthe covariance or both in 1), 2) and 4). In order to make afair comparison we used the same feature dimensionality.The covariance part of the model produces features thatare more discriminative, but modelling both mean andcovariance further improves generalization. In 2) and 3)we show that increasing the number of filters in C whilekeeping the same feature dimensionality (by pooling morewith matrix P ) also improves the performance. We canallow for a large number of features as long as we poollater into a more compact and invariant representation.Entries 4), 5) and 6) show that adding an extra layer on thetop (by training a binary RBM on the 11025 dimensionalfeature) improves generalization. Using three stages and8192 features we achieved the best performance of 71.0%.

We also compared to the more compact 384 dimensionalrepresentation produced by GIST and found that our fea-tures are more discriminative, as shown in table 4. Previousresults using GRBMs [60] reported an accuracy of 59.6%using whitened data while we achieve 68.2%. Their resultimproved to 64.8% by using unprocessed data and by fine-tuning, but it did not improve by using a deeper model.Our performance improves by adding other layers showingthat these features are more suitable for use in a DBN. Thecurrent state-of-the-art result on this dataset is 80.5% [65]and it employs a much bigger network and perturbation ofthe input to improve generalization.

Fig. 12. Top: Samples generated by a five-layer deep modeltrained on faces. The top layer has 128 binary latent variablesand images have size 48x48 pixels. Bottom: comparison betweensix samples from the model (top row) and the Euclidean distancenearest neighbor images in the training set (bottom row).

TABLE 5TFD: Facial expression classification accuracy using features

trained without supervision.

Method layer 1 layer 2 layer 3 layer 4raw pixels 71.5 - - -Gaussian SVM 76.2 - - -Sparse Coding 74.6 - - -Gabor PCA 80.2 - - -GRBM 80.0 81.5 80.3 79.5PoT 79.4 79.3 80.6 80.2mPoT 81.6 82.1 82.5 82.4

6.5 Recognition of Facial ExpressionsIn these experiments we study the recognition of facialexpressions under occlusion in the Toronto Face Database(TFD) [66]. This is the largest publicly available datasetof faces to date, created by merging together 30 pre-existing datasets [66]. It has about 100,000 images thatare unlabeled and more than 4,000 images that are labeledwith seven facial expressions, namely: anger, disgust, fear,happiness, sadness, surprise and neutral. Faces were prepro-cessed by: detection and alignment of faces using the Ma-chine Perception Toolbox [67], followed by down-samplingto a common resolution of 48x48 pixels. We choose topredict facial expressions under occlusion because thisis a particularly difficult problem: The expression is asubtle property of faces that requires good representationsof detailed local features, which are easily disrupted byocclusion.

Since the input images have fairly low resolution andthe statistics across the images are strongly non-stationary(because the faces have been aligned), we trained a deepmodel without weight-sharing. The first layer uses filtersof size 16x16 centered at grid-points that are four pixelsapart, with 32 covariance filters and 16 mean filters ateach grid-point. At the second layer we learn a fully-connected RBM with 4096 latent variables each of which

JOURNAL OF PAMI, VOL. ?, NO. ?, JANUARY 20?? 13

TABLE 4Test recognition accuracy on the CIFAR 10 dataset produced by

different methods. Features are fed to a multinomial logisticregression classifier for recognition.

Method Accuracy %

384 dimens. GIST 54.710,000 linear random projections 36.010K GRBM(*), 1 layer, ZCA’d images 59.610K GRBM(*), 1 layer 63.810K GRBM(*), 1layer with fine-tuning 64.810K GRBM-DBN(*), 2 layers 56.611025 mcRBM 1 layer, PCA’d images 68.2

8192 mcRBM-DBN, 3 layers, PCA’d images 71.0

384 mcRBM-DBN, 5 layers, PCA’d images 59.8

to recognize the object category in the image. Since ourmodel is unsupervised, we train it on a set of two millionimages from the TINY dataset that does not overlap withthe labeled CIFAR 10 subset in order to further improvegeneralization [20], [63], [64]. In the default set up welearn all parameters of the model, we use 81 filters in W

to encode the mean, 576 filters in C to encode covarianceconstraints and we pool these filters into 144 hidden unitsthrough matrix P . P is initialized with a two-dimensionaltopography that takes 3x3 neighborhoods of filters witha stride equal to 2. In total, at each location we extract144+81=225 features. Therefore, we represent a 32x32image with a 225x7x7=11025 dimensional descriptor.

Table 3 shows some comparisons. First, we assesswhether it is best to model just the mean intensity, or justthe covariance or both in 1), 2) and 4). In order to make afair comparison we used the same feature dimensionality.The covariance part of the model produces features thatare more discriminative, but modelling both mean andcovariance further improves generalization. In 2) and 3)we show that increasing the number of filters in C whilekeeping the same feature dimensionality (by pooling morewith matrix P ) also improves the performance. We canallow for a large number of features as long as we poollater into a more compact and invariant representation.Entries 4), 5) and 6) show that adding an extra layer on thetop (by training a binary RBM on the 11025 dimensionalfeature) improves generalization. Using three stages and8192 features we achieved the best performance of 71.0%.

We also compared to the more compact 384 dimensionalrepresentation produced by GIST and found that our fea-tures are more discriminative, as shown in table 4. Previousresults using GRBMs [60] reported an accuracy of 59.6%using whitened data while we achieve 68.2%. Their resultimproved to 64.8% by using unprocessed data and by fine-tuning, but it did not improve by using a deeper model.Our performance improves by adding other layers showingthat these features are more suitable for use in a DBN. Thecurrent state-of-the-art result on this dataset is 80.5% [65]and it employs a much bigger network and perturbation ofthe input to improve generalization.

Fig. 12. Top: Samples generated by a five-layer deep modeltrained on faces. The top layer has 128 binary latent variablesand images have size 48x48 pixels. Bottom: comparison betweensix samples from the model (top row) and the Euclidean distancenearest neighbor images in the training set (bottom row).

TABLE 5TFD: Facial expression classification accuracy using features

trained without supervision.

Method layer 1 layer 2 layer 3 layer 4raw pixels 71.5 - - -Gaussian SVM 76.2 - - -Sparse Coding 74.6 - - -Gabor PCA 80.2 - - -GRBM 80.0 81.5 80.3 79.5PoT 79.4 79.3 80.6 80.2mPoT 81.6 82.1 82.5 82.4

6.5 Recognition of Facial ExpressionsIn these experiments we study the recognition of facialexpressions under occlusion in the Toronto Face Database(TFD) [66]. This is the largest publicly available datasetof faces to date, created by merging together 30 pre-existing datasets [66]. It has about 100,000 images thatare unlabeled and more than 4,000 images that are labeledwith seven facial expressions, namely: anger, disgust, fear,happiness, sadness, surprise and neutral. Faces were prepro-cessed by: detection and alignment of faces using the Ma-chine Perception Toolbox [67], followed by down-samplingto a common resolution of 48x48 pixels. We choose topredict facial expressions under occlusion because thisis a particularly difficult problem: The expression is asubtle property of faces that requires good representationsof detailed local features, which are easily disrupted byocclusion.

Since the input images have fairly low resolution andthe statistics across the images are strongly non-stationary(because the faces have been aligned), we trained a deepmodel without weight-sharing. The first layer uses filtersof size 16x16 centered at grid-points that are four pixelsapart, with 32 covariance filters and 16 mean filters ateach grid-point. At the second layer we learn a fully-connected RBM with 4096 latent variables each of which

4layer  DBN(+mPoT)  sampling

生成画像と原画像の比較

JOURNAL OF PAMI, VOL. ?, NO. ?, JANUARY 20?? 13

TABLE 4Test recognition accuracy on the CIFAR 10 dataset produced by

different methods. Features are fed to a multinomial logisticregression classifier for recognition.

Method Accuracy %

384 dimens. GIST 54.710,000 linear random projections 36.010K GRBM(*), 1 layer, ZCA’d images 59.610K GRBM(*), 1 layer 63.810K GRBM(*), 1layer with fine-tuning 64.810K GRBM-DBN(*), 2 layers 56.611025 mcRBM 1 layer, PCA’d images 68.2

8192 mcRBM-DBN, 3 layers, PCA’d images 71.0

384 mcRBM-DBN, 5 layers, PCA’d images 59.8

to recognize the object category in the image. Since ourmodel is unsupervised, we train it on a set of two millionimages from the TINY dataset that does not overlap withthe labeled CIFAR 10 subset in order to further improvegeneralization [20], [63], [64]. In the default set up welearn all parameters of the model, we use 81 filters in W

to encode the mean, 576 filters in C to encode covarianceconstraints and we pool these filters into 144 hidden unitsthrough matrix P . P is initialized with a two-dimensionaltopography that takes 3x3 neighborhoods of filters witha stride equal to 2. In total, at each location we extract144+81=225 features. Therefore, we represent a 32x32image with a 225x7x7=11025 dimensional descriptor.

Table 3 shows some comparisons. First, we assesswhether it is best to model just the mean intensity, or justthe covariance or both in 1), 2) and 4). In order to make afair comparison we used the same feature dimensionality.The covariance part of the model produces features thatare more discriminative, but modelling both mean andcovariance further improves generalization. In 2) and 3)we show that increasing the number of filters in C whilekeeping the same feature dimensionality (by pooling morewith matrix P ) also improves the performance. We canallow for a large number of features as long as we poollater into a more compact and invariant representation.Entries 4), 5) and 6) show that adding an extra layer on thetop (by training a binary RBM on the 11025 dimensionalfeature) improves generalization. Using three stages and8192 features we achieved the best performance of 71.0%.

We also compared to the more compact 384 dimensionalrepresentation produced by GIST and found that our fea-tures are more discriminative, as shown in table 4. Previousresults using GRBMs [60] reported an accuracy of 59.6%using whitened data while we achieve 68.2%. Their resultimproved to 64.8% by using unprocessed data and by fine-tuning, but it did not improve by using a deeper model.Our performance improves by adding other layers showingthat these features are more suitable for use in a DBN. Thecurrent state-of-the-art result on this dataset is 80.5% [65]and it employs a much bigger network and perturbation ofthe input to improve generalization.

Fig. 12. Top: Samples generated by a five-layer deep modeltrained on faces. The top layer has 128 binary latent variablesand images have size 48x48 pixels. Bottom: comparison betweensix samples from the model (top row) and the Euclidean distancenearest neighbor images in the training set (bottom row).

TABLE 5TFD: Facial expression classification accuracy using features

trained without supervision.

Method layer 1 layer 2 layer 3 layer 4raw pixels 71.5 - - -Gaussian SVM 76.2 - - -Sparse Coding 74.6 - - -Gabor PCA 80.2 - - -GRBM 80.0 81.5 80.3 79.5PoT 79.4 79.3 80.6 80.2mPoT 81.6 82.1 82.5 82.4

6.5 Recognition of Facial ExpressionsIn these experiments we study the recognition of facialexpressions under occlusion in the Toronto Face Database(TFD) [66]. This is the largest publicly available datasetof faces to date, created by merging together 30 pre-existing datasets [66]. It has about 100,000 images thatare unlabeled and more than 4,000 images that are labeledwith seven facial expressions, namely: anger, disgust, fear,happiness, sadness, surprise and neutral. Faces were prepro-cessed by: detection and alignment of faces using the Ma-chine Perception Toolbox [67], followed by down-samplingto a common resolution of 48x48 pixels. We choose topredict facial expressions under occlusion because thisis a particularly difficult problem: The expression is asubtle property of faces that requires good representationsof detailed local features, which are easily disrupted byocclusion.

Since the input images have fairly low resolution andthe statistics across the images are strongly non-stationary(because the faces have been aligned), we trained a deepmodel without weight-sharing. The first layer uses filtersof size 16x16 centered at grid-points that are four pixelsapart, with 32 covariance filters and 16 mean filters ateach grid-point. At the second layer we learn a fully-connected RBM with 4096 latent variables each of which

Page 17: 20140530.journal club

応用4:  顔画像生成とか認識(続き1)

•  顔のパーツを遮蔽した場合の生成  – 上段が入力,下段が生成結果 JOURNAL OF PAMI, VOL. ?, NO. ?, JANUARY 20?? 14

OR

IGIN

AL

TY

PE

1T

YP

E 2

TY

PE

3

Fig. 13. Example of conditional generation performed by a four-layer deep model trained on faces. Each column is a differentexample (not used in the unsupervised training phase). The topmostrow shows some example images from the TFD dataset. The otherrows show the same images occluded by a synthetic mask (on thetop) and their restoration performed by the deep generative model(on the bottom).

Fig. 14. An example of restoration of unseen images performedby propagating the input up to first, second, third, fourth layer, andagain through the four layers and re-circulating the input through thesame model for ten times.

is connected to all of the first layer features. Similarly,at the third and fourth layer we learn fully connected

Fig. 15. Expression recognition accuracy on the TFD datasetwhen both training and test labeled images are subject to 7 typesof occlusion.

RBMs with 1024 latent variables, and at the fifth layerwe have an RBM with 128 hidden units. The deep modelwas trained entirely generatively on the unlabeled imageswithout using any labeled instances [64]. The discriminativetraining consisted of training a linear multi-class logisticregression classifier on the top level representation withoutusing back-propagation to jointly optimize the parametersacross all layers.

Fig. 12 shows samples randomly drawn from the gener-ative model. Most samples resemble plausible faces of dif-ferent individuals with a nice variety of facial expressions,poses, and lighting conditions. Nearest neighbor analysis,using Euclidean distance, reveals that these samples arenot just copies of images in the training set: Each sampleexhibits regularities derived from many different trainingcases.

In the first experiment, we train a linear classifier onthe features produced by each layer and we predict thefacial expression of the images in the labeled set. Eachinput image is processed by subtracting from each pixel themean intensity in that image then dividing by the standarddeviation of the pixels in that image. The features atsuccessive hidden layers give accuracies of 81.6%, 82.1%,82.5%, and 82.4%. Each higher layer is a RBM with 4096hidden units. These accuracies should be compared to:71.5% achieved by a linear classifier on the raw pixels,76.2% achieved by a Gaussian SVM on the raw pixels,74.6% achieved by a sparse coding method [68], and 80.2%achieved by the method proposed by Dailey et al. [69]which employs a large Gabor filter bank followed by PCAand a linear classifier, using cross-validation to select thenumber of principal components. The latter method andits variants [70] are considered a strong baseline in theliterature. Table 5 reports also a direct comparison to a DBNwith a GRBM at the first layer and to a DBN with a PoTat the first layer using the same number of parameters andweight sharing scheme. mPoT outperforms its competitorson this task. The accuracies reported on the table are anaverage over 5 random training/test splits of the data with80% of the images used for training and the rest for test.Facial identities of subjects in training and test sets aredisjoint.

In the next experiment, we apply synthetic occlusions

JOURNAL OF PAMI, VOL. ?, NO. ?, JANUARY 20?? 14

TY

PE

4T

YP

E 5

TY

PE

6T

YP

E 7

Fig. 13. Example of conditional generation performed by a four-layer deep model trained on faces. Each column is a differentexample (not used in the unsupervised training phase). The topmostrow shows some example images from the TFD dataset. The otherrows show the same images occluded by a synthetic mask (on thetop) and their restoration performed by the deep generative model(on the bottom).

Fig. 14. An example of restoration of unseen images performedby propagating the input up to first, second, third, fourth layer, andagain through the four layers and re-circulating the input through thesame model for ten times.

is connected to all of the first layer features. Similarly,at the third and fourth layer we learn fully connected

Fig. 15. Expression recognition accuracy on the TFD datasetwhen both training and test labeled images are subject to 7 typesof occlusion.

RBMs with 1024 latent variables, and at the fifth layerwe have an RBM with 128 hidden units. The deep modelwas trained entirely generatively on the unlabeled imageswithout using any labeled instances [64]. The discriminativetraining consisted of training a linear multi-class logisticregression classifier on the top level representation withoutusing back-propagation to jointly optimize the parametersacross all layers.

Fig. 12 shows samples randomly drawn from the gener-ative model. Most samples resemble plausible faces of dif-ferent individuals with a nice variety of facial expressions,poses, and lighting conditions. Nearest neighbor analysis,using Euclidean distance, reveals that these samples arenot just copies of images in the training set: Each sampleexhibits regularities derived from many different trainingcases.

In the first experiment, we train a linear classifier onthe features produced by each layer and we predict thefacial expression of the images in the labeled set. Eachinput image is processed by subtracting from each pixel themean intensity in that image then dividing by the standarddeviation of the pixels in that image. The features atsuccessive hidden layers give accuracies of 81.6%, 82.1%,82.5%, and 82.4%. Each higher layer is a RBM with 4096hidden units. These accuracies should be compared to:71.5% achieved by a linear classifier on the raw pixels,76.2% achieved by a Gaussian SVM on the raw pixels,74.6% achieved by a sparse coding method [68], and 80.2%achieved by the method proposed by Dailey et al. [69]which employs a large Gabor filter bank followed by PCAand a linear classifier, using cross-validation to select thenumber of principal components. The latter method andits variants [70] are considered a strong baseline in theliterature. Table 5 reports also a direct comparison to a DBNwith a GRBM at the first layer and to a DBN with a PoTat the first layer using the same number of parameters andweight sharing scheme. mPoT outperforms its competitorson this task. The accuracies reported on the table are anaverage over 5 random training/test splits of the data with80% of the images used for training and the rest for test.Facial identities of subjects in training and test sets aredisjoint.

In the next experiment, we apply synthetic occlusions

Page 18: 20140530.journal club

応用4:  顔画像生成とか認識(続き2)

•  顔パーツを遮蔽した場合の生成  – 各階層の寄与  – 生成画像を再入力したとき  

JOURNAL OF PAMI, VOL. ?, NO. ?, JANUARY 20?? 14

Fig. 13. Example of conditional generation performed by a four-layer deep model trained on faces. Each column is a differentexample (not used in the unsupervised training phase). The topmostrow shows some example images from the TFD dataset. The otherrows show the same images occluded by a synthetic mask (on thetop) and their restoration performed by the deep generative model(on the bottom).

Fig. 14. An example of restoration of unseen images performedby propagating the input up to first, second, third, fourth layer, andagain through the four layers and re-circulating the input through thesame model for ten times.

is connected to all of the first layer features. Similarly,at the third and fourth layer we learn fully connected

Fig. 15. Expression recognition accuracy on the TFD datasetwhen both training and test labeled images are subject to 7 typesof occlusion.

RBMs with 1024 latent variables, and at the fifth layerwe have an RBM with 128 hidden units. The deep modelwas trained entirely generatively on the unlabeled imageswithout using any labeled instances [64]. The discriminativetraining consisted of training a linear multi-class logisticregression classifier on the top level representation withoutusing back-propagation to jointly optimize the parametersacross all layers.

Fig. 12 shows samples randomly drawn from the gener-ative model. Most samples resemble plausible faces of dif-ferent individuals with a nice variety of facial expressions,poses, and lighting conditions. Nearest neighbor analysis,using Euclidean distance, reveals that these samples arenot just copies of images in the training set: Each sampleexhibits regularities derived from many different trainingcases.

In the first experiment, we train a linear classifier onthe features produced by each layer and we predict thefacial expression of the images in the labeled set. Eachinput image is processed by subtracting from each pixel themean intensity in that image then dividing by the standarddeviation of the pixels in that image. The features atsuccessive hidden layers give accuracies of 81.6%, 82.1%,82.5%, and 82.4%. Each higher layer is a RBM with 4096hidden units. These accuracies should be compared to:71.5% achieved by a linear classifier on the raw pixels,76.2% achieved by a Gaussian SVM on the raw pixels,74.6% achieved by a sparse coding method [68], and 80.2%achieved by the method proposed by Dailey et al. [69]which employs a large Gabor filter bank followed by PCAand a linear classifier, using cross-validation to select thenumber of principal components. The latter method andits variants [70] are considered a strong baseline in theliterature. Table 5 reports also a direct comparison to a DBNwith a GRBM at the first layer and to a DBN with a PoTat the first layer using the same number of parameters andweight sharing scheme. mPoT outperforms its competitorson this task. The accuracies reported on the table are anaverage over 5 random training/test splits of the data with80% of the images used for training and the rest for test.Facial identities of subjects in training and test sets aredisjoint.

In the next experiment, we apply synthetic occlusions

Page 19: 20140530.journal club

まとめ

•  Gated  MRF 使うと  latent  の取り扱い楽だよ  –  2体相互作用→    “mean”  –  3体相互作用→  “precision”  

•  Gated  MRF  は入力空間を  hyper-­‐ellipse  で覆おうとする→ 割と精度良く使えそう(GRBM,  PPCA  でも多分可だけど  hyper-­‐sphere なので相関強い時はちと問題かも)  

•  システムとしては  –  Low  level  (patch処理のとこ?)  と  High  level    (高解像度化のと

こ?)のモデル作って  –  High  level  の部分は DBN に繋げる,と.  

•  で,まとめたシステムを  –  物体,シーン認識  –  顔生成と認識とかに使って,state-­‐of-­‐art  に迫る成績出した  

Page 20: 20140530.journal club

Future  work

•  厳密な最尤推定が出来ないのは欠点じゃなかろうか?  – MCMC  も計算コスト高いので…  

•  後はタスクと計算機コストでモデルを修正してくしかなさそ  •  確かにテクスチャとか出てくるけど,まだ単純なものだし…  •  Gated  MRF  は  –  エネルギー関数の形式を改良する余地があるよ  –  Higher  level  の構造を吟味することが出来るよ  

全部 RBM  →  gated  MRF  とかにするとか  –  パラメトリックとノンパラをミックスするとかがいいんじゃない?

(こんなこと言ってたっけ?)  •  Video  とかへの応用はどうだろう?