deep learning for image denoising and superresolution

108
Deep Learning for Image Denoising and Super-resolution Yu Huang Sunnyvale, California [email protected]

Upload: yu-huang

Post on 27-Jan-2015

127 views

Category:

Technology


10 download

DESCRIPTION

deep learning, MLP, Convolutional Network, Deep Belief Nets, Deep Boltzmann Machine, Stacked Denoising Auto-Encoder, Image Denoising, Image Superresolution

TRANSCRIPT

Page 1: Deep learning for image denoising and superresolution

Deep Learning for Image Denoising and Super-resolution

Yu Huang

Sunnyvale, California

[email protected]

Page 2: Deep learning for image denoising and superresolution

Outline• Deep learning• Why deep learning?• State of Art deep learning• Parallel Deep Learning at Google• Sparse coding• Dictionary learning• Multiple Layer NN (MLP)• Convolutional Neural Network• Stacked Denoising Auto-Encoder• Deep Belief Nets (DBN)• Deep Boltzmann Machines (DBM)• Generative model: MRF• Deep Gated MRF• Image Denoising• Image Denoising by BM3D• Image Denoising by K-SVD

• Image Denoising by CNN• Image Denoising by MLPs• Image Denoising by DBMs• Image Denoising by Deep GMRF• Image Restoration by CNN• Image Super-resolution• Example-based SR• Sparse Coding for SR• Frame Alignment-based SR• Image Super-resolution by DBMs• Image Super-resolution by DBNs• Image SR by Cascaded SAE• Image SR by Deep CNN• References• Appendix

Page 3: Deep learning for image denoising and superresolution

Appendix• PCA, AP & Spectral Clustering• NMF & pLSA• ISOMAP• LLE• Laplacian Eigenmaps• Gaussian Mixture & EM• Hidden Markov Model (HMM)• Discriminative model: CRF• Product of Experts• Back propagation• Stochastic gradient descent• MCMC sampling for optimization approx.• Mean field for optimization approx.• Contrastive divergence for RBMs• “Wake-sleep” algorithm for DBNs• Two-stage pre-training for DBMs• Greedy layer-wise unsupervised pre-training

Page 4: Deep learning for image denoising and superresolution

Gartner Emerging Tech Hype Cycle 2012

Page 5: Deep learning for image denoising and superresolution

Deep Learning• Representation learning attempts to automatically learn good features or

representations; • Deep learning algorithms attempt to learn multiple levels of representation

of increasing complexity/abstraction (intermediate and high level features);• Become effective via unsupervised pre-training + supervised fine tuning;

• Deep networks trained with back propagation (without unsupervised pre-training) perform worse than shallow networks.

• Deal with the curse of dimensionality (smoothing & sparsity) and over-fitting (unsupervised, regularizer);

• Semi-supervised: structure of manifold assumption; • labeled data is scarce and unlabeled data is abundant.

Page 6: Deep learning for image denoising and superresolution

Deep Net Architectures

• Feed-Forward: multilayer neural nets, convolutional neural nets

• Feed-Back: stacked sparse coding, deconvolutional nets

• Bi-Directional: deep Boltzmann machines, stacked auto-encoders

Page 7: Deep learning for image denoising and superresolution

Why Deep Learning?• Supervised training of deep models (e.g. many-layered Nets) is too hard

(optimization problem);• Learn prior from unlabeled data;

• Shallow models are not for learning high-level abstractions;• Ensembles or forests do not learn features first;• Graphical models could be deep net, but mostly not.

• Unsupervised learning could be “local-learning”;• Resemble boosting with each layer being like a weak learner

• Learning is weak in directed graphical models with many hiddenvariables;• Sparsity and regularizer.

• Traditional unsupervised learning methods aren’t easy to learn multiple levels of representation.• Layer-wised unsupervised learning is the solution.

• Multi-task learning (transfer learning and self taught learning);• Other issues: scalability & parallelism with the burden from big data.

Page 8: Deep learning for image denoising and superresolution

The Mammalian Visual Cortex is Hierarchical

Page 9: Deep learning for image denoising and superresolution

State-of-Art Deep Learning R&D• Deep Learning as the hottest topic in speech recognition

• Performance records broken with deep learning methods• Microsoft, Google: DL-based speech recognition products

• Deep Learning is the hottest topic in Computer Vision• The record holders on ImageNet are convolutional nets

• Deep Learning is becoming hot in NLP• Deep Learning/Feature Learning in Applied Mathematics

• sparse coding• non-convex optimization• stochastic gradient algorithms

• Transfer learning: inductive transfer, storing knowledge gained while solving one problem and applying it to a different but related problem• Transfer the classification knowledge, adapt the model or less annotate data.

• Self taught learning: generic unlabeled data improve the performance on a supervised learning task.• Relax the assumption about the unlabeled data;• Use unlabeled data to learn the best represent. (dictionary) with sparse

coding.

Page 10: Deep learning for image denoising and superresolution

Convolutional Neural Network’s Progress

• Data and GPU, also networks deeper and more non-linear.

Convolutional Neural Net 2012

Convolutional Neural Net 1998

Convolutional Neural Net 1988

Page 11: Deep learning for image denoising and superresolution

Convolutional Neural Network’s Progress

• Fukushima 1980: designed network with same basic structure but did not train by back propagation.

• LeCun from late 80s: figured out back propagation for CNN, popularized and deployed CNN for OCR applications and others.

• Poggio from 1999: same basic structure but learning is restricted to top layer (k-means at second stage)

• LeCun from 2006: unsupervised feature learning

• DiCarlo from 2008: large scale experiments, normalization layer

• LeCun from 2009: harsher non-linearities, normalization layer, learning unsupervised and supervised.

• Mallat from 2011: provides a theory behind the architecture

• Hinton 2012: use bigger nets, GPUs, more data

Page 12: Deep learning for image denoising and superresolution

DL Winner in Object Recognition• Won the 2012 ImageNet LSVRC. 60 Million parameters, 832M MAC ops;

• Convolutional Nets [Krizhevsky et al., 2012]

Page 13: Deep learning for image denoising and superresolution

Parallel Deep Learning at Google• More features always improve performance unless data is scarce;

• Deep learning methods have higher capacity and have the potential to model data better;

• However, big data needs deep learning to be scalable: lots of training samples (>10M), classes (>10K) and input dimensions (>10K).

• Distributed Deep Nets (easy to be distributed).

Model parallelism Model parallelism + data parallelism

Page 14: Deep learning for image denoising and superresolution

Scaling Across Multiple GPUs• Two variations: 1) Simulate the synchronous execution of SGD in one core; 2)

Approximation of SGD, not perfectly simulating but working better;

• Two parallelisms: 1) model parallelism: Across the model dimension, where different workers train different parts of the model (amount of computation per neuron activity is high); 2) data parallelism: Across the data dimension, where different workers train on different data examples (amount of computation per weight is high);

• Observ.s: data parallelism for convolutional layer and model parallelism for fully connected layer;

• Convolutional layers cumulatively contain ~90-95% computation, ~5% of parameters;

• Fully-connected layers contain ~5-10% of the computation, ~95% of the parameters;

• Forward pass:

• Each of the K workers is given a different data batch of (let’s say) 128 examples;

• Each of the K workers computes all of the convolutional layer activities on its batch;

• To compute the fully-connected layer activities, the workers switch to model parallelism;

• Parallelism: three schemes of parallelism.

Page 15: Deep learning for image denoising and superresolution

Scaling Across Multiple GPUs• Scheme I: each worker sends its last-stage convolutional layer activities to each

other worker; the workers then assemble a big batch of activities for 128K examples and compute the fully-connected activities on this batch as usual;

• Scheme II: one of the workers sends its last-stage convolutional layer activities to all other workers; the workers then compute the fully connected activities on this batch of 128 examples and then begin to back propagate the gradients for these 128 examples; in parallel with this computation, the next worker sends its last-stage convolutional layer activities to all other workers; then the workers compute the fully-connected activities on this second batch of 128 examples, and so on;

• Scheme III: all of the workers send 128=K of their last stage convolutional layer activities to all other workers. The workers then proceed as in scheme II;

• Backward pass is similar: the workers compute the gradients in the fully connected layers in the usual way, then the next step depends on the schemes in forward pass.

• Weight synchronization in the convolutional layers after backward pass;

• Variable batch size (128k in the convolutional layers and 128 in the fully-connected layers);

Page 16: Deep learning for image denoising and superresolution
Page 17: Deep learning for image denoising and superresolution

Model Parallelism: Partition model across machines Data Parallelism: Asynchronous Distributed Stochastic Gradient Descent

Page 18: Deep learning for image denoising and superresolution

Sparse Coding

• Sparse coding (Olshausen & Field, 1996).

• Originally developed to explain early visual processing in the brain (edge detection).

• Objective: Given a set of input data vectors learn a dictionary of bases such that:

• Each data vector is represented as a sparse linear combination of bases.

Sparse: mostly zeros

Page 19: Deep learning for image denoising and superresolution

Predictive Sparse Coding• Recall the objective function for sparse coding:

• Modify by adding a penalty for prediction error:

• Approximate the sparse code with an encoder

• PSD for hierarchical feature training

• Phase 1: train the first layer;

• Phase 2: use encoder + absolute value as 1st feature extractor

• Phase 3: train the second layer;

• Phase 4: use encoder + absolute value as 1st feature extractor

• Phase 5: train a supervised classifier on top layer;

• Phase 6: optionally train the whole network with supervised BP.

Page 20: Deep learning for image denoising and superresolution

Methods of Solving Sparse Coding• Greedy methods: projecting the residual on some atom;

• Matching pursuit, orthogonal matching pursuit;

• L1-norm: Least Absolute Shrinkage and Selection Operator (LASSO);

• The residual is updated iteratively in the direction of the atom;

• Gradient-based finding new search directions

• Projected Gradient Descent

• Coordinate Descent

• Homotopy: a set of solutions indexed by a parameter (regularization)

• LARS (Least Angle Regression)

• First order/proximal methods: Generalized gradient descent

• solving efficiently the proximal operator

• soft-thresholding for L1-norm

• Accelerated by the Nesterov optimal first-order method

• Iterative reweighting schemes

• L2-norm: Chartand and Yin (2008)

• L1-norm: Cand`es et al. (2008)

Page 21: Deep learning for image denoising and superresolution

Strategy of Dictionary Selection• What D to use?• A fixed overcomplete set of basis: no adaptivity.

• Steerable wavelet;• Bandlet, curvelet, contourlet;• DCT Basis;• Gabor function;• ….

• Data adaptive dictionary – learn from data;• K-SVD: a generalized K-means clustering process for Vector

Quantization (VQ).• An iterative algorithm to effectively optimize the sparse approximation

of signals in a learned dictionary.

• Other methods of dictionary learning:• non-negative matrix decompositions.• sparse PCA (sparse dictionaries).• fused-lasso regularizations (piecewise constant dictionaries)

• Extending the models: Sparsity + Self-similarity=Group Sparsity

Page 22: Deep learning for image denoising and superresolution

Multi Layer Neural Network• A neural network = running several logistic regressions at the

same time;

• Neuron=logistic regression or…

• Calculate error derivatives (gradients) to refine: back propagate the error derivative through model (the chain rule)

• Online learning: stochastic/incremental gradient descent

• Batch learning: conjugate gradient descent

Page 23: Deep learning for image denoising and superresolution

Problems in MLPs• Multi Layer Perceptrons (MLPs), one feed-forward neural network,

were popularly used for decades.• Gradient is progressively getting more scattered

• Below the top few layers, the correction signal is minimal

• Gets stuck in local minima • Especially start out far from ‘good’ regions (i.e., random initialization)

• In usual settings, use only labeled data • Almost all data is unlabeled! • Instead the human brain can learn from unlabeled data.

Page 24: Deep learning for image denoising and superresolution

Convolutional Neural Networks• CNN is a special kind of multi-layer NNs applied to 2-d arrays (usually

images), based on spatially localized neural input;• local receptive fields(shifted window), shared weights (weight averaging) across

the hidden units, and often, spatial or temporal sub-sampling;

• Related to generative MRF/discriminative CRF: • CNN=Field of Experts MRF=ML inference in CRF;

• Generate ‘patterns of patterns’ for pattern recognition.

• Each layer combines (merge, smooth) patches from previous layers• Pooling /Sampling (e.g., max or average) filter: compress and smooth the data.

• Convolution filters: (translation invariance) unsupervised;

• Local contrast normalization: increase sparsity, improve optimization/invariance.

C layers convolutions, S layers pool/sample

Page 25: Deep learning for image denoising and superresolution

ConvNets• Convolutional Networks are trainable multistage architectures composed of multiple

stages;

• Input and output of each stage are sets of arrays called feature maps;

• At output, each feature map represents a particular feature extracted at all locations on input;

• Each stage is composed of: a filter bank layer, a non-linearity layer, and a feature pooling layer;

• A ConvNet is composed of 1, 2 or 3 such 3-layer stages, followed by a classification module;

• A fully connected layer: softmax transfer function for posterior distribution.

• Filter: A trainable filter (kernel) in filter bank connects input feature map to output feature map;

• Nonlinearity: a pointwise sigmoid tanh() or a rectified sigmoid abs(gi•tanh()) function;

• In rectified function, gi is a trainable gain parameter, might be followed a contrast normalization N;

• Feature pooling: treats each feature map separately -> a reduced-resolution output feature map;

• Supervised training is performed using a form of SGD to minimize the prediction error;

• Gradients are computed with the back-propagation method.

• Unsupervised pre-training: predictive sparse decomposition (PSD), then supervised fine-tuning.

* is discrete convolution operator

Page 26: Deep learning for image denoising and superresolution
Page 27: Deep learning for image denoising and superresolution

LeNet (LeNet-5)

• A layered model composed of convolution and subsampling operations followed by a holistic representation and ultimately a classifier for handwritten digits;

• Local receptive fields (5x5) with local connections;

• Output via a RBF function, one for each class, with 84 inputs each;

• Learning by Graph Transformer Networks (GTN);

Page 28: Deep learning for image denoising and superresolution

AlexNet• A layered model composed of convol., subsample.,

followed by a holistic representation and all-in-all a landmark classifier;

• Consists of 5 convolutional layers, some of which followed by max-pooling layers, 3 fully-connected layers with a final 1000-way softmax;

• Fully-connected “FULL” layers: linear classifiers/matrix multiplications;

• ReLU are rectified-linear nonlinearities on layer output, can be trained several times faster;

• Local normalization scheme aids generalization;

• Overlapping pooling slightly less prone to overfitting;

• Data augmentation: artificially enlarge the dataset using label-preserving transformations;

• Dropout: setting to zero output of each hidden neuron with prob. 0.5;

• Trained by SGD with batch # 128, momentum 0.9, weight decay 0.0005.

Page 29: Deep learning for image denoising and superresolution

The network’s input is 150,528-dimensional, and the number of neurons in the network’s remaining layers is given by 253,440–186,624–64,896–64,896–43,264-4096–4096–1000.

Page 30: Deep learning for image denoising and superresolution

Generative Model: MRF• Random Field: F={F1,F2,…FM} a family of random variables

on set S in which each Fi takes value fi in a label set L.

• Markov Random Field: F is said to be a MRF on S w.r.t. a neighborhood N if and only if it satisfies Markov property.

• Generative model for joint probability p(x)

• allows no direct probabilistic interpretation

• define potential functions Ψ on maximal cliques A

• map joint assignment to non-negative real number

• requires normalization

• MRF is undirected graphical models

Page 31: Deep learning for image denoising and superresolution

Belief Nets• Belief net is directed acyclic graph composed of stochastic var.

• Can observe some of the variables and solve two problems:

• inference: Infer the states of the unobserved variables.

• learning: Adjust the interactions between variables to more likely generate the observed data.

stochastichidden cause

visible effect

Use nets composed of layers of stochastic variables with weighted connections.

Page 32: Deep learning for image denoising and superresolution

Boltzmann Machines• Energy-based model associate a energy to each configuration of stochastic

variables of interests (for example, MRF, Nearest Neighbor);• Learning means adjustment of the low energy function’s shape properties;

• Boltzmann machine is a stochastic recurrent model with hidden variables;• Monte Carlo Markov Chain, i.e. MCMC sampling (appendix);

• Restricted Boltzmann machine is a special case: • Only one layer of hidden units;

• factorization of each layer’s neurons/units (no connections in the same layer);

• Contrastive divergence: approximation of gradient (appendix).

probability

Energy Function

Learning rule

Page 33: Deep learning for image denoising and superresolution

Deep Belief Networks• A hybrid model: can be trained as

generative or discriminative model;

• Deep architecture: multiple layers (learn features layer by layer);• Multi layer learning is difficult in

sigmoid belief networks.

• Top two layers are undirected connections, RBM;

• Lower layers get top down directed connections from layers above;

• Unsupervised or self-taught pre-learning provides a good initialization;• Greedy layer-wise unsupervised

training for RBM

• Supervised fine-tuning • Generative: wake-sleep algorithm (Up-

down)

• Discriminative: back propagation (bottom-up)

Page 34: Deep learning for image denoising and superresolution

Deep Boltzmann Machine• Learning internal representations that become increasingly complex;

• High-level representations built from a large supply of unlabeled inputs;

• Pre-training consists of learning a stack of modified RBMs, which are composed to create a deep Boltzmann machine (undirected graph);

• Generative fine-tuning: different from DBN

• Positive and negative phase (appendix)

• Discriminative fine-tuning: the same to DBN

• Back propagation.

Page 35: Deep learning for image denoising and superresolution

Deep Gated MRF• Conditional Distribution Over Input:

• P(x∣h)=N (mean(h) ,D);

• examples: PPCA, Factor Analysis, ICA, Gaussian RBM;

• model does not represent well dependencies, only mean intensity;

• P(x∣h)=N (0,Covariance(h));

• examples: PoT (product of student’s t), covariance RBM;

• model does not represent well mean intensity, only dependencies;

• P(x∣h)=N (mean(h) , Covariance(h));

• mean cRBM, mean PoT;

• two sets of latent variables to modulate mean and covariance of the conditional distribution over the input;

• Deep gated MRF: RBM layers + MRF with adaptive affinities (to gate the effective interactions and to decide mean intensities);

• Learning: Gibbs sampling/HMC sampling, Fast persistent CD.

Page 36: Deep learning for image denoising and superresolution

Deep Gated MRF

Page 37: Deep learning for image denoising and superresolution

Denoising Auto-Encoder• Multilayer NNs with target output=input;• Reconstruction=decoder(encoder(input));

• Perturbs the input x to a corrupted version;

• Randomly sets some of the coordinates of input to zeros.

• Recover x from encoded perturbed data.

• Learns a vector field towards higher probability regions;• Pre-trained with DBN or regularizer with perturbed training data; • Minimizes variational lower bound on a generative model;

• corresponds to regularized score matching on an RBM;

• PCA=linear manifold=linear Auto Encoder;• Auto-encoder learns the salient variation like a nonlinear PCA.

Page 38: Deep learning for image denoising and superresolution

Stacked Denoising Auto-Encoder• Stack many (may be sparse) auto-encoders in succession and train

them using greedy layer-wise unsupervised learning

• Drop the decode layer each time

• Performs better than stacking RBMs;

• Supervised training on the last layer using final features;

• (option) Supervised training on the entire network to fine- tune all weights of the neural net;

• Empirically not quite as accurate as DBNs.

Page 39: Deep learning for image denoising and superresolution

Image Denoising• Noise reduction: various assumptions of content internal

structures;

• Learning-based

• Field of experts (MRF), CRF, NN (MLP, CNN);

• Sparse coding: K-SVD, LSSC,….

• Self-similarity

• Gaussian, Median;

• Bilateral filter, anisotropic diffusion;

• Non-local means.

• Sparsity prior• Wavelet shrinkage;

• Use of both Redundancy and Sparsity• BM3D (block matching 3-d filter)-benchmark;

• Can ‘Deep Learning’ compete with BM3D?

Page 40: Deep learning for image denoising and superresolution

Block Matching 3-D for Denoising• For each patch, find similar patches;• Group the similar patches into a 3-d stack;• Perform a 3-D transform (2-d + 1-d) and coefficient

thresholding (sparsity); • Apply inverse 3-D transform (1-d + 2-d); • Also combine multiple patches in a collaborative way

(aggregation); • Two stages: hard -> wiener (soft).

Page 41: Deep learning for image denoising and superresolution

BM3D Outline

Page 42: Deep learning for image denoising and superresolution

Apply Sparse Coding for Denoising

• A cost function for : Y = Z + n

• Solve for: Prior term

• Break problem into smaller problems

• Aim at minimization at the patch level.

Proximity of selected patch

Sparsity of the representations

Global proximity

Page 43: Deep learning for image denoising and superresolution

Image Data in K-SVD Denoising• Extract overlapping patches from a single image;

• clean or corrupted, even reference (multiple frames)?• for example, 100k of size 8x8 block patches;

• Applied the K-SVD, training a dictionary;• Size of 64x256 (n=64, dictionary size k).• Lagrange multiplier namda = 30/sigma of noise;

• The coefficients from OMP; • the maximal iteration is 180 and noise gain C=1.15;• the number of nonzero elements L=6 (sigma=5).

• Denoising by normalized weighted averaging:

Page 44: Deep learning for image denoising and superresolution

Image Denoising by Conv. Nets• Image denoising is a learning problem to training Conv. Net;

• Parameter estimation to minimize the reconstruction error.

• Online learning (rather than batch learning): stochastic gradient• Gradient update from 6x6 patches sampled from 6 different training images

• Run like greedy layer-wise training for each layer.

Page 45: Deep learning for image denoising and superresolution

Image Denoising by MLP• Denoising as learning: map noisy patches to noise-free ones;

• Patch size 17x17;

• Training with different noise types and levels:• Sigma=25; noise as Gaussian, stripe, salt-and-pepper, coding artifact;

• Feed-forward NN: MLP;• input layer 289-d, four hidden layers (2047-d), output layer 289-d.• input layer 169-d, four hidden layers (511-d), output layer 169-d.

• 40 million training images from LabelMe and Berkeley segmentation!• 1000 testing images: Mcgill, Pascal VOC 2007;• GPU: slower than BM3D, much faster than KSVD.• Deep learning can help: unsupervised learning from unlabelled data.

Page 46: Deep learning for image denoising and superresolution

Image Denoising with Deep Nets• Combine sparse coding and deep network pre-trained by DAE;

• Reconstruct clean image from noisy image by training DAE;

• image denoising by choosing appropriate η in different situations.

• Deep network: stacked sparse DAE (denoising auto-encoder).

• Pre-training

• Fine-tuning by back propagation

• Patch-based.

Hidden layer

KL divergence with sparsity

Page 47: Deep learning for image denoising and superresolution

Image Denoising by DBMs• Combine Botlzmann machine and Denoising Auto-Encoder;• 100, 000 image patches of sizes 4×4, 8×8 and 16×16 from CIFAR-10

dataset to get 50, 000 training samples;• Three sets of testing images from USC, textures, aerials and

miscellaneous;• Gaussian BMs+DAEs: one, two and four hidden layers;• Deep Network training:

• A two-stage pre-training and PCD training for Gaussian DBMs;• Stochastic BP for DAE training;

• Noise: Gaussian, salt-and-pepper;• Patch-based as well;• Comparison: when noise is heavy, DBM beats DAE; otherwise, vice versa.

Page 48: Deep learning for image denoising and superresolution

Image Denoising by Deep Gated MRF• Works as solving the following optimization problem

where F(x;θ) is the mPoT energy function

• Adapt the generic prior learned by mPoT:

• 1. Adapt the parameters to the denoised test image (mPoT+A), such as sparse coding;

• 2. Add to the denoising loss an extra quadratic term pulling the estimate close to the denoising result of the non-local means algorithm (mPoT+A+NLM), such as adding the term as

Original noisy (22.1dB) mPoT(28.0dB) mPoT+A(29.2dB) mPoT+A+NLM(30.7dB)

Page 49: Deep learning for image denoising and superresolution

Image Restoration by CNN

• Collect a dataset of clean/corrupted image pairs which are then used to train a specialized form of convolutional neural network.

• Given a noisy image x, predict a clean image y close to the clean image y*

• the input kernels p1 = 16, the output kernel pL = 8. • 2 hidden layers (i.e. L = 3), each with 512 units, the middle layer kernel p2 = 1. • W1 512 kernels of size 16x16x3, W2 512 kernels of size 1x1x512, and W3 size 8x8x512.

• This learns how to map corrupted image patches to clean ones, implicitly capturing the characteristic appearance of noise in natural images.• Train the weights Wl and biases bl by minimizing the mean squared error

• Minimize with SGD• Regarded as: first patchifying the input, applying a fully-connected neural network to each

patch, and averaging the resulting output patches.

Page 50: Deep learning for image denoising and superresolution

Image Restoration by CNN• Comparison.

Page 51: Deep learning for image denoising and superresolution

Image Deconvolution with Deep CNN• Establish the connection between traditional optimization-based schemes

and a CNN architecture;

• A separable structure is used as a reliable support for robust deconvolutionagainst artifacts;

• The deconvolution task can be approximated by a convolutional network by nature, based on the kernel separability theorem;

• Kernel separability is achieved via SVD;

• An inverse kernel with length 100 is enough for plausible deconv. results;

• Image deconvolution convolutional neural network (DCNN);

• Two hidden layers: h1 is 38 large-scale 1-d kernels of size 121×1, and h2 is 381x121 convolution kernels to each in h1, output is 1×1×38 kernel;

• Random-weight initialization or from the separable kernel inversion;

• Concatenation of deconvolution CNN module with denoising CNN;

• called “Outlier-rejection Deconvolution CNN (ODCNN)”;

• 2 million sharp patches together with their blurred versions in training.

Page 52: Deep learning for image denoising and superresolution

Image Deconvolution with Deep CNN

Page 53: Deep learning for image denoising and superresolution

Image Super-resolution• Super-resolution (SR): how to find missing details/HF comp?

• Interpolation-based:

• Edge-directed;

• B-spline;

• Sub-pixel alignment;

• Reconstruction-based:

• Gradient prior;

• TV (Total Variation);

• MRF (Markov Random Field).

• Learning-based (hallucination).

• Example-based: texture synthesis, LR-HR mapping;

• Self learning: sparse coding, self similarity-based;

• ‘Deep Learning’ competes with shallow learning in image SR.

Page 54: Deep learning for image denoising and superresolution

• Estimate missing HR detail that isn’t present in the original LR image, and which we can’t make visible by simple sharpening;

• Image database with HR/LR image pairs;

• Algorithm uses a training set to learn the fine details of LR;

• It then uses learned relationships (MRF) to predict fine details.

What is Example Based SR?

Page 55: Deep learning for image denoising and superresolution

SR from a Single Image• Multi-frame-based SR (alignment);

• Example-based SR.

Page 56: Deep learning for image denoising and superresolution

SR from a Single Image

• Combination of Example-based and Multi-frame-based.

same scale

different scales

FindNN Parent Copy

Page 57: Deep learning for image denoising and superresolution

Example-based Edge Statistics Single Frame

Page 58: Deep learning for image denoising and superresolution

Sparse Coding for SR [Yang et al.08]• HR patches have a sparse represent. w.r.t. an over-complete

dictionary of patches randomly sampled from similar images.

• Sample 3 x 3 LR overlapping patches y on a regular grid.

output HR patch HR dictionary

for some with

The input LR patch satisfies

linear measurements of sparse coefficient vector !

Dictionary of low-resolution patches

Downsampling/Blurring operator

If we can recover the sparse solution to the underdetermined system of linear equations , we can reconstruct as

convex relaxation

T, T’: select overlap between patches F : 1st and 2nd derivatives from LR bicubic interpolation.

Page 59: Deep learning for image denoising and superresolution

Sparse Coding for SR [Yang et al.08]Two training sets:

Flower images – smooth area, sharp edge

Animal images -- HF textures

Randomly sample 100,000 HR-LR patch pairs from each set of training images.

Page 60: Deep learning for image denoising and superresolution

Sparse coding

MRF / BP[Freeman IJCV ‘00]

Bicubic

Original

Page 61: Deep learning for image denoising and superresolution

Joint Dictionary Learning for SR• Local sparse prior for detail recovery;

• Global constraints for artifact avoiding (L=SH);

• Joint dictionary learning:

extract overlap regionprevious reconstruct on the overlap

controls the tradeoff between matching the LRinput and finding a neighbor-compatible HR patch.

Solved by back-projection: a gradient descent method

Page 62: Deep learning for image denoising and superresolution

Bicubic Sparse coding

MRF / BP [Freeman IJCV ‘00]Input LR

Page 63: Deep learning for image denoising and superresolution

Image SR by DBMs• Sparsity prior pre-learned into the dictionary [Yang’08];

• Learn the dictionary (size=1024), encoded in the RBM;

• Trained by contrastive divergence.

• Use interpolation to initialize HR from LR, to accelerate inference;

• Training images: 10,000 HR/LR image patch (8x8) pairs.

The image patches are elements of the dictionaries to be learned and collected from the normalized weights in RBM.

Page 64: Deep learning for image denoising and superresolution

Results of images magnified by a factor of 2

Page 65: Deep learning for image denoising and superresolution

Super-resolution by DBNs• SR is a image completion problem of missing data (HF);

• Training: HR image is divided into PxP patches transformed to DCT domain, and DBNs trained by SGD, layer-by-layer;

• Restoring: LR image is interpolated first and divided into PxP patches as well, transformed to DCT domain, fed into DBNs to infer missing HF, then reversed.

• Iteratively.

• Experiment setting: P=16, scaling =2, learning rate 0.01, hidden units 400 (1st layer) + 200 (2nd layer).

Page 66: Deep learning for image denoising and superresolution

Super-resolution by DBNs

Connections among LF and HF

Restoration of HF after training

(Two hidden layers as example)

Page 67: Deep learning for image denoising and superresolution

Super-resolution by DBNsCOMPARISON OF SUPER-RESOLUTION METHODS USING PSNR AND SSIM

Page 68: Deep learning for image denoising and superresolution

Image Super-resolution by Learning Deep CNN

• Learns an end-to-end mapping btw low/high-resolution images as a deep CNN from the low-resolution image to the high-resolution one;

• Traditional sparse-coding-based SR viewed as a deep convolutionalnetwork, but handle each component separately, rather jointly optimizes all layers.

Page 69: Deep learning for image denoising and superresolution

Image Super-resolution by Learning Deep CNN

• LR image upscaled to the desired size using bicubic interpolation as Y; Then recover from Y an image F(Y) similar to ground truth HR image X.

• Learn a mapping F, consists of three operations:

• 1. Patch extraction and representation;

• 2. Non-linear mapping;

• 3. Reconstruction.

• Traditional Sparse coding method shown as

Page 70: Deep learning for image denoising and superresolution

Image Super-resolution by Learning Deep CNN

Results comparison.

Page 71: Deep learning for image denoising and superresolution

Image Superresolution by a Cascade of Stacked CLA

• In each layer of the cascade, non-local self-similarity search is first performed to enhance high-frequency texture details of the partitioned patches in the input image;

• The enhanced image patches are then input into a collaborative local auto-encoder (CLA) to suppress the noises as well as collaborate the compatibility of the overlapping patches;

• By closing the loop on non-local self-similarity search and CLA in a cascade layer, refine the super-resolution fed into next layer until the required image scale.

Page 72: Deep learning for image denoising and superresolution

Image Superresolution by a Cascade of Stacked CLA

• Experimental results compared with others.

Kim’ sparse regression exemplar-based Yang’s sparse coding Cacade of stacked LAS

Page 73: Deep learning for image denoising and superresolution

References• Y. Bengio, Learning Deep Architectures for AI, Foundations and Trends in

ML, 2(1), pp.1-127, 2009.• R Fergus, H. Lee, M Ranzato, R. Salakhutdinov, G. Taylor, K Yu, Deep

Learning Methods for Vision, CVPR 2012 Tutorial.• Hinton, G., Osindero, S. and The, Y. A fast learning algorithm for deep

belief nets. Neural Computation, 18, 2006. • Salakhutdinov, Ruslan, and Geoffrey E. Hinton. Deep Boltzmann machines.

Int. conf. on AI and statistics. 5(2), 2009.• Vincent, Pascal, et al. Stacked denoising autoencoders: Learning useful

representations in a deep network with a local denoising criterion. J. of Machine Learning Research 11 (2010): 3371-3408.

• Le, Ranzato, Monga, Devin, Corrado, Chen, Dean, Ng. Building High-Level Features Using Large Scale Unsupervised Learning. ICML 2012.

• K Cho, T Raiko, A Ilin and J Karhunen, A Two-stage Pretraining Algorithm for Deep Boltzmann Machines, NIPS workshop, 2012.

• V. Jain and H.S. Seung. Natural image denoising with convolutionalnetworks. Advances in Neural Information Processing Systems, 21:769–776, 2008.

Page 74: Deep learning for image denoising and superresolution

References• Burger, Schuler, Harmeling, Image Denoising: Can Plain Neural Networks

Compete with BM3D?, CVPR, 2012.• Xie, J., Xu, L., Chen, E. Image denoising and inpainting with deep neural

networks. Advances in Neural Information Processing Systems 25, 2012.• K. Cho, Simple Sparsification Improves Sparse Denoising Autoencoders

in Image Denoising, ICML, 2013.• J Gao, Y Guo, M Yin, Restricted Boltzmann Machine Approach to Couple

Dictionary Training for Image Superresolution, IEEE ICIP, 2013. • T. Nakashika, T Takiguchi, Y Ariki, HF Restoration Using Deep Belief Nets

for SR, 2013.• D Eigen, D Krishnan, R Fergu, Restoring An Image Taken Through a

Window Covered with Dirt or Rain. ICCV’13;• M Ranzato, V Mnih, J M. Susskind, G E. Hinton, Modeling Natural Images

Using Gated MRFs, IEEE T-PAMI, 2013;• Cui, Chang, Shan, Zhong, Chen, Deep Network Cascade for Image Super-

resolution, ECCV’14;• Dong, Loy, He, Tang, Learning a Deep Convolutional Network for Image

Super-Resolution, ECCV’14.

Page 75: Deep learning for image denoising and superresolution

Appendix

Page 76: Deep learning for image denoising and superresolution

Graphical Models

• Graphical Models: Powerful framework for representing dependency structure between random variables.

• The joint probability distribution over a set of random variables.• The graph contains a set of nodes (vertices) that represent random variables, and a set of links (edges) that represent dependencies between those random variables.

• The joint distribution over all random variables decomposes into a product of factors, where each factor depends on a subset of the variables.• Two type of graphical models:

• Directed (Bayesian networks)• Undirected (Markov random fields, Boltzmann machines)• Hybrid graphical models that combine directed and undirected models,such as Deep Belief Networks, Hierarchical-Deep Models.

Page 77: Deep learning for image denoising and superresolution

PCA, AP & Spectral Clustering• Principal Component Analysis (PCA) uses orthogonal transformation to

convert a set of observations of possibly correlated variables into a set of linearly uncorrelated variables called principal components.

• This transformation is defined in such a way that the first principal component has the largest possible variance and each succeeding component in turn has the highest variance possible under the constraint that it be orthogonal to the preceding components.

• PCA is sensitive to the relative scaling of the original variables.• Also called as Karhunen–Loève transform (KLT), Hotelling transform,

singular value decomposition (SVD) , factor analysis, eigenvalue decomposition (EVD), spectral decomposition etc.;

• Affinity Propagation (AP) is a clustering algorithm based on the concept of "message passing" between data points.[Unlike clustering algorithms such as k-means or k-medoids, AP does not require the number of clusters to be determined or estimated before running the algorithm;

• Spectral Clustering makes use of the spectrum (eigenvalues) of the data similarity matrix to perform dimensionality reduction before clustering in fewer dimensions. • The similarity matrix consists of a quantitative assessment of the relative

similarity of each pair of points in the dataset.

Page 78: Deep learning for image denoising and superresolution

PCA, AP & Spectral Clustering

Page 79: Deep learning for image denoising and superresolution

NMF & pLSA

• Non-negative matrix factorization (NMF): a matrix V is factorized into (usually) two matrices W and H, that all three matrices have no negative elements.

• The different types arise from using different cost functions for measuring the divergence between V and W*H and possibly by regularization of the W and/or H matrices;

• squared error, Kullback-Leibler divergence or total variation (TV);

• NMF is an instance of a more general probabilistic model called "multinomial PCA“, as pLSA (probabilistic latent semantic analysis);

• pLSA is a statistical technique for two-mode (extended naturally to higher modes) analysis, modeling the probability of each co-occurrence as a mixture of conditionally independent multinomial distributions;

• Their parameters are learned using EM algorithm;

• pLSA is based on a mixture decomposition derived from a latent class model, not as downsizing the occurrence tables by SVD in LSA.

• Note: an extended model, LDA (Latent Dirichlet allocation) , adds a Dirichlet prior on the per-document topic distribution.

Page 80: Deep learning for image denoising and superresolution

NMF & pLSA

Note: d is the document index variable, c is a word's topic drawn from the document's topic distribution, P(c|d), and w is a word drawn from the word distribution of this word's topic, P(w|c). (d and w are observable variables, c is a latent variable.)

Page 81: Deep learning for image denoising and superresolution

ISOMAP• General idea:

• Approximate the geodesic distances by shortest graph distance.

• MDS (multi-dimensional scaling) using geodic distances

• Algorithm:• Construct a neighborhood graph

• Construct a distance matrix

• Find the shortest path between every i and j (e.g. using Floyd-Marshall) and construct a new distance matrix such that Dij is the length of the shortest path between i and j.

• Apply MDS to matrix to find coordinates

Page 82: Deep learning for image denoising and superresolution

LLE (Locally Linear Embedding)• General idea: represent each point on the local linear subspace of the

manifold as a linear combination of its neighbors to characterize the local neighborhood relations; then use the same linear coefficient for embedding to preserve the neighborhood relations in the low dimensional space;

• Compute the coefficient w for each data by solving a constraint LS problem;

• Algorithm: • 1. Find weight matrix W of linear coefficients

• 2. Find low dimensional embedding Y that minimizes the reconstruction error

• 3. Solution: Eigen-decomposition of M=(I-W)’(I-W)

i j

jiji YWYY

2

)(

Page 83: Deep learning for image denoising and superresolution

Laplacian Eigenmaps• General idea: minimize the norm of Laplace-Beltrami operator on the

manifold

• measures how far apart maps nearby points.

• Avoid the trivial solution of f = const.

• The Laplacian-Beltrami operator can be approximated by Laplacian of the neighborhood graph with appropriate weights.

• Construct the Laplacian matrix L=D-W.

• can be approximated by its discrete equivalent

• Algorithm: • Construct a neighborhood graph (e.g., epsilonball, k-nearest neighbors).

• Construct an adjacency matrix with the following weights

• Minimize

• The generalized eigen-decomposition of the graph Laplacian is

• Spectral embedding of the Laplacian manifold:

• • The first eigenvector is trivial (the all one vector).

Page 84: Deep learning for image denoising and superresolution

Gaussian Mixture Model & EM• Mixture model is a probabilistic model for representing the presence

of subpopulations within an overall population;

• “Mixture models" are used to make statistical inferences about the properties of the sub-populations given only observations on the pooled population;

• A Gaussian mixture model can be Bayesian or non-Bayesian;

• A variety of approaches focus on maximum likelihood estimate (MLE) as expectation maximization (EM) or maximum a posteriori (MAP);

• EM is used to determine the parameters of a mixture with an a priori given number of components (a variation version can adapt it in the iteration);• Expectation step: "partial membership" of each data point in each constituent

distribution is computed by calculating expectation values for the membership variables of each data point;

• Maximization step: plug-in estimates, mixing coefficients and component model parameters, are re-computed for the distribution parameters;

• Each successive EM iteration will not decrease the likelihood.

• Alternatives of EM for mixture models:• mixture model parameters can be deduced using posterior sampling as indicated

by Bayes' theorem, i.e. Gibbs sampling or Markov Chain Monte Carlo (MCMC);

• Spectral methods based on SVD;

• Graphical model: MRF or CRF.

Page 85: Deep learning for image denoising and superresolution

Gaussian Mixture Model & EM

Page 86: Deep learning for image denoising and superresolution

Hidden Markov Model• A hidden Markov model (HMM) is a statistical Markov model: the

modeled system is a Markov process with unobserved (hidden) states; • In HMM, state is not visible, but output, dependent on state, is visible.

• Each state has a probability distribution over the possible output tokens;• Sequence of tokens generated by an HMM gives some information about

the sequence of states.

• Note: the adjective 'hidden' refers to the state sequence through which the model passes, not to the parameters of the model;

• A HMM can be considered a generalization of a mixture model where the hidden variables are related through a Markov process;

• Inference: prob. of an observed sequence by Forward-Backward Algorithm and the most likely state trajectory by Viterbi algorithm (DP);

• Learning: optimize state transition and output probabilities by Baum-Welch algorithm (special case of EM).

Page 87: Deep learning for image denoising and superresolution

• A flow network G(V, E) defined as a fully connected directed graph where each edge (u,v) in E has a positive capacity c(u,v) >= 0;

• The max-flow problem is to find the flow of maximum value on a flow network G;

• A s-t cut or simply cut of a flow network G is a partition of V into S and T = V-S, such that s in S and t in T;

• A minimum cut of a flow network is a cut whose capacity is the least over all the s-t cuts of the network;

• Methods of max flow or mini-cut:

• Ford Fulkerson method;

• "Push-Relabel" method.

Page 88: Deep learning for image denoising and superresolution

• Mostly labeling is solved as an energy minimization problem;

• Two common energy models:

• Potts Interaction Energy Model;

• Linear Interaction Energy Model.

• Graph G contain two kinds of vertices: p-vertices and i-vertices;

• all the edges in the neighborhood N, called n-links;

• edges between the p-vertices and the i-vertices called t-links.

• In the multiple labeling case, the multi-way cut should leave each p-vertex connected to one i-vertex;

• The minimum cost multi-way cut will minimize the energy function where the severed n-links would correspond to the boundaries of the labeled vertices;

• The approximation algorithms to find this multi-way cut:

• "alpha-expansion" algorithm;

• "alpha-beta swap" algorithm.

Page 89: Deep learning for image denoising and superresolution

A simplified Bayes Net: it propagates info. throughout a graphical model via a series of messages between neighboring nodes iteratively; likely to converge to a consensus that determines the marginal prob. of all the variables;

messages estimate the cost (or energy) of a configuration of a clique given all other cliques; then the messages are combined to compute a belief (marginal or maximum probability);

• Two types of BP methods:

• max-product;

• sum-product.

• BP provides exact solution when there are no loops in graph!

• Equivalent to dynamic programming/Viterbi in these cases;

• Loopy Belief Propagation: still provides approximate (but often good) solution;

Page 90: Deep learning for image denoising and superresolution

• Generalized BP for pairwise MRFs

• Hidden variables xi and xj are connected through a compatibility function;

• Hidden variables xi are connected to observable variables yi by the local “evidence” function;

• The joint probability of {x} is given by

• To improve inference by taking into account higher-order interactions among the variables;

• An intuitive way is to define messages that propagate between groups of nodes rather than just single nodes;

• This is the intuition in Generalized Belief Propagation (GBP).

Page 91: Deep learning for image denoising and superresolution

Discriminative Model: CRF• Conditional , not joint, probabilistic sequential models p(y|x)

• Allow arbitrary, non-independent features on the observation seq X

• Specify the probability of possible label seq given an observation seq

• Prob. of a transition between labels depend on past/future observ.

• Relax strong independence assumptions, no p(x) required

• CRF is MRF plus “external” variables, where “internal” variables Y of MRF are un-observables and “external” variables X are observables

• Linear chain CRF: transition score depends on current observation

• Inference by DP like HMM, learning by forward-backward as HMM

• Optimization for learning CRF: discriminative model

• Conjugate gradient, stochastic gradient,…

Page 92: Deep learning for image denoising and superresolution

Product of Experts (PoE)

• Model a probability distribution by combining the output from several simpler distributions.

• Combine several probability distributions ("experts") by multiplying their density functions— similar to “AND" operation. • This allows each expert to make decisions on the basis of a few dimensions w.o.

having to cover the full dimensionality.

• Related to (but quite different from) a mixture model, combining several probability distributions via “OR" operation.

• Learning by CD: run N samplers in parallel, one for each data-case in the (mini-)batch;

• Boosting: focusing on training data with high reconstruct. errors;

• Easy for inference, no suffer from “Explaining Away”.

Page 93: Deep learning for image denoising and superresolution

Stochastic Gradient Descent (SGD)

• The general class of estimators that arise as minimizers of sums are called M-estimators;• Where are stationary points of the likelihood function (or zeroes of its

derivative, the score function)?

• Online gradient descent samples a subset of summand functions at every step;• The true gradient of is approximated by a gradient at a single example;

• Shuffling of training set at each pass.

• There is a compromise between two forms, often called "mini-batches", where the true gradient is approximated by a sum over a small number of training examples.

• STD converges almost surely to a global minimum when the objective function is convex or pseudo-convex, and otherwise converges almost surely to a local minimum.

Page 94: Deep learning for image denoising and superresolution

Back Propagation• Back propagation is a multi-layer network training method

• Error propagation • Forward propagation of a training pattern's input through the

multilayer network to generate the output activations;• Backward propagation of the output activations (logistic or soft-

max) through the multiplayer network using the training pattern target to generate the deltas of all output and hidden units (thechain rule);

• Weight update• Multiply its output delta and input activation to get the weight

gradient;• Subtract a ratio (i.e. the learning rate) of the gradient from the

weight.

Page 95: Deep learning for image denoising and superresolution

Loss function• Euclidean loss is used for regressing to real-valued lables [-inf,inf];

• Sigmoid cross-entropy loss is used for predicting K independent probability values in [0,1];

• Softmax (normalized exponential) loss is predicting a single class of K mutually exclusive classes;

• Generalization of the logistic function that "squashes" a K-dimensional vector

of arbitrary real values z to a K-dimensional vector of real values σ(z) in the

range (0, 1).

• The predicted probability for the j'th class given a sample vector x is

• Sigmoidal or Softmax normalization is a way of reducing the influence of extreme values or outliers in the data without removing them from the dataset.

Page 96: Deep learning for image denoising and superresolution

Variable Learning Rate• Too large learning rate

• cause oscillation in searching for the minimal point

• Too slow learning rate• too slow convergence to the minimal point

• Adaptive learning rate• At the beginning, the learning rate can be large when the current

point is far from the optimal point;

• Gradually, the learning rate will decay as time goes by.

• Should not be too large or too small: • annealing rate 𝛼(𝑡)=𝛼(0)/(1+𝑡/𝑇)

• 𝛼(𝑡) will eventually go to zero, but at the beginning it is almost a constant.

Page 97: Deep learning for image denoising and superresolution

Variable Momentum• Classical Momentum (CM) is a technique for accelerating gradient descent that

accumulates a velocity vector in directions of persistent reduction in the

objective across iterations: given the objective function f(θ),

Vt+1 = µVt - ε𝛻f(θt), θt+1 = θt + Vt+1,

With ε>0 as learning rate, µͼ[0,1] as momentum coefficient and 𝛻f(θt) as

gradient at θt;

• Nesterov’s Accelerated Gradient (NAG) is also a 1st order optimization method

with better convergence rate guarantee than gradient descent;

Vt+1 = µVt - ε𝛻f(θt + µVt), θt+1 = θt + Vt+1,

• For convex objectives, momentum-based methods outperform SGD in the early

or transient stages of optimization, however equally effective in the final stage;

• Hessian-free (HF) methods and truncated Newton methods work by optimizing

a local quadratic model of the objective via the linear conjugate gradient (CG)

algorithms;

• If CG terminated after just one step, HF becomes equivalent to NAG;

Page 98: Deep learning for image denoising and superresolution

Data Augmentation for Overfitting• The easiest and most common method to reduce overfitting on

image data is to artificially enlarge the dataset using label-preserving transformations;

• Perturbing an image I by transformations that leave the underlying class unchanged (e.g. cropping and flipping) in order to generate additional examples of the class;

• Two distinct forms of data augmentation:

• image translation

• horizontal reflections

• changing RGB intensities

Page 99: Deep learning for image denoising and superresolution

Dropout and Maxout for Overfitting

• Dropout: set the output of each hidden neuron to zero w.p. 0.5. • Motivation: Combining many different models that share parameters

succeeds in reducing test errors by approximately averaging together the predictions, which resembles the bagging.

• The units which are “dropped out” in this way do not contribute to the forward pass and do not participate in back propagation.

• So every time an input is presented, the NN samples a different architecture, but all these architectures share weights.

• This technique reduces complex co-adaptations of units, since a neuron cannot rely on the presence of particular other units.

• It is, therefore, forced to learn more robust features that are useful in conjunction with many different random subsets of the other units.

• Without dropout, the network exhibits substantial overfitting.

• Dropout roughly doubles the number of iterations required to converge.

• Maxout takes the maximum across multiple feature maps;

Page 100: Deep learning for image denoising and superresolution

MCMC Sampling for Optimization• Markov Chain: a stochastic process in which future states are

independent of past states but the present state.

• Markov chain will typically converge to a stable distribution.

• Monte Carlo Markov Chain: sampling using ‘local’ information

• Devise a Markov chain whose stationary distribution is the target.• Ergodic MC must be aperiodic, irreducible, and positive recurrent.

• Monte Carlo Integration to get quantities of interest.

• Metropolis-Hastings method: sampling from a target distribution

• Create a Markov chain whose transition matrix does not depend on the normalization term.

• Make sure the chain has a stationary distribution and it is equal to the target distribution (accept ratio).

• After sufficient number of iterations, the chain will converge the stationary distribution.

• Gibbs sampling is a special case of M-H Sampling.• The Hammersley-Clifford Theorem: get the joint distribution from the

complete conditional distribution.• Hybrid Monte Carlo: gradient sub step for each Markov chain.

Page 101: Deep learning for image denoising and superresolution

Mean Field for Optimization• Variational approximation modifies the optimization problem to

be tractable, at the price of approximate solution;

• Mean Field replaces M with a (simple) subset M(F), on which A*

(μ) is a closed form (Note: F is disconnected graph);• Density becomes factorized product distribution in this sub-family.

• Objective: K-L divergence.

• Mean field is a structured variation approximation approach:• Coordinate ascent (deterministic);

• Compared with stochastic approximation (sampling):• Faster, but maybe not exact.

Page 102: Deep learning for image denoising and superresolution

Contrastive Divergence for RBMs

• Contrastive divergence (CD) is proposed for training PoE first, also being a quicker way to learn RBMs;

• Contrastive divergence as the new objective;

• Taking gradients and ignoring a term which is usually very small.

• Steps:• Start with a training vector on the visible units.• Then alternate between updating all the hidden units in parallel and

updating all the visible units in parallel.

• Can be applied using any MCMC algorithm to simulate the model (not limited to just Gibbs sampling);

• CD learning is biased: not work as gradient descent• Improved: Persistent CD explores more modes in the distribution

• Rather than from data samples, begin sampling from the mode samples, obtained from the last gradient update.

• Still suffer from divergence of likelihood due to missing the modes.

• Score matching: the score function does not depend on its normal. factor. So, match it b.t.w. the model with the empirical density.

Page 103: Deep learning for image denoising and superresolution

“Wake-Sleep” Algorithm for DBN• Pre-trained DBN is a generative model;

• Do a stochastic bottom-up pass (wake phase)• Get samples from factorial distribution (visible first, then generate

hidden);

• Adjust the top-down weights to be good at reconstructing the feature activities in the layer below.

• Do a few iterations of sampling in the top level RBM• Adjust the weights in the top-level RBM.

• Do a stochastic top-down pass (sleep phase)• Get visible and hidden samples generated by generative model using

data coming from nowhere!

• Adjust the bottom-up weights to be good at reconstructing the feature activities in the layer above.

• Any guarantee for improvement? No!

• The “Wake-Sleep” algorithm is trying to describe the representation economical (Shannon’s coding theory).

Page 104: Deep learning for image denoising and superresolution

Greedy Layer-Wise Training• Deep networks tend to have more local minima problems than

shallow networks during supervised training

• Train first layer using unlabeled data• Supervised or semi-supervised: use more unlabeled data.

• Freeze the first layer parameters and train the second layer

• Repeat this for as many layers as desire• Build more robust features

• Use the outputs of the final layer to train the last supervisedlayer (leave early weights frozen)

• Fine tune the full network with a supervised approach;

• Avoid problems to train a deep net in a supervised fashion.• Each layer gets full learning

• Help with ineffective early layer learning

• Help with deep network local minima

Page 105: Deep learning for image denoising and superresolution

Why Greedy Layer-Wise Training Works?

• Take advantage of the unlabeled data;

• Regularization Hypothesis

• Pre-training is “constraining” parameters in a region relevant to unsupervised dataset;

• Better generalization (representations that better describe unlabeled data are more discriminative for labeled data) ;

• Optimization Hypothesis

• Unsupervised training initializes lower level parameters near localities of better minima than random initialization can.

• Only need fine tuning in the supervised learning stage.

Page 106: Deep learning for image denoising and superresolution

Two-Stage Pre-training in DBMs • Pre-training in one stage

• Positive phase: clamp observed, sample hidden, using variationalapproximation (mean-field)

• Negative phase: sample both observed and hidden, using persistent sampling (stochastic approximation: MCMC)

• Pre-training in two stages• Approximating a posterior distribution over the states of hidden

units (a simpler directed deep model as DBNs or stacked DAE);

• Train an RBM by updating parameters to maximize the lower-bound of log-likelihood and correspond. posterior of hidden units.

• Options (CAST, contrastive divergence, stochastic approximation…).

Page 107: Deep learning for image denoising and superresolution

A. Reference • Dempster, A., Laird, N., Rubin, D. (1977). "Maximum Likelihood from Incomplete

Data via the EM Algorithm". J.of the Royal Statistical Society B, 39 (1): 1–38.• L. R. Rabiner (Feb 1989). "A tutorial on Hidden Markov Models and selected

applications in speech recognition". Proc. of the IEEE. 77 (2): 257–286.• Mitchell, T. (1997). Machine Learning, McGraw Hill.• Jensen, Finn (1996). An introduction to Bayesian networks. Berlin: Springer.• Frey, Brendan (1998). Graphical Models for Machine Learning and Digital

Communication. MIT Press• Gamerman, D. (1997). Markov Chain Monte Carlo: Stochastic Simulation for Bayesian

Inference. Chapman and Hall: London.• M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, L. K. Saul (1999). An Introduction to

Variational Methods for Graphical Models, Machine Learning, v.37 n.2, p.183-233.• S Roweis & L Saul. (Dec.2000 ). “Nonlinear dimensionality reduction by locally linear

embedding”. Science, v.290,. pp.2323--2326.• Stan Z. Li. (2001). Markov Random Field Modeling in Image Analysis. Springer-Verlag.• J. Lafferty, A. McCallum, and F. Pereira. (2001). Conditional random fields:

Probabilistic models for segmenting and labeling sequence data. Proc. 18th International Conf. on Machine Learning.

• Lee, Honglak; Battle, Alexis; Raina, Rajat; Ng, Andrew Y. (2006). "Efficient sparse coding algorithms". Advances in Neural Information Processing Systems.

• U. von Luxburg, (2007)"A tutorial on spectral clustering", Stat. Comp. Vol. 17, Issue 4 , pp395-416.

Page 108: Deep learning for image denoising and superresolution