machine learning based contour boundary detection from images

Yu HuangSunnyvale, California

[email protected]

mailto:[email protected]

Edges, Contours and Boundaries Finding Meaningful Contours Static Segmentation (Regions) Classical Gestalt Cues Berkeley Segmentation Data Set Learning for Scene Segmentation Learn a Local Boundary Model Image Figure/Ground Assignment Learning Edges and Boundaries Sparse Models for Edge Detection Boundary Detection and Grouping Sparse Coding for Contour Detection Sketch Tokens for Contour Detection Deep learning shape prior for segmentation Deep neural prediction network for visual boundary References Appendix

Edges: Significant local changes in image; occur on theboundary between 2 different regions in an image.

Contour: Representation of linked edges for a regionboundary.◦ Closed: Correspond to region boundaries; filling algorithm

determines the pixels in the region.

◦ Open: part of a region boundary; gaps’ formation due to highedge-detection threshold or weak contrast.

occur when line fragments are linked together, as in drawing or handwriting.

Contour Representation:◦ Ordered list of Edges (chains codes)

◦ Curve- model for a contour (piecewise line segments or cubicsplines)

Local edge detection◦ Problems - false targets, misses

One solution: use other cues (image segmentation)◦ Texture: Sharp changes in orientation, scale of textures

◦ Motion: >=2 Frames

◦ Disparity: Stereo

Left eye Right eyeFrame 1 Frame 2

Regional Approaches (split-merge, watershed, mean shift, ...)

Use regional info, optimize labelling of regional tokens, e.g. clustering

Depending on uniformity in object region

Active Contour Models (snakes)

◦ Use regional (external) & boundary (internal) info, optimize deformation of model

◦ Sensitivity to initialization, too smooth

Level Set (implicit active contour)

handle topological changes naturally

not robust to boundary gaps

Contour Grouping

Use boundary info (& regional info), optimize grouping of contour fragments

Learning-based: Boundary Detection.

How is grouping done in human vision?

Proximity

Similarity◦ Brightness

◦ Contrast

Good continuation ◦ Parallelism

◦ Co-circularity

Two-class classification model Over segmentation as preprocessing Use classical Gestalt cues◦ Contour, texture, brightness and continuation

A linear classifier is used for training (logistic regression)

SuperpixelmapK=200

Reconstruction of human segmentation from Superpixels

•Local•Coherent•Preserve structure

•Contour •texture

ImageBoundary Cues

Model

Brightness

Color

Texture

Challenges: texture cue, cue combination

Goal: learn the posterior probability of a boundary

Pb(x,y,) from local information only

Cue Combination

Human subjects label ground truth figure/ground assignments in natural images.

“Shapemes” encode high-level knowledge in a generic way, capturing local figure/ground cues.

A conditional random field (CRF) incorporates junction cues and enforces global consistency.

Shapemes (clusters of local shapes)

Pb edge maps human-markedboundaries

Color image contour/junction

Boosted Edge Learning (BEL): Probabilistic Boosting Tree (PBT) classification;

Features: gradient+Haarlet, over a large image patch. Learn to detect edges from images with labeled

ground truth;

PBT Training:

Sparseland model and dictionary learning by k-SVD;

Edge detection as the pixelwise classification problem;◦ “patches centered on edge pixel or not”;

Contour training: class specific edge classifier

Shape training: shape-based object classifier

Classification: edge classifier then shape classifier◦ Bike, Motorbike, Person or Car?

Person?

Learning-based boundary detection: SIFT-based, dim. reduction by PCA, boosting (Adaboost, Gentleboost and Madaboost);

Boundary grouping: use a normalized saliency criterion, fractional-linear programming to find graph circles with min. cost .

Sparse Code Gradients (SCG): by sparse coding (k-SVD); Gradient, color, plus depth & surface normal(option); Linear classifier (SVM) with contrast features (SCG); Globalization by computing a spectral gradient (like gPb) optionally;

Definition: straight lines, t-junctions, y-junctions, corners, curves, parallel lines; Learned (k-means clustering) from patches of human generated contours: a

number of classes in hundreds (150 in the paper), Daisy (MSR) descriptors used for shift invariance;

Low-level image features: gradient, color, orientation, etc.; Classifier: Random decision forest for sketch token labeling from image patches.

Sketch Tokens

Like “Shapeme”?

Use deep Boltzmann machine to learn the hierarchical architecture of shape priors: low level local feature and high level global feature;

Apply the learned architecture to model shape variations of global and local structures;

A data-driven variational method to perform object extraction based on shape probabilistic representation.

original result learned shape result by sparse learned shape

Integration from multiple scales and semantic levels via multi-streams of interlinked, layered, non-linear “deep” processing;

◦ Deep belief net with a variant of the mean-and-covariance RBM;

Unsupervised feature learning;

◦ Supervised boundary prediction by feed forward NN.

X. Ren, and J. Malik. "Learning a Classification Model for Segmentation", ICCV’03

D. Martin, C. Fowlkes, and J. Malik. "Learning to detect natural Image boundariesusing local brightness, color, and texture cues", IEEE T-PAMI 2004

P. Doll´ar, Z. Tu, and S. Belongie, “Supervised learning of edges and objectboundaries”, CVPR, 2005

Ren, Fowlkes, Malik. "Figure/Ground assignment in natural images“, ECCV 2006

Mairal1, M. Leordeanu, F. Bach1, M. Hebert, J. Ponce, “Discriminative Sparse Image Models for Class-Specific Edge Detection and Image Interpretation”, ECCV’08.

I. Kokkinos, “Highly Accurate Boundary Detection and Grouping”. CVPR 2010.

X. Ren and L. Bo, “Discriminatively Trained Sparse Code Gradients for Contour Detection”, NIPS’12.

J Lim, C. L. Zitnick, P Dollar, “Sketch Tokens: A Learned Mid-level Representation for Contour and Object Detection”, CVPR, 2013

Chen, Yu, Hu, Zeng, “Deep Learning Shape Priors for Object Segmentation”, CVPR’13.

Kivinen, Williams, Heess, “visual boundary prediction: a deep neural prediction network and quality dissection”, AISTATS, 2014.

“Machine Learning is programming computers to optimize a performance criterion using example data or past experience”◦ Supervised/Unsupervised model: labeled/unlabeled data;◦ Semi-supervised model: both labeled and unlabeled data;◦ Online learning: incremental update;◦ Ensemble classifiers: bagging, stacking, boosting, random forest,…◦ Reinforcement Learning: learn by interacting with an environment.

Types of ML algorithms◦ Prediction: predicting a variable from data◦ Classification: assigning records to predefined groups◦ Clustering: splitting records into groups based on similarity◦ Association learning: seeing what often appears together with what

Relationship with others◦ Artificial intelligence: emulate how the brain works with program.;

ML is a branch of AI

◦ Data mining: building models in order to detect the patterns;◦ Statistical analysis: probabilistic models, on which to infer with data;◦ Information retrieval: retrieval of information from a collection of data.

Unsupervised learning is that of trying to find hidden structure in unlabeled data; Since the examples given to the learner are unlabeled, there is no error or reward signal to evaluate a potential solution;

It is closely related to the problem of density estimation in statistics; However also encompasses many other techniques that seek to summarize and explain key features of the data

Unsupervised learning also encompasses many other techniques that seek to summarize and explain key features of the data.

Approaches to unsupervised learning include:◦ Clustering;◦ Hidden Markov models;◦ Blind signal separation (PCA, ICA, NMF, SVD…);

Unsupervised methods in NN: ◦ Self Organizing Map: topographic organization in which nearby locations in

the map represent inputs with similar properties;◦ Adaptive Resonance Theory: allows the number of clusters to vary with

problem size and lets the user control the degree of similarity between members of the same clusters by means of the vigilance parameter.

Supervised learning is the task of inferring a function from labeled training data. The training data consist of a set of training examples.

Each example is a pair consisting of an input object (typically a vector) and a desired output value (also called the supervisory signal).

A supervised learning algorithm analyzes the training data and produces an inferred function, used for mapping new examples.

There are four major issues to consider in supervised learning:◦ tradeoff between bias and variance;◦ amount of training data relative to the complexity of the "true" function;◦ dimensionality of the input space: curse of dimensionality;◦ degree of noise in the desired output values: over-fitting.

There are several ways to be generalized:◦ Semi-supervised learning: the desired output values are provided only for a

subset of the training data. The remaining data is unlabeled.◦ Active learning: Instead of assuming that all of the training examples are

given at the start, interactively collect new examples, typically by making queries to a human user.

Training/testing data (70%/30%)

Data unbalanced (one class’ data more than others)

◦ Sampling, learning algorithm modification (cost-sensitive), ensemble,…

Feature extraction

◦ Sparse coding, vector quantization,…

Curse of Dimensionality: Sensitivity to “noise”

◦ Dimension reduction, manifold learning/distance metric learning

Linear or non-linear model

◦ Local/Global minimum (convex/concave obj. function): Learning rate

◦ Regularization: L-1/L-2 norm

◦ Kernel trick: mapping nonlinear feature space to high dim. linear space

Discriminative or generative model

◦ Bottom up (conditional distribution) /Top down (joint distribution)

Over-fitting: Learn the “noise”

◦ Cross validation with grid search

Performance evaluation

◦ Precision/Recall, confusion matrix, ROC, i.e. receiver operating characteristic)

Principal Component Analysis (PCA) uses orthogonal transformation to convert a set of observations of possibly correlated variables into a set of linearly uncorrelated variables called principal components.

This transformation is defined in such a way that the first principal component has the largest possible variance and each succeeding component in turn has the highest variance possible under the constraint that it be orthogonal to the preceding components.

PCA is sensitive to the relative scaling of the original variables.

Also called as Karhunen–Loève transform (KLT), Hotelling transform, singular value decomposition (SVD) , factor analysis, eigenvalue decomposition (EVD), spectral decomposition etc.;

Affinity Propagation (AP) is a clustering algorithm based on the concept of "message passing" between data points. Unlike clustering algorithms such as k-means or k-medoids, AP does not require the number of clusters to be determined or estimated before running the algorithm;

Spectral Clustering makes use of the spectrum (eigenvalues) of the data similarity matrix to perform dimensionality reduction before clustering in fewer dimensions.

Independent component analysis (ICA) is for separating a multivariate signal into additive subcomponents by assuming that the subcomponents are non-Gaussian signals and all statistically independent from each other.

◦ ICA is a special case of blind source separation.

Assumptions: the source signals are independent of each other; distribution of the values in each source signals are non-Gaussian.

Three effects of mixing signals as below

◦ Independence: the signal mixtures may not;

◦ Normality: closer to Gaussian than any of original variables;

◦ Complexity: Greater than that of its constituent source signal.

Preprocessing: centering, whitening and dimension reduction;

ICA finds the independent components (latent variables) by maximizing the statistical independence of the estimated components;

Definitions of independence for ICA:

◦ Minimization of mutual information (KL divergence or entropy);

◦ Maximization of non-Gaussianity (kurtosis and negative entropy).

Initial signal mixed signal whitening ICA

Mixture model is a probabilistic model for representing the presence of subpopulations within an overall population;

“Mixture models" are used to make statistical inferences about the properties of the sub-populations given only observations on the pooled population;

A Gaussian mixture model can be Bayesian or non-Bayesian; A variety of approaches focus on maximum likelihood estimate (MLE)

as expectation maximization (EM) or maximum a posteriori (MAP); EM is used to determine the parameters of a mixture with an a priori given

number of components (a variation version can adapt it in the iteration);◦ Expectation step: "partial membership" of each data point in each

constituent distribution is computed by calculating expectation values for the membership variables of each data point;

◦ Maximization step: plug-in estimates, mixing coefficients and component model parameters, are re-computed for the distribution parameters;

◦ Each successive EM iteration will not decrease the likelihood.

Alternatives of EM for mixture models:◦ mixture model parameters can be deduced using posterior sampling as

indicated by Bayes' theorem, i.e. Gibbs sampling or Markov Chain Monte Carlo (MCMC);

◦ Spectral methods based on SVD;◦ Graphical model: MRF or CRF.

Non-negative matrix factorization (NMF): a matrix V is factorized into (usually) two matrices W and H, that all three matrices have no negative elements.

The different types arise from using different cost functions for measuring the divergence between V and W*H and possibly by regularization of the W and/or H matrices;◦ squared error, Kullback-Leibler divergence or total variation (TV);

NMF is an instance of a more general probabilistic model called "multinomial PCA“, as pLSA (probabilistic latent semantic analysis);

pLSA is a statistical technique for two-mode (extended naturally to higher modes) analysis, modeling the probability of each co-occurrence as a mixture of conditionally independent multinomial distributions;◦ Their parameters are learned using EM algorithm;

pLSA is based on a mixture decomposition derived from a latent class model, not as downsizing the occurrence tables by SVD in LSA.

Note: an extended model, LDA (Latent Dirichlet allocation) , adds a Dirichlet prior on the per-document topic distribution.

Note: d is the document index variable, c is a word's topic drawn from the document's topic distribution, P(c|d), and w is a word drawn from the word distribution of this word's topic, P(w|c). (d and w are observable variables, c is a latent variable.)

http://sens.tistory.com/319

http://sens.tistory.com/319

A hidden Markov model (HMM) is a statistical Markov model: the modeled system is a Markov process with unobserved (hidden) states;

In HMM, state is not visible, but output, dependent on state, is visible. ◦ Each state has a probability distribution over the possible output tokens;◦ Sequence of tokens generated by an HMM gives some information about the sequence of

states.

Note: the adjective 'hidden' refers to the state sequence through which the model passes, not to the parameters of the model;

A HMM can be considered a generalization of a mixture model where the hidden variables are related through a Markov process;

Inference: prob. of an observed sequence by Forward-Backward Algorithm and the most likely state trajectory by Viterbi algorithm (DP);

Learning: optimize state transition and output probabilities by Baum-Welch algorithm (special case of EM).

http://en.wikipedia.org/wiki/File:Hmm_temporal_bayesian_net.svg

http://en.wikipedia.org/wiki/File:Hmm_temporal_bayesian_net.svg

logistic regression is a probabilistic statistical classification model;

The prob. of the possible outcomes of a single trial are modeled as a function of explanatory variables by a logistic function;

Training: Maximizes conditional likelihood P(y|x) directly;

http://upload.wikimedia.org/wikipedia/commons/8/88/Logistic-curve.svg

http://upload.wikimedia.org/wikipedia/commons/8/88/Logistic-curve.svg

Convex optimization (logistic function): w = argmax|w P(Y|X, w);◦ Adding regularization term as well for overfitting;

◦ Iterative solution: a gradient descent method.

In a Bayesian statistics context, prior distributions are normally placed on the regression coefficients, usually as Gaussian distributions;◦ Apply Metropolis–Hastings algorithm (a more general MCMC method than

Gibbs sampling): based on proposal distribution or jumping distribution.

The proposal distribution Q proposes the next point that the random walk might move to.

http://upload.wikimedia.org/wikipedia/commons/8/80/Metropolis_hastings_algorithm.png

http://upload.wikimedia.org/wikipedia/commons/8/80/Metropolis_hastings_algorithm.png

The Naive Bayes classifier is designed when features are independent of one another within each class, but it appears to work well in practice even when that independence assumption is not valid. It classifies data in two steps:

◦ Training step: Using the training samples, the method estimates the parameters of a probability distribution, assuming features are conditionally independent given the class.

◦ Prediction step: For any unseen test sample, the method computes the posterior probability of that sample belonging to each class. The method then classifies the test sample according the largest posterior probability.

The class-conditional independence assumption greatly simplifies the training step since you can estimate the one-dimensional class-conditional density for each feature individually;

◦ While the class-conditional independence between features is not true in general, research shows that this optimistic assumption works well in practice;

◦ This assumption of class independence allows the Naive Bayes classifier to better estimate the parameters required for accurate classification while using less training data than many other classifiers;

◦ This makes it particularly effective for datasets containing many predictors or features.

Supported distributions in NB classif.◦ Naive Bayes is based on

estimating P(x|y), the probability or probability density of features x given class y.

◦ Support for normal (Gaussian), kernel, multinomial, and multivariate multinomial distributions. Normal (Gaussian) Distribution: features have

normal distributions in each class; Kernel: computes a separate kernel density

estimate for each class based on the training data for that class;

Multinomial Distribution ("bag of words" model): each feature for the count of one word; classification is based on the relative frequencies of the words.

Multivariate Multinomial Distribution: feature categories, differ from the class levels for the response variable.

Separable Data◦ An SVM classifies data by finding the best hyperplane that

separates all data points of one class from those of the other class.

◦ “Margin” means the maximal width of the slab parallel to the hyperplane that has no interior data points.

◦ The support vectors are the data points that are closest to the separating hyperplane.

Mathematical Formulation: Primal.

Mathematical Formulation: Dual.

Variables i are slack variables measuring the error made at point (xi,yi)

l

i 1

i

2

Kf,Cf min

i

yif(xi) 1 - i, for all i i

0

l

1i

l

1j

jijiji

l

1i

iα

),K(yyαα2

1α min

i

xx0 i C, for all i

01

l

i

ii y

Non-separable Data◦ Your data might not allow for a separating hyperplane. In that case, SVM can use

a soft margin, meaning a hyperplane that separates many, but not all data points.

Nonlinear Transformation with Kernels◦ Some binary classification problems do not have a simple hyperplane as a useful

separating criterion;

◦ Theory of reproducing kernels: Polynomials, Radial basis or sigmoid function;

◦ Nonlinear kernels can use identical calculations and solution algorithms, and obtain classifiers that are nonlinear.

Random Field: F={F1,F2,…FM} a family of random variables on set S in which each Fi takes value fi in a label set L.

Markov Random Field: F is said to be a MRF on S w.r.t. a neighborhood N if and only if it satisfies Markov property.◦ Generative model for joint probability p(x)◦ allows no direct probabilistic interpretation◦ define potential functions Ψ on maximal cliques A map joint assignment to non-negative real number

requires normalization

MRF is undirected graphical models

Conditional , not joint, probabilistic sequential models p(y|x)

Allow arbitrary, non-independent features on the observation seq X

Specify the probability of possible label seq given an observation seq

Prob. of a transition between labels depend on past/future observ.

Relax strong independence assumptions, no p(x) required

CRF is MRF plus “external” variables, where “internal” variables Y of MRF are un-observables and “external” variables X are observables

Linear chain CRF: transition score depends on current observation

◦ Inference by DP like HMM, learning by forward-backward as HMM

Optimization for learning CRF: discriminative model

◦ Conjugate gradient, stochastic gradient,…

Ensemble methods use multiple models to obtain better predictive performance than could be obtained from any of the constituent models;

Ensembles combine many weak learners to produce a strong learner;

◦ Term ensemble is for methods that generate multiple hypotheses using the same base learner;

Ensembles can be shown to have more flexibility in the functions they can represent. This flexibility can, in theory, enable them to over-fit the training data more than a single model would; but in practice, some ensemble techniques (especially bagging) tend to reduce problems related to over-fitting of the training data;

Empirically, ensembles tend to yield better results when there is a significant diversity among the models;

Popular types:

◦ Bagging, boosting, stacking, stochastic discrimination, random subspace, …

◦ Random forest, derived from the random subspace method, constructs a multitude of decision trees at training time and outputting the class that is the mode of the classes output by individual trees.

Samples the training set, generate random independent bootstrap replicates, constructs the classifier, aggregates them by a majority vote in the final decision rule; (called “bootstrap aggregating”)

Bootstrapping is based on random sampling with replacement; Therefore, taking bootstrap replicate (random selection with

replacement) of the training set sometimes avoid or get less misleading training objects in the bootstrap training set;

Consequently, a classifier constructed on such a training set may have a better performance.

At each step, training data are re-weighted that incorrectly classified objects get larger weights in a new, modified training set, thus actually maximizes the margins between objects;

Classifiers are constructed on weighted versions of the training set, which are independent on previous classification results;

Boosting learning originated from the Probably Approximately Correct (PAC) learning theory;

AdaBoost is the first algorithm that could adapt to the weak learners;

Variant of Adaboost (Adaptive boosting):

◦ LogitBoost:

◦ GentleBoost: Update is fm(x) = P(y=1|x) – P(y=0|x) instead of

In SVM, one performs global optimization in order to maximize the minimal margin, while in Boosting one maximizes the margin locally for each training object;

SVM uses the L-2 norm for both hypothesis and weight vector, while Boosting uses the L- norm for the hypothesis vector and L-1 norm for the weight vector;

It is shown that if the number of relevant weak hypothesis k is a small fraction of the total number of weak hypotheses, then the margin associated with Boosting will be much larger than the one associated with SVM;

SVM corresponds to quadratic programming while Boosting only to linear programming;

Through the method of kernels, SVM allows to perform low dimensional calculation that mathematically equivalent to inner products in a high dimensional ‘virtual’ space; Instead, Boosting employs greedy search, the re-weighting of the examples changes the distribution with respect to which the correlation is measured, thus guiding the weak learner to find different correlated coordinates.

With discrete stochastic processes, arbitrary numbers of very weak models are generated and combined to separate the points in multi-dimensional spaces. ◦ Can be regarded as a method of dimensionality reduction; ◦ ``uniformity'‘: two points from the same class are equally likely to be

captured by a weak model of a given size ;◦ “enrichment”: weak models do not have the same chance of capturing

points from different classes .

SD has the property that the very complex and accurate classifiers produced in this way retain the ability, characteristic of their weak component pieces, to generalize to new data;

It is in combining these weak models that the discriminative power is developed.

SD simply transforms the multi-d feature vectors to points coming from two uni-variate normal distributions;

These two uni-variate normal distributions are separate further as the number of weak models increases, which intuitively is similar to how people learn the knowledge.

Classifiers are constructed in random subspaces of the data feature space, usually combined by simple majority voting in the final decision rule;

It relies on also a stochastic process that randomly selects a number of components of the given feature vector in constructing each classifier;

Geometrically this is equivalent to projecting all the points to the selected subspace;

Random subspace method effectively takes advantages of high dimensionality .

• Defined as a set {D, X, Y} such that

DY = X

• Given a D and yi, how to find xi ?

• Constraint : xi is sufficiently sparse;

• Finding exact solution is difficult;

• Approximate a solution good enough?

Greedy methods: projecting the residual on some atom;◦ Matching pursuit, orthogonal matching pursuit;

L1-norm: Least Absolute Shrinkage and Selection Operator (LASSO);◦ The residual is updated iteratively in the direction of the atom;

Gradient-based finding new search directions◦ Projected Gradient Descent◦ Coordinate Descent

Homotopy: a set of solutions indexed by a parameter (regularization)◦ LARS (Least Angle Regression)

First order/proximal methods: Generalized gradient descent◦ solving efficiently the proximal operator◦ soft-thresholding for L1-norm◦ Accelerated by the Nesterov optimal first-order method

Iterative reweighting schemes◦ L2-norm: Chartand and Yin (2008) ◦ L1-norm: Cand`es et al. (2008)

Select dk with maxprojection on residue

xk = arg min ||y-Dkxk||

Update residue

r = y - Dkxk

Check terminating condition

D, y x

What D to use? A fixed overcomplete set of basis: no adaptivity.

Steerable wavelet; Bandlet, curvelet, contourlet; DCT Basis; Gabor function; ….

Data adaptive dictionary – learn from data; K-SVD: a generalized K-means clustering process for

Vector Quantization (VQ).◦ An iterative algorithm to effectively optimize the sparse

approximation of signals in a learned dictionary.

Other methods of dictionary learning:◦ non-negative matrix decompositions.◦ sparse PCA (sparse dictionaries).◦ fused-lasso regularizations (piecewise constant dictionaries)

Extending the models: Sparsity + Self-similarity=Group Sparsity

• Select atoms from input;• Atoms can be image patches;• Patches are overlapping.

Initialize Dictionary

Sparse Coding(OMP)

Update DictionaryOne atom at a time

• Use OMP or any pursuit method;• Output sparse code for all signals;• Minimize representation error.

Representation learning attempts to automatically learn good features or representations;

Deep learning algorithms attempt to learn multiple levels ofrepresentation of increasing complexity/abstraction (intermediate and high level features);

Become effective via unsupervised pre-training + supervised fine tuning;◦ Deep networks trained with back propagation (without unsupervised pre-

training) perform worse than shallow networks.

Deal with the curse of dimensionality (smoothing & sparsity) and over-fitting (unsupervised, regularizer);

Semi-supervised: structure of manifold assumption; ◦ labeled data is scarce and unlabeled data is abundant.

• Supervised training of deep models (e.g. many-layered Nets) is too hard (optimization problem);

Learn prior from unlabeled data;

• Shallow models are not for learning high-level abstractions; Ensembles or forests do not learn features first; Graphical models could be deep net, but mostly not.

• Unsupervised learning could be “local-learning”; Resemble boosting with each layer being like a weak learner

• Learning is weak in directed graphical models with many hiddenvariables;

Sparsity and regularizer.

• Traditional unsupervised learning methods aren’t easy to learn multiple levels of representation.

Layer-wised unsupervised learning is the solution.

• Multi-task learning (transfer learning and self taught learning);• Other issues: scalability & parallelism with the burden from big data.

A neural network = running several logistic regressions at the same time;

◦ Neuron=logistic regression or…

Calculate error derivatives (gradients) to refine: back propagate the error derivative through model (the chain rule)

◦ Online learning: stochastic/incremental gradient descent;

◦ Batch learning: conjugate gradient descent.

CNN is a special kind of multi-layer NNs applied to 2-d arrays (usually images), based on spatially localized neural input;◦ local receptive fields(shifted window), shared weights (weight averaging)

across the hidden units, and often, spatial or temporal sub-sampling;◦ Related to generative MRF/discriminative CRF:

CNN=Field of Experts MRF=ML inference in CRF; ◦ Generate ‘patterns of patterns’ for pattern recognition.

Each layer combines (merge, smooth) patches from previous layers◦ Pooling /Sampling (e.g., max or average) filter: compress and smooth the

data.◦ Convolution filters: (translation invariance) unsupervised;◦ Local contrast normalization: increase sparsity, improve

optimization/invariance.

C layers convolutions, S layers pool/sample

Convolutional Networks are trainable multistage architectures composed of multiple stages;

Input and output of each stage are sets of arrays called feature maps;

At output, each feature map represents a particular feature extracted at all locations on input;

Each stage is composed of: a filter bank layer, a non-linearity layer, and a feature pooling layer;

A ConvNet is composed of 1, 2 or 3 such 3-layer stages, followed by a classification module;

◦ A fully connected layer: softmax transfer function for posterior distribution.

Filter: A trainable filter (kernel) in filter bank connects input feature map to output feature map;

Nonlinearity: a pointwise sigmoid tanh() or a rectified sigmoid abs(gi•tanh()) function;

◦ In rectified function, gi is a trainable gain parameter, might be followed a contrast normalization N;

Feature pooling: treats each feature map separately -> a reduced-resolution output feature map;

Supervised training is performed using a form of SGD to minimize the prediction error;

◦ Gradients are computed with the back-propagation method.

Unsupervised pre-training: predictive sparse decomposition (PSD), then supervised fine-tuning.

* is discrete convolution operator

• A hybrid model: can be trained as generative or discriminative model;

• Deep architecture: multiple layers (learn features layer by layer);

• Multi layer learning is difficult in sigmoid belief networks.

• Top two layers are undirected connections, Restricted Boltzmann Machine (RBM);

• Lower layers get top down directed connections from layers above;

• Unsupervised or self-taught pre-learning provides a good initialization;

• Greedy layer-wise unsupervised training for RBM;

• Supervised fine-tuning

• Generative: wake-sleep algorithm (Up-down);

• Discriminative: back propagation (bottom-up);

Belief net is directed acyclic graph composed of stochastic variables.

• Boltzmann machine is a stochastic recurrent model, and RBM is its special case (one hidden layer);

• Learning internal representations that become increasingly complex;

• High-level representations built from a large supply of unlabeled inputs;

• Pre-training: learning a stack of modified RBMs, which are composed to create a deep Boltzmann machine (undirected graph);

• Generative fine-tuning: different from DBN

• Positive and negative phase

• Discriminative fine-tuning: the same to DBN

• Back propagation.

• Denoising Auto-Encoder: Multilayer NNs with target output=input;

• Auto-encoder learns the salient variation like a nonlinear PCA;

• Stack many (may be sparse) auto-encoders in succession and train them using greedy layer-wise unsupervised learning

• Drop the decode layer each time

• Performs better than stacking RBMs;

• Supervised training on the last layer using final features;

• (option) Supervised training on the entire network to fine- tune all weights of the neural net;

• Empirically not quite as accurate as DBNs.

Stochastic Gradient Descent (SGD)

• The general class of estimators that arise as minimizers of sums are called M-estimators;• Where are stationary points of the likelihood function (or zeroes of its

derivative, the score function)?

• Online gradient descent samples a subset of summand functions at every step;• The true gradient of is approximated by a gradient at a single example;

• Shuffling of training set at each pass.

• There is a compromise between two forms, often called "mini-batches", where the true gradient is approximated by a sum over a small number of training examples.

• STD converges almost surely to a global minimum when the objective function is convex or pseudo-convex, and otherwise converges almost surely to a local minimum.

Back Propagation

• Back propagation is a multi-layer network training method • Error propagation

• Forward propagation of a training pattern's input through the multilayer network to generate the output activations;

• Backward propagation of the output activations (logistic or soft-max) through the multiplayer network using the training pattern target to generate the deltas of all output and hidden units (thechain rule);

• Weight update• Multiply its output delta and input activation to get the weight

gradient;• Subtract a ratio (i.e. the learning rate) of the gradient from the

weight.

Too large learning rate◦ cause oscillation in searching for the minimal point

Too slow learning rate◦ too slow convergence to the minimal point

Adaptive learning rate◦ At the beginning, the learning rate can be large when the

current point is far from the optimal point;

◦ Gradually, the learning rate will decay as time goes by.

Should not be too large or too small: ◦ annealing rate 𝛼(𝑡)=𝛼(0)/(1+𝑡/𝑇)

◦ 𝛼(𝑡) will eventually go to zero, but at the beginning it is almost a constant.

Classical Momentum (CM) is a technique for accelerating gradient descent

that accumulates a velocity vector in directions of persistent reduction in the

objective across iterations: given the objective function f(θ),

Vt+1 = µVt - ε𝛻f(θt), θt+1 = θt + Vt+1,

With ε>0 as learning rate, µͼ[0,1] as momentum coefficient and 𝛻f(θt) as

gradient at θt;

Nesterov’s Accelerated Gradient (NAG) is also a 1st order optimization

method with better convergence rate guarantee than gradient descent;

Vt+1 = µVt - ε𝛻f(θt + µVt), θt+1 = θt + Vt+1,

For convex objectives, momentum-based methods outperform SGD in the

early or transient stages of optimization, however equally effective in the

final stage;

Hessian-free (HF) methods and truncated Newton methods work by

optimizing a local quadratic model of the objective via the linear conjugate

gradient (CG) algorithms;

◦ If CG terminated after just one step, HF becomes equivalent to NAG;

Dropout: set the output of each hidden neuron to zero w.p. 0.5. ◦ Motivation: Combining many different models that share parameters

succeeds in reducing test errors by approximately averaging together the predictions, which resembles the bagging.

◦ The units which are “dropped out” in this way do not contribute to the forward pass and do not participate in back propagation.

◦ So every time an input is presented, the NN samples a different architecture, but all these architectures share weights.

◦ This technique reduces complex co-adaptations of units, since a neuron cannot rely on the presence of particular other units.

◦ It is, therefore, forced to learn more robust features that are useful in conjunction with many different random subsets of the other units.

◦ Without dropout, the network exhibits substantial overfitting.

◦ Dropout roughly doubles the number of iterations required to converge.

Maxout takes the maximum across multiple feature maps;

Markov Chain: a stochastic process in which future states are independent of past states but the present state.◦ Markov chain will typically converge to a stable distribution.

Monte Carlo Markov Chain: sampling using ‘local’ information◦ Devise a Markov chain whose stationary distribution is the target.

Ergodic MC must be aperiodic, irreducible, and positive recurrent.

◦ Monte Carlo Integration to get quantities of interest.

Metropolis-Hastings method: sampling from a target distribution◦ Create a Markov chain whose transition matrix does not depend on the

normalization term.

◦ Make sure the chain has a stationary distribution and it is equal to the target distribution (accept ratio).

◦ After sufficient number of iterations, the chain will converge the stationary distribution.

Gibbs sampling is a special case of M-H Sampling.◦ The Hammersley-Clifford Theorem: get the joint distribution from the

complete conditional distribution.

Hybrid Monte Carlo: gradient sub step for each Markov chain.

Variational approximation modifies the optimization problem to be tractable, at the price of approximate solution;

Mean Field replaces M with a (simple) subset M(F), on which A*

(μ) is a closed form (Note: F is disconnected graph);◦ Density becomes factorized product distribution in this sub-family.

◦ Objective: K-L divergence.

Mean field is a structured variation approximation approach:◦ Coordinate ascent (deterministic);

Compared with stochastic approximation (sampling):◦ Faster, but maybe not exact.

Contrastive divergence (CD) is proposed for training PoE first, also being a quicker way to learn RBMs;

◦ Contrastive divergence as the new objective;

◦ Taking gradients and ignoring a term which is usually very small.

Steps:◦ Start with a training vector on the visible units.◦ Then alternate between updating all the hidden units in parallel and

updating all the visible units in parallel.

Can be applied using any MCMC algorithm to simulate the model (not limited to just Gibbs sampling);

CD learning is biased: not work as gradient descent Improved: Persistent CD explores more modes in the distribution

◦ Rather than from data samples, begin sampling from the mode samples, obtained from the last gradient update.

◦ Still suffer from divergence of likelihood due to missing the modes.

Score matching: the score function does not depend on its normal. factor. So, match it b.t.w. the model with the empirical density.

Pre-trained DBN is a generative model;

Do a stochastic bottom-up pass (wake phase)◦ Get samples from factorial distribution (visible first, then generate hidden);

◦ Adjust the top-down weights to be good at reconstructing the feature activities in the layer below.

Do a few iterations of sampling in the top level RBM◦ Adjust the weights in the top-level RBM.

Do a stochastic top-down pass (sleep phase)◦ Get visible and hidden samples generated by generative model using data

coming from nowhere!

◦ Adjust the bottom-up weights to be good at reconstructing the feature activities in the layer above.

◦ Any guarantee for improvement? No!

The “Wake-Sleep” algorithm is trying to describe the representation economical (Shannon’s coding theory).

Deep networks tend to have more local minima problems

than shallow networks during supervised training

Train first layer using unlabeled data

◦ Supervised or semi-supervised: use more unlabeled data.

Freeze the first layer parameters and train the second layer

Repeat this for as many layers as desire

◦ Build more robust features

Use the outputs of the final layer to train the last supervised

layer (leave early weights frozen)

Fine tune the full network with a supervised approach;

Avoid problems to train a deep net in a supervised fashion.

◦ Each layer gets full learning

◦ Help with ineffective early layer learning

◦ Help with deep network local minima

Take advantage of the unlabeled data;

Regularization Hypothesis

◦ Pre-training is “constraining” parameters in a region relevant to unsupervised dataset;

◦ Better generalization (representations that better describe unlabeled data are more discriminative for labeled data) ;

Optimization Hypothesis

◦ Unsupervised training initializes lower level parameters near localities of better minima than random initialization can.

Only need fine tuning in the supervised learning stage.

Pre-training in one stage◦ Positive phase: clamp observed, sample hidden, using

variational approximation (mean-field)

◦ Negative phase: sample both observed and hidden, using persistent sampling (stochastic approximation: MCMC)

Pre-training in two stages◦ Approximating a posterior distribution over the states of hidden

units (a simpler directed deep model as DBNs or stacked DAE);

◦ Train an RBM by updating parameters to maximize the lower-bound of log-likelihood and correspond. posterior of hidden units.

Options (CAST, contrastive divergence, stochastic approximation…).

machine learning based contour boundary detection from images

Technology