singular learning theory part i: statistical learningcmnd/programs/mis2013/undergrad/lin/powe… ·...
TRANSCRIPT
![Page 1: SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… · Discrete and Continuous Random Variables Probability •Random Variables •Discrete·Continuous](https://reader034.vdocuments.pub/reader034/viewer/2022042403/5f16bb07ed99435f721826f6/html5/thumbnails/1.jpg)
SINGULAR LEARNING THEORY
Part I: Statistical Learning
Shaowei Lin(Institute for Infocomm Research, Singapore)
21-25 May 2013Motivic Invariants and Singularities Thematic Program
Notre Dame University
![Page 2: SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… · Discrete and Continuous Random Variables Probability •Random Variables •Discrete·Continuous](https://reader034.vdocuments.pub/reader034/viewer/2022042403/5f16bb07ed99435f721826f6/html5/thumbnails/2.jpg)
Overview
Probability
Statistics
Bayesian
Regression
Algebraic Geometry
Asymptotic Theory∫
[0,1]2(1−x2y2)N/2dxdy ≈
√
π8 N
−12 logN−
√
π8 ( 1
log 2−2 log 2−
γ)N−12 − 1
4N−1logN + 1
4 (1
log 2 +
1−γ)N−1 −√
2π128 N
−32 logN + · · ·
Statistical Learning
A B
D C
Algebraic Statistics Singular Learning Theory
![Page 3: SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… · Discrete and Continuous Random Variables Probability •Random Variables •Discrete·Continuous](https://reader034.vdocuments.pub/reader034/viewer/2022042403/5f16bb07ed99435f721826f6/html5/thumbnails/3.jpg)
Overview
Probability
Statistics
Bayesian
Regression
Part I: Statistical Learning
Part II: Real Log Canonical Thresholds
Part III: Singularities in Graphical Models
![Page 4: SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… · Discrete and Continuous Random Variables Probability •Random Variables •Discrete·Continuous](https://reader034.vdocuments.pub/reader034/viewer/2022042403/5f16bb07ed99435f721826f6/html5/thumbnails/4.jpg)
Probability
Probability
• Random Variables
• Discrete·Continuous
• Gaussian
• Basic Concepts
• Independence
Statistics
Bayesian
Regression
![Page 5: SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… · Discrete and Continuous Random Variables Probability •Random Variables •Discrete·Continuous](https://reader034.vdocuments.pub/reader034/viewer/2022042403/5f16bb07ed99435f721826f6/html5/thumbnails/5.jpg)
Random Variables
Probability
• Random Variables
• Discrete·Continuous
• Gaussian
• Basic Concepts
• Independence
Statistics
Bayesian
Regression
A probability space (ξ,F ,P) consists of
• a sample space ξ which is the set of all possible outcomes,• a collection♯ F of events, which are subsets of ξ,• an assignment P : F → [0, 1] of probabilities to events
A (real-valued) random variable X : ξ → Rk is
• a function from the sample space to a real vector space.• a measurement of the possible outcomes.• X ∼ P means “X has the distribution given by P ”.♯ σ-algebra: closed under complement, countable union and contains ∅. probability measure: P(∅) = 0, P(ξ) = 1, countable additivity for disjoint events. measurable function: for all x ∈ R, the preimage of y ∈ R
k : y ≤ x is in F .
Example . Rolling a fair die.
ξ = , , , , , F = ∅, , , , . . .X ∈ 1, 2, 3, 4, 5, 6 P(X=1) = 1
6 , P(X≤3) = 12
![Page 6: SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… · Discrete and Continuous Random Variables Probability •Random Variables •Discrete·Continuous](https://reader034.vdocuments.pub/reader034/viewer/2022042403/5f16bb07ed99435f721826f6/html5/thumbnails/6.jpg)
Discrete and Continuous Random Variables
Probability
• Random Variables
• Discrete·Continuous
• Gaussian
• Basic Concepts
• Independence
Statistics
Bayesian
Regression
If ξ is finite, we say X is a discrete random variable.
• probability mass function p(x) = P (X = x), x ∈ Rk.
If ξ is infinite, we define the
• cumulative distribution function (CDF) F (x)
F (x) = P(X≤x), x ∈ Rk.
• probability density function ♯ (PDF) p(y)
F (x) =∫
y∈Rk:y≤x p(y)dy, x ∈ Rk.
If the PDF exists, then X is a continuous random variable.♯ Radon-Nikodym derivative of F (x) with respect to the Lebesgue measure on R
k . We can also define PDFs for discrete variables if we allow the Dirac delta function.
The probability mass/density function is often informally referred toas the distribution of X .
![Page 7: SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… · Discrete and Continuous Random Variables Probability •Random Variables •Discrete·Continuous](https://reader034.vdocuments.pub/reader034/viewer/2022042403/5f16bb07ed99435f721826f6/html5/thumbnails/7.jpg)
Gaussian Random Variables
Probability
• Random Variables
• Discrete·Continuous
• Gaussian
• Basic Concepts
• Independence
Statistics
Bayesian
Regression
Example . Multivariate Gaussian distribution X ∼ N (µ,Σ).
X ∈ Rk, mean µ ∈ R
k, covariance Σ ∈ Rk≻0
p(x) =1
(2π detΣ)k/2exp
(
−1
2(x− µ)⊤Σ−1(x− µ)
)
PDFs of univariate Gaussian distributions X ∼ N (µ, σ2)
![Page 8: SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… · Discrete and Continuous Random Variables Probability •Random Variables •Discrete·Continuous](https://reader034.vdocuments.pub/reader034/viewer/2022042403/5f16bb07ed99435f721826f6/html5/thumbnails/8.jpg)
Basic Concepts
Probability
• Random Variables
• Discrete·Continuous
• Gaussian
• Basic Concepts
• Independence
Statistics
Bayesian
Regression
The expectation E[X] is the integral of the random variable Xwith respect to its probability measure, i.e. the “average”.
Discrete variables Continuous variables
E[X] =∑
x∈Rk
xP(X=x) E[X] =
∫
Rk
xp(x) dx
The variance E[(X − E[X])2] measures the “spread” of X .
The conditional probability ♯P(A|B) of two events A,B ∈ F is
the probability that A will occur given that we know B has occurred.
• If P(B) > 0, then P(A|B) = P(A ∩B)/P(B).♯ formal definition depends on the notion of conditional expectation.
Example . Weather forecast.
P(Rain|Thunder)= 0.2/(0.1 + 0.2) = 2/3
Rain No Rain
Thunder 0.2 0.1No Thunder 0.3 0.4
![Page 9: SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… · Discrete and Continuous Random Variables Probability •Random Variables •Discrete·Continuous](https://reader034.vdocuments.pub/reader034/viewer/2022042403/5f16bb07ed99435f721826f6/html5/thumbnails/9.jpg)
Independence
Probability
• Random Variables
• Discrete·Continuous
• Gaussian
• Basic Concepts
• Independence
Statistics
Bayesian
Regression
Let X ∈ Rk, Y ∈ R
l, Z ∈ Rm be random variables.
X,Y are independent (X⊥⊥Y ) if
• P(X∈S, Y ∈T ) = P(X∈S)P(Y ∈T )for all measurable subsets S ⊂ R
k, T ⊂ Rl.
• i.e. “ knowing X gives no information about Y ”
X,Y are conditionally independent given Z (X⊥⊥Y | Z) if
• P(X∈S, Y ∈T |Z=z) = P(X∈S|Z=z)P(Y ∈T |Z=z)for all z ∈ R
m and measurable subsets S ⊂ Rk, T ⊂ R
l.• i.e. “ any dependence between X and Y is due to Z ”
Example . Hidden variables.
Favorite color X ∈ red, blue, favorite food Y ∈ salad, steak.If X,Y are dependent, one may ask if there is a hidden variable,e.g. gender Z ∈ female,male, such that X⊥⊥Y | Z .
![Page 10: SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… · Discrete and Continuous Random Variables Probability •Random Variables •Discrete·Continuous](https://reader034.vdocuments.pub/reader034/viewer/2022042403/5f16bb07ed99435f721826f6/html5/thumbnails/10.jpg)
Statistics
Probability
Statistics
• Statistical Model
• Maximum Likelihood
• Kullback-Leibler
• Mixture Models
Bayesian
Regression
![Page 11: SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… · Discrete and Continuous Random Variables Probability •Random Variables •Discrete·Continuous](https://reader034.vdocuments.pub/reader034/viewer/2022042403/5f16bb07ed99435f721826f6/html5/thumbnails/11.jpg)
Statistical Model
Probability
Statistics
• Statistical Model
• Maximum Likelihood
• Kullback-Leibler
• Mixture Models
Bayesian
Regression
Let ∆ denote the space of distributions with outcomes ξ.
Model : a family M of probability distributions, i.e. a subset of ∆.
Parametric model : family M of distributions p(·|ω) are indexedby parameters ω in a space Ω, i.e. we have a map Ω → ∆.
Example . Biased coin tosses.
Number of heads in two tosses of coin: H ∈ ξ = 0, 1, 2Space of distributions:
∆ = p ∈ R3≥0 : p(0)+p(1)+p(2) = 1
Probability of getting heads: ω ∈ Ω = [0, 1] ⊂ R
Parametric model for H :p(0|ω) = (1− ω)2
p(1|ω) = 2(1− ω)ωp(2|ω) = ω2
implicit equation4 p(0)p(2)− p(1)2 = 0
![Page 12: SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… · Discrete and Continuous Random Variables Probability •Random Variables •Discrete·Continuous](https://reader034.vdocuments.pub/reader034/viewer/2022042403/5f16bb07ed99435f721826f6/html5/thumbnails/12.jpg)
Maximum Likelihood
Probability
Statistics
• Statistical Model
• Maximum Likelihood
• Kullback-Leibler
• Mixture Models
Bayesian
Regression
A sample X1, . . . , XN of X is a set of independent, identicallydistributed (i.i.d.) random variables with the same distribution.
Goal: Given a statistical model p(·|ω) : ω∈Ω and a sample,find a distribution p(·|ω) that best describes the sample.
A statistic f(X1, . . . , XN ) is a function of the sample.An important statistic is the maximum likelihood estimate (MLE).It is a parameter ω that maximizes the likelihood function
L(ω) =
N∏
i=1
p(Xi|ω).
Example . Biased coin tosses.
Suppose the table below summarizes a sample of H of size 100.
H 0 1 2
Count 25 45 30Then, L(ω) = 245ω105(1− ω)95
ω = 105/200.
![Page 13: SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… · Discrete and Continuous Random Variables Probability •Random Variables •Discrete·Continuous](https://reader034.vdocuments.pub/reader034/viewer/2022042403/5f16bb07ed99435f721826f6/html5/thumbnails/13.jpg)
Kullback-Leibler Divergence
Probability
Statistics
• Statistical Model
• Maximum Likelihood
• Kullback-Leibler
• Mixture Models
Bayesian
Regression
Let X1, . . . , XN be a sample of a discrete variable X .The empirical distribution is the function
q(x) =1
N
N∑
i=1
δ(x−Xi)
where δ(·) is the Kronecker delta function.
The Kullback-Leibler divergence of a distribution p from q is
K(q||p) =∑
x∈Rk
q(x) logq(x)
p(x).
Proposition . ML distributions minimize the KL divergenceof q(·) = p(·|ω) ∈ M from the empirical distribution q(·).
K(q||q) =∑
x∈Rk
q(x) log q(x)
︸ ︷︷ ︸
entropy
−1
NlogL(ω)
![Page 14: SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… · Discrete and Continuous Random Variables Probability •Random Variables •Discrete·Continuous](https://reader034.vdocuments.pub/reader034/viewer/2022042403/5f16bb07ed99435f721826f6/html5/thumbnails/14.jpg)
Kullback-Leibler Divergence
Probability
Statistics
• Statistical Model
• Maximum Likelihood
• Kullback-Leibler
• Mixture Models
Bayesian
Regression
Let X1, . . . , XN be a sample of a continuous variable X .The empirical distribution is the generalized function
q(x) =1
N
N∑
i=1
δ(x−Xi)
where δ(·) is the Dirac delta function.
The Kullback-Leibler divergence of a distribution p from q is
K(q||p) =
∫
Rk
q(x) logq(x)
p(x)dx.
Proposition . ML distributions minimize the KL divergenceof q(·) = p(·|ω) ∈ M from the empirical distribution q(·).
K(q||q) =
∫
Rk
q(x) log q(x)dx
︸ ︷︷ ︸
entropy
−1
NlogL(ω)
![Page 15: SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… · Discrete and Continuous Random Variables Probability •Random Variables •Discrete·Continuous](https://reader034.vdocuments.pub/reader034/viewer/2022042403/5f16bb07ed99435f721826f6/html5/thumbnails/15.jpg)
Kullback-Leibler Divergence
Probability
Statistics
• Statistical Model
• Maximum Likelihood
• Kullback-Leibler
• Mixture Models
Bayesian
Regression
![Page 16: SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… · Discrete and Continuous Random Variables Probability •Random Variables •Discrete·Continuous](https://reader034.vdocuments.pub/reader034/viewer/2022042403/5f16bb07ed99435f721826f6/html5/thumbnails/16.jpg)
Kullback-Leibler Divergence
Probability
Statistics
• Statistical Model
• Maximum Likelihood
• Kullback-Leibler
• Mixture Models
Bayesian
Regression
Example . Population mean and variance.
Let X ∼ N(µ, σ2) be the height of a random Singaporean.Given sample X1, . . . , XN , estimate mean µ and variance σ2.
Now, the Kullback-Leibler divergence is
K(q||q) =1
2σ2N
N∑
i=1
(Xi − µ)2 +1
Nlog σ + constant.
Differentiating this function gives us the MLE
µ =1
N
N∑
i=1
Xi, σ2 =1
N
N∑
i=1
(Xi − µ)2.
MLE for the model mean is the sample mean.MLE for the model variance is the sample variance.
![Page 17: SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… · Discrete and Continuous Random Variables Probability •Random Variables •Discrete·Continuous](https://reader034.vdocuments.pub/reader034/viewer/2022042403/5f16bb07ed99435f721826f6/html5/thumbnails/17.jpg)
Mixture Models
Probability
Statistics
• Statistical Model
• Maximum Likelihood
• Kullback-Leibler
• Mixture Models
Bayesian
Regression
A mixture of distributions p1(·), . . . , pm(·) is a convex combination
p(x) =m∑
i=1
αipi(x), x ∈ Rk
i.e. the mixing coefficients αi are nonnegative and sum to one.
Example . Gaussian mixtures.
Mixing univariate Gaussians N (µi, σ2i ), i = 1, . . . ,m, produces
distributions of the form
p(x) =m∑
i=1
αi√
2πσ2i
exp
(
−(x− µi)
2
2σ2i
)
.
This mixture model is therefore described by parameters
ω = (α1, . . . , αm, µ1, . . . , µm, σ1, . . . , σm)
and is frequently used in cluster analysis.
![Page 18: SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… · Discrete and Continuous Random Variables Probability •Random Variables •Discrete·Continuous](https://reader034.vdocuments.pub/reader034/viewer/2022042403/5f16bb07ed99435f721826f6/html5/thumbnails/18.jpg)
Bayesian Statistics
Probability
Statistics
Bayesian
• Interpretations
• Bayes’ Rule
• Parameter Estimation
• Model Selection
• Estimating Integrals
• Information Criterion
• Singularities
Regression
![Page 19: SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… · Discrete and Continuous Random Variables Probability •Random Variables •Discrete·Continuous](https://reader034.vdocuments.pub/reader034/viewer/2022042403/5f16bb07ed99435f721826f6/html5/thumbnails/19.jpg)
Interpretations of Probability Theory
Probability
Statistics
Bayesian
• Interpretations
• Bayes’ Rule
• Parameter Estimation
• Model Selection
• Estimating Integrals
• Information Criterion
• Singularities
Regression
![Page 20: SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… · Discrete and Continuous Random Variables Probability •Random Variables •Discrete·Continuous](https://reader034.vdocuments.pub/reader034/viewer/2022042403/5f16bb07ed99435f721826f6/html5/thumbnails/20.jpg)
Interpretations of Probability Theory
Probability
Statistics
Bayesian
• Interpretations
• Bayes’ Rule
• Parameter Estimation
• Model Selection
• Estimating Integrals
• Information Criterion
• Singularities
Regression
![Page 21: SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… · Discrete and Continuous Random Variables Probability •Random Variables •Discrete·Continuous](https://reader034.vdocuments.pub/reader034/viewer/2022042403/5f16bb07ed99435f721826f6/html5/thumbnails/21.jpg)
Bayes’ Rule
Probability
Statistics
Bayesian
• Interpretations
• Bayes’ Rule
• Parameter Estimation
• Model Selection
• Estimating Integrals
• Information Criterion
• Singularities
Regression
Updating our belief of event A based on an observation B.
P(A|B)︸ ︷︷ ︸
posterior (new belief)
=P(B|A)
P(B)P(A)︸ ︷︷ ︸
prior (old belief)
Example . Biased coin toss.
Let θ denote P(heads) of a coin.Determine if the coin is fair (θ = 1
2) or biased (θ = 34).
Old belief : P(fair) = 0.9Now, suppose we observed a sample with 8 heads and 2 tails.New belief :
P(fair|sample) =P(sample|fair)
P(sample)P(fair)
=(12)
8(12)2(0.9)8
(12)8(12)
2(0.9)8 + (34)8(14)
2(0.1)8≈ 0.584
![Page 22: SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… · Discrete and Continuous Random Variables Probability •Random Variables •Discrete·Continuous](https://reader034.vdocuments.pub/reader034/viewer/2022042403/5f16bb07ed99435f721826f6/html5/thumbnails/22.jpg)
Parameter Estimation
Probability
Statistics
Bayesian
• Interpretations
• Bayes’ Rule
• Parameter Estimation
• Model Selection
• Estimating Integrals
• Information Criterion
• Singularities
Regression
Let D = X1, . . . , XN denote a sample of X , i.e. “the data”.
Frequentists : Compute maximum likelihood estimate.Bayesians : Treat parameters ω ∈ Ω as random variables
with priors p(ω) that require updating.
Posterior distributionon parameters ω ∈ Ω
p(ω|D) =p(D|ω)p(ω)
∫
Ω p(D|ω)p(ω)dω
Posterior mean
∫
Ωω p(ω|D)dω
Posterior mode argmaxω∈Ω p(ω|D)
Predictive distributionon outcomes x ∈ ξ
p(x|D) =
∫
Ωp(x|ω)p(ω|D)dω
These integrals are difficult to compute or even estimate!
![Page 23: SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… · Discrete and Continuous Random Variables Probability •Random Variables •Discrete·Continuous](https://reader034.vdocuments.pub/reader034/viewer/2022042403/5f16bb07ed99435f721826f6/html5/thumbnails/23.jpg)
Model Selection
Probability
Statistics
Bayesian
• Interpretations
• Bayes’ Rule
• Parameter Estimation
• Model Selection
• Estimating Integrals
• Information Criterion
• Singularities
Regression
Which model M1, . . . ,Mm best describes the sample D?
Frequentists : Pick model containing ML distribution.Bayesians : Assign priors p(Mi), p(ω|Mi) and compute
P(Mi|D) =P(D|Mi)P(Mi)
P(D)∝ P(D|Mi)P(Mi).
where P(D|Mi) is the likelihood integral
P(D|Mi) =
∫
Ω
N∏
i=1
p(Xi|ω,Mi)p(ω|Mi)dω.
Parameter estimation is a form of model selection!For each ω ∈ Ω, we define a model Mω with one distribution.
![Page 24: SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… · Discrete and Continuous Random Variables Probability •Random Variables •Discrete·Continuous](https://reader034.vdocuments.pub/reader034/viewer/2022042403/5f16bb07ed99435f721826f6/html5/thumbnails/24.jpg)
Estimating Integrals
Probability
Statistics
Bayesian
• Interpretations
• Bayes’ Rule
• Parameter Estimation
• Model Selection
• Estimating Integrals
• Information Criterion
• Singularities
Regression
Generally, there are three ways to estimate statistical integrals.
1. Exact methodsCompute a closed form formula for the integral, e.g.Baldoni·Berline·De Loera·Koppe·Vergne, 2010;Lin·Sturmfels·Xu, 2009.
2. Numerical methodsApproximate using Markov Chain Monte Carlo (MCMC)and other sampling techniques.
3. Asymptotic methodsAnalyze how the integral behaves for large samples.Rewrite the likelihood integral as
ZN =
∫
Ωe−Nf(ω)ϕ(ω)dω
where f(ω) = − 1N logL(ω) and ϕ(ω) is the prior on Ω.
![Page 25: SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… · Discrete and Continuous Random Variables Probability •Random Variables •Discrete·Continuous](https://reader034.vdocuments.pub/reader034/viewer/2022042403/5f16bb07ed99435f721826f6/html5/thumbnails/25.jpg)
Bayesian Information Criterion
Probability
Statistics
Bayesian
• Interpretations
• Bayes’ Rule
• Parameter Estimation
• Model Selection
• Estimating Integrals
• Information Criterion
• Singularities
Regression
Laplace approximation : If f(ω) is uniquely minimized at MLE ωand the Hessian ∂2f(ω) is full rank, then asymptotically
− logZN ≈ Nf(ω) +dimΩ
2logN +O(1)
as sample size N → ∞.
Bayesian information criterion (BIC): Select model that maximizes
Nf(ω) +dimΩ
2logN
N = 1 N = 10
Graphs of e−Nf(ω) for different N . Integral = volume under graph.
![Page 26: SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… · Discrete and Continuous Random Variables Probability •Random Variables •Discrete·Continuous](https://reader034.vdocuments.pub/reader034/viewer/2022042403/5f16bb07ed99435f721826f6/html5/thumbnails/26.jpg)
Singularities in Statistical Models
Probability
Statistics
Bayesian
• Interpretations
• Bayes’ Rule
• Parameter Estimation
• Model Selection
• Estimating Integrals
• Information Criterion
• Singularities
Regression
Informally, the model is singular at ω0 ∈ Ω if the Laplaceapproximation fails when the empirical distribution is p(·|ω0).
Formally, if we define the Kullback-Leibler function
K(ω) =
∫
ξp(x|ω0) log
p(x|ω0)
p(x|ω)dx.
then ω0 is a singularity when the Hessian ∂2K(ω0) is not full rank.
Statistical models with hidden variables , e.g. mixture models,often contain many singularities.
![Page 27: SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… · Discrete and Continuous Random Variables Probability •Random Variables •Discrete·Continuous](https://reader034.vdocuments.pub/reader034/viewer/2022042403/5f16bb07ed99435f721826f6/html5/thumbnails/27.jpg)
Linear Regression
Probability
Statistics
Bayesian
Regression
• Least Squares
• Sparsity Penalty
• Paradigms
![Page 28: SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… · Discrete and Continuous Random Variables Probability •Random Variables •Discrete·Continuous](https://reader034.vdocuments.pub/reader034/viewer/2022042403/5f16bb07ed99435f721826f6/html5/thumbnails/28.jpg)
Least Squares
Probability
Statistics
Bayesian
Regression
• Least Squares
• Sparsity Penalty
• Paradigms
Suppose we have random variables Y ∈ R, X ∈ Rd that satisfy
Y = ω ·X + ε.
Parameters ω ∈ Rd; noise ε ∈ N (0, 1); data (Yi, Xi), i = 1 .. N .
• Commonly computed quantitiesMLE argminω
∑Ni=1 |Yi − ω ·Xi|
2
Penalized MLE argminω∑N
i=1 |Yi − ω ·Xi|2 + π(ω)
• Commonly used penaltiesLASSO π(ω) = |ω|1 · βBayesian Info Criterion (BIC) π(ω) = |ω|0 · logNAkaike Info Criterion (AIC) π(ω) = |ω|0 · 2
• Commonly asked questionsModel selection (e.g. which factors are important?)Parameter estimation (e.g. how important are the factors?)
![Page 29: SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… · Discrete and Continuous Random Variables Probability •Random Variables •Discrete·Continuous](https://reader034.vdocuments.pub/reader034/viewer/2022042403/5f16bb07ed99435f721826f6/html5/thumbnails/29.jpg)
Sparsity Penalty
Probability
Statistics
Bayesian
Regression
• Least Squares
• Sparsity Penalty
• Paradigms
The best model is usually selected using a score
argminu∈Ω l(u) + π(u)
where the likelihood l(u) measures the fitting error of the modelwhile the penalty π(u) measures its complexity .
Recently, sparse penalties derived from statistical considerationswere found to be highly effective.
Bayesian info criterion (BIC) ↔ Marginal likelihood integralAkaike info criterion (AIC) ↔ Kullback-Leibler divergence
Compressive sensing ↔ ℓ1-regularization of BIC
Singular learning theory plays an important role in these derivations.
![Page 30: SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… · Discrete and Continuous Random Variables Probability •Random Variables •Discrete·Continuous](https://reader034.vdocuments.pub/reader034/viewer/2022042403/5f16bb07ed99435f721826f6/html5/thumbnails/30.jpg)
Paradigms in Statistical Learning
Probability
Statistics
Bayesian
Regression
• Least Squares
• Sparsity Penalty
• Paradigms
• Probabilitytheory of random phenomenaStatisticstheory of making sense of dataLearningthe art of prediction using data
• Frequentists : “True distribution, maximum likelihood”Bayesians : “Belief updates, maximum a posteriori”
Interpretations do not affect the correctness of probability theory,but they greatly affect the statistical methodology .
• We often have to balance the complexity of the modelwith its fitting error via some suitable probabilistic criteria.The learning algorithm also needs to be computable to be useful.
![Page 31: SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… · Discrete and Continuous Random Variables Probability •Random Variables •Discrete·Continuous](https://reader034.vdocuments.pub/reader034/viewer/2022042403/5f16bb07ed99435f721826f6/html5/thumbnails/31.jpg)
Probability
Statistics
Bayesian
Regression
• Least Squares
• Sparsity Penalty
• Paradigms
Thank you!
“Algebraic Methods for Evaluating Integrals in Bayesian Statistics”
http://math.berkeley.edu/~shaowei/swthesis.pdf
(PhD dissertation, May 2011)
![Page 32: SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… · Discrete and Continuous Random Variables Probability •Random Variables •Discrete·Continuous](https://reader034.vdocuments.pub/reader034/viewer/2022042403/5f16bb07ed99435f721826f6/html5/thumbnails/32.jpg)
References
Probability
Statistics
Bayesian
Regression
• Least Squares
• Sparsity Penalty
• Paradigms
1. V. I. ARNOL’D, S. M. GUSEIN-ZADE AND A. N. VARCHENKO: Singularities of Differentiable Maps, Vol. II,Birkhauser, Boston, 1985.
2. V. BALDONI, N. BERLINE, J. A. DE LOERA, M. KOPPE, M. VERGNE: How to integrate a polynomialover a simplex, Mathematics of Computation 80 (2010) 297–325.
3. A. BRAVO, S. ENCINAS AND O. VILLAMAYOR: A simplified proof of desingularisation and applications,Rev. Math. Iberoamericana 21 (2005) 349–458.
4. D. A. COX, J. B. LITTLE, AND D. O’SHEA: Ideals, Varieties, and Algorithms: An Introduction toComputational Algebraic Geometry and Commutative Algebra, Springer-Verlag, New York, 1997.
5. R. DURRETT: Probability - Theory and Examples (4th edition), Cambridge U. Press, 2010.6. M. EVANS, Z. GILULA AND I. GUTTMAN: Latent class analysis of two-way contingency tables by
Bayesian methods, Biometrika 76 (1989) 557–563.7. H. HIRONAKA: Resolution of singularities of an algebraic variety over a field of characteristic zero I, II,
Ann. of Math. (2) 79 (1964) 109–203.8. S. LIN, B. STURMFELS AND Z. XU: Marginal likelihood integrals for mixtures of independence models,
J. Mach. Learn. Res. 10 (2009) 1611–1631.9. S. LIN: Algebraic methods for evaluating integrals in Bayesian statistics, PhD dissertation, Dept.
Mathematics, UC Berkeley (2011).10. L. MACKEY: Fundamentals - Probability and Statistics, (slides for CS 281A: Statistical Learning Theory
at UC Berkeley, Aug 2009).11. S. WATANABE: Algebraic Geometry and Statistical Learning Theory, Cambridge Monographs on
Applied and Computational Mathematics 25, Cambridge University Press, Cambridge, 2009.