degeneracy of the em algorithm for the mle of multivariate gaussian mixtures and dynamic constraints

Computational Statistics and Data Analysis 55 (2011) 1715–1725

Contents lists available at ScienceDirect

Computational Statistics and Data Analysis

journal homepage: www.elsevier.com/locate/csda

Degeneracy of the EM algorithm for the MLE of multivariate Gaussianmixtures and dynamic constraintsSalvatore Ingrassia a,∗, Roberto Rocci ba Dipartimento Impresa, Culture e Società, Università di Catania, Corso Italia, 55-95129 Catania, Italyb Dipartimento SEFEMEQ, Università di Roma ‘‘Tor Vergata’’, Roma, Italy

a r t i c l e i n f o

Article history:Received 20 January 2010Received in revised form 28 October 2010Accepted 29 October 2010Available online 5 November 2010

Keywords:Mixture modelsEM algorithmDegeneracyDynamic constraints

a b s t r a c t

EM algorithms for multivariate normal mixture decomposition have been recentlyproposed in order to maximize the likelihood function in a constrained parameter spacehaving no singularities and a reduced number of spurious local maxima. However, suchapproaches require some a priori information about the eigenvalues of the covariancematrices. The behavior of the EM algorithm near a degenerated solution is investigated.The obtained theoretical results would suggest a new kind of constraint based on thedissimilarity between two consecutive updates of the eigenvalues of each covariancematrix. The performances of such a ‘‘dynamic’’ constraint are evaluated on the groundsof some numerical experiments.

© 2010 Elsevier B.V. All rights reserved.

1. The problem

The EM algorithm is a well-known and largely studied general purpose method for maximum likelihood estimation inincomplete data problems, see e.g. Dempster et al. (1977) and McLachlan and Krishnan (2008). For a given i.i.d. randomsample {xn}n=1,...,N of size N drawn from the density f (x;ψ), where x ∈ Rq and the parameterψ assumes values in a subset9 of a suitable Euclidean space, the EM algorithm generates a sequence of estimates {ψ(m)

}m, whereψ(0) denotes the initialguess andψ(m)

∈ 9 form ∈ N, so that the corresponding sequence of the log-likelihood values {L(ψ(m))}m is not decreasing.In this paper we focus on mixtures of k q-variate normal distributions with density

f (x;ψ) = α1p(x;µ1, 61) + · · · + αkp(x;µk, 6k), (1)

where the αj’s are the mixing weights and p(x;µj, 6j) is the density function of the j-th q-variate normal component ofthe mixture with mean vector µj and covariance matrix 6j (j = 1, . . . , k). Thus ψ = {αj,µj, 6j, j = 1, . . . , k} ∈ 9 ⊂

Rk[1+q+(q2+q)/2]−1. Here the E-step, on the (m + 1)th iteration, computes the quantities

φ(m+1)nj = α

(m)j p(xn;µ

(m)j , 6

(m)j ), (2)

u(m+1)nj =

φ(m+1)nj

k∑h=1

φ(m+1)nh

, (3)

∗ Corresponding author.E-mail addresses: [email protected] (S. Ingrassia), [email protected] (R. Rocci).

0167-9473/$ – see front matter© 2010 Elsevier B.V. All rights reserved.doi:10.1016/j.csda.2010.10.026

http://dx.doi.org/10.1016/j.csda.2010.10.026

http://www.elsevier.com/locate/csda

http://www.elsevier.com/locate/csda

mailto:[email protected]

mailto:[email protected]

http://dx.doi.org/10.1016/j.csda.2010.10.026

1716 S. Ingrassia, R. Rocci / Computational Statistics and Data Analysis 55 (2011) 1715–1725

Fig. 1. Two different parameter estimates: right estimate (left), degenerate estimate (right). The ellipses refer to the 95% confidence regions.

for n = 1, . . . ,N and j = 1, . . . , k; the M-step gives

α(m+1)j =

1N

N−n=1

u(m+1)nj , (4)

µ(m+1)j =

N∑n=1

u(m+1)nj xn

N∑n=1

u(m+1)nj

j = 1, . . . , k, (5)

6(m+1)j =

N∑n=1

u(m+1)nj (xn − µ

(m+1)j )(xn − µ

(m+1)j )′

N∑n=1

u(m+1)nj

j = 1, . . . , k. (6)

For the sake of simplicity, in the rest of the paper the superscripts (m) and (m+1) will be suppressed when the parametersinvolved in the formulas refer to the same iteration.

It is well known that the likelihood of Gaussian mixtures is unbounded and may present local spurious maxima whichoccur as a consequence of a fitted component having a very small variance or generalized variance (i.e. the determinant of thecovariancematrix) compared to the others, see e.g. Day (1969) and Biernacki (2004). Such a component usually correspondsto a cluster containing few data points either relatively close together or almost lying in a lower-dimensional subspace inthe case of multivariate data; in the following we shall refer to such components as degenerate. We illustrate this issue witha numerical example. Consider a sample of size N = 200 generated from a three component two-dimensional multivariatenormal distributions (k = 3 and q = 2) mixture having the following parameters:

α = (0.3, 0.4, 0.3)′ µ1 = (0, 3)′ µ2 = (1, 5)′ µ3 = (−3, 8)′,

61 =

1 00 2

62 =

1 −1

−1 2

63 =

2 11 2

.

Fig. 1 shows the estimated normal components both when the right solution and a degenerate solution is attained. Notethat the degenerate solution concentrates one component only on two points of the sample data.

The behavior of the EM algorithm near a degenerate solution has been investigated in the case of univariate normalmixtures byBiernacki andChretien (2003). They proved that there exists a domain of attraction leading the EM todegeneracyand that the speed of convergence to a singularity is very fast. In particular, they show that the variance of a degeneratedcomponent tends to zero at an exponential rate.

In this paper we first extend such a result to the multivariate case; afterwards, based on this theoretical background, wepropose a new kind of dynamic constraint based on the dissimilarity between two consecutive updates of the eigenvaluesof each covariance matrix.

The rest of the paper is organized as follows. In the next section we present some theoretical results showing that thesmallest eigenvalue of the degenerate component tends to zero at an exponential rate. In Section 3 a new kind of dynamicconstraint is introduced. Afterwards, the effectiveness of such constraints is evaluated on the grounds of some numericalexperiments based on both simulated and real data, see Section 4. Finally, concluding remarks are given in Section 5.

2. Theoretical results

In this section we extend previous results due to Biernacki and Chretien (2003) to the multivariate case.

S. Ingrassia, R. Rocci / Computational Statistics and Data Analysis 55 (2011) 1715–1725 1717

Let D be a subset of {1, 2, . . . ,N} with d elements and assume d ≤ q. Let us denote by j0 (1 ≤ j0 ≤ k) a degeneratecomponent of the mixture (1), and set the vector

v0 = [{1/φnj0}n∈D , {φnj0}n∈D ]′, (7)

where φnj0 is defined in (2). For a degenerate component the Euclidean norm ‖v0‖ is small.Furthermore, we shall consider the following assumption about the eigenvalues and the eigenvectors of the covariance

matrix 6j0:

Assumption. Let γmin be the eigenvector corresponding to the smallest eigenvalues of6j0 and xD be themean vector of thepoints in D . Then there exists β > 0 such that

minn∈D

[(xn − xD)′γmin]2

≥ β. (8)

2.1. First order expansions

Let us set φnj0 =∑

j=j0φnj (n = 1, . . . ,N and j = 1, . . . , k) and consider the conditional probabilities unj defined in (3).

The results of this section are based on the following Taylor expansions of unj0 in a neighborhood of v0 = 0:

unj0 =φnj0

φnj0 + φnj0

=

1 −

φnj0

φnj0+ o‖v0‖ if n ∈ D

φnj0

φnj0

+ o‖v0‖ if n ∈ D,

(9)

where v0 is introduced in (7) and o‖v0‖ denotes second order or higher terms. In the following we compute the first-orderexpansions of the parameter updates (4)–(6). To begin with, let us consider the mixing weight:

αj0 =1N

N−n=1

unj0 =1N

−n∈D

1 −

φnj0

φnj0

+

−n∈D

φnj0

φnj0

+ o‖v0‖

=1N

d −

−n∈D

φnj0

φnj0+

−n∈D

φnj0

φnj0

+ o‖v0‖.

Afterwards let us consider the mean vector µj0

µj0 =

N∑n=1

unj0xn

N∑n=1

unj0

=

∑n∈D

unj0xn +∑n∈D

unj0xn

d −∑n∈D

φnj0φnj0

+∑n∈D

φnj0φnj0

+ o‖v0‖

=1d2

−n∈D

1 −

φnj0

φnj0

xn +

−n∈D

φnj0

φnj0

xn

d +

−n∈D

φnj0

φnj0−

−n∈D

φnj0

φnj0

+ o‖v0‖

=1d2

d−n∈D

xn − d−n∈D

φnj0

φnj0xn + d

−n∈D

φnj0

φnj0

xn +

−n∈D

xn−n∈D

φnj0

φnj0−

−n∈D

xn−n∈D

φnj0

φnj0

+ o‖v0‖.

Let us denote by xD the mean of the points in D , that is xD =1d

∑n∈D xn, then we get

µj0 = xD +1d

−n∈D

φnj0

φnj0

(xn − xD) −1d

−n∈D

φnj0

φnj0(xn − xD) + o‖v0‖.

Finally, let us take the calculation of the first-order Taylor expansion of the covariance matrix 6j0 into account

6j0 =

N∑n=1

unj0(xn − xD)(xn − xD)′

N∑n=1

unj0

− (xD − µj0)(xD − µj0)′

=

∑n∈D


N∑n=1

unj0

+

∑n∈D


N∑n=1

unj0

+ O‖v0‖,


since (xD − µj0)(xD − µj0)′= O‖v0‖. If we set:

SD =

∑n∈D


N∑n=1

unj0

,

SDc =

∑n∈D


N∑n=1

unj0

, (10)

then we get:

6j0 = SD + SDc + O‖v0‖. (11)

Now we note that

SDc =

∑n∈D


N∑n=1

unj0

=

∑n∈D

φnj0φnj0

(xn − xD)(xn − xD)′

d −∑n∈D

φnj0φnj0

+∑n∈D

φnj0φnj0

+ o‖v0‖

=1d2

−n∈D

φnj0

φnj0

(xn − xD)(xn − xD)′

d +

−n∈D

φnj0

φnj0−

−n∈D

φnj0

φnj0

+ O‖v0‖

=1d

−n∈D

φnj0

φnj0

(xn − xD)(xn − xD)′ + O‖v0‖. (12)

2.2. Rate of convergence of the smallest eigenvalue of 6j0

In this subsection, we compute the rate of convergence of the smallest eigenvalue of the covariance matrix 6j0 of adegenerate component j0 and show that it tends to zero at an exponential rate. Let λ

(m)min ≡ λ

(m)1 ≤ · · · ≤ λ

(m)q ≡ λ

(m)max

denote the eigenvalues of 6(m)j0

in non decreasing order, and γ(m)min ≡ γ

(m)1 , . . . , γ

(m)q be the corresponding eigenvectors. Our

first result generalizes Lemma 1 in Biernacki and Chretien (2003).

Lemma. There exists ε > 0 such that if ‖v0‖ ≤ ε then

λmin ≤ ε2/q,

and

[(xn − µj0)′γmin]

2≤ −qε2/q ln ε2/q, for n ∈ D.

Proof. For n ∈ D , let us consider

φnj0 =αj0

|2π6j0 |1/2

exp−

12(xn − µj0)

′6−1j0

(xn − µj0)

.

Since ‖v0‖ ≤ ε then 1/φnj0 ≤ ε and thus

(2π)q/2|6j0 |1/2

αj0exp

12(xn − µj0)

′6−1j0

(xn − µj0)

≤ ε. (13)

Moreover (2π)q/2/αj0 > 1 and exp{(xn − µj0)′6−1

j0(xn − µj0)} ≥ 1, and thus we obtain the first inequality

|6j0 |1/2

≤ ε ⇒ λqmin ≤ ε2

⇒ λmin ≤ ε2/q. (14)

Now assume 0 < ε ≤ e−q/2 < 1. From (13) it follows

|6j0 |1/2 exp

12(xn − µj0)

′6−1j0

(xn − µj0)

≤ ε,


and taking the logarithm on both sides we get

ln |6j0 |1/2

+12(xn − µj0)

′6−1j0

(xn − µj0) ≤ ln ε.

Since

(xn − µj0)′6−1

j0(xn − µj0) =

q−i=1

λ−1i (xn − µj0)

′γiγ′

i(xn − µj0)

≥ λ−1min[(xn − µj0)

′γmin]2,

then

λ−1min[(xn − µj0)

′γmin]2

≤ 2 ln ε − ln |6j0 | ≤ − lnq∏

i=1

λi = −

q−i=1

ln λi ≤ −q ln λmin.

Finally using (14) we get

[(xn − µj0)′γmin]

2≤ −qλmin ln λmin ≤ −qε2/q ln ε2/q,

because x ln x is an increasing function of x for x ≤ e−1 and thus−λmin ln λmin ≤ −ε−q/2 ln ε−q/2 whenλmin ≤ ε2/q accordingto (14). This completes the proof. �

Note that the above result reduces to Lemma 1 in Biernacki and Chretien (2003) for q = 1.

Theorem. Let j0 be a degenerate component of the mixture (1), with 1 ≤ j0 ≤ k. Let S(m)

Dc be the covariance matrix definedin (10) at iteration m ∈ N. There exists ε > 0 such that if ‖v0‖ ≤ ε then

y′S(m+1)Dc y <

δ

(λ(m)min)

q/2exp

−

β

4λ(m)min

+ o‖v0‖, (15)

for each y ∈ Rq such that ‖y‖ = 1 and for a suitable constant δ > 0, where β has been defined in (8).

Proof. For the sake of simplicity, throughout this proof, the superscript (m+1) will be suppressed, while the superscript −

will denote the estimate at the previous iterationm. By taking (12) into account, we consider first:

y′SDcy = y′

−n∈D

φnj0

φnj0

(xn − xD)(xn − xD)′ + O‖v0‖

y

=

−n∈D

φnj0

φnj0

[y′(xn − xD)]2 + o‖v0‖

≤

−n∈D

φnj0‖xn − xD‖

2

φnj0

+ o‖v0‖.

The proof is based on two steps: firstwe compute an upper bound onφnj0 and afterwards an upper bound on ‖xn−xD‖2/φnj0 .

To begin with, since α−

j0/(2π)q/2 < 1, then

φnj0 =α−

j0

|2π6−

j0|1/2

exp[−

12(xn − µ−

j0)′(6−

j0)−1(xn − µ−

j0)

]

=α−

j0

|2π6−

j0|1/2

exp

−

12(xn − µ−

j0)′

q−i=1

γ−

i γ′−

i

λ−

i(xn − µ−

j0)

<1

(λ−

min)q/2

exp

−

12

q−i=1

[(xn − µ−

j0)′γ−

i ]2

λ−

i

<1

(λ−

min)q/2

exp

−

12

[(xn − µ−

j0)′γ−

min]2

λ−

min

. (16)

In the following we shall prove that [(xn −µ−

j0)′γ−

min]2

≥ β/2, where β has been defined in (8). As a matter of fact, we have

[(xn − µ−

j0)′γ−

min]2

= {[(xn − xD) + (xD − µ−

j0)]′γ−

min}2


= [(xn − xD)′γ−

min]2+ [(xD − µ−

j0)′γ−

min]2+ 2[(xn − xD)′γ−

min][(xD − µ−

j0)′γ−

min]

≥ [(xn − xD)′γ−

min]2+ 2[(xn − xD)′γ−

min][(xD − µ−

j0)′γ−

min]

≥[(xn − xD)′γ−

min]2

2+

[(xn − xD)′γ−

min]2

2− 2|(xn − xD)′γ−

min| · |(xD − µ−

j0)′γ−

min|

≥β

2+

β

2− 2|(xn − xD)′γ−

min| · |(xD − µ−

j0)′γ−

min|. (17)

Because of∑

n∈D [(xn − µ−

j0)′γ−

min]2

=∑

n∈D [(xn − xD)′γ−

min]2+ d[(xD − µ−

j0)′γ−

min]2, thus from Lemma we have−

n∈D

[(xn − xD)′γ−

min]2+ d[(xD − µ−

j0)′γ−

min]2

≤ −dqε2/q ln ε2/q, (18)

yielding

[(xD − µ−

j0)′γ−

min]2

≤ −qε2/q ln ε2/q.

Therefore, since |(xn − xD)′γ−

min| is finite, then there exists ε > 0 such that β/2− 2|(xn − xD)′γ−

min| · |(xD −µ−

j0)′γ−

min| ≥ 0,and finally we get

[(xn − µ−

j0)′γ−

min]2

≥β

2,

so that from (16) the following upper bound on φnj0 is finally attained

φnj0 <1

(λ−

min)q/2

exp−

β

4λ−

min

.

In order to complete the proof, we remark that, since not all components of the mixture are degenerate, there exists a > 0such that for any iteration of the EM algorithm φnj0 > a is obtained. Thus let us set

δ =

−n∈D

‖xn − xD‖2

a,

and we note that

‖xn − xD‖2

≥ [(xn − xD)′γ−

min]2

≥ β > 0,

which implies δ > 0. This completes the proof. �

Now let us consider the main theoretical result of the paper, which generalizes Theorem 2 in Biernacki and Chretien (2003).

Corollary. Let j0 be a degenerate component of the mixture (1), with 1 ≤ j0 ≤ k. Let 6(m+1)j0

be the estimate of the covariancematrix 6j0 at iteration m + 1 ∈ N given in (6). There exists ε > 0 such that if ‖v0‖ ≤ ε then

λmin(6(m+1)j0

) <δ

[λmin(6(m)j0

)]q/2exp

−

β

4λmin(6(m)j0

)

+ o‖v0‖, (19)

where β has been defined in (8).

Proof. For y ∈ Rq such that ‖y‖ = 1, let us consider y′6j0y. Then from (11) and (15), ignoring the second order terms, weobtain

y′6j0y = y′(SD + SDc )y < y′SDy +δ

(λ−

min)q/2

exp−

β

4λ−

min

+ o‖v0‖,

and therefore

λmin(6j0) = min‖y‖=1

y′6j0y < min‖y‖=1

y′SDy +δ

(λ−

min)q/2

e−β/(4λ−

min)+ o‖v0‖.

Since the matrix SD has rank d − 1 then the smallest eigenvalue of SD is zero which implies

min‖y‖=1

y′SDy = 0.

This completes the proof and this proves that in the near degeneracy case the smallest eigenvalue of 6j0 tends to zero at anexponential rate. �


2.3. Discussion about the validity of Assumption (8)

The results of the previous two sections are partly based on the validity of (8). We carried out some simulation studieswhere the degeneracywasmainly caused by the fact that the sample sizewas small relatively to the number q of dimensions,and we found the assumption to be always true. However, this does not imply that the assumption is true in general unlessd = q. In fact, in this case we can state, with probability 1, that the vector (xm − xD), where m ∈ D , does not belong tothe subspace spanned by the d vectors {(xn − xD), n ∈ D}, having dimension d − 1. In other words, it means that, withprobability 1, there exists ν > 0 such that

ν = min‖y‖=1

[(xm − xD)′y]2 +

−n∈D

[(xn − xD)′y]2

> 0. (20)

From (18) we get−n∈D

[(xn − xD)′γ−

min]2

≤ −dqε2/q ln ε2/q, (21)

so that combining (20) with (21), we obtain

ν ≤ [(xm − xD)′γmin]2+

−n∈D

[(xn − xD)′γmin]2

≤ [(xm − xD)′γmin]2− dqε2/q ln ε2/q,

and

ν + dqε2/q ln ε2/q≤ [(xm − xD)′γmin]

2.

By noting that ε2/q ln ε2/q→ 0 when ε → 0, it follows that there exists an ε > 0 such that ν + dqε2/q ln ε2/q > 0 and the

assumption is true.

3. A dynamic constraint on the eigenvalues

The last result presented in the previous section states that if the EM algorithm fits a degenerate component, say j0, thenthe smallest eigenvalue of6j0 tends to zero at an exponential rate. This suggests that during the EM iterations the eigenvaluesmay vary very rapidly. Triggering such a behavior is very dangerous when the current estimates of the parameters are farfrom the optimal solution, like in the first iterations of the algorithm. Thus, we conjecture that such bad behavior shouldbe prevented by bounding the eigenvalues variations between two consecutive iterations. As a matter of fact, the idea ofconstraining eigenvalues is not new, in particular constrained EM algorithms for finite mixtures of multivariate normaldistributions have been recently proposed in Ingrassia (2004) and Ingrassia and Rocci (2007). These proposals moved fromthe results of Hathaway (1985) who considered a constrained parameter space for the maximum likelihood estimation ofnormal mixtures of kind

min1≤h=j≤k

λ(6h6−1j ) ≥ c > 0 with 0 < c ≤ 1, (22)

which leads to a constrained (global) maximum likelihood formulation of the problem and a clearly reduced numberof spurious maxima of the likelihood function. Ingrassia (2004) proved that (22) holds whenever the eigenvalues of thecovariance matrices satisfy the constraints

a ≤ λi(6j) ≤ b, (23)

for a/b ≥ c , since

λmin(6h6−1j ) ≥

λmin(6h)

λmax(6j)≥

ab

≥ c > 0.

In Ingrassia and Rocci (2007) this issue has been further deepened, in particular weaker constraints have been proposed andconditions assuring the monotonicity of constrained EM algorithms have been investigated. Indeed, in order to implementthe constraint (23), they first rewrite each covariance matrix as 6j = 0j3j0

′

j , where 3j = diag(λ1j, . . . , λqj) is the diagonalmatrix of the eigenvalues in non-decreasing order, and 0j is the orthonormal matrix of the standardized eigenvectors of 6j.Then, they split the M-step into four separate conditional maximizations with respect to αj,µj, 0j and 3j, respectively. Thisprocedure leads to the maximum value because at each step the conditional maximum depends only on the values of theparameters obtained in the previous steps. The constraints (23) are implemented by updating the eigenvalues as

λij = min(b,max(a, lij)), (24)

where lij is the update of λij computed in the unconstrainedM-step of the EM algorithm. It has been shown that suchmodifi-cation does not destroy the monotonicity property of the EM algorithm, see Ingrassia and Rocci (2007) for details. However,it is clear that such kind of constraints requires some a priori information about the intervals where the eigenvalues lie.


We remark that computational problems concerning parameter estimation of multivariate normal mixtures, accordingto the likelihood approach, have been recently approached by considering also penalized approaches, see e.g. Chen and Tan(2009), and a doubly smoothed maximum likelihood estimator, see Seo and Lindsay (2010).

In order to bound the eigenvalues variations between two consecutive iterations, we impose some constraints on thedissimilarity between the update and the current values of the eigenvalues of each covariance matrix. We refer to suchconstraints as ‘‘dynamic constraints’’ in order to point out that the bound on the eigenvalue at the current iteration dependson the value of the eigenvalue computed at the previous step of the algorithm. On the contrary, we shall refer to the type ofconstraints like in (23) as ‘‘static’’ because the interval remains fixed during the whole computation.

By using this approach, the idea that at each iteration the reduction (increment) of the smallest (largest) eigenvalue doesnot exceed a fixed percentage of the previous update emerges. Thus in iteration m + 1 the eigenvalues are updated underthe constraints

λmin(6(m)j )/ϑa ≤ λ

(m+1)ij ≤ ϑbλmax(6

(m)j ), (25)

with ϑa, ϑb > 1. A monotone algorithm implementing (25) can be easily derived by using the constrained EM algorithmof Ingrassia and Rocci (2007) previously described. The update (24) becomes

λ(m+1)ij = min(ϑbλmax(6

(m)j ),max(λmin(6

(m)j )/ϑa, l

(m+1)ij )). (26)

In other words, the update of λij is l(m+1)ij or, if it lies outside the interval (25), the closest extreme.

It is important to note that this implementation does not lead to an EM algorithm, because in the ‘‘M-step’’ the completelog-likelihood function is not necessarily maximized. However, by noting that at iterationm+1 the complete log-likelihoodis increased by every update of λij lying in the interval

[min(λmij , l

(m+1)ij ),max(λm

ij , l(m+1)ij )],

we can conclude that in the M-step the complete log-likelihood is always increased. This kind of algorithm, where thecomplete log-likelihood is increased instead of maximized, has been referred to as generalized EM in Dempster et al. (1977).A discussion of its convergence properties can be found in Wu (1983).

4. Numerical experiments

In this section we present both numerical experiments on simulated and real data in order to evaluate and comparethe performances of the proposed dynamic constraint under different settings. In particular, we consider the following sixalgorithms:

U UnconstrainedOrdinary EM. One random starting point.

U2 UnconstrainedOrdinary EM. Two random starting points, the solution giving the highest likelihood value is chosen.

LCS Lower dynamically constrained, strongConstrained EM algorithm with ϑa = 1.111 and ϑb = +∞. One random starting point.

LCW Lower dynamically constrained, weakConstrained EM algorithm with ϑa = 2 and ϑb = +∞. One random starting point.

LUCS Lower and upper dynamically constrained, strongConstrained EM algorithm with ϑa = ϑb = 1.111. One random starting point.

LUCW Lower and upper dynamically constrained, weakConstrained EM algorithm with ϑa = ϑb = 2. One random starting point.

In order to prevent non-invertibility of the covariance matrices, their eigenvalues have been constrained to be greaterthan 10−14. Each starting point consists in a matrix of weighted means, where the weights are randomly generated, a set ofcovariance matrices equal to a diagonal matrix having a half of the mean of the variances of the observed variables on themain diagonal. The prior probabilities are computed from the randomweights used to compute themeans. The first startingpoint is the same for each algorithm.

4.1. Simulated data

The sample data have been generated by mixtures of three (k = 3) multivariate normal distributions with:• prior probabilities α = [0.2, 0.3, 0.5]′;• mean vectors generated independently from a normal distribution with mean zero and standard deviation σµ equal to 1

or 3;• eigenvalues of the covariancematrix of the j-component are generated independently from a uniform distribution in the

interval [0.01 × j, j];• eigenvectors of the covariance matrices are generated by orthonormalizing matrices generated independently by a

standard normal distribution.


Table 1Mean values over 250 samples of the log-likelihood (L), number of iterations (it) and modified rand index (mr) for the six algorithms in 8 differentconditions.

σµ q N U U2 LCS LCW LUCS LUCW

1 4 60 L −61.82 −58.70 −61.96 −61.84 −61.30 −61.77it 51 104 60 51 59 51mr 0.26 0.28 0.26 0.26 0.33 0.28

240 L −313.27 −310.34 −313.31 −313.25 −313.37 −313.48it 108 210 111 108 98 104mr 0.53 0.56 0.53 0.53 0.54 0.53

12 60 L 6.62 42.60 37.95 6.37 55.86 18.09it 12 24 87 16 96 16mr 0.29 0.34 0.33 0.29 0.56 0.38

240 L −583.06 −547.07 −576.42 −581.38 −542.99 −575.67it 46 90 54 46 43 43mr 0.74 0.84 0.75 0.74 0.86 0.76

3 4 60 L 42.83 49.91 47.31 42.14 49.35 43.77it 35 69 60 37 51 34mr 0.67 0.73 0.75 0.66 0.84 0.72

240 L 137.30 146.54 139.84 137.77 136.86 138.54it 39 76 54 39 54 38mr 0.91 0.95 0.91 0.91 0.91 0.91

12 60 L 494.42 620.06 596.66 531.21 631.93 561.67it 9 18 201 29 210 30mr 0.62 0.66 0.76 0.64 0.92 0.78

240 L 1022.40 1054.30 1051.80 1026.40 1039.40 1025.80it 20 40 52 20 55 17mr 0.93 0.96 0.97 0.94 0.94 0.93

A different set of means and covariance matrices have been generated for each sample.The six algorithms have been tested in eight different conditions obtained by combining the following two level factors:

1. sample size: small (N = 60), large (N = 240);2. number of variables: small (q = 4), large (q = 12);3. component separation: small (σµ = 1), large (σµ = 3).

For each combination of the three factors, 250 samples have been generated. The aimwas to analyze the performance ofthe algorithms in terms of mean values of:

• (L) the log-likelihood computed at the final estimate;• (#iter) the number of iterations;• (mr) the modified rand index of Hubert and Arabie (1985) between the estimated and true classification computed

assigning each observation to the component corresponding to the maximum posterior probability. This index is lessthan 1 and equal to 1 if the two partitions are equal;

• (msd) the unweighted sum of squared differences between the true and the estimated conditional probabilities. Informulas

msd =

12N

N−n=1

k−j=1

(unj − unj)2.

This index is equal to zero when the true and the estimated conditional probabilities coincide. In our simulation, it hasbeen computed by minimizing over all possible permutations of the labels of the estimated components.

The algorithms were stopped when the relative increment of the log-likelihood was less than 10−7. The results are listedin Table 1. To begin with, let us consider the first case where all factors are at a small level. The four dynamically constrainedalgorithms are always at least as good as the ordinary EM algorithm and out perform it when the variation of the eigenvaluesis bounded in a given range, that is for LUCS and LUCW algorithms. Of course, the performance of the ordinary EM increaseswhen two starting points are considered but its goodness of recovery, measured in terms of modified rand index, remainslower than that of LUCS. Looking at the average number of iterations, we can see that LUCS has a performance which issuperior to that of U2, saving about 40% of computational time. It is interesting to note that in this case U2 gives, in average,maxima of the log-likelihood that are greater than those obtained with LUCS even though its efficiency in recovery is not asgood. This is probably due to the fact it stops more frequently on spurious local maxima.

Those differences almost disappear in the second situation, when the sample size increases from 60 to 240, while theyhold, and become more evident, in the third case where the number of variables involved is q = 12 instead of q = 4. Inthis situation, LUCS strongly outperforms the EM algorithm but it results in being too costly in terms of computational time.


Table 2Percentage of degeneracies on 250 samples for the six algorithms in 8 different conditions.


1 4 60 0.40 1.20 0.40 0.40 0.40 0.40240 0.00 0.00 0.00 0.00 0.00 0.00

12 60 10.80 20.00 15.20 9.60 18.40 10.40240 0.00 0.00 0.00 0.00 0.00 0.00

3 4 60 5.20 9.60 4.00 4.40 1.60 3.20240 0.80 0.40 0.80 0.40 0.00 0.80

12 60 38.80 65.60 56.40 44.40 59.60 49.20240 2.80 3.20 0.80 1.20 2.00 1.60

Table 3Iteration where the degeneracy occurs. Means over 250 samples for the six algorithms in 8 different conditions.


1 4 60 49 36 307 84 318 84240 – – – – – –

12 60 8 8 301 47 301 47240 – – – – – –

3 4 60 22 18 302 53 301 55240 27 19 301 54 – 60

12 60 6 6 301 47 301 47240 10 10 301 47 301 47

LUCW should be preferred because it is better than U and U2 with a number of iterations only slightly greater than that ofU. Increasing the sample size form 60 to 240, the differences in terms of number of iterations disappear but LUCS remainsthe best in terms of efficiency in recovery. The last four situations differ from the first ones only in the separation of thecomponents. The means of the components have been generated by a normal distribution with zero mean and standarddeviation equal to 3 instead of 1. This modification increased the ratio of the trace of the between covariance matrix and thetotal one from about 0.3 to 0.8. As expected, the overall performance of the algorithms improves but the relative differencesin terms of efficiency in recovery hold. In Table 1 we do not report the mean values of msd because it resulted in having acorrelation of −0.93 with the modified rand index.

From this simulation we can conclude that the best strategy is to bound the eigenvalue variations in a given range. Asexpected, the choice of the parameter ϑ is important. However, in the simulation we have seen that large values of ϑ do notalways perform at their best but improve the ordinary EM without increasing the computational complexity significantly.

In the same simulation studywe also checked if the dynamic constraints help in preventing the degeneracy phenomenon.In Table 2 the percentages of degeneracies for each algorithm in the eight different situations are depicted. A run has beenconsidered degenerate if one or more eigenvalues become less than 10−14 during the iterations. It seems clear that theconstraints decrease the occurrence of a degeneracy but do not prevent it. This is due, at least in part, to the fact that in oursimulation a solution that gives a good recovery of the true classification usually is degenerate. As an example, consider thecase where N = 60 and q = 12. In this case there is a high probability of having less than 12 observations coming fromthe first component, which has a prior probability of 0.2. If the algorithm recognizes the classification correctly, then theestimated covariance matrix will necessarily be singular. This is the reason why the greater number of degeneracies occurwhen the separation is large and the number of variables is equal to 12 but the sample size is small. However, we notethat the effect of the dynamic constraints is a delay in the degeneracy. This is evident in Table 3, where the averages of theiterations where the degeneracy occurred are depicted.

4.2. An analysis on real data

To conclude our numerical experiments, we tested the six algorithms on a well-known data set already studied by otherresearchers (Forina et al., 1988). The data are the results of a chemical analysis of wines grown in the same region in Italybut derived from three different cultivators. The analysis determined the quantities of 13 constituents found in each ofthe three types of wines. The constituents are: alcohol, malic acid, ash, alkalinity of ash, magnesium, total phenols, flavanoids,non-flavanoid phenols, proanthocyanins, color intensity, hue, OD280/OD315 of diluted wines, proline. On this data we run thealgorithms, excluding U2, 250 times starting from 250 different points in the parameter space, the same for each algorithm.In this case the degeneracy occurred only for the LCS algorithm in the 2.8% of cases. The result concerning the goodness ofrecovery is depicted in Fig. 2 in terms of modified rand index. It is interesting to note how this different experiment leads tothe same conclusions as the previous one.


Fig. 2. Box-plots of the distribution of the modified rand index over 250 different starting points.

5. Concluding remarks

In this paper we have presented two main issues. The first part has been devoted to the extension to the multivariatecase of some results about the convergence of the EM algorithm proposed in Biernacki and Chretien (2003). In particular,our main result concerns the generalization of Theorem 2 in that paper, and we showed that near degeneracy, the smallesteigenvalue of the degenerate component tends to zero at an exponential rate. Based on this result, in the second part ofthe paper we have proposed a new set of constraints based on the dissimilarity between two consecutive updates of theeigenvalues of each covariance matrix. We remark that such constraints have a different background with respect to theones proposed in Ingrassia (2004) and Ingrassia and Rocci (2007). The latter are based on a constrained formulation of thelikelihood function for mixture models; the constraints proposed here are in some sense of an algorithmic type being basedon the convergence properties of the EM algorithm.

Our results highlighted that, in some way, the convergence of the EM algorithm to some spurious maximum is alsodue to the properties of the algorithm itself. Indeed, since in the near degeneracy case the covariance matrices converge atexponential rate toward singularity, this implies that in such cases such a covariance matrix could model some spurioussmall group of data quite quickly and this amounts to an increase in the probability of the algorithm to get stuck into somesome spurious maximum.

In general, dynamic constraints performed always at least as good as the unconstrained EM algorithm and goodperformances were attained when both bounds on the variation of the eigenvalues were implemented. In this context,the comparison between LUCS and LUCW strategies (corresponding to two different choices of the constant θa, θb in (25))showed that a narrower range (strong constraint, LUCS) is preferable.

Acknowledgements

The authors sincerely thank the Associate editor and the anonymous referees for their very helpful comments andsuggestions.

References

Biernacki, C., 2004. An asymptotic upper bound of the likelihood to prevent Gaussianmixtures from degenerating, Technical report, Université de Franche-Comté.

Biernacki, C., Chretien, S., 2003. Degeneracy in the maximum likelihood estimation of univariate Gaussian mixtures with the EM. Statistics & ProbabilityLetters 61, 373–382.

Chen, J., Tan, X., 2009. Inference for multivariate normal mixtures. Journal of the Multivariate Analysis 100, 1367–1383.Day, N.E., 1969. Estimating the components of a mixture of two normal distributions. Biometrika 56, 463–474.Dempster, A.P., Laird, N.M., Rubin, D.B., 1977. Maximum likelihood from incomplete data via the EM algorithm (with discussion). Journal of the Royal

Statistical Society B 39, 1–38.Forina, M., Lear, R., Armanino, C., Lauter, S., 1988. PARVUS — an extendible package for data exploration, classification and correlation. In: Institute of

Pharmaceutical and Food Analysis and Technologies, Genoa, Italy.Hathaway, R.J., 1985. A constrained formulation of maximum-likelihood estimation for normal mixture distributions. The Annals of Statistics 13, 795–800.Hubert, L., Arabie, P., 1985. Comparing partitions. Journal of Classification 2, 193–218.Ingrassia, S., 2004. A likelihood-based constrained algorithm for multivariate normal mixture models. Statistical Methods & Applications 13, 151–166.Ingrassia, S., Rocci, R., 2007. A constrained monotone EM algorithm for finite mixture of multivariate Gaussians. Computational Statistics & Data Analysis

51, 5339–5351.McLachlan, G.J., Krishnan, T., 2008. The EM Algorithm and Extensions, 2nd ed., John Wiley & Sons, New York.Seo, B., Lindsay, B.G., 2010. A computational strategy for doubly smoothed MLE exemplified in the normal mixture model. Computational Statistics & Data

Analysis 54, 1930–1941.Wu, C.F.J., 1983. On the convergence properties of the EM algorithm. Annals of Statistics 11, 95–103.

degeneracy of the em algorithm for the mle of multivariate gaussian mixtures and dynamic constraints

Documents