huangetal2010_robustmetriclearningbysmoothoptn_uai2010_0096

8

Click here to load reader

Upload: alexandrubratu6

Post on 05-Apr-2018

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: HuangEtAl2010_RobustMetricLearningbySmoothOptn_UAI2010_0096

7/31/2019 HuangEtAl2010_RobustMetricLearningbySmoothOptn_UAI2010_0096

http://slidepdf.com/reader/full/huangetal2010robustmetriclearningbysmoothoptnuai20100096 1/8

Robust Metric Learning by Smooth Optimization

Kaizhu Huang

NLPR,Institute of Automation

Chinese Academy of SciencesBeijing, 100190 China

Rong Jin

Dept. of CSEMichigan State University

East Lansing, MI 48824

Zenglin Xu

Saarland University& MPI for Informatics

Campus E14, 66123Saarbrucken, Germany

Cheng-Lin Liu

NLPRInstitute of Automation

Chinese Academy of SciencesBeijing, 100190 China

Abstract

Most existing distance metric learning

methods assume perfect side informationthat is usually given in pairwise or tripletconstraints. Instead, in many real-worldapplications, the constraints are derivedfrom side information, such as users’ im-plicit feedbacks and citations among arti-cles. As a result, these constraints are usu-ally noisy and contain many mistakes. Inthis work, we aim to learn a distance met-ric from noisy constraints by robust opti-mization in a worst-case scenario, to whichwe refer as robust metric learning.We formulate the learning task initially

as a combinatorial optimization problem,and show that it can be elegantly trans-formed to a convex programming problem.We present an efficient learning algorithmbased on smooth optimization [7]. It hasa worst-case convergence rate of O(1/

√ε)

for smooth optimization problems, where εis the desired error of the approximate so-lution. Finally, our empirical study withUCI data sets demonstrate the effective-ness of the proposed method in comparisonto state-of-the-art methods.

1 Introduction

Distance metric learning is an important fundamen-tal research topic in machine learning. A numberof studies [1, 9, 8, 15, 12, 13, 6] have shown thatwith appropriately learned distance metrics, we willbe able to improve significantly both the classifica-tion accuracy and the clustering performance. De-pending on the nature of the side information, we

can learn a distance metric either from pairwise con-straints or from triplet constraints. In this work, wefocus on distance metric learning from triplet con-straints because of its empirical success. It aims

to learn a distance function f  with a metric ma-trix A for a data set that is consistent with a setof triplet constraints D = {(xi, yi, zi)|f (xi, yi) ≤f (xi, zi), i = 1, . . . N  } [6]. In the simplest form,the distance function f  associated with A betweena pair of data points (xi, xj) is computed as (xi −xj)⊤A(xi − xj). For each triplet (xi, yi, zi) in D,we assume instance xi is more similar to yi than tozi, leading to a smaller distance between xi and yi

than that between xi and zi.

Most metric learning algorithms assume perfect sideinformation (i.e., perfect triplet constraints in our

case). This however is not always the case in practicebecause in many real-world applications, the con-straints are derived from the side information suchas users’ implicit feedbacks and citations among arti-cles. As a result, these constraints are usually noisyand consist of many mistakes. We refer to the prob-lem of learning distance metric from noisy side in-formation as robust distance metric learning.Feeding the noisy constraints directly into a metriclearning algorithm will inevitably degrade its per-formance, and more seriously, it may even resultin worse performance than the straightforward Eu-

clidean distance metric, as demonstrated in our em-pirical study.

To address this problem, we propose a general frame-work of robust distance metric learning from noisyor uncertain side information in this paper. Weformulate the learning task initially as a combina-torial integer optimization problem, each {0, 1} in-teger variable indicating whether the triplet con-straint is a noisy or not. We then prove that thisproblem can be transformed to a semi-infinite pro-gramming problem [4] and further to a convex op-

Page 2: HuangEtAl2010_RobustMetricLearningbySmoothOptn_UAI2010_0096

7/31/2019 HuangEtAl2010_RobustMetricLearningbySmoothOptn_UAI2010_0096

http://slidepdf.com/reader/full/huangetal2010robustmetriclearningbysmoothoptnuai20100096 2/8

timization problem. We show that the final prob-lem can be solved by Nesterov’s smooth optimizationmethod [7], which has a worst-case convergence rateof  O(1/

√ε) for smooth objective functions, where

ε is required level of errors. This is significantlyfaster than the sub-gradient method whose conver-

gence rate is usually O(1/ε2). The proposed methodmay also be adapted to the other distance metriclearning algorithms, e.g., the Large Margin NearestNeighbor (LMNN) method [12], when the side infor-mation is noisy.

The rest of this paper is organized as follows. Inthe next section, we will briefly discuss the relatedwork. In Section 3, we review distance metric learn-ing with perfect side information. We then introducethe robust metric learning framework in Section 4.In Section 5, we evaluate our proposed frameworkon five real world data sets. Finally, we set out the

conclusion in Section 6 with some remarks.

2 Related Work

Many algorithms have been proposed for dis-tance metric learning in the literature. Exem-plar algorithms include Information-Theoretic Met-ric Learning [3] (ITML), Relevant Component Anal-ysis (RCA) [1], Dissimilarity-ranking Vector Ma-chine (D-ranking VM) [8], Generalized Sparse Met-ric Learning (GSML) [5], the method proposed byXing et al. [15], and Large Margin Nearest Neigh-

bor (LMNN) [12, 13]. In particular, the last twomethods are regarded as state-of-the-art approachesin this field. Specifically, ITML maps the metricspace to a Gaussian distribution space and searchesthe best metric by learning the optimal Gaussianthat is consistent with the given constraints. [11]extends this work by the idea of information ge-ometry. In [15], a distance metric is then learnedto shrink the averaged distance within must-linkswhile enlarging the average distance within cannot-links simultaneously. In [1], RCA seeks a distancemetric which minimizes the covariance matrix im-

posed by the equivalence constraints. In [12, 13],LMNN is proposed to learn a large margin nearestneighbor classifier. Huang et al. proposed a uni-fied sparse distance metric learning method [5] thatprovides an interesting viewpoint for understandingmany sparse metric learning methods. Besides theempirical results, the generalized performance of reg-ularized distance metric learning was analyzed in therecent study [6]. More discussion of distance metriclearning can be found in the survey [17].

All the above methods assume perfect side informa-

tion. In case of noisy constraints, direct applicationof the above methods may significantly degrade theirperformance, which is verified in our experiments.In order to alleviate the noisy side information, Wuet al. proposed a probabilistic framework for met-ric learning which is however specially designed for

RCA [14]. In contrast to this method, we proposea robust optimization framework that is capable of learning from noisy side information. Another im-portant contribution of this paper is that we formu-late the related robust optimization problem as aconvex programming problem and solve it using ef-ficient Nesterov’s smooth optimization method. Weshow that the proposed methodology can be adaptedto the other metric learning methods such as LMNN,D-ranking VM, and the sparse method proposedin [5].

We finally note that our work is closely related

to the work of Robust Support Vector Machine(RSVM) [16] which aims to learn a support vec-tor machine from noisily labeled training data. Ourwork differs from RSVM in that we consider theworst case analysis in our formulation while a bestcase analysis is utilized in RSVM. As illustratedin [2], unlike the best case analysis, the worstcase analysis usually leads to a generalization er-ror bound, making it theoretically sound. Besides,the worst case analysis often results in a convex-concave optimization problem, which can be solvedefficiently, while the best case analysis could lead to

a non-convex optimization problem that is usuallydifficult to solve.

3 Learning Distance Metric from

Perfect Side Information

We first describe the framework of learning a dis-tance metric from perfect side information, which iscast in the form of triplet constraints in our case.In particular, we represent the training examples asa collection of triplets, i.e., D = {(xi, yi, zi), i =

1, . . . , N  }, where xi ∈Rd

, yi ∈Rd

, and zi ∈Rd

.In each triplet (xi, yi, zi), (xi, yi) forms a must-link(or similar) pair and (xi, zi) form a cannot-link (ordissimilar) pair. We expect dA(xi, yi) to be signif-icantly smaller than dA(xi, zi), where dA(x, y) =(x− y)⊤A(x−y). Assuming all the training triplesin D are correct, we can cast the distance metriclearning into the following optimization problem:

minA≽0

λ

2∥A∥2F  +

N i=1

ℓ(dA(xi, zi) − dA(xi, yi))(1)

Page 3: HuangEtAl2010_RobustMetricLearningbySmoothOptn_UAI2010_0096

7/31/2019 HuangEtAl2010_RobustMetricLearningbySmoothOptn_UAI2010_0096

http://slidepdf.com/reader/full/huangetal2010robustmetriclearningbysmoothoptnuai20100096 3/8

where ℓ(z) is a pre-defined loss function. The com-putational challenge of Eq. 1 arises from handlingthe linear matrix inequality (LMI) A ≽ 0. It can beaddressed in two ways. First, we could simplify theconstraint A ≽ 0 by assuming that

A ∈ S = A =mk=1

λkξkξ⊤k : λ = (λ1, . . . , λm) ∈ Rm+where ξk ∈ R

d, k = 1, . . . , m are base vectors thatare derived from the existing data points. In thisway, we approximate the original Semi-Definite Pro-gramming (SDP) problem into a quadratic optimiza-tion problem, and therefore can be solved efficiently.We can further improve this scheme by greedilysearching for the base vector ξk. Let A be the cur-rent solution, and we aim to update A by A + αξξ⊤

with unknown α and ξ

∈Rd. We choose the base by

maximizing the gradient of the objective function,i.e.,

min|ξ|2=1

ξ⊤Zξ ,

where Z  is defined as

Z  = λA +

N i=1

∂ℓ(dA(xi, zi) − dA(xi − yi))K i,

with K i = (xi− zi)(xi− zi)⊤− (xi− yi)(xi−yi)

⊤.Hence, ξ is the minimum eigenvector of matrix Z .

The second approach is to learn the distance metric

A by an online learning algorithm by assuming thatthe training triples are provided sequentially. Thedetails can be found in [6].

4 Learning Distance Metric from

Noisy Side Information

We now turn to the case of learning distance metricfrom noisy side information. Let’s assume that weknow a priori that at most 1−η ∈ [0, 1) of the triplesin D are uncertain. The value η can be estimatedwith reasonable accuracy by randomly sampling thetriples from D and checking if the labels are correct.We emphasize that although the percentage of noisyor uncertain triples can be reliably estimated, wedo not know which subset of the triples in D areuncertain, making it a challenging learning problem.

4.1 Robust Model

In order to address the uncertainty in correct triples,we explore the idea of robust optimization [2]. Itsearches for a distance metric A that minimizes the

loss function for any η percent of the triples in D,i.e.,

mint,A≽0

t +λ

2∥A∥2F  s.t. (2)

t ≥N i=1

qiℓ(dA(xi, zi) − dA(xi, yi)) ∀q ∈ Q(η)

where set Q is defined as follows

Q(η) =

q ∈ {0, 1}N  :

N i=1

qi ≤ Nη

.

As indicated by Eq. 2, the learned distance metricA will minimize the loss of any possible subset of training triples whose cardinality is no more thanN η. Note that the number of constraints in (2) isexponential in N η, which makes it computationally

infeasible. The following lemma allows us to simplify(2) by replacing the combinatorial set Q(η) with its

convex hull Q.

Lemma 1. Eq. 2 is equivalent to the following op-timization problem 

mint,A≽0

t +λ

2∥A∥2F  s.t. (3)

t ≥N i=1

qiℓ(dA(xi, zi) − dA(xi, yi)) ∀q ∈ Q(η)

where Q(η) is defined as follows

 Q(η) = cl (Q(η)) =

q ∈ [0, 1]N  :

N i=1

qi ≤ Nη

Lemma 1 follows directly the fact that Eq. 3 is linearin q, and as a result, its optimum should be achievedat the extreme points.

Eq. 3 can be viewed as a semi-infinite programmingproblem [4], since the number of constraints is infi-nite. For the rest of the paper, we assume a hinge

loss for ℓ(z), i.e., ℓ(z) = max(1 − z, 0) = (1 − z)+.We have the following lemma that transforms thesemi-infinite programming problem in Eq. 3 into aconvex-concave optimization problem.

Lemma 2. Eq. 3 is equivalent to the following op-timization problem 

maxA≽0

minq∈ Q(η)

−λ

2∥A∥2F  −

N i=1

qi(1 − tr(AK i))+ (4)

where K i = (xi−zi)(xi−zi)⊤− (xi−yi)(xi−yi)

⊤.

Page 4: HuangEtAl2010_RobustMetricLearningbySmoothOptn_UAI2010_0096

7/31/2019 HuangEtAl2010_RobustMetricLearningbySmoothOptn_UAI2010_0096

http://slidepdf.com/reader/full/huangetal2010robustmetriclearningbysmoothoptnuai20100096 4/8

The following lemma allows us to simplify theconvex-concave optimization Eq. 4 into a maximiza-tion problem.

Lemma 3. The convex-concave optimization prob-lem in (4) is equivalent to the following minimization problem 

minq∈ Q(η)

L(q) −N i=1

qi +1

πS+

N i=1

qiK i

2

, (5)

where πS+(B) projects the matrix  B onto the SDP cone.

Lemma 3 can be verified by simply setting the firstorder derivative with respect to A to be zero.

We could solve (5) by a sub-gradient approach. Themain difficulty in deploying the sub-gradient ap-proach is to compute the sub-gradient of the projec-

tion function πS+(A). The following lemma allowsus to compute the sub-gradient effectively.

Lemma 4. The gradient of the objective function Eq. 5, denoted by  ∂ L(q), is computed as

∂ L(q) = −1 +1

λh (6)

where h = (h1, . . . , hN ) and each  hi is computed as

hi = tr

K iπS+

N i=1

qiK i

(7)

where πS+

(A) projects the matrix  A onto the SDP cone.

Using Eq. 6, we can adopt the Rosen gradientmethod [10], which projects the sub-gradient to do-

main Q(η) and solve the optimization problem inEq. 6 iteratively. In each step, the best step size canbe obtained by an efficient line search. However, it iswell known that the sub-gradient method is usuallyslow in convergence, particularly for non-smooth ob- jective function. In the following, we aim to improvethe learning efficiency by using Nesterov’s smoothoptimization method [7].

4.2 Smooth Optimization

We intend to solve the optimization problem in Eq. 5using the smooth optimization method [7]. First, itis easy to verify that the objective function in (5) isLipschitz continuous with respect to L2 norm of  q,with Lipschitz constant L bounded by

L ≤  N 

i=1

(1 +

1

λ|hi|)2

≤  N 

i=1

[1 +

1

λtr(K iZ )

]2(8)

where Z  =∑N 

i=1(xi − zi)(xi − zi)⊤.

We can deploy the smooth optimization approachto optimize Eq. 5, which has convergence rate of O(1/

√ε), where ε is the target error. This is signifi-

cantly better when compared to the convergence rate

of sub-gradient descent (i.e., O(1/ϵ2

)). Algorithm 1shows the smooth optimization algorithm for Eq. 5.There are two optimization problems when apply-ing Algorithm 1, i.e., the problems in Eq. 10 and 11.Both problems have the following general form

minq

L

2|q − a|22 + (q − b)⊤s (16)

s. t. q ∈ [0, 1]N , q⊤1 ≤ N η

where a, b, and s are given vectors. Below, wediscuss how to efficiently optimize the problem inEq. 16. First, it is straightforward to verify that Eq.

16 can be cast into the following convex-concave op-timization problem

maxλ≥0

minq∈[0,1]N

L

2|q− a|22 + (q−b)⊤s + λ(q⊤1−N η) .

By solving the minimization problem, we have q ex-pressed as

qi = π[0,1](ai − si/L − λ)

According to KKT condition, we have λ(q⊤1 −N η) = 0. Hence, to decide the value for λ, we havethe following procedure

• We first try λ = 0. If  ∑N 

i=1 qi ≤ N η whenλ = 0, that is the optimal solution.

• If not, we search λ > 0 that q⊤1 = Nη. Sinceqi is monotonically decreasing in λ, a bi-sectionsearch can be used to identify λ that satisfiesq⊤1 = N η.

The theorem below shows the convergence rate of the smooth optimization algorithm described in Al-gorithm 1

Theorem 1. After running the algorithm specified in Algorithm 1 for  T  iterations, we have the follow-ing inequality holds

|L(qT ) −L(q∗)| ≤ 2LNη

(T  + 1)(T  + 2)

where q∗ = arg maxq∈ Q(η)

L(q).

The proof follows directly Theorem 2 in [7]. The the-orem below allows us to bound the maximum num-ber of iteration for the algorithm in Algorithm 1 toquit.

Page 5: HuangEtAl2010_RobustMetricLearningbySmoothOptn_UAI2010_0096

7/31/2019 HuangEtAl2010_RobustMetricLearningbySmoothOptn_UAI2010_0096

http://slidepdf.com/reader/full/huangetal2010robustmetriclearningbysmoothoptnuai20100096 5/8

Algorithm 1 Smooth Optimization Algorithm for Solving (5)

1: Input

•ε: predefined error for optimization

2: Initialization• Compute the Lipschitz constant L using Eq. 8• Initialize the solution z0 = 0 ∈ RN 

• t = 03: repeat

4: Compute the solution for qt

• Computes

At =1

λπS+

N i=1

zt,iK i

, ∇L(zt) = −1 + h/λ, L(zt) = −z⊤t 1 +

λ

2|At|2F  (9)

where πS+(M ) projects matrix M  into the SDP cone and h = (h1, . . . , hN ) with hi = tr(K iAt).

• Solve the following optimization problem

 q = arg minq∈ Q(η)

{L

2|q − zt|22 + (q − zt)

⊤∇L(q)

}(10)

• qt = arg minq′

{L(q′) : q′ ∈ {qt−1, q, zt}}6: Compute zt+1

• Solve the following optimization problem

wt = arg minq∈

 Q(η)

L

2|q|22 +

tj=0

 j + 1

2(q − zj)⊤∇L(zj)

(11)

• Set

zt+1 =2

t + 3wt +

t + 1

t + 3qt (12)

7: Compute the duality gap ƥ Set

At =4

(t + 1)(t + 2)

tj=0

 j + 1

2Aj (13)

• Compute duality gap ∆i as

∆t = L(qt) −H(At) (14)

where H(At) is computed as

H(At) = −λ

2∥At∥2F  − max

q∈ Q(η)

N i=1

qi(1− tr(AtK i)) (15)

8: until ∆t ≤ ε

Page 6: HuangEtAl2010_RobustMetricLearningbySmoothOptn_UAI2010_0096

7/31/2019 HuangEtAl2010_RobustMetricLearningbySmoothOptn_UAI2010_0096

http://slidepdf.com/reader/full/huangetal2010robustmetriclearningbysmoothoptnuai20100096 6/8

Theorem 2. The number of iterations for running the algorithm in Algorithm 1 before it quits, denoted by  T , is bounded as follows

T <

√ 2LN η

ε− 1

Proof. First, we have

T t=0

t + 1

2[L(zt) + (q − zt)

⊤∇L(zt)]

≤T t=0

t + 1

2

{−z⊤t 1 − λ

2|At|2F +

N i=1

zt,itr(AtK i) − (q − zt)⊤(1 − h/λ)

=T 

t=0

t + 1

2 −q⊤1

−λ

2 ∥At

∥2F  +

n

i=1

qitr(AtK i)≤ Q(T )

−q⊤1 − λ

2∥AT |2F  +

N i=1

qitr(AtK i)

In the above, Q(T ) = (T +1)(T +2)4 . The last step uses

the concaveness of H(A). Using Theorem 2 of thework by Nesterov [7], we have

L(qT ) ≤ 2LN η

(T  + 1)(T  + 2)+

minq∈ Q(η)

−q⊤1 − λ

2∥AT |2F  +

i=1

qitr(AtK i)

≤ 2LgN η

(T  + 1)(T  + 2)+H(AT )

Since ∆ = H(AT )−L(qT ), we have the result in thetheorem.

4.3 Extension to Other Metric Learning

Methods

As shown by Huang et al. [5], many distance metriclearning optimization problems can be seen as anoptimization problem with different regularization

terms. For instance, LMNN can be formulated as asimilar problem regularized by a trace form, i.e.,

minA≽0

N i=1

ℓ(dA(xi, zi) − dA(xi, yi)) +λ

2tr(W A),

where W  = 1N 

∑N i=1(xi − yi)(xi − yi)

⊤. Using thesimilar techniques described in this work, we canextend LMNN to deal with noisy class assignments.Similarly, we can extend Generalized Sparse MetricLearning [5] and D-ranking Vector Machine [8] totheir robust versions.

Table 1: Descriptions of UCI data setsData Set. # Samples #Dimension # Class

Sonar 208 60 2Iris 150 4 3

Ionosphere 351 34 2Heart 270 13 2

Wine 178 13 3

5 Experiments

Experimental setup The five UCI data sets areused in the experiments.1 Table 1 summarizes thestatistics of these data sets. We compare the pro-posed robust metric learning method, denoted asRML, with three state-of-the-art algorithms formetric learning, including LMNN [12], Xing’smethod. [15] , and the common Euclidean distancemetric (EUCL). Depending on different optimiza-tion methods, we name the RML method as RMLgd

when using sub-gradient optimization and RMLnes

when using Nesterov optimization.

We first apply these metric learning methods to learna distance metric from the training data, and thendeploy the k-Nearest Neighbor (kNN) method toevaluate the quality of the learned metric. We fol-low the same setting as in [12] and use the categoryinformation to generate the side information. Morespecifically, in the training data, given y is x’s knearest neighbor, we have x and y form a must-link

pair if they share the same class label and a cannot-link if they belong to different classes. We split eachdata set randomly into a training set and a test setwith 85% of data used for training. We performtraining on the triplet set obtained from the train-ing set and evaluate the classification performanceof  kNN in the test set. In order to generate thenoisy side information, we randomly switch the cor-rect triplet (xi, yi, zi) to (xi, zi, yi) with the proba-bility 1 − η.

All the metric learning methods involve a trade-off parameter λ. It is tuned from the range

{2−3

, 2−2

, 2−1

, 20

, 21

, 22

, 23

, 24

, 25

, 26

, 100}. The pa-rameter k is empirically set to 5 according to ourexperiments. We repeat the experiments 10 timesand then report the results averaged over 10 runs.

Experimental Results Table 2 reports the clas-sification errors of  kNN using the distance met-ric learned from noisy constraints with η = 0.8(i.e., 20% of constraints are uncertain). We ob-serve that the proposed robust framework consis-

1http://www.ics.uci.edu/∼mlearn/MLRepository.html.

Page 7: HuangEtAl2010_RobustMetricLearningbySmoothOptn_UAI2010_0096

7/31/2019 HuangEtAl2010_RobustMetricLearningbySmoothOptn_UAI2010_0096

http://slidepdf.com/reader/full/huangetal2010robustmetriclearningbysmoothoptnuai20100096 7/8

Table 2: Test error for the proposed algorithm and three baseline algorithms with η = 0.8.Data Set RMLnes RMLgd LMNN Xing EUCL

Sonar (%) 17.42± 5.73 17.96± 5.25 36.45± 7.91 17.42± 6.31 19.68± 6.17Iris (%) 1.82± 3.18 1.97± 3.53 2.27± 3.21 2.73± 2.36 2.73± 2.58

Ionosphere (%) 11.13± 4.12 11.13± 4.12 11.89± 4.27 13.77± 3.78 14.15± 4.81Heart (%) 16.00± 7.38 15.93± 6.32 17.50± 5.00 19.75± 7.12 17.55± 6.39

Wine (%) 1.

48± 2.

59 1.

81± 2.

52 6.

67± 5.

74 2.

96± 1.

56 4.

07± 4.

43

tently performs better than the baseline algorithms.We also observe that RMLnes and RMLgd, the twoimplementations of the proposed framework usingdifferent optimization strategies, yield almost iden-tical performance. For certain data sets, we observethat LMNN and Xing can be affected significantlyby the noisy constraints, and learn a distance metricthat is significantly worse than the Euclidean dis-tance metric (i.e., EUCL). The degradation is sig-

nificant for Xing’s method in Heart, and for LMNNin Sonar. These results clearly indicate the nega-tive effect caused by noisy constraints. In contrast,the proposed method always outperforms the base-line Euclidean distance metric given the noisy con-straints, indicating that our method is able to dealwith the uncertain issue appropriately and conse-quently lift up the performance.

To show the efficiency of Nesterov’s smooth opti-mization method, in Fig. 1 (the first 5 graphs), weplot the convergence curves for both Nesterov’s opti-mization method and the sub-gradient method on all

data sets. Evidently, Nesterov’s method convergessignificantly faster than the sub-gradient method.This once again verifies the advantage of the smoothoptimization method. All the algorithms are imple-mented and run using Matlab. All the experimentsare conducted on Intel Core (TM)2 2.66Ghz CPUmachine with 4G RAM and Windows XP OS.

In order to see how η affects the performance of dif-ferent metric learning methods, we plot their testerrors against different η on the Wine data set inthe last graph of Fig. 1. For convenience, the pa-rameter λ is set to 26 for all the metric learning

methods, while k is still set to 5. It is interesting tonote that, RMLnes is quite robust against the noiseif noise does not dominate side information. Withheavy noise in side information, it may fail to finda better metric than EUCL. This is understandable,since too heavy noise may “bury” the latent bestmetric and consequently makes restoration less pos-sible. In addition, LMNN appears quite sensitiveto noisy side information in this data set; Xing alsodemonstrates robust property, but performs worsethan RMLnes when η is reasonably large.

6 Conclusion

In this paper, we proposed a robust distance metriclearning framework, which is formulated in a worst-case scenario. In contrast to most existing studiesthat usually assume perfect side information, we de-veloped a novel method that learns from noisy or un-certain side information. We formulated the learningtask initially as a combinatorial optimization prob-lem, and then showed that it can be transformed

into a convex programming problem. Finally, we ex-ploited the efficient Nesterov smooth optimizationmethod to solve the problem. One appealing fea-ture of the proposed framework is that the robustframework can be adapted to other metric learningmethods in order to handle noisy constraints. Ex-periments on several UCI data sets showed that theproposed novel framework is highly effective for deal-ing with noisy side information.

Acknowledgements

The work of Rong Jin was supported by Na-tional Science Foundation (IIS-0643494) and Na-tional Health Institute (1R01GM079688-01). Thework of Kaizhu Huang and Cheng-Lin Liu were sup-ported by the National Natural Science Foundationof China (NSFC) under grant no.60825301.

References

[1] A. Bar-Hillel, T. Hertz, N. Shental, andD. Weinshall. Learning a mahalanobis metricfrom equivalence constraints. Journal of Ma-chine Learning Research , 6:937–965, 2005.

[2] A. Ben-Tal, L. E. Ghaoui, and A. Nemirovski.Robust Optimization . Princeton UniversityPress, 2009.

[3] J. Davis, B. Kulis, P. Jain, S. Sra, andI. Dhillon. Information-theoretic metric learn-ing. In Intermational Conference on MachineLearning (ICML), 2007.

[4] R. Hettich and K. O. Kortanek. Semi-infiniteprogramming: theory, methods, and applica-tions. SIAM Rev., 35(3):380–429, 1993.

Page 8: HuangEtAl2010_RobustMetricLearningbySmoothOptn_UAI2010_0096

7/31/2019 HuangEtAl2010_RobustMetricLearningbySmoothOptn_UAI2010_0096

http://slidepdf.com/reader/full/huangetal2010robustmetriclearningbysmoothoptnuai20100096 8/8

0 50 100 150 200 250 300 3500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Epoch

   N  o  r  m  a   l   i  z  e   d  v  a   l  u  e  o   f  o   b   j  e  c   t   i  v  e   f  u  n  c   t   i  o  n

Convergence Curves on Sonar

 

Sub−gradient

Nesterov

0 50 100 150 200 2500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Epoch

   N  o  r  m  a   l   i  z  e   d

  v  a   l  u  e  o   f  o   b   j  e  c   t   i  v  e   f  u  c  n   t   i  o  n

Convergence Curves on Iris

 

Sub−gradient

Nesterov

0 50 100 150 200 250 300 350 400 450 5000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Epoch

   N  o  r  m  a   l   i  z  e   d

  v  a   l  u  e  o   f  o   b   j  e  c   t   i  v  e   f  u  c  n   t   i  o  n

Convergence Curves on Ionosphere

 

Sub−gradient

Nesterov

0 50 100 150 200 250 300 350 400 450 5000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Epoch

   N  o  r  m  a   l   i  z  e   d

  v  a   l  u  e  o   f  o   b   j  e  c   t   i  v  e   f  u  c  n   t   i  o  n

Convergence Curves on Heart

 

Sub−gradient

Nesterov

0 50 100 150 200 250 300 350 4000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Convergence Curves on Wine

Epoch

   N  o  r  m  a   l   i  z  e   d

  v  a   l  u  e  o   f  o   b   j  e  c   t   i  v  e   f  u  c  n   t   i  o  n

 

Subgradient

Nesterov

0.10.30.50.70.90

5

10

15

20

25

η

   T  e  s   t  e  r  r  o  r  r  a   t  e   (   %   )

Test error rate vsη for different methods on Wine data

Figure 1: The first 5 graphs: convergence curves for the sub-gradient method and Nesterov’s method on fivedata sets. The right-bottom graph: test error rate vs η for different methods on Wine Data.

[5] K. Huang, Y. Ying, and C. Campbell. Gsml: Aunified framework for sparse metric learning. InProceedings of 2009 IEEE International Con- ference on Data Mining (ICDM), 2009.

[6] R. Jin, S. Wang, and Y. Zhou. Regularized dis-tance metric learning: Theory and algorithm.In Advances in Neural Information Processing Systems (NIPS), Cambridge, MA, 2009.

[7] Y. Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming ,152:103–127, 2005.

[8] H. Ouyang and A. Gray. Learning dissimilari-ties by ranking: from sdp to qp. In Proceedingsof the Twenty-five International Conference on Machine learning (ICML), 2008.

[9] R. Rosales and G. Fung. Learning sparse met-rics via linear programming. In ACM Interna-tional conference on Knowledge Discovery and 

Data Mining (KDD), pages 367–373, 2006.

[10] J. B. Rosen. The gradient projection methodfor nonlinear programming:part i-linear con-straints. Journal of the Society for Industrial and Applied Mathematics, 3(1):181–217, 1960.

[11] S. Wang and R. Jin. An information geometryapproach for distance metric learning. In Pro-ceedings of the 12th International Conferenceon Artificial Intelligence and Statistics (AIS-TATS), pages 591–598, 2009.

[12] K. Weinberger, J. Blitzer, and L. Saul. Dis-tance metric learning for large margin nearestneighbor classification. In Advances in Neural Information Processing Systems (NIPS), 2006.

[13] K. Weinberger and L. Saul. Fast solvers and effi-cient implentations for distance metric learning.In Proceedings of the twenty-fifth international conference on Machine learning (ICML), 2008.

[14] L. Wu, S. C.H. Hoi, R. Jin, J. Zhu, and N. Yu.Distance metric learning from uncertain side in-formation with application to automated phototagging. In Proceedings of the 2009 ACM In-ternational Conference on Multimedia (MM),2009.

[15] E. Xing, A. Ng, M. Jordan, and S. Rus-sell. Distance metric learning, with applica-tion to clustering with side information. In Ad-vances in Neural Information Processing Sys-tems (NIPS), 2002.

[16] L. Xu, K. Crammer, and D. Schuurmans. Ro-bust support vector machine training via con-vex outlier ablation. In Proceedings of the 17th National Conference on Artificial Intelligence(AAAI), 2006.

[17] L. Yang and R. Jin. Distance metric learning: Acomprehensive survey. In Technical report, De-partment of Computer Science and Engineer-ing, Michigan State University , 2006.