[ieee 2012 ieee statistical signal processing workshop (ssp) - ann arbor, mi, usa...

4
NEW STATISTIC IN P -VALUE ESTIMATION FOR ANOMALY DETECTION Jing Qian Boston University Boston, MA Venkatesh Saligrama Boston University Boston, MA ABSTRACT Given n nominal samples, a query point η and a significance level α, the uniformly most powerful test for anomaly detec- tion can be to test p(η) α, where p(η) is the p-value func- tion of η. In [1] a p-value estimator is proposed which is based on ranking some statistic over all data samples, and is shown to be asymptotically consistent. Relying on this framework we propose a new statistic for p-value estimation. It is based on the average of K nearest neighbor (K-NN) distances of η within a K-NN graph constructed from n nominal training samples. We also provide a bootstrapping strategy for esti- mating p-values which leads to better robustness. We then theoretically justify the asymptotic consistency of our ideas through a finite sample analysis. Synthetic and real experi- ments demonstrate the superiorities of our scheme. Index TermsAnomaly Detection, p-value, k-NN graph 1. INTRODUCTION Anomaly detection problems, also called outlier detection [2], novelty detection [3] or one-calss classification [4, 5], arise in many applications where failure to detect anomalous activ- ities or events could result in severe consequences, and has been under extensive study. Typically a training set of nom- inal samples are given, and the task of anomaly detection is to design a detection rule such that the detection power of whether a query point η is anomalous is maximized, while at the same time the false alarm probability is controlled no higher than some specified significance level α. Approaches for anomaly detection can be divided into t- wo categories: parametric and non-parametric. Usually the underlying nominal density is unknown, so the parametric ap- proaches for anomaly detection [6], which assume a nominal distribution, can suffer from the problem of model mismatch and result in poor performance. In recent years the non-parametric approaches have be- come popular. These methods includes one-class SVM [7], minimum volume (MV) set estimation [8] and the GEM ap- proach for determining MV sets [9]. The one-class SVM ap- proach is computationally efficient, but can not directly con- trol the desired false alarm probability and usually does not generate satisfactory performance. The MV set estimation approach suffers from the complexity of high-dimensional quantity approximation, either the multivariate density or the MV set boundaries. The GEM approach involves seeking the K-point minimum spanning tree of n samples and is com- putational intractable. A simplified surrogate of the GEM approach is also proposed in [9], which while simpler than GEM, is no longer asymptotically consistent and loses the provable optimality. In [1] a non-parametric approach for anomaly detection is proposed, which is based on a score function. This score function maps the data samples into the interval [0, 1]. For a query point η, anomaly is declared by directly threshold- ing the score of η using the desired significance level α. This score function of η is the ranking of η among all nominal sam- ples based on some statistic G, and is essentially an estimate of the p-value function at η. In [1] it is proven that directly thresholding the p-value function with respect to α is the u- niformly most powerful (UMP) test when the anomalies are drawn from a mixture of the unknown nominal density and the uniform distribution, and that this score function of η con- verges to the true p-value at η asymptotically. In this paper we propose a new statistic G based on the same framework of [1]. Our statistic G is based on the av- erage of K nearest neighbor (K-NN) distances of η within a K-NN graph constructed from all nominal samples. Through a finite sample analysis we show that our scheme not only inherits the asymptotic consistency in estimating the p-value function, thus remaining UMP optimality for anomaly detec- tion, also results in a better convergence rate compared with the scheme K-LPE in [1]. Moreover, we provide a U-statistic bootstrapping strategy for reducing variance in computing G, which leads to improved robustness for p-value estimation. The rest of this paper is organized as follows. Sec.2 describes our new statistic G in p-value estimation, the U- statistic bootstrapping strategy to compute G and the pro- cedure of anomaly detection based on our scheme. Sec.3 shows the finite sample analysis and asymptotic consisten- cy. Synthetic and real experiments are reported in Sec.4 demonstrating improved performance and robustness. Sec.5 concludes the paper. 2012 IEEE Statistical Signal Processing Workshop (SSP) 978-1-4673-0183-1/12/$31.00 ©2012 IEEE 393

Upload: venkatesh

Post on 12-Mar-2017

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: [IEEE 2012 IEEE Statistical Signal Processing Workshop (SSP) - Ann Arbor, MI, USA (2012.08.5-2012.08.8)] 2012 IEEE Statistical Signal Processing Workshop (SSP) - New statistic in P-value

NEW STATISTIC IN P -VALUE ESTIMATION FOR ANOMALY DETECTION

Jing Qian

Boston UniversityBoston, MA

Venkatesh Saligrama

Boston UniversityBoston, MA

ABSTRACT

Given n nominal samples, a query point η and a significance

level α, the uniformly most powerful test for anomaly detec-

tion can be to test p(η) ≤ α, where p(η) is the p-value func-

tion of η. In [1] a p-value estimator is proposed which is based

on ranking some statistic over all data samples, and is shown

to be asymptotically consistent. Relying on this framework

we propose a new statistic for p-value estimation. It is based

on the average of K nearest neighbor (K-NN) distances of

η within a K-NN graph constructed from n nominal training

samples. We also provide a bootstrapping strategy for esti-

mating p-values which leads to better robustness. We then

theoretically justify the asymptotic consistency of our ideas

through a finite sample analysis. Synthetic and real experi-

ments demonstrate the superiorities of our scheme.

Index Terms— Anomaly Detection, p-value, k-NN graph

1. INTRODUCTION

Anomaly detection problems, also called outlier detection [2],

novelty detection [3] or one-calss classification [4, 5], arise in

many applications where failure to detect anomalous activ-

ities or events could result in severe consequences, and has

been under extensive study. Typically a training set of nom-

inal samples are given, and the task of anomaly detection is

to design a detection rule such that the detection power of

whether a query point η is anomalous is maximized, while

at the same time the false alarm probability is controlled no

higher than some specified significance level α.

Approaches for anomaly detection can be divided into t-

wo categories: parametric and non-parametric. Usually the

underlying nominal density is unknown, so the parametric ap-

proaches for anomaly detection [6], which assume a nominal

distribution, can suffer from the problem of model mismatch

and result in poor performance.

In recent years the non-parametric approaches have be-

come popular. These methods includes one-class SVM [7],

minimum volume (MV) set estimation [8] and the GEM ap-

proach for determining MV sets [9]. The one-class SVM ap-

proach is computationally efficient, but can not directly con-

trol the desired false alarm probability and usually does not

generate satisfactory performance. The MV set estimation

approach suffers from the complexity of high-dimensional

quantity approximation, either the multivariate density or the

MV set boundaries. The GEM approach involves seeking the

K-point minimum spanning tree of n samples and is com-

putational intractable. A simplified surrogate of the GEM

approach is also proposed in [9], which while simpler than

GEM, is no longer asymptotically consistent and loses the

provable optimality.

In [1] a non-parametric approach for anomaly detection

is proposed, which is based on a score function. This score

function maps the data samples into the interval [0, 1]. For

a query point η, anomaly is declared by directly threshold-

ing the score of η using the desired significance level α. This

score function of η is the ranking of η among all nominal sam-

ples based on some statistic G, and is essentially an estimate

of the p-value function at η. In [1] it is proven that directly

thresholding the p-value function with respect to α is the u-

niformly most powerful (UMP) test when the anomalies are

drawn from a mixture of the unknown nominal density and

the uniform distribution, and that this score function of η con-

verges to the true p-value at η asymptotically.

In this paper we propose a new statistic G based on the

same framework of [1]. Our statistic G is based on the av-

erage of K nearest neighbor (K-NN) distances of η within a

K-NN graph constructed from all nominal samples. Through

a finite sample analysis we show that our scheme not only

inherits the asymptotic consistency in estimating the p-value

function, thus remaining UMP optimality for anomaly detec-

tion, also results in a better convergence rate compared with

the scheme K-LPE in [1]. Moreover, we provide a U-statistic

bootstrapping strategy for reducing variance in computing G,

which leads to improved robustness for p-value estimation.

The rest of this paper is organized as follows. Sec.2

describes our new statistic G in p-value estimation, the U-

statistic bootstrapping strategy to compute G and the pro-

cedure of anomaly detection based on our scheme. Sec.3

shows the finite sample analysis and asymptotic consisten-

cy. Synthetic and real experiments are reported in Sec.4

demonstrating improved performance and robustness. Sec.5

concludes the paper.

2012 IEEE Statistical Signal Processing Workshop (SSP)

978-1-4673-0183-1/12/$31.00 ©2012 IEEE 393

Page 2: [IEEE 2012 IEEE Statistical Signal Processing Workshop (SSP) - Ann Arbor, MI, USA (2012.08.5-2012.08.8)] 2012 IEEE Statistical Signal Processing Workshop (SSP) - New statistic in P-value

2. MAIN IDEA AND ALGORITHMS

Let S = {x1, ..., xn} be the given n-point nominal training

set, which is drawn i.i.d from some d-dimensional underlying

density f0. A test point is assumed to come from a mixture

of the nominal distribution f0 and another known density f1,

which for simplicity is assumed to be the uniform distribu-

tion. Let η be the query point. The task of anomaly detec-

tion is to declare whether η is consistent with nominal data:

H0 : η ∼ f0, or deviates from nominal: H1 : η ∼ f1, under

the specified significance level: P (declareH1|H0) ≤ α.

In [1] it has been proven that the uniformly most powerful

test for the above problem is:

φ(η) =

{H1 p(η) ≤ αH0 otherwise

(1)

where under the assumption f1 is uniformly distributed, p(η)is the p-value function of η:

p(η) = P0

(x :

f1(x)

f0(x)≥ f1(η)

f0(η)

)=

∫{x:f0(x)≤f0(η)}

f0(x)dx

(2)

For a query point η, the p-value function can be estimated

through a ranking of η among all nominal points {x1, ..., xn}based on some statistic G, as follows:

p̂(η) =1

n

n∑i=1

I{G(η)≤G(xi)} (3)

where I denotes the indicator function.

Ideally we would like to choose G to be the nominal den-

sity, f0(·) of the samples, so that p̂(η) approximates the p(η).Since the underlying density f0 is unknown, surrogates re-

flecting the relative density at points are used. In [1] two

forms of G are adopted:

(1) ε-neighborhood: G(x) = −Nε(x) is the number of neigh-

bors within an ε-ball of x among n nominal points.

(2) K-nearest neighorhood distance: G(x) = D(K)(x) is the

distance from x to its K-th nearest neighbor among n nominal

points.

2.1. aK-LPE Anomaly Detection Algorithm

We propose a new statistic G which has the following form:

G(x) =1

K

K+�K2 �∑

i=K−�K−12 �

D(i)(x) (4)

Our G is the average of x’s K2 -th to 3K

2 -th nearest neighbor

distances among n points.

We then provide a U-statistic bootstrapping strategy to

compute G. This resampling technique can reduce the vari-

ance and increase the robustness [10].

U-statistic Resampling For Computing G:Given n = 2m nominal points,

(a) Randomly split the data into two equal parts: S1 ={x1, ..., xm}, S2 = {xm+1, ..., x2m}.

(b) Points in S2 are used to calculate G for xi ∈ S1 according

to Eq.(4), and vice versa.

(c) Resplit the data and repeat the above steps B times. Let

Gb(xi) be the statistic of xi obtained from the b-th resam-

pling. We then use the average as the final statistic:

G(xi) =1

B

B∑b=1

Gb(xi), i = 1, 2, ..., N

The above algorithm describes the procedure to compute

G for n nominal points using themselves. For a new test point

η the steps for calculating G(η) follow similarly. Notice that

the number of points used to calculate G for nominal training

points {x1, ..., xn} and test point η should be identical, such

that G reflects the same property at xi or η: the average of

K/2-th to 3K/2-th nearest neighbor distances among totally

m = n/2 points.

The algorithm for anomaly detection based on our new

statistic G, which we call averaged K-LPE, or aK-LPE is

as follows:

aK-LPE anomaly detection:1. Input: Nominal training data {x1, ..., xn}, query point

η, false alarm rate α.

2. Training Stage:Calculate G for every nominal point xi according to the above

U-statistic bootstrapping strategy: G(xi), i = 1, 2, ..., n.

3. Testing Stage:(a) Calculate G for the query point η according to the above

U-statistic bootstrapping strategy: G(η).

(b) Calculate the p-value estimate p̂(η) according to Eq.(3).

(c) Claim η to be anomalous if p̂(η) ≤ α; otherwise claim ηto be nominal.

Run Time Complexity:Computing distances from one point to another n points

requires O(dn). Sorting these distances requires O(n log n).So the training stage requires O(Bn2(d+log n)) and the test-

ing stage requires O(Bn(d+log n)). Note that the resampling

time B is a small constant. Compared to K-LPE of [1], the

time complexity of testing stage is decreased, because for ev-

ery query point K-LPE needs to recalculate G for {x1, ..., xn}and the complexity is O(n2(d+ log n)).

Recently [11] proposed an anomaly detection method

based on bipartite k-NN graphs (BP-kNNG). They use an-

other form of G based on a bipartite partition of the nominal

training data for p-value estimation. Practically their ideas

are quite similar to K-LPE and our aK-LPE, except that they

don’t incorporate the bootstrapping strategy for reducing the

variance.

394

Page 3: [IEEE 2012 IEEE Statistical Signal Processing Workshop (SSP) - Ann Arbor, MI, USA (2012.08.5-2012.08.8)] 2012 IEEE Statistical Signal Processing Workshop (SSP) - New statistic in P-value

3. ANALYSIS

In this section we establish the asymptotic consistency of our

method for estimating p-value through a finite sample anal-

ysis. We first show the expectation of our p-value estimate

converges to the true p-value function. Then we show the

empirical p-value estimate is concentrated at its expectation.

For simplicity let n = m1(m2 + 1) and divide n points in-

to: S = S0

⋃S1

⋃...⋃Sm1 . S0 = {x1, ..., xm1}; each

Sj , j = 1, ...,m1 has m2 points. Sj is used to compute Gfor η and xj ∈ S0. S0 is used to compute the rank of η:

p̂(η) =1

m1

m1∑j=1

I{G(xj ;Sj)>G(η;Sj)} (5)

Suppose the nominal density f = f0 satisfies some reg-

ularity conditions: f is continuous and lower-bounded on

a compact support C: f(x) ≥ fmin > 0. It is smooth,

i.e. ||∇f(x)|| ≤ λ, where ∇f(x) is the gradient of f(·)at x. Flat regions are not allowed, i.e. ∀x ∈ C, ∀σ > 0,

P {y : |f(y)− f(x)| < σ} ≤ Mσ, where M is a constant.

Theorem 1. By choosing K properly, as m2 → ∞, we have,

|E [p̂(η)]− p(η)| → 0. (6)

Proof. We only provide a brief outline for the proof.

ES [p̂(η)] = ES\S0

⎡⎣ES0

⎡⎣ 1

m1

m1∑j=1

I{G(η;Sj)<G(xj ;Sj)}

⎤⎦⎤⎦

=1

m1

m1∑j=1

Exj

[ESj

[I{G(η;Sj)<G(xj ;Sj)}

]]= Ex [PS1 (G(η;S1) < G(x;S1))]

Fix x and let Fx(y1, ..., ym2) = G(x) − G(η), where

y1, ..., ym2 are the m2 points in S1. It follows:

PS1 (G(u) < G(x)) = PS1 (Fx − EFx > −EFx) .

Apply McDiarmid’s inequality for Fx after checking the con-

dition on {y1, ..., ym2}, carefully add the indicator function

I{EFx>0}, and take expectation w.r.t. x, we have:

|E [PS1 (Fx > 0)]− Px (EFx > 0) | ≤ Ex

[e−c1

(EFx)2l2

m2

]

where c1 is some constant. Divide the support C = C1

⋃C2,

where C1 contains points whose density is far away from f(η)and C2 contains the rest. The proof follows the line that for

x ∈ C1 the above exponential term goes to 0 and I{EFx>0} =I{f(u)>f(x)}, while the rest C2 has small measure. We skip

the tricky steps and directly present the result:

|ES [p̂(η)]− p(η)| ≤ exp

(−c2

K2+ 4d

m1+ 4

d2

)+ c3

(K

m2

) 1d

Let K = mα2 such that d+4

2d+4 < α < 1 and the proof is

finished.

Theorem 2. Let p̂(η) be defined as in Eq.(5). For any ε > 0it follows that with probability at least 1− 2 exp

(−2m1ε2),

|p̂(η)− E [p̂(η)] | ≤ ε.

Proof. p̂(η) = 1m1

∑m1

j=1 Yj where Yj = I{G(xj ;Sj)>G(η;Sj)}is independent across j, and Yj ∈ [0, 1]. Applying Hoeffd-

ing’s inequality finishes the proof.

Combine the two theorems and the asymptotic consisten-

cy is obvious. Our method has better convergence rate than

[1](O(n−6/5d

)). In [11] they provide MSE convergence rate

results for their BP-kNNG algorithm. Notice that this is only

asymptotically studying E[|p̂(η)− p(η)|2] as n → ∞, while

our analysis here presents explicit bounds for finite sample

cases: for any finite m1,m2, d, with probability at least 1− δ,

|p̂(η)− p(η)| ≤ g(K,m2; d) +

√ln 2

δ

2m1.

where g(K,m2; d) denotes the bound in the proof of Thm.1.

Also notice that our proof follows similar lines when G is cho-

sen differently, for example, the average of squares of K/2-th

to 3K/2-th nearest neighbor distances.

4. EXPERIMENTS

We first demonstrate the robustness of our method in esti-

mating p-value function. Fig.1 shows the p-value estima-

tion of K-LPE and aK-LPE for a Gaussian mixture densi-

ty:∑2

i=1 αiN(μi,Σi), where α1=0.8, α2=0.2, μ1=[4.5;0],

μ2=[-0.5;0], Σ1 = diag(2, 1),Σ2 = I . n = 1000, K = 30.

It is clear that our method performs more robust than K-LPE.

0.1 0.10.1

0.10.1

0.1

0.3

0.3 0.30.3

0.3

0.3

0.5

0.5

0.5

0.7

0.7

0.7

x1

x 2

−2 0 2 4 6 8−3

−2

−1

0

1

2

3

0.10.1

0.1

0.10.10.1

0.3

0.3

0.3

0.30.3

0.5

0.5

0.5

0.7

0.7

x1

x 2

−2 0 2 4 6 8−3

−2

−1

0

1

2

3

Fig. 1. Empirical p-value estimates of K-LPE(upper) and aK-

LPE(lower) for a Gaussian mixture density. Our method(aK-LPE)

performs much more robust than K-LPE.

395

Page 4: [IEEE 2012 IEEE Statistical Signal Processing Workshop (SSP) - Ann Arbor, MI, USA (2012.08.5-2012.08.8)] 2012 IEEE Statistical Signal Processing Workshop (SSP) - New statistic in P-value

0 0.2 0.4 0.6 0.8 10.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

false positives

true

posi

tives

aK−LPEK−LPE

(a) USPS 5 vs. else

0 0.2 0.4 0.6 0.8 10.4

0.5

0.6

0.7

0.8

0.9

1

false positives

true

posi

tives

aK−LPEK−LPE

(b) LetterRec 6 vs. else

Fig. 2. ROC curves for 5 vs. else of 256-dim USPS digit and 6 vs.

else of 16-dim Letter Recognition data sets. Our method aK-LPE

performs better than K-LPE.

We apply the aK-LPE and K-LPE anomaly detection al-

gorithm to the 256-dim USPS digit and 16-dim Letter Recog-

nition data sets from UCI data repository [12]. In [1] K-LPE

has been shown to significantly outperform the baseline one-

class SVM algorithm, so we do not include it. Fig.2,3 show

that our aK-LPE performs better than K-LPE.

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False Positive

True

Pos

itive

ROC curve

OursUMichSIRVLedoit−WolfTyler MLSample covariance

Fig. 3. ROC curves for UMich sensor received signal strength

(RSS) data set. Our method performs the best.

The UMich RSS data set [13] consists of sensor received

signal strength (RSS) measurements collected by Mica2 sen-

sor nodes deployed inside and outside a lab room, with

anomaly patterns occurring when students walk into and out

of the lab. There are 14 sensors randomly deployed inside

and outside the lab (in total 14×13 = 182 sensor pairs). Each

measurement is taken 0.5 seconds apart over 30 minutes. In

our experiment we use n = 900 measurements for training

and the remaining for testing. At each time instance we deter-

mine whether there is activity. K = 20. Fig.3 demonstrates

that our method outperforms other methods.

5. CONCLUSIONS

Based on the K-LPE anomaly detection framework of [1], we

propose a new statistic for estimating the p-value function

p(η) which is based on the average of K nearest neighbor

distances of η among the given n nominal training samples.

It makes use of the information from multiple neighbors of η.

We also provide a U-statistic strategy for computing G which

leads to more robustness. We then justify the asymptotic con-

sistency through a finite-sample analysis. Synthetic and real

experiments demonstrate the superiority of our ideas.

6. REFERENCES

[1] M. Zhao and V. Saligrama, “Anomaly detection with

score functions based on nearest neighbor graphs,” Ad-vances in Neural Information Processing Systems, vol.

22, 2009.

[2] R. Rastogi R. Ramaswamy and K. Shim, “Efficient al-

gorithms for mining outliers from large data sets,” in

Proceedings of the ACM SIGMOD Conference, 2000.

[3] M. Markou and S. Singh, “Novelty detection: a review

– part 1: statistical approaches,” in Signal Processing,

2003, vol. 83, pp. 2481–2497.

[4] D. Tax and K. R. Muller, “Feature extraction for one-

class classification,” in Artificial neural networks andneural information processing, 2003.

[5] R. Vert and J. Vert, “Consistency and convergence rates

of one-class svms and related algorithms,” in Journal ofMachine Learning Research, 2006, vol. 7, pp. 817–854.

[6] I. V. Nikiforov and M. Basseville, “Detection of abrupt

changes: Theory and applications,” New Jersey, 1993,

Prentice-Hall.

[7] J. Shawe-Taylor A. J. Smola B. Scholkopf, J. C. Plat-

t and R. Williamson, “Estimating the support of a

high-dimensional distribution,” in Neural Computation,

2001, vol. 13, pp. 1443–1471.

[8] C. Scott and R. D. Nowak, “Learning minimum volume

sets,” in Journal of Machine Learning Research, 2006,

vol. 7, pp. 665–704.

[9] A. O. Hero, “Geometric entropy minimization(gem) for

anomaly detection and localization,” in Neural Informa-tion Processing Systems Conference, 2006, vol. 19.

[10] V. Koroljuk and Y. Borovskich, Theory of U-statistics(Mathematics and Its Applications), Kluwer Academic

Publishers Group, 1994.

[11] K. Sricharan and A. O. Hero III, “Efficient anomaly

detection using bipartite k-nn graphs,” in NIPS, 2011.

[12] A. Frank and A. Asuncion, UCI Machine LearningRepository, http://archive.ics.uci.edu/ml.

[13] Neal Patwari, Alfred O. Hero III, and Kumar S-

richaran, “CRAWDAD data set umich/rss,”

http://crawdad.cs.dartmouth.edu/umich/rss.

396