[ieee 2012 ieee statistical signal processing workshop (ssp) - ann arbor, mi, usa...
TRANSCRIPT
NEW STATISTIC IN P -VALUE ESTIMATION FOR ANOMALY DETECTION
Jing Qian
Boston UniversityBoston, MA
Venkatesh Saligrama
Boston UniversityBoston, MA
ABSTRACT
Given n nominal samples, a query point η and a significance
level α, the uniformly most powerful test for anomaly detec-
tion can be to test p(η) ≤ α, where p(η) is the p-value func-
tion of η. In [1] a p-value estimator is proposed which is based
on ranking some statistic over all data samples, and is shown
to be asymptotically consistent. Relying on this framework
we propose a new statistic for p-value estimation. It is based
on the average of K nearest neighbor (K-NN) distances of
η within a K-NN graph constructed from n nominal training
samples. We also provide a bootstrapping strategy for esti-
mating p-values which leads to better robustness. We then
theoretically justify the asymptotic consistency of our ideas
through a finite sample analysis. Synthetic and real experi-
ments demonstrate the superiorities of our scheme.
Index Terms— Anomaly Detection, p-value, k-NN graph
1. INTRODUCTION
Anomaly detection problems, also called outlier detection [2],
novelty detection [3] or one-calss classification [4, 5], arise in
many applications where failure to detect anomalous activ-
ities or events could result in severe consequences, and has
been under extensive study. Typically a training set of nom-
inal samples are given, and the task of anomaly detection is
to design a detection rule such that the detection power of
whether a query point η is anomalous is maximized, while
at the same time the false alarm probability is controlled no
higher than some specified significance level α.
Approaches for anomaly detection can be divided into t-
wo categories: parametric and non-parametric. Usually the
underlying nominal density is unknown, so the parametric ap-
proaches for anomaly detection [6], which assume a nominal
distribution, can suffer from the problem of model mismatch
and result in poor performance.
In recent years the non-parametric approaches have be-
come popular. These methods includes one-class SVM [7],
minimum volume (MV) set estimation [8] and the GEM ap-
proach for determining MV sets [9]. The one-class SVM ap-
proach is computationally efficient, but can not directly con-
trol the desired false alarm probability and usually does not
generate satisfactory performance. The MV set estimation
approach suffers from the complexity of high-dimensional
quantity approximation, either the multivariate density or the
MV set boundaries. The GEM approach involves seeking the
K-point minimum spanning tree of n samples and is com-
putational intractable. A simplified surrogate of the GEM
approach is also proposed in [9], which while simpler than
GEM, is no longer asymptotically consistent and loses the
provable optimality.
In [1] a non-parametric approach for anomaly detection
is proposed, which is based on a score function. This score
function maps the data samples into the interval [0, 1]. For
a query point η, anomaly is declared by directly threshold-
ing the score of η using the desired significance level α. This
score function of η is the ranking of η among all nominal sam-
ples based on some statistic G, and is essentially an estimate
of the p-value function at η. In [1] it is proven that directly
thresholding the p-value function with respect to α is the u-
niformly most powerful (UMP) test when the anomalies are
drawn from a mixture of the unknown nominal density and
the uniform distribution, and that this score function of η con-
verges to the true p-value at η asymptotically.
In this paper we propose a new statistic G based on the
same framework of [1]. Our statistic G is based on the av-
erage of K nearest neighbor (K-NN) distances of η within a
K-NN graph constructed from all nominal samples. Through
a finite sample analysis we show that our scheme not only
inherits the asymptotic consistency in estimating the p-value
function, thus remaining UMP optimality for anomaly detec-
tion, also results in a better convergence rate compared with
the scheme K-LPE in [1]. Moreover, we provide a U-statistic
bootstrapping strategy for reducing variance in computing G,
which leads to improved robustness for p-value estimation.
The rest of this paper is organized as follows. Sec.2
describes our new statistic G in p-value estimation, the U-
statistic bootstrapping strategy to compute G and the pro-
cedure of anomaly detection based on our scheme. Sec.3
shows the finite sample analysis and asymptotic consisten-
cy. Synthetic and real experiments are reported in Sec.4
demonstrating improved performance and robustness. Sec.5
concludes the paper.
2012 IEEE Statistical Signal Processing Workshop (SSP)
978-1-4673-0183-1/12/$31.00 ©2012 IEEE 393
2. MAIN IDEA AND ALGORITHMS
Let S = {x1, ..., xn} be the given n-point nominal training
set, which is drawn i.i.d from some d-dimensional underlying
density f0. A test point is assumed to come from a mixture
of the nominal distribution f0 and another known density f1,
which for simplicity is assumed to be the uniform distribu-
tion. Let η be the query point. The task of anomaly detec-
tion is to declare whether η is consistent with nominal data:
H0 : η ∼ f0, or deviates from nominal: H1 : η ∼ f1, under
the specified significance level: P (declareH1|H0) ≤ α.
In [1] it has been proven that the uniformly most powerful
test for the above problem is:
φ(η) =
{H1 p(η) ≤ αH0 otherwise
(1)
where under the assumption f1 is uniformly distributed, p(η)is the p-value function of η:
p(η) = P0
(x :
f1(x)
f0(x)≥ f1(η)
f0(η)
)=
∫{x:f0(x)≤f0(η)}
f0(x)dx
(2)
For a query point η, the p-value function can be estimated
through a ranking of η among all nominal points {x1, ..., xn}based on some statistic G, as follows:
p̂(η) =1
n
n∑i=1
I{G(η)≤G(xi)} (3)
where I denotes the indicator function.
Ideally we would like to choose G to be the nominal den-
sity, f0(·) of the samples, so that p̂(η) approximates the p(η).Since the underlying density f0 is unknown, surrogates re-
flecting the relative density at points are used. In [1] two
forms of G are adopted:
(1) ε-neighborhood: G(x) = −Nε(x) is the number of neigh-
bors within an ε-ball of x among n nominal points.
(2) K-nearest neighorhood distance: G(x) = D(K)(x) is the
distance from x to its K-th nearest neighbor among n nominal
points.
2.1. aK-LPE Anomaly Detection Algorithm
We propose a new statistic G which has the following form:
G(x) =1
K
K+�K2 �∑
i=K−�K−12 �
D(i)(x) (4)
Our G is the average of x’s K2 -th to 3K
2 -th nearest neighbor
distances among n points.
We then provide a U-statistic bootstrapping strategy to
compute G. This resampling technique can reduce the vari-
ance and increase the robustness [10].
U-statistic Resampling For Computing G:Given n = 2m nominal points,
(a) Randomly split the data into two equal parts: S1 ={x1, ..., xm}, S2 = {xm+1, ..., x2m}.
(b) Points in S2 are used to calculate G for xi ∈ S1 according
to Eq.(4), and vice versa.
(c) Resplit the data and repeat the above steps B times. Let
Gb(xi) be the statistic of xi obtained from the b-th resam-
pling. We then use the average as the final statistic:
G(xi) =1
B
B∑b=1
Gb(xi), i = 1, 2, ..., N
The above algorithm describes the procedure to compute
G for n nominal points using themselves. For a new test point
η the steps for calculating G(η) follow similarly. Notice that
the number of points used to calculate G for nominal training
points {x1, ..., xn} and test point η should be identical, such
that G reflects the same property at xi or η: the average of
K/2-th to 3K/2-th nearest neighbor distances among totally
m = n/2 points.
The algorithm for anomaly detection based on our new
statistic G, which we call averaged K-LPE, or aK-LPE is
as follows:
aK-LPE anomaly detection:1. Input: Nominal training data {x1, ..., xn}, query point
η, false alarm rate α.
2. Training Stage:Calculate G for every nominal point xi according to the above
U-statistic bootstrapping strategy: G(xi), i = 1, 2, ..., n.
3. Testing Stage:(a) Calculate G for the query point η according to the above
U-statistic bootstrapping strategy: G(η).
(b) Calculate the p-value estimate p̂(η) according to Eq.(3).
(c) Claim η to be anomalous if p̂(η) ≤ α; otherwise claim ηto be nominal.
Run Time Complexity:Computing distances from one point to another n points
requires O(dn). Sorting these distances requires O(n log n).So the training stage requires O(Bn2(d+log n)) and the test-
ing stage requires O(Bn(d+log n)). Note that the resampling
time B is a small constant. Compared to K-LPE of [1], the
time complexity of testing stage is decreased, because for ev-
ery query point K-LPE needs to recalculate G for {x1, ..., xn}and the complexity is O(n2(d+ log n)).
Recently [11] proposed an anomaly detection method
based on bipartite k-NN graphs (BP-kNNG). They use an-
other form of G based on a bipartite partition of the nominal
training data for p-value estimation. Practically their ideas
are quite similar to K-LPE and our aK-LPE, except that they
don’t incorporate the bootstrapping strategy for reducing the
variance.
394
3. ANALYSIS
In this section we establish the asymptotic consistency of our
method for estimating p-value through a finite sample anal-
ysis. We first show the expectation of our p-value estimate
converges to the true p-value function. Then we show the
empirical p-value estimate is concentrated at its expectation.
For simplicity let n = m1(m2 + 1) and divide n points in-
to: S = S0
⋃S1
⋃...⋃Sm1 . S0 = {x1, ..., xm1}; each
Sj , j = 1, ...,m1 has m2 points. Sj is used to compute Gfor η and xj ∈ S0. S0 is used to compute the rank of η:
p̂(η) =1
m1
m1∑j=1
I{G(xj ;Sj)>G(η;Sj)} (5)
Suppose the nominal density f = f0 satisfies some reg-
ularity conditions: f is continuous and lower-bounded on
a compact support C: f(x) ≥ fmin > 0. It is smooth,
i.e. ||∇f(x)|| ≤ λ, where ∇f(x) is the gradient of f(·)at x. Flat regions are not allowed, i.e. ∀x ∈ C, ∀σ > 0,
P {y : |f(y)− f(x)| < σ} ≤ Mσ, where M is a constant.
Theorem 1. By choosing K properly, as m2 → ∞, we have,
|E [p̂(η)]− p(η)| → 0. (6)
Proof. We only provide a brief outline for the proof.
ES [p̂(η)] = ES\S0
⎡⎣ES0
⎡⎣ 1
m1
m1∑j=1
I{G(η;Sj)<G(xj ;Sj)}
⎤⎦⎤⎦
=1
m1
m1∑j=1
Exj
[ESj
[I{G(η;Sj)<G(xj ;Sj)}
]]= Ex [PS1 (G(η;S1) < G(x;S1))]
Fix x and let Fx(y1, ..., ym2) = G(x) − G(η), where
y1, ..., ym2 are the m2 points in S1. It follows:
PS1 (G(u) < G(x)) = PS1 (Fx − EFx > −EFx) .
Apply McDiarmid’s inequality for Fx after checking the con-
dition on {y1, ..., ym2}, carefully add the indicator function
I{EFx>0}, and take expectation w.r.t. x, we have:
|E [PS1 (Fx > 0)]− Px (EFx > 0) | ≤ Ex
[e−c1
(EFx)2l2
m2
]
where c1 is some constant. Divide the support C = C1
⋃C2,
where C1 contains points whose density is far away from f(η)and C2 contains the rest. The proof follows the line that for
x ∈ C1 the above exponential term goes to 0 and I{EFx>0} =I{f(u)>f(x)}, while the rest C2 has small measure. We skip
the tricky steps and directly present the result:
|ES [p̂(η)]− p(η)| ≤ exp
(−c2
K2+ 4d
m1+ 4
d2
)+ c3
(K
m2
) 1d
Let K = mα2 such that d+4
2d+4 < α < 1 and the proof is
finished.
Theorem 2. Let p̂(η) be defined as in Eq.(5). For any ε > 0it follows that with probability at least 1− 2 exp
(−2m1ε2),
|p̂(η)− E [p̂(η)] | ≤ ε.
Proof. p̂(η) = 1m1
∑m1
j=1 Yj where Yj = I{G(xj ;Sj)>G(η;Sj)}is independent across j, and Yj ∈ [0, 1]. Applying Hoeffd-
ing’s inequality finishes the proof.
Combine the two theorems and the asymptotic consisten-
cy is obvious. Our method has better convergence rate than
[1](O(n−6/5d
)). In [11] they provide MSE convergence rate
results for their BP-kNNG algorithm. Notice that this is only
asymptotically studying E[|p̂(η)− p(η)|2] as n → ∞, while
our analysis here presents explicit bounds for finite sample
cases: for any finite m1,m2, d, with probability at least 1− δ,
|p̂(η)− p(η)| ≤ g(K,m2; d) +
√ln 2
δ
2m1.
where g(K,m2; d) denotes the bound in the proof of Thm.1.
Also notice that our proof follows similar lines when G is cho-
sen differently, for example, the average of squares of K/2-th
to 3K/2-th nearest neighbor distances.
4. EXPERIMENTS
We first demonstrate the robustness of our method in esti-
mating p-value function. Fig.1 shows the p-value estima-
tion of K-LPE and aK-LPE for a Gaussian mixture densi-
ty:∑2
i=1 αiN(μi,Σi), where α1=0.8, α2=0.2, μ1=[4.5;0],
μ2=[-0.5;0], Σ1 = diag(2, 1),Σ2 = I . n = 1000, K = 30.
It is clear that our method performs more robust than K-LPE.
0.1 0.10.1
0.10.1
0.1
0.3
0.3 0.30.3
0.3
0.3
0.5
0.5
0.5
0.7
0.7
0.7
x1
x 2
−2 0 2 4 6 8−3
−2
−1
0
1
2
3
0.10.1
0.1
0.10.10.1
0.3
0.3
0.3
0.30.3
0.5
0.5
0.5
0.7
0.7
x1
x 2
−2 0 2 4 6 8−3
−2
−1
0
1
2
3
Fig. 1. Empirical p-value estimates of K-LPE(upper) and aK-
LPE(lower) for a Gaussian mixture density. Our method(aK-LPE)
performs much more robust than K-LPE.
395
0 0.2 0.4 0.6 0.8 10.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
false positives
true
posi
tives
aK−LPEK−LPE
(a) USPS 5 vs. else
0 0.2 0.4 0.6 0.8 10.4
0.5
0.6
0.7
0.8
0.9
1
false positives
true
posi
tives
aK−LPEK−LPE
(b) LetterRec 6 vs. else
Fig. 2. ROC curves for 5 vs. else of 256-dim USPS digit and 6 vs.
else of 16-dim Letter Recognition data sets. Our method aK-LPE
performs better than K-LPE.
We apply the aK-LPE and K-LPE anomaly detection al-
gorithm to the 256-dim USPS digit and 16-dim Letter Recog-
nition data sets from UCI data repository [12]. In [1] K-LPE
has been shown to significantly outperform the baseline one-
class SVM algorithm, so we do not include it. Fig.2,3 show
that our aK-LPE performs better than K-LPE.
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
False Positive
True
Pos
itive
ROC curve
OursUMichSIRVLedoit−WolfTyler MLSample covariance
Fig. 3. ROC curves for UMich sensor received signal strength
(RSS) data set. Our method performs the best.
The UMich RSS data set [13] consists of sensor received
signal strength (RSS) measurements collected by Mica2 sen-
sor nodes deployed inside and outside a lab room, with
anomaly patterns occurring when students walk into and out
of the lab. There are 14 sensors randomly deployed inside
and outside the lab (in total 14×13 = 182 sensor pairs). Each
measurement is taken 0.5 seconds apart over 30 minutes. In
our experiment we use n = 900 measurements for training
and the remaining for testing. At each time instance we deter-
mine whether there is activity. K = 20. Fig.3 demonstrates
that our method outperforms other methods.
5. CONCLUSIONS
Based on the K-LPE anomaly detection framework of [1], we
propose a new statistic for estimating the p-value function
p(η) which is based on the average of K nearest neighbor
distances of η among the given n nominal training samples.
It makes use of the information from multiple neighbors of η.
We also provide a U-statistic strategy for computing G which
leads to more robustness. We then justify the asymptotic con-
sistency through a finite-sample analysis. Synthetic and real
experiments demonstrate the superiority of our ideas.
6. REFERENCES
[1] M. Zhao and V. Saligrama, “Anomaly detection with
score functions based on nearest neighbor graphs,” Ad-vances in Neural Information Processing Systems, vol.
22, 2009.
[2] R. Rastogi R. Ramaswamy and K. Shim, “Efficient al-
gorithms for mining outliers from large data sets,” in
Proceedings of the ACM SIGMOD Conference, 2000.
[3] M. Markou and S. Singh, “Novelty detection: a review
– part 1: statistical approaches,” in Signal Processing,
2003, vol. 83, pp. 2481–2497.
[4] D. Tax and K. R. Muller, “Feature extraction for one-
class classification,” in Artificial neural networks andneural information processing, 2003.
[5] R. Vert and J. Vert, “Consistency and convergence rates
of one-class svms and related algorithms,” in Journal ofMachine Learning Research, 2006, vol. 7, pp. 817–854.
[6] I. V. Nikiforov and M. Basseville, “Detection of abrupt
changes: Theory and applications,” New Jersey, 1993,
Prentice-Hall.
[7] J. Shawe-Taylor A. J. Smola B. Scholkopf, J. C. Plat-
t and R. Williamson, “Estimating the support of a
high-dimensional distribution,” in Neural Computation,
2001, vol. 13, pp. 1443–1471.
[8] C. Scott and R. D. Nowak, “Learning minimum volume
sets,” in Journal of Machine Learning Research, 2006,
vol. 7, pp. 665–704.
[9] A. O. Hero, “Geometric entropy minimization(gem) for
anomaly detection and localization,” in Neural Informa-tion Processing Systems Conference, 2006, vol. 19.
[10] V. Koroljuk and Y. Borovskich, Theory of U-statistics(Mathematics and Its Applications), Kluwer Academic
Publishers Group, 1994.
[11] K. Sricharan and A. O. Hero III, “Efficient anomaly
detection using bipartite k-nn graphs,” in NIPS, 2011.
[12] A. Frank and A. Asuncion, UCI Machine LearningRepository, http://archive.ics.uci.edu/ml.
[13] Neal Patwari, Alfred O. Hero III, and Kumar S-
richaran, “CRAWDAD data set umich/rss,”
http://crawdad.cs.dartmouth.edu/umich/rss.
396