anomaly detection

40
이이 이이 (Anomaly Detection) 이이 이이 이이 이이이 (Advanced Intelligence Technology Research Society) 이이 ([email protected]) 2016-07-09

Upload: chul-kim

Post on 21-Apr-2017

392 views

Category:

Data & Analytics


5 download

TRANSCRIPT

Page 1: Anomaly detection

이상 감지(Anomaly Detection)

고등 지능 기술 연구회(Advanced Intelligence Technology Research Society)김철 ([email protected])

2016-07-09

Page 2: Anomaly detection

이상감지란 ?데이터의 메인 스트림에서 벗어난 샘플데이터 마이닝에서 이상감지는 예상 패턴 또는 정상 범주를 준수하지 않는 아이템 , 이벤트 , 관찰들의 식별을 의미 .

outlier

Page 3: Anomaly detection

이상감지란 ?(cont.)Min:Max ≠ Outlier1.5xIQR ruleIQR(Interquartile Range) = Q3 – Q1

Max

Min

Page 4: Anomaly detection

이상감지란 ?(cont.)이상 값은 전형적으로 문제의 한 증상으로 해석일반적인 통계 정의에 따르지 않는 드문 현상

Page 5: Anomaly detection

이상감지란 ?(cont.)클러스터 알고리즘으로 이상 패턴에 의해 형성된마이크로 클러스터를 검출

Page 6: Anomaly detection

역사Anomaly detection was proposed for intrusion detection systems (IDS) by Dorothy Denning in 1986.

초기에는 정상 임계치 , 통계량의 전처리 , 소프트 컴퓨팅 그리고 , 귀납적 학습

Page 7: Anomaly detection

역사 (cont.)

Page 8: Anomaly detection

응용기술사이버 침입 탐지 , 신용카드 사기 , 고장 감지 , 시스템 건전성 모니터링 , IoT, etc.

생태계 교란을 감지데이터에서 이상 값을 제거하는 데 자주 사용

Page 9: Anomaly detection

3 가지 분류1. 비지도 이상 감지 (Unsupervised anomaly detection)

- 레이블 없는 데이터에서 이상 감지- K-means 클러스터 알고리즘으로 이상검출

2. 지도 이상 감지 (Supervised anomaly detection)- 정상 (Normal), 비정상 (Abnormal) 레이블이 존재- 분류 모델 이용 (SVM, Random forests, Logistic, Robust,

KNN, etc.)

Page 10: Anomaly detection

3 가지 분류 (cont.)3. 준지도 이상 감지 (Semi-supervised anomaly detection)

- 정상 (Normal) 레이블만 존재하고 , 정상 모델에 의해 생성한 likelihood 를 비교해서 이상 값을 추출

- NKIA’s LRSTSD based Anomaly Detection- Twitter’s Seasonal Hybrid ESD (S-H-ESD) based

Anomaly Detection

NKIA’s Anomaly Detection Twitter’s Anomaly Detection

Page 11: Anomaly detection

입력 데이터단변량 (Univariate) 다변량 (Multivariate)

Page 12: Anomaly detection

입력 데이터 (cont.)자료구조

- Binary- Categorical- Continuous- Hybrid

Page 13: Anomaly detection

이상값의 종류Point Anomalies

- 데이터 셋의 뭉치에서 벗어나는 값

Page 14: Anomaly detection

이상값의 종류 (cont.)Contextual Anomalies

- 컨텍스트에 동떨어진 값- 컨텍스트의 개념이 필요- 조건부 이상치의 참조 (Rules)

Page 15: Anomaly detection

이상값의 종류 (cont.)Collective Anomalies

- 수집 문제로 발생한 이상값

Page 16: Anomaly detection

Output of Anomaly Detection

Label- Label of normal or anomaly - 분류문제 접근법에서 true|false or class

Score- Rank- 0:1- Threshold parameter 가 필요

Page 17: Anomaly detection

이상감지의 평가F-Measure

- 지도학습 , 분류문제 평가- Formula:

Recall(R) = TP / (TP + FN)Precision(P) = TP / (TP + FP)F-measure = 2*R*P/(R+P)

The Area Under an ROC Curve- AUC(Area Under the Curve)- Detection Rate(TP), False Alarm Rate(TN)- 0:1- Equation:

Confusion Actual classNormal Anomaly

Predicted class

Normal TP FPAnomaly FN TN

이원교차표 (Crosstable)

Score Label.90 ~ 1 Excellent(

A).80 ~ .90 Good(B).70 ~ .80 Fair(C).60 ~ .70 Poor(D).50 ~ .60 Fail(F)평가표 ROC(Receiver Operating

Characteristic) Curvesm = # of TP, n = # of TN, (Detection Rate),

Page 18: Anomaly detection

Taxonomy*

Page 19: Anomaly detection

유명한 이상감지 기법들

Page 20: Anomaly detection

Twitter’s Anomaly Detection R pack.

Twitter open-sourced their R package for anomaly detection. They call their algorithm Seasonal Hybrid ESD (S-H-ESD), which is built on Generalized ESD.Sometimes anomalies can mess up your model-ing.

Page 21: Anomaly detection

Twitter’s Anomaly Detection R pack.(cont.)

install.packages("devtools")devtools::install_github("twitter/AnomalyDetection")library(AnomalyDetection)install.packages("gtable")install.packages("scales")data(raw_data)res = AnomalyDetectionTs(raw_data, max_anoms=0.02, direction='both', plot=TRUE)res$plota

Page 22: Anomaly detection

Twitter’s Anomaly Detection R pack.(cont.)

v <- read.csv("D:/r/tsd_paper/cpu_5m_02.csv")res2 = AnomalyDetectionVec(v, max_anoms=0.02, pe-riod=72, direction='both', plot=TRUE)res2$plot

Page 23: Anomaly detection

Twitter’s Anomaly Detection R pack.(cont.)

UsageAnomalyDetectionTs(x, max_anoms = 0.1, direction = "pos", alpha = 0.05, only_last = NULL, threshold = "None", e_value = FALSE, longterm = FALSE, piecewise_median_period_weeks = 2, plot = FALSE, y_log = FALSE, xlabel = "", ylabel = "count", title = NULL, verbose = FALSE)ArgumentsX : Time series as a two column data frame where the first column consists of the timestamps and the second col-umn consists of the observations.max_anoms : Maximum number of anomalies that S-H-ESD will detect as a percentage of the data.direction : Directionality of the anomalies to be detected. Options are: 'pos' | 'neg' | 'both'.alpha : The level of statistical significance with which to accept or reject anomalies.only_last : Find and report anomalies only within the last day or hr in the time series. NULL | 'day' | 'hr'.threshold : Only report positive going anoms above the threshold specified. Options are: 'None' | 'med_max' | 'p95' | 'p99'.e_value : Add an additional column to the anoms output containing the expected value.longterm : Increase anom detection efficacy for time series that are greater than a month. See Details below.piecewise_median_period_weeks : The piecewise median time window as described in Vallis, Hochenbaum, and Ke-jariwal (2014). Defaults to 2.

Page 24: Anomaly detection

Twitter’s Anomaly Detection R pack.(cont.)

UsageAnomalyDetectionTs(x, max_anoms = 0.1, direction = "pos", alpha = 0.05, only_last = NULL, threshold = "None", e_value = FALSE, longterm = FALSE, piecewise_median_period_weeks = 2, plot = FALSE, y_log = FALSE, xlabel = "", ylabel = "count", title = NULL, verbose = FALSE)Arguments(cont.)plot : A flag indicating if a plot with both the time series and the estimated anoms, indicated by circles, should also be returned.y_log : Apply log scaling to the y-axis. This helps with viewing plots that have extremely large positive anomalies relative to the rest of the data.xlabel : X-axis label to be added to the output plot.ylabel : Y-axis label to be added to the output plot.title : Title for the output plot.verbose : Enable debug messages

Page 25: Anomaly detection

Twitter’s Anomaly Detection R pack.(cont.)To understand how twitter’s algorithm works, you need to know.

- Student t-distribution- Extreme Studentized Deviate (ESD) test- Generalized ESD- Linear regression- LOESS- STL(Seasonal Trend LOESS)

Page 26: Anomaly detection

Twitter’s Anomaly Detection R pack.(cont.)Student t-distribution정규 분포의 평균을 측정할 때 주로 사용되는 분포

PDF

t

Page 27: Anomaly detection

Twitter’s Anomaly Detection R pack.(cont.)Extreme Studentized Deviate (ESD) test

Page 28: Anomaly detection

Twitter’s Anomaly Detection R pack.(cont.)Generalized ESD

Page 29: Anomaly detection

Twitter’s Anomaly Detection R pack.(cont.)Seasonality(linear regression, LOESS, STL)The generalized ESD works when you have a set of points from a normal distri-bution, but real data has some seasonality. This is where STL comes in. It de-composes the data into a season part, a trend and whatever’s left over using lo-cal regression (LOESS), which fits a low order polynomial to a subset of the data and stitches them together by weighting them. Since you can remove the trend and seasonal part with loess, you should be left with something that is more or less normally distributed. You can apply generalized ESD on what’s left over to detect anomalies.#STL: “Seasonal and Trend decomposition using Loess”

Seasonality Local regression(LOESS) Polynomial regression

Page 30: Anomaly detection

Twitter: Introducing practical and robust anomaly detection in a time seriesGlobal/LocalAt Twitter, we observe distinct seasonal patterns in most of the time series.Global: global anomalies typically extend above or below expected seasonality and are therefore not subject to seasonality and underlying trendLocal: anomalies which occur inside seasonal patterns, are masked and thus are much more difficult to detect in a robust fashion. Positive/NegativePositive: 슈퍼볼 경기 동안의 트윗 폭증 등 ( 이벤트에 대한 용량 산정을 위해 사용 )Negative: 초당 쿼리수 (QPS[Queries Per Second]) 의 증가 등 잠재적인 하드웨어나 데이터 수집 이슈를 발견

Page 31: Anomaly detection

Subspace- and correlation-based outlier detection for high-dimensional data.주성분 분석 (PCA), 요인 분석 (Dimension reduction) 을 이용하여 차원 축소

부분공간 (Subspace) 의 대비 (Contrast) 를 계산하여 이상을 감지

Page 32: Anomaly detection

Subspace- and correlation-based outlier de-tection for high-dimensional data.(cont.)HiCS: High Contrast Subspaces for Density-Based Outlier Ranking

Page 33: Anomaly detection

RNN(Replicator neural net-works)에러를 최소화해서 입력 패턴을 재생하는 방법정상 모델을 생성하여 이상값을 추출

A schematic view of a fully connected Replicator Neural Network. = i 번째 요소의 Anomaly Factor 스코어

= # of features i 번째 요소의 j 컬럼 관측값 i 번째 요소의 j 컬럼 RNN 으로 재생한 정규값

Page 34: Anomaly detection

LOF(Local Outlier Factor)Density-based anomaly detection by KNNScore 를 제공하여 해석이 용이하나 delay time 이 좀 있음 .Unsupervised anomaly detection

Basic idea of LOF: comparing the local density of a point with the densities of its neighbors. A has a much lower density than its neighbors

Page 35: Anomaly detection

LOF(Local Outlier Factor)(cont.)Formula:

Illustration of the reach-ability distance. Objects B and C have the same reachability distance (k=3), while D is not a k nearest neighbor

Page 36: Anomaly detection

LOF(Local Outlier Factor)(cont.)

LOF scores as visualized by ELKI. While the upper right cluster has a comparable density to the outliers close to the bottom left cluster, they are detected correctly.

Page 37: Anomaly detection

LOF(Local Outlier Factor)(cont.)

LOF scores of cpu util. vs. Time by Rlof

Page 38: Anomaly detection

LRSTSD(Log regression seasonality based approach of time series de-composition)Anomaly score formula:

Anomaly score

1 일 네트워크 트래픽 Tx 7 일 네트워크 트래픽 Tx

= i 번째 에러= i 번째 관측값= i 번째 예측 상한 값= i 번째 예측 하한 값= 전체 값 (Parameter)

Page 39: Anomaly detection

결론이상감지는 예측 모델 생성 시 Noise 를 제거할 수 있는 기술 예측률 향상 기대데이터의 오탐 / 수집 실패를 감지 Resampling, 보정 등 적절한 대처가 가능관측된 이상 값과 문제와의 연관성 분석 문제에 대한 사전 감지 기술로 활용 고장 예측

Page 40: Anomaly detection

참고문헌• https://en.wikipedia.org/wiki/Anomaly_detection• http://datascience.stackexchange.com/questions/2313/ma-

chine-learning-where-is-the-difference-between-one-class-binary-class-and-m

• https://en.wikipedia.org/wiki/Outlier#Detection• https://www.semanticscholar.org/paper/Outlier-Detection-Us-

ing-Replicator-Neural-Networks-Hawkins-He/87a09c777dce-cab4883e328669ef2af1ba8dd7be

• http://neuro.bstu.by/ai/To-dom/My_research/Papers-0/For-re-search/D-mining/Anomaly-D/KDD-cup-99/NN/dawak02.pdf

• http://slideplayer.com/slide/4194183/• http://link.springer.com/chapter/10.1007%2F978-981-10-

0281-6_118#page-1• https://cran.r-project.org/web/packages/Rlof/index.html• https://warrenmar.wordpress.com/tag/seasonal-hybrid-esd/• https://ko.wikipedia.org/wiki/%EC%8A%A4%ED%8A%9C%EB

%8D%98%ED%8A%B8_t_%EB%B6%84%ED%8F%AC

• https://en.wikipedia.org/wiki/Soft_computing• https://www.google.com/trends/explore#q=anomaly%2C%20%2Fm

%2F02vnd10%2C%20%2Fm%2F0bs2j8q&cmpt=q&tz=Etc%2FGMT-9• http://www.slideserve.com/sidonie/data-mining-for-anomaly-detec-

tion• http://www.physics.csbsju.edu/stats/box2.html• http://study.com/academy/lesson/maximums-minimums-outliers-in-

a-data-set-lesson-quiz.html• http://www.sfu.ca/~jackd/Stat203/Wk02_1_Full.pdf• http://slideplayer.com/slide/6321088/• http://gim.unmc.edu/dxtests/roc3.htm• http://www.cs.ru.nl/~tomh/onderwijs/dm/dm_files/roc_auc.pdf• http://togaware.com/papers/dawak02.pdf• https://en.wikipedia.org/wiki/Grubbs%27_test_for_outliers• https://github.com/twitter/AnomalyDetection• https://blog.twitter.com/2015/introducing-practical-and-robust-

anomaly-detection-in-a-time-series