anomaly detection

이상 감지(Anomaly Detection)

고등 지능 기술 연구회(Advanced Intelligence Technology Research Society)김철 ([email protected])

2016-07-09

이상감지란 ?데이터의 메인 스트림에서 벗어난 샘플데이터 마이닝에서 이상감지는 예상 패턴 또는 정상 범주를 준수하지 않는 아이템 , 이벤트 , 관찰들의 식별을 의미 .

outlier

이상감지란 ?(cont.)Min:Max ≠ Outlier1.5xIQR ruleIQR(Interquartile Range) = Q3 – Q1

Max

Min

이상감지란 ?(cont.)이상 값은 전형적으로 문제의 한 증상으로 해석일반적인 통계 정의에 따르지 않는 드문 현상

이상감지란 ?(cont.)클러스터 알고리즘으로 이상 패턴에 의해 형성된마이크로 클러스터를 검출

역사Anomaly detection was proposed for intrusion detection systems (IDS) by Dorothy Denning in 1986.

초기에는 정상 임계치 , 통계량의 전처리 , 소프트 컴퓨팅 그리고 , 귀납적 학습

https://en.wikipedia.org/wiki/Intrusion_detection_systems

https://en.wikipedia.org/wiki/Dorothy_E._Denning

http://destiny738.tistory.com/462

http://www.aistudy.co.kr/cognitive/machine_jang.htm#_bookmark_16838b8

역사 (cont.)

응용기술사이버 침입 탐지 , 신용카드 사기 , 고장 감지 , 시스템 건전성 모니터링 , IoT, etc.

생태계 교란을 감지데이터에서 이상 값을 제거하는 데 자주 사용

3 가지 분류1. 비지도 이상 감지 (Unsupervised anomaly detection)

- 레이블 없는 데이터에서 이상 감지- K-means 클러스터 알고리즘으로 이상검출

2. 지도 이상 감지 (Supervised anomaly detection)- 정상 (Normal), 비정상 (Abnormal) 레이블이 존재- 분류 모델 이용 (SVM, Random forests, Logistic, Robust,

KNN, etc.)

3 가지 분류 (cont.)3. 준지도 이상 감지 (Semi-supervised anomaly detection)

- 정상 (Normal) 레이블만 존재하고 , 정상 모델에 의해 생성한 likelihood 를 비교해서 이상 값을 추출

- NKIA’s LRSTSD based Anomaly Detection- Twitter’s Seasonal Hybrid ESD (S-H-ESD) based

Anomaly Detection

NKIA’s Anomaly Detection Twitter’s Anomaly Detection

입력 데이터단변량 (Univariate) 다변량 (Multivariate)

입력 데이터 (cont.)자료구조

- Binary- Categorical- Continuous- Hybrid

이상값의 종류Point Anomalies

- 데이터 셋의 뭉치에서 벗어나는 값

이상값의 종류 (cont.)Contextual Anomalies

- 컨텍스트에 동떨어진 값- 컨텍스트의 개념이 필요- 조건부 이상치의 참조 (Rules)

이상값의 종류 (cont.)Collective Anomalies

- 수집 문제로 발생한 이상값

Output of Anomaly Detection

Label- Label of normal or anomaly - 분류문제 접근법에서 true|false or class

Score- Rank- 0:1- Threshold parameter 가 필요

이상감지의 평가F-Measure

- 지도학습 , 분류문제 평가- Formula:

Recall(R) = TP / (TP + FN)Precision(P) = TP / (TP + FP)F-measure = 2*R*P/(R+P)

The Area Under an ROC Curve- AUC(Area Under the Curve)- Detection Rate(TP), False Alarm Rate(TN)- 0:1- Equation:

Confusion Actual classNormal Anomaly

Predicted class

Normal TP FPAnomaly FN TN

이원교차표 (Crosstable)

Score Label.90 ~ 1 Excellent(

A).80 ~ .90 Good(B).70 ~ .80 Fair(C).60 ~ .70 Poor(D).50 ~ .60 Fail(F)평가표 ROC(Receiver Operating

Characteristic) Curvesm = # of TP, n = # of TN, (Detection Rate),

Taxonomy*

유명한 이상감지 기법들

Twitter’s Anomaly Detection R pack.

Twitter open-sourced their R package for anomaly detection. They call their algorithm Seasonal Hybrid ESD (S-H-ESD), which is built on Generalized ESD.Sometimes anomalies can mess up your model-ing.

https://github.com/twitter/AnomalyDetection

Twitter’s Anomaly Detection R pack.(cont.)

install.packages("devtools")devtools::install_github("twitter/AnomalyDetection")library(AnomalyDetection)install.packages("gtable")install.packages("scales")data(raw_data)res = AnomalyDetectionTs(raw_data, max_anoms=0.02, direction='both', plot=TRUE)res$plota


v <- read.csv("D:/r/tsd_paper/cpu_5m_02.csv")res2 = AnomalyDetectionVec(v, max_anoms=0.02, pe-riod=72, direction='both', plot=TRUE)res2$plot


UsageAnomalyDetectionTs(x, max_anoms = 0.1, direction = "pos", alpha = 0.05, only_last = NULL, threshold = "None", e_value = FALSE, longterm = FALSE, piecewise_median_period_weeks = 2, plot = FALSE, y_log = FALSE, xlabel = "", ylabel = "count", title = NULL, verbose = FALSE)ArgumentsX : Time series as a two column data frame where the first column consists of the timestamps and the second col-umn consists of the observations.max_anoms : Maximum number of anomalies that S-H-ESD will detect as a percentage of the data.direction : Directionality of the anomalies to be detected. Options are: 'pos' | 'neg' | 'both'.alpha : The level of statistical significance with which to accept or reject anomalies.only_last : Find and report anomalies only within the last day or hr in the time series. NULL | 'day' | 'hr'.threshold : Only report positive going anoms above the threshold specified. Options are: 'None' | 'med_max' | 'p95' | 'p99'.e_value : Add an additional column to the anoms output containing the expected value.longterm : Increase anom detection efficacy for time series that are greater than a month. See Details below.piecewise_median_period_weeks : The piecewise median time window as described in Vallis, Hochenbaum, and Ke-jariwal (2014). Defaults to 2.


UsageAnomalyDetectionTs(x, max_anoms = 0.1, direction = "pos", alpha = 0.05, only_last = NULL, threshold = "None", e_value = FALSE, longterm = FALSE, piecewise_median_period_weeks = 2, plot = FALSE, y_log = FALSE, xlabel = "", ylabel = "count", title = NULL, verbose = FALSE)Arguments(cont.)plot : A flag indicating if a plot with both the time series and the estimated anoms, indicated by circles, should also be returned.y_log : Apply log scaling to the y-axis. This helps with viewing plots that have extremely large positive anomalies relative to the rest of the data.xlabel : X-axis label to be added to the output plot.ylabel : Y-axis label to be added to the output plot.title : Title for the output plot.verbose : Enable debug messages

Twitter’s Anomaly Detection R pack.(cont.)To understand how twitter’s algorithm works, you need to know.

- Student t-distribution- Extreme Studentized Deviate (ESD) test- Generalized ESD- Linear regression- LOESS- STL(Seasonal Trend LOESS)

Twitter’s Anomaly Detection R pack.(cont.)Student t-distribution정규 분포의 평균을 측정할 때 주로 사용되는 분포

PDF

t

https://ko.wikipedia.org/wiki/%EC%8A%A4%ED%8A%9C%EB%8D%98%ED%8A%B8_t_%EB%B6%84%ED%8F%AC

Twitter’s Anomaly Detection R pack.(cont.)Extreme Studentized Deviate (ESD) test

https://en.wikipedia.org/wiki/Grubbs'_test_for_outliers

Twitter’s Anomaly Detection R pack.(cont.)Generalized ESD

Twitter’s Anomaly Detection R pack.(cont.)Seasonality(linear regression, LOESS, STL)The generalized ESD works when you have a set of points from a normal distri-bution, but real data has some seasonality. This is where STL comes in. It de-composes the data into a season part, a trend and whatever’s left over using lo-cal regression (LOESS), which fits a low order polynomial to a subset of the data and stitches them together by weighting them. Since you can remove the trend and seasonal part with loess, you should be left with something that is more or less normally distributed. You can apply generalized ESD on what’s left over to detect anomalies.#STL: “Seasonal and Trend decomposition using Loess”

Seasonality Local regression(LOESS) Polynomial regression

Twitter: Introducing practical and robust anomaly detection in a time seriesGlobal/LocalAt Twitter, we observe distinct seasonal patterns in most of the time series.Global: global anomalies typically extend above or below expected seasonality and are therefore not subject to seasonality and underlying trendLocal: anomalies which occur inside seasonal patterns, are masked and thus are much more difficult to detect in a robust fashion. Positive/NegativePositive: 슈퍼볼 경기 동안의 트윗 폭증 등 ( 이벤트에 대한 용량 산정을 위해 사용 )Negative: 초당 쿼리수 (QPS[Queries Per Second]) 의 증가 등 잠재적인 하드웨어나 데이터 수집 이슈를 발견

Subspace- and correlation-based outlier detection for high-dimensional data.주성분 분석 (PCA), 요인 분석 (Dimension reduction) 을 이용하여 차원 축소

부분공간 (Subspace) 의 대비 (Contrast) 를 계산하여 이상을 감지

Subspace- and correlation-based outlier de-tection for high-dimensional data.(cont.)HiCS: High Contrast Subspaces for Density-Based Outlier Ranking

RNN(Replicator neural net-works)에러를 최소화해서 입력 패턴을 재생하는 방법정상 모델을 생성하여 이상값을 추출

A schematic view of a fully connected Replicator Neural Network. = i 번째 요소의 Anomaly Factor 스코어

= # of features i 번째 요소의 j 컬럼 관측값 i 번째 요소의 j 컬럼 RNN 으로 재생한 정규값

LOF(Local Outlier Factor)Density-based anomaly detection by KNNScore 를 제공하여 해석이 용이하나 delay time 이 좀 있음 .Unsupervised anomaly detection

Basic idea of LOF: comparing the local density of a point with the densities of its neighbors. A has a much lower density than its neighbors

LOF(Local Outlier Factor)(cont.)Formula:

Illustration of the reach-ability distance. Objects B and C have the same reachability distance (k=3), while D is not a k nearest neighbor

LOF(Local Outlier Factor)(cont.)

LOF scores as visualized by ELKI. While the upper right cluster has a comparable density to the outliers close to the bottom left cluster, they are detected correctly.

https://en.wikipedia.org/wiki/Environment_for_DeveLoping_KDD-Applications_Supported_by_Index-Structures

LOF(Local Outlier Factor)(cont.)

LOF scores of cpu util. vs. Time by Rlof

https://cran.r-project.org/web/packages/Rlof/index.html

LRSTSD(Log regression seasonality based approach of time series de-composition)Anomaly score formula:

Anomaly score

1 일 네트워크 트래픽 Tx 7 일 네트워크 트래픽 Tx

= i 번째 에러= i 번째 관측값= i 번째 예측 상한 값= i 번째 예측 하한 값= 전체 값 (Parameter)

http://link.springer.com/chapter/10.1007%2F978-981-10-0281-6_118

결론이상감지는 예측 모델 생성 시 Noise 를 제거할 수 있는 기술 예측률 향상 기대데이터의 오탐 / 수집 실패를 감지 Resampling, 보정 등 적절한 대처가 가능관측된 이상 값과 문제와의 연관성 분석 문제에 대한 사전 감지 기술로 활용 고장 예측

참고문헌• https://en.wikipedia.org/wiki/Anomaly_detection• http://datascience.stackexchange.com/questions/2313/ma-

chine-learning-where-is-the-difference-between-one-class-binary-class-and-m

• https://en.wikipedia.org/wiki/Outlier#Detection• https://www.semanticscholar.org/paper/Outlier-Detection-Us-

ing-Replicator-Neural-Networks-Hawkins-He/87a09c777dce-cab4883e328669ef2af1ba8dd7be

• http://neuro.bstu.by/ai/To-dom/My_research/Papers-0/For-re-search/D-mining/Anomaly-D/KDD-cup-99/NN/dawak02.pdf

• http://slideplayer.com/slide/4194183/• http://link.springer.com/chapter/10.1007%2F978-981-10-

0281-6_118#page-1• https://cran.r-project.org/web/packages/Rlof/index.html• https://warrenmar.wordpress.com/tag/seasonal-hybrid-esd/• https://ko.wikipedia.org/wiki/%EC%8A%A4%ED%8A%9C%EB

%8D%98%ED%8A%B8_t_%EB%B6%84%ED%8F%AC

• https://en.wikipedia.org/wiki/Soft_computing• https://www.google.com/trends/explore#q=anomaly%2C%20%2Fm

%2F02vnd10%2C%20%2Fm%2F0bs2j8q&cmpt=q&tz=Etc%2FGMT-9• http://www.slideserve.com/sidonie/data-mining-for-anomaly-detec-

tion• http://www.physics.csbsju.edu/stats/box2.html• http://study.com/academy/lesson/maximums-minimums-outliers-in-

a-data-set-lesson-quiz.html• http://www.sfu.ca/~jackd/Stat203/Wk02_1_Full.pdf• http://slideplayer.com/slide/6321088/• http://gim.unmc.edu/dxtests/roc3.htm• http://www.cs.ru.nl/~tomh/onderwijs/dm/dm_files/roc_auc.pdf• http://togaware.com/papers/dawak02.pdf• https://en.wikipedia.org/wiki/Grubbs%27_test_for_outliers• https://github.com/twitter/AnomalyDetection• https://blog.twitter.com/2015/introducing-practical-and-robust-

anomaly-detection-in-a-time-series

anomaly detection

Data & Analytics