outlier detection cs selab. 양 근 석. outlier detection 이상치 – 속성의 값이 일반적...
TRANSCRIPT
Outlier Detection
CS SELab.양 근 석
Outlier Detection
• 이상치–속성의 값이 일반적 ( 혹은 예상 ) 값 보다 상당히
편차가 심한 큰 값
• 이상치 탐지 목적–대부분의 다른 객체들과 다른 객체들을 찾는 것
Outlier Detection
Outlier Detection
• 이상치의 응용–사기 탐지–침입 탐지–환경 파괴–공중 위생–의료
Outlier Detection
• 이상치 발생 원인–상이한 클래스에 속한 데이터–자연 변화–데이터 측정과 수집오류
Univariate Outlier Detection(Cont.)
• set.seed(3147)• x = rnorm(100)• summary(x)
• boxplot.stats(x)
Univariate Outlier Detection(Cont.)
• boxplot(x)
Univariate Outlier Detection(Cont.)
• y = rnorm(100)• df = data.frame(x, y)• rm(x, y)• head(df)
Univariate Outlier Detection(Cont.)
• attach(df)• a = which(x %in% boxplot.stats(x)$out)
• b = which(y %in% boxplot.stats(y)$out)• detach(df)
Univariate Outlier Detection(Cont.)
• outlier.list1 = intersect(a, b)
• plot(df)
Univariate Outlier Detection(Cont.)
• points(df[outlier.list1,], col=“red”, pch=“+”, cex=2)
Univariate Outlier Detection(Cont.)
• outlier.list2 = union(a, b)
• plot(df)
Univariate Outlier Detection
• points(df[outlier.list2,], col=“blue”, pch=“x”, cex=1.5)
Outlier Detection with LOF(Cont.)
• LOF(Local Outlier Factor) algorithm• 밀도 기반으로 이상점을 찾는 알고리즘
Outlier Detection with LOF(Cont.)
• 장점–지표로 하나의 값 만을 반환하기 때문에 해석 용이
• 단점–계산 시간이 너무 오래걸림 , 숫자만 가능
Outlier Detection with LOF(Cont.)
• install.packages(“DMwR”)• library(DMwR)• iris2 = iris[,1:4]• Outlier.scores = lofactor(iris2, k=5)
Outlier Detection with LOF(Cont.)
• plot(density(outlier.scores))
• plot(outlier.scores)
Outlier Detection with LOF(Cont.)
• outliers = order(outlier.scores, decreasing=T)[1:5]
• print(outliers)
• print(iris2[outliers,])
Outlier Detection with LOF(Cont.)
• n = nrow(iris2)• Labels = 1:n• Labels[-outliers] = “.”• Biplot(prcomp(iris2), cex=.8, xlabs=labels)
Outlier Detection with LOF
Outlier Detection by Clustering(Cont.)
• K-Means 알고리즘
Outlier Detection by Clustering(Cont.)
• iris2 = iris[,1:4]• kmeans.result = kmeans(iris2, centers=3)
Outlier Detection by Clustering(Cont.)
• kmeans.result
Outlier Detection by Clustering(Cont.)
• center = kmeans.result$centers[kmeans.result$cluster,]
• distances = sqrt(rowSums((iris2 – centers)^2))• outliers = order(distances, decreasing=T)[1:5]• print(outliers)
Outlier Detection by Clustering(Cont.)
• print(iris2[outliers,])
Outlier Detection by Clustering(Cont.)
• plot(iris2[,c(“Sepal.Length”, “Sepal.Width”)], pch=“o”, col = kmeans.result$cluster, cex=0.3)
Outlier Detection by Clustering(Cont.)
• points(kmeans.result$centers[,c(“Sepal.Length”, “Sepal.Width”)], col=1:3, pch=8, cex=1)
Outlier Detection by Clustering
• points(iris2[outliers, c(“Sepal.Length”, “Sepal.Width”)], pch=“+”, col=4)
Outlier Detection from Time Series Data(Cont.)
• AirPassengers
Outlier Detection from Time Series Data(Cont.)
• Plot(AirPassengers)
Outlier Detection from Time Series Data(Cont.)
• f = stl(AirPassengers, “periodic”, robust=TRUE)• outliers = which(f$weights < 1e – 8)
• plot(f)
Outlier Detection from Time Series Data
• points(time(sts)[outliers], sts[, “remainder”][outliers], pch=“x”, col=“red”)