outlier detection cs selab. 양 근 석. outlier detection 이상치 – 속성의 값이 일반적...

Outlier Detection

CS SELab.양 근 석

Outlier Detection

• 이상치–속성의 값이 일반적 ( 혹은 예상 ) 값 보다 상당히

편차가 심한 큰 값

• 이상치 탐지 목적–대부분의 다른 객체들과 다른 객체들을 찾는 것

Outlier Detection

Outlier Detection

• 이상치의 응용–사기 탐지–침입 탐지–환경 파괴–공중 위생–의료

Outlier Detection

• 이상치 발생 원인–상이한 클래스에 속한 데이터–자연 변화–데이터 측정과 수집오류

Univariate Outlier Detection(Cont.)

• set.seed(3147)• x = rnorm(100)• summary(x)

• boxplot.stats(x)


• boxplot(x)


• y = rnorm(100)• df = data.frame(x, y)• rm(x, y)• head(df)


• attach(df)• a = which(x %in% boxplot.stats(x)$out)

• b = which(y %in% boxplot.stats(y)$out)• detach(df)


• outlier.list1 = intersect(a, b)

• plot(df)


• points(df[outlier.list1,], col=“red”, pch=“+”, cex=2)


• outlier.list2 = union(a, b)

• plot(df)

Univariate Outlier Detection

• points(df[outlier.list2,], col=“blue”, pch=“x”, cex=1.5)

Outlier Detection with LOF(Cont.)

• LOF(Local Outlier Factor) algorithm• 밀도 기반으로 이상점을 찾는 알고리즘


• 장점–지표로 하나의 값 만을 반환하기 때문에 해석 용이

• 단점–계산 시간이 너무 오래걸림 , 숫자만 가능


• install.packages(“DMwR”)• library(DMwR)• iris2 = iris[,1:4]• Outlier.scores = lofactor(iris2, k=5)


• plot(density(outlier.scores))

• plot(outlier.scores)


• outliers = order(outlier.scores, decreasing=T)[1:5]

• print(outliers)

• print(iris2[outliers,])


• n = nrow(iris2)• Labels = 1:n• Labels[-outliers] = “.”• Biplot(prcomp(iris2), cex=.8, xlabs=labels)

Outlier Detection with LOF

Outlier Detection by Clustering(Cont.)

• K-Means 알고리즘


• iris2 = iris[,1:4]• kmeans.result = kmeans(iris2, centers=3)


• kmeans.result


• center = kmeans.result$centers[kmeans.result$cluster,]

• distances = sqrt(rowSums((iris2 – centers)^2))• outliers = order(distances, decreasing=T)[1:5]• print(outliers)


• print(iris2[outliers,])


• plot(iris2[,c(“Sepal.Length”, “Sepal.Width”)], pch=“o”, col = kmeans.result$cluster, cex=0.3)


• points(kmeans.result$centers[,c(“Sepal.Length”, “Sepal.Width”)], col=1:3, pch=8, cex=1)

Outlier Detection by Clustering

• points(iris2[outliers, c(“Sepal.Length”, “Sepal.Width”)], pch=“+”, col=4)

Outlier Detection from Time Series Data(Cont.)

• AirPassengers


• Plot(AirPassengers)


• f = stl(AirPassengers, “periodic”, robust=TRUE)• outliers = which(f$weights < 1e – 8)

• plot(f)

Outlier Detection from Time Series Data

• points(time(sts)[outliers], sts[, “remainder”][outliers], pch=“x”, col=“red”)

outlier detection cs selab. 양 근 석. outlier detection 이상치 – 속성의 값이 일반적...

Documents