4.representing data and engineering features

46
Machine Learning with Python Representing Data and Engineering Features 1

Upload: haesun-park

Post on 21-Jan-2018

580 views

Category:

Software


8 download

TRANSCRIPT

Page 1: 4.representing data and engineering features

Machine Learningwith Python

Representing Data and Engineering Features

1

Page 2: 4.representing data and engineering features

ContactsHaesun Park

Email : [email protected]

Meetup: https://www.meetup.com/Hongdae-Machine-Learning-Study/

Facebook : https://facebook.com/haesunrpark

Blog : https://tensorflow.blog

2

Page 3: 4.representing data and engineering features

Book파이썬라이브러리를활용한머신러닝, 박해선.

(Introduction to Machine Learning with Python, Andreas Muller & Sarah Guido의번역서입니다.)

번역서의 1장과 2장은블로그에서무료로읽을수있습니다.

원서에대한프리뷰를온라인에서볼수있습니다.

Github: https://github.com/rickiepark/introduction_to_ml_with_python/

3

Page 4: 4.representing data and engineering features

데이터표현과특성공학

4

Page 5: 4.representing data and engineering features

연속형 vs 범주형

출력값이연속형à회귀 , 출력값이범주형à분류

연속형특성continuous feature : 실수데이터, ex) 0.13493, 100.0

범주형(이산형) 특성categorical feature : 연속적이지않은숫자나속성,ex) 픽셀강도, 브랜드, 쇼핑카테고리

범주형특성의사이에는중간값이없고순서가없습니다. ex) 책과옷

특성공학feature engineering : 애플리케이션에가장적합한데이터표현을찾는것

알맞은데이터표현 >>> 매개변수탐색 ex) inch와 cm의스케일

5

Page 6: 4.representing data and engineering features

범주형변수

61994년인구조사데이터베이스: 미국성인의소득데이터셋(Adult Data Set)

이진출력: 분류문제<=50 , >50k

연속형특성

범주형특성𝑦" = 𝑤 0 ×𝑥 0 + 𝑤 1 ×𝑥 1 +⋯+ 𝑤 𝑝 ×𝑥 𝑝 + 𝑏 > 0

Page 7: 4.representing data and engineering features

원-핫-인코딩one-hot-encoding(가변수dummy variable)범주형변수를 0 또는 1 의값을가진여러개의새로운특성으로바꿈(이진특성)

7* 통계에서의더미인코딩은마지막값을모든변수가 0인것으로대신함(열랭크부족현상때문)

Page 8: 4.representing data and engineering features

pandas.read_csv

8

Page 9: 4.representing data and engineering features

pandas.get_dummies

9

범주형데이터자동변환

Page 10: 4.representing data and engineering features

로지스틱회귀적용

10

occupation_ Transport-moving까지포함됨

훈련세트나누기전에먼저 get_dummies 적용

Page 11: 4.representing data and engineering features

숫자로된범주형변수

숫자로되어있다고무조건연속형특성은아닙니다.

ex) workclass를객관식으로골랐다면 1, 2, 3 와같은숫자로저장될수있습니다.

연속형인지범주형인지는특성의의미를알아야판단할수있는경우가많습니다.

ex) 별다섯개의평점데이터, 영화관람등급(범주형이지만순서가있음)

pandas의 get_dummies 또는 scikit-learn의 OneHotEncoder를사용할수있음

OneHotEncoder는숫자로된범주형변수에만사용가능

11

Page 12: 4.representing data and engineering features

숫자범주형변환예제

12

Page 13: 4.representing data and engineering features

구간분할

13

Page 14: 4.representing data and engineering features

wave + 선형회귀, 결정트리회귀

14

Page 15: 4.representing data and engineering features

구간분할bining

연속형특성하나를구간을나누어여러개의범주형특성으로만듦

15

[10] : 2.4 <= x < 3

[11] : 3 <= x

연속형

범주형

Page 16: 4.representing data and engineering features

OneHotEncoder

16

일반 numpy 배열을받기위해

선형모델은상수특성으로인해유연해졌으나결정트리는오히려더나빠졌음

Page 17: 4.representing data and engineering features

상호작용과다항식

17

Page 18: 4.representing data and engineering features

구간분할 + 원본데이터

18

Page 19: 4.representing data and engineering features

구간분할 * 원본데이터

19

Page 20: 4.representing data and engineering features

PolynomialFeatures구간분할의예제처럼원본특성에상호작용이나제곱항을추가함

20

Page 21: 4.representing data and engineering features

다항회귀Polynomial Regression

21

민감한영역존재

Page 22: 4.representing data and engineering features

SVR과비교

22

Page 23: 4.representing data and engineering features

boston dataset

23

132 =

14!2! 12! = 91

Page 24: 4.representing data and engineering features

boston + Ridge vs RandomForestClassifier

24

Page 25: 4.representing data and engineering features

일변량비선형변환

25

Page 26: 4.representing data and engineering features

일변량변환

log, exp, sin 같은수학함수를적용하여특성값을변환함

선형이나신경망모델같은경우데이터스케일에민감함

ex) 2장의컴퓨터메모리가격데이터예제

26

로그스케일

Page 27: 4.representing data and engineering features

카운트데이터

27

Page 28: 4.representing data and engineering features

카운트데이터 + Ridge

28

Page 29: 4.representing data and engineering features

특성자동선택

29

Page 30: 4.representing data and engineering features

일변량통계

유용한 특성인지를 판단하는 방법: 일변량 통계, 모델 기반 선택, 반복적 선택

분산 분석(ANOVAanalysis of variance)은 클래스별 평균을 비교합니다.

𝐹 =𝑆𝑆5678669/(𝑘 − 1)

(𝑆𝑆7?7 − 𝑆𝑆5678669)/(𝑛 − 𝑘)

𝑆𝑆5678669 =A𝑛B �̅�B − �̅�D, 𝑆𝑆7?7 =A 𝑥F − �̅� D

9

FGH

I

BGH

SelectKBest나 SelectPercentile에서 분류는 f_classif, 회귀는 f_regression로 지정

30

Page 31: 4.representing data and engineering features

cancer + noise

31

Page 32: 4.representing data and engineering features

SelectPercentile + LogisticRegression

32

성능 향상 + 모델 해석⬆

Page 33: 4.representing data and engineering features

모델기반선택

feature_importances_(결정트리)나 coef_(선형모델) 값을사용합니다.

기본임계값 : L1 페널티(라쏘)가있는경우 10-5, 그외는평균값

33

mean1.3*median

Page 34: 4.representing data and engineering features

SelectFromModel + LogisticRegression

34

Page 35: 4.representing data and engineering features

반복적특성선택

특성을하나씩추가하면서모델을만들거나모든특성에서하나씩제거하면서모델을만듭니다.

여러개의모델을만들기때문에계산비용이많이듭니다.

재귀적특성제거(RFErecursive feature elimination)은전체특성을포함한모델에서특성중요도가가장낮은특성을지정된개수만큼남을때까지계속제거합니다.

feature_importances_(결정트리)나 coef_(선형모델) 값을사용합니다.

35

Page 36: 4.representing data and engineering features

RFE

36

Page 37: 4.representing data and engineering features

RFE + LogisticRegressionRFE에사용한모델을이용해평가에사용할수있습니다.

37

로지스틱회귀와랜덤포레스트의성능이비슷함

Page 38: 4.representing data and engineering features

전문가지식활용

38

Page 39: 4.representing data and engineering features

특성공학과도메인지식

항공료를잘예측하려면날짜, 항공사, 출발지, 도착지외에음력공휴일이나방학기간을알아야합니다.

뉴욕의시티바이크대여예측

39

훈련데이터 테스트데이터

3시간간격의대여횟수

Unix epoch 시간

Page 40: 4.representing data and engineering features

Unix Time + RandomForestRegressor

40

트리모델의특징(외삽불능)

` [248, 1]

Page 41: 4.representing data and engineering features

Hour + RandomForestRegressor

41

Page 42: 4.representing data and engineering features

Hour,week + RandomForestRegressor

42

[248, 2]

Page 43: 4.representing data and engineering features

Hour,week + LinearRegression

43

시간을연속형으로인식

Page 44: 4.representing data and engineering features

OneHotEncoder + LinearReression

44

[248, 15] : 7개요일, 8개시간구간

Page 45: 4.representing data and engineering features

PolynomialFeatures + LinearRegression

45

[248, 120] : HJD + 15 = 105 + 15 = 120

Page 46: 4.representing data and engineering features

LinearRegression’s coef_선형모델은학습된모델파라미터를확인할수있습니다.

46