서울시 미세먼지 데이터 분석

26
Principles and Practice in Data Mining 2012314261 LEE DONG HEE Seoul City Weather Data Analysis 2016. 12. 09. Prof. Seo yuran 1

Upload: dong-hee-lee

Post on 15-Apr-2017

87 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: 서울시 미세먼지 데이터 분석

Principles and Practice in Data

Mining2012314261 LEE DONG HEE

Seoul City Weather Data Analysis

2016. 12. 09.

Prof. Seo yuran

1

Page 2: 서울시 미세먼지 데이터 분석

INDEX 01PROBLEM

02ANALYSIS PROCESS

03CONCLUSION

2

Page 3: 서울시 미세먼지 데이터 분석

01PROBLEM

3

Page 4: 서울시 미세먼지 데이터 분석

01 | PROBLEM

4

Page 5: 서울시 미세먼지 데이터 분석

01 | PROBLEM

5

Page 6: 서울시 미세먼지 데이터 분석

01 | PROBLEM

1.1 BackgroundIn recent years, high concentration of local pollution has occurred due to regional characteristics.Therefore, it is necessary to analyze the cause by scientific reason.

1.2 PurposeThe relationship between find dust and meteorological factors is identified and formulatedthrough statistical techniques.And, a basis for the prediction of fine dust management in Seoul is provided.

6

Page 7: 서울시 미세먼지 데이터 분석

01 | PROBLEM

1.3 Data Source

ASOS(Automated Synopic

Oberving System)PM10

• Temperature• Wind Speed• Sunshine

…”59 variables”

• Fine dust

“1 variable”

▪ Area : Seoul City

▪ Period : 2010 - 2015

▪ Rows : 2190 (365 X 6)

▪ Columns : 60 (59 + 1)

7

Page 8: 서울시 미세먼지 데이터 분석

02ANALYSISPROCESS

8

Page 9: 서울시 미세먼지 데이터 분석

02ANALYSISPROCESS

1. Exploring Data 2. Refining Data3. Creating

Model& Verification

9

Page 10: 서울시 미세먼지 데이터 분석

02 | ANALYSIS PROCESS

2.1 Exploring DataBasic Statistic

MeanTemperature

MeanWind Speed Precipitation Mean

Relative Humidity Radiation Sunshine PM

NA 0 0 1320 0 13 1 8

MIN - 14.5 1.1 0 20.1 0.25 0 3.9

MEDIAN 14 2.5 1.5 60 11.57 7.2 41.05

MEAN 12.68 2.69 10.03 60.33 12.24 6.29 45.95

MAX 31.8 7.5 301.5 99.8 29.74 13.5 658.2

10

Page 11: 서울시 미세먼지 데이터 분석

02 | ANALYSIS PROCESS

2.1 Exploring Data

PM

Date 11

PM Trend Graph

Page 12: 서울시 미세먼지 데이터 분석

02 | ANALYSIS PROCESS

2.1 Exploring DataMissing Value : NA Outliers

NA

Precipitation

1320 ☞ 0

Radiation 13☞ 0

Sunshine 1☞ 0

PM 8 ☞ 0

All missing value is changed to ‘0’. Because ’NA’ means it has nothing value and this show that meteorological instrument do not observe anything.

Boxplot of PM

The outlier’s valueIs 658.2. And the next higher value I s 292.So, the outlier is changed to 300.Because generally the outlier is replace with upper limit of data.

12

Page 13: 서울시 미세먼지 데이터 분석

02 | ANALYSIS PROCESS

2.1 Exploring Data

13

Correlation Coefficient

Page 14: 서울시 미세먼지 데이터 분석

02 | ANALYSIS PROCESS

2.2 Refining DataAdd a variable : Degree(factor type)

Fine Dust Levels PM(µg/m3)

Good 0 ~ 30

Normal 31 ~ 80

Bad 81 ~ 150

Very Bad 151 ~

MeanTemperature

MeanWind Speed

PrecipitationMean

Relative Humidity

Radiation Sunshine

PM Degree

14

Page 15: 서울시 미세먼지 데이터 분석

02 | ANALYSIS PROCESS

2.2 Refining DataCreate Standardization data

Because the scales are different for each variable, you standardized the variables for accurate modeling.

Train Set : 80 % Test Set: 20 %

Separate Train Set and Test Set randomly from the dataset

• Train Set is used to learn the model• Test Set is used to evaluate the performance of the model which you created.

15

Page 16: 서울시 미세먼지 데이터 분석

02 | ANALYSIS PROCESS

2.3 Creating ModelVariable Selection

• This variables are selected among

60 variables from raw data, based on

the relevant paper about weather.

Therefore, general variable selection

way is not used to select specific

variables in this process.

MeanTemperature

MeanWind Speed

PrecipitationMean

Relative Humidity

Radiation Sunshine

PM Degree

16

Page 17: 서울시 미세먼지 데이터 분석

02 | ANALYSIS PROCESS

2.3 Creating ModelAnalysis Methods

PCAMultinomial

LogisticRegression

NeuralNetwork

17

Page 18: 서울시 미세먼지 데이터 분석

02 | ANALYSIS PROCESS

2.3 Creating ModelPCA

18

Page 19: 서울시 미세먼지 데이터 분석

02 | ANALYSIS PROCESS

2.3 Creating ModelPCA : Parallel Analysis

Parallel analysis suggests that the number of factors

= 319

Page 20: 서울시 미세먼지 데이터 분석

02 | ANALYSIS PROCESS

2.3 Creating ModelPCA : 3 Principal component

• PC1 = 0.08425922*Mean Temperature + (-0.01601653)*Mean Wind Speed + 0.36003954*Precipitation +0.52249043*Mean Relative Humidity + (-0.51271657)*Radiation + (-0.57196228)*Sunshine

• PC2 = (-0.7977584)*Mean Temperature + 0.2431814*Mean Wind Speed + (-0.1878170)*Precipitation +(-0.2777519)*Mean Relative Humidity + (-0.4220239)*Radiation + (-0.1179781)*Sunshine

• PC3 = 0.080656658*Mean Temperature + 0.899196707*Mean Wind Speed + 0.387333525*Precipitation +0.009005798*Mean Relative Humidity + 0.160962973*Radiation + 0.094458149*Sunshine

20

Page 21: 서울시 미세먼지 데이터 분석

02 | ANALYSIS PROCESS

2.3 Creating Modela. Multinomial Logistic Regression

21

Page 22: 서울시 미세먼지 데이터 분석

02 | ANALYSIS PROCESS

2.3 Creating Modelb. Neural Network (Basic Variables)

22

Page 23: 서울시 미세먼지 데이터 분석

02 | ANALYSIS PROCESS

2.3 Creating Modelc. Neural Network (Principal Component)

23

Page 24: 서울시 미세먼지 데이터 분석

02 | ANALYSIS PROCESS

2.3 VerificationConfusion Matrix

62.01%

Accuracy

62.92%

68.27%

MultinomialLogistic Regression

Neural Network(Basic Variable)

Neural Network(Principal Component)

24

Page 25: 서울시 미세먼지 데이터 분석

03CONCLUSION

25

Page 26: 서울시 미세먼지 데이터 분석

03 | CONCLUSION

Correlation coefficient analysis showed that the correlation coefficient between PM10 concentration and meteorological factors ranged from -0.200 to 0.058.

As the result of model, principal component has higher prediction accuracy than basic variable. And, neural network has higher prediction accuracy than multinomial logistic regression.

26