서울시 미세먼지 데이터 분석

Principles and Practice in Data

Mining2012314261 LEE DONG HEE

Seoul City Weather Data Analysis

2016. 12. 09.

Prof. Seo yuran

1

INDEX 01PROBLEM

02ANALYSIS PROCESS

03CONCLUSION

2

01PROBLEM

3

01 | PROBLEM

4

01 | PROBLEM

5

01 | PROBLEM

1.1 BackgroundIn recent years, high concentration of local pollution has occurred due to regional characteristics.Therefore, it is necessary to analyze the cause by scientific reason.

1.2 PurposeThe relationship between find dust and meteorological factors is identified and formulatedthrough statistical techniques.And, a basis for the prediction of fine dust management in Seoul is provided.

6

01 | PROBLEM

1.3 Data Source

ASOS(Automated Synopic

Oberving System)PM10

• Temperature• Wind Speed• Sunshine

…”59 variables”

• Fine dust

“1 variable”

▪ Area : Seoul City

▪ Period : 2010 - 2015

▪ Rows : 2190 (365 X 6)

▪ Columns : 60 (59 + 1)

7

02ANALYSISPROCESS

8

02ANALYSISPROCESS

1. Exploring Data 2. Refining Data3. Creating

Model& Verification

9

02 | ANALYSIS PROCESS

2.1 Exploring DataBasic Statistic

MeanTemperature

MeanWind Speed Precipitation Mean

Relative Humidity Radiation Sunshine PM

NA 0 0 1320 0 13 1 8

MIN - 14.5 1.1 0 20.1 0.25 0 3.9

MEDIAN 14 2.5 1.5 60 11.57 7.2 41.05

MEAN 12.68 2.69 10.03 60.33 12.24 6.29 45.95

MAX 31.8 7.5 301.5 99.8 29.74 13.5 658.2

10


2.1 Exploring Data

PM

Date 11

PM Trend Graph


2.1 Exploring DataMissing Value : NA Outliers

NA

Precipitation

1320 ☞ 0

Radiation 13☞ 0

Sunshine 1☞ 0

PM 8 ☞ 0

All missing value is changed to ‘0’. Because ’NA’ means it has nothing value and this show that meteorological instrument do not observe anything.

Boxplot of PM

The outlier’s valueIs 658.2. And the next higher value I s 292.So, the outlier is changed to 300.Because generally the outlier is replace with upper limit of data.

12


2.1 Exploring Data

13

Correlation Coefficient


2.2 Refining DataAdd a variable : Degree(factor type)

Fine Dust Levels PM(µg/m3)

Good 0 ~ 30

Normal 31 ~ 80

Bad 81 ~ 150

Very Bad 151 ~

MeanTemperature

MeanWind Speed

PrecipitationMean

Relative Humidity

Radiation Sunshine

PM Degree

14


2.2 Refining DataCreate Standardization data

Because the scales are different for each variable, you standardized the variables for accurate modeling.

Train Set : 80 % Test Set: 20 %

Separate Train Set and Test Set randomly from the dataset

• Train Set is used to learn the model• Test Set is used to evaluate the performance of the model which you created.

15


2.3 Creating ModelVariable Selection

• This variables are selected among

60 variables from raw data, based on

the relevant paper about weather.

Therefore, general variable selection

way is not used to select specific

variables in this process.

MeanTemperature

MeanWind Speed

PrecipitationMean

Relative Humidity

Radiation Sunshine

PM Degree

16


2.3 Creating ModelAnalysis Methods

PCAMultinomial

LogisticRegression

NeuralNetwork

17


2.3 Creating ModelPCA

18


2.3 Creating ModelPCA : Parallel Analysis

Parallel analysis suggests that the number of factors

= 319


2.3 Creating ModelPCA : 3 Principal component

• PC1 = 0.08425922*Mean Temperature + (-0.01601653)*Mean Wind Speed + 0.36003954*Precipitation +0.52249043*Mean Relative Humidity + (-0.51271657)*Radiation + (-0.57196228)*Sunshine

• PC2 = (-0.7977584)*Mean Temperature + 0.2431814*Mean Wind Speed + (-0.1878170)*Precipitation +(-0.2777519)*Mean Relative Humidity + (-0.4220239)*Radiation + (-0.1179781)*Sunshine

• PC3 = 0.080656658*Mean Temperature + 0.899196707*Mean Wind Speed + 0.387333525*Precipitation +0.009005798*Mean Relative Humidity + 0.160962973*Radiation + 0.094458149*Sunshine

20


2.3 Creating Modela. Multinomial Logistic Regression

21


2.3 Creating Modelb. Neural Network (Basic Variables)

22


2.3 Creating Modelc. Neural Network (Principal Component)

23


2.3 VerificationConfusion Matrix

62.01%

Accuracy

62.92%

68.27%

MultinomialLogistic Regression

Neural Network(Basic Variable)

Neural Network(Principal Component)

24

03CONCLUSION

25

03 | CONCLUSION

Correlation coefficient analysis showed that the correlation coefficient between PM10 concentration and meteorological factors ranged from -0.200 to 0.058.

As the result of model, principal component has higher prediction accuracy than basic variable. And, neural network has higher prediction accuracy than multinomial logistic regression.

26

서울시 미세먼지 데이터 분석

Data & Analytics