data mining final report vipin saini m964011062 許博淞 m964020009 陳昀志 m964020043

24
DATA MINING FINAL REPORT Vipin Saini M964011062 許許許 M964020009 許許許 M964020043

Upload: eustace-powell

Post on 31-Dec-2015

253 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: DATA MINING FINAL REPORT Vipin Saini M964011062 許博淞 M964020009 陳昀志 M964020043

DATA MINING FINAL REPORT

Vipin Saini M964011062 許博淞 M964020009 陳昀志 M964020043

Page 2: DATA MINING FINAL REPORT Vipin Saini M964011062 許博淞 M964020009 陳昀志 M964020043

Outline

Introduction DM Methodology(Step1~Step3) DM Methodology(Step4~Step8) DM Methodology(Step9~Step10) Conclusion

Page 3: DATA MINING FINAL REPORT Vipin Saini M964011062 許博淞 M964020009 陳昀志 M964020043

Introduction

• Direct marketing • Response rate• Telecommunications company• Publicly available business data • Addition of random companies

Page 4: DATA MINING FINAL REPORT Vipin Saini M964011062 許博淞 M964020009 陳昀志 M964020043

Step2-Records

Some characteristics about each prospect Number of employees at a particular office Number of employees for the entire company Annual sales (in thousands) at a particular office Annual sales (in thousands) for the entire company Whether or not the company does business outside

the United States Annual advertising expense Whether the company has moved recently or is a new

business The type of ownership Specific industry code General industry code Age of the company (in years)

Page 5: DATA MINING FINAL REPORT Vipin Saini M964011062 許博淞 M964020009 陳昀志 M964020043

Step3-Data Type

Correcting the data types.

Make sure "Buyer" is the type Yes/No. Change the type of Age to integer. Make sure the "International" type is string or Boolean. Change "Local Employees" to integer. Change "Local Sales" to integer.

Change "Industry Type" to categorical.

Change "Total Employees" to integer.

Change "Total Sales" to integer.

Page 6: DATA MINING FINAL REPORT Vipin Saini M964011062 許博淞 M964020009 陳昀志 M964020043

Step4 Create a Model Set

The number of employees and the number of sales differ based on the size of the company. All of these characteristics represent a picture of company size.

Employee Ratio, Sales Ratio, Productivity Ratio

Page 7: DATA MINING FINAL REPORT Vipin Saini M964011062 許博淞 M964020009 陳昀志 M964020043

Step4: Create a Model Set

With our newly applied rules, the World dataset now has redundant columns.

Page 8: DATA MINING FINAL REPORT Vipin Saini M964011062 許博淞 M964020009 陳昀志 M964020043

Step5: Fix Problems with the Data

Categorical variables with too many values

Page 9: DATA MINING FINAL REPORT Vipin Saini M964011062 許博淞 M964020009 陳昀志 M964020043

Step6: Transform the data

create a training and testing set Total Records : 13117

Page 10: DATA MINING FINAL REPORT Vipin Saini M964011062 許博淞 M964020009 陳昀志 M964020043

Step7: Build Model

We use PolyAnalyst to help us to mine the data, and the version is 5.0.

Page 11: DATA MINING FINAL REPORT Vipin Saini M964011062 許博淞 M964020009 陳昀志 M964020043

Step7: Build Model

We used MarketData.CSV file which we edited as the source. After the software filtrated out missing values, we had the decision tree.

Page 12: DATA MINING FINAL REPORT Vipin Saini M964011062 許博淞 M964020009 陳昀志 M964020043

the Decision TreeRootRoot

Local Employe

e<23

Local Employe

e<23

Local Employee>=23

Local Employee>=23

Age<3Age<3 Age>=3Age>=3

Local

Employee <10

Local

Employee <10

Local

Employee >=1

0

Local

Employee >=1

0

Age<2

Age<2

Age>=2Age>=2

Sales

Ratio < 0.0027

Sales

Ratio < 0.0027

Sales

Ratio

>= 0.0027

Sales

Ratio

>= 0.0027

Sales

Ratio

=N/A

Sales

Ratio

=N/A

Industry

Category = C

Industry

Category = C

Industry Category = H

Industry Category = H

Industry Categor

y = F

Industry Categor

y = F

Industry

Category = E

Industry

Category = E

Industry

Category = D

Industry

Category = D

Industry

Category = A

Industry

Category = A

Industry Category = B

Industry Category = B

Industry Category = G

Industry Category = G

Industry Categor

y = I

Industry Categor

y = I

Employee

Ratio< 0.214

Employee

Ratio< 0.214

Employee Ratio < 0.214

Employee Ratio < 0.214

Page 13: DATA MINING FINAL REPORT Vipin Saini M964011062 許博淞 M964020009 陳昀志 M964020043

the Decision Tree

We made a decision tree with: Number of non-terminal nodes : 41 Number of leaves : 91 Depth of the tree : 8

Page 14: DATA MINING FINAL REPORT Vipin Saini M964011062 許博淞 M964020009 陳昀志 M964020043

Step 8:Assess model• the result of decision tree of Training set:

• Total classification error: 14.04% • Classification accuracy: 85.96% • Classification error for class No: 14.89% • Classification error for class Yes: 13.01%

Real/predict

No Yes undefined

No 3018 528 49

Yes 379 2535 49

Page 15: DATA MINING FINAL REPORT Vipin Saini M964011062 許博淞 M964020009 陳昀志 M964020043

Step 8:Assess model

If we use top 40% of data and can use this model to predict 80% corrected response.

Page 16: DATA MINING FINAL REPORT Vipin Saini M964011062 許博淞 M964020009 陳昀志 M964020043

Step 9. Deploy models

The testing set is random selected 50 % of records from the whole dataset.

Total classification error: 15.54% Classification accuracy: 84.46% Classification error for class No: 16.56% Classification error for class Yes: 14.19%

Real/predict

No Yes undefined

No 3074 610 45

Yes 396 2395 39

Page 17: DATA MINING FINAL REPORT Vipin Saini M964011062 許博淞 M964020009 陳昀志 M964020043

Step 10. Assess result

RootRoot

Local Employee<2

3

Local Employee<2

3

Local Employee>=

23

Local Employee>=

23

YesNo

Page 18: DATA MINING FINAL REPORT Vipin Saini M964011062 許博淞 M964020009 陳昀志 M964020043

Step 10. Assess result

Almost every company that have more than 23 employee have higher ratio to respond. (Class label is Yes and the ratio is 75.5%). a bigger company with more employee which

have higher trends to response. the number of employee is smaller than 23,

are likely not to response (Class label is No and the ratio is 72.9%) a small company doesn’t have trends to

response

Page 19: DATA MINING FINAL REPORT Vipin Saini M964011062 許博淞 M964020009 陳昀志 M964020043

Step 10. Assess result

RootRoot

Local Employee

<23

Local Employee

<23

Local Employee

>=23

Local Employee

>=23

Industry

Category = C

Industry

Category = C

Industry Category

= H

Industry Category

= H

Industry Category

= F

Industry Category

= F

Industry

Category = E

Industry

Category = E

Industry

Category = D

Industry

Category = D

Industry

Category = A

Industry

Category = A

Industry Category

= B

Industry Category

= B

Industry Category

= G

Industry Category

= G

Industry Category

= I

Industry Category

= I

Employee Ratio< 0.214

Employee Ratio< 0.214

Employee Ratio >=

0.214

Employee Ratio >=

0.214

YesNo

Page 20: DATA MINING FINAL REPORT Vipin Saini M964011062 許博淞 M964020009 陳昀志 M964020043

Step 10. Assess result

if the Local Employee ratio is smaller than 0.214 then the response ratio is low.(class label is No and the ratio is 85.7%)

if the Local Employee ratio is bigger than 0.214 then the response ratio is high.(class label is Yes and the ratio is 66.2%) the Local employee ratio have influence on

response ratio of the bigger companies and Industry Category is E, depends on how is the Local employee Ratio is.

Page 21: DATA MINING FINAL REPORT Vipin Saini M964011062 許博淞 M964020009 陳昀志 M964020043

Step 10. Assess result

RootRoot

Local Employee<23

Local Employee<23

Local Employee>=23

Local Employee>=23

Age<3Age<3 Age>=3Age>=3

Local Employee

<10

Local Employee

<10

Local Employee

>=10

Local Employee

>=10Age<2Age<2 Age>=2Age>=2

Sales Ratio < 0.0027

Sales Ratio < 0.0027

Sales Ratio >= 0.0027

Sales Ratio >= 0.0027

Sales Ratio =N/A

Sales Ratio =N/A

Yes

Page 22: DATA MINING FINAL REPORT Vipin Saini M964011062 許博淞 M964020009 陳昀志 M964020043

Step 10. Assess result

if the Sales ratio is more than 0.27% then the response ration is high (class label is Yes and the ratio is 98.2%) a new beginning company and his sales

rate is good, so he likes to response.

Page 23: DATA MINING FINAL REPORT Vipin Saini M964011062 許博淞 M964020009 陳昀志 M964020043

Conclusion

We use a decision tree to approach the target marketing.

Knowing how the industry category type is, we can get more information from this mining result.

Page 24: DATA MINING FINAL REPORT Vipin Saini M964011062 許博淞 M964020009 陳昀志 M964020043

Thanks For Your Listening!