data mining final report vipin saini m964011062 許博淞 m964020009 陳昀志 m964020043
TRANSCRIPT
DATA MINING FINAL REPORT
Vipin Saini M964011062 許博淞 M964020009 陳昀志 M964020043
Outline
Introduction DM Methodology(Step1~Step3) DM Methodology(Step4~Step8) DM Methodology(Step9~Step10) Conclusion
Introduction
• Direct marketing • Response rate• Telecommunications company• Publicly available business data • Addition of random companies
Step2-Records
Some characteristics about each prospect Number of employees at a particular office Number of employees for the entire company Annual sales (in thousands) at a particular office Annual sales (in thousands) for the entire company Whether or not the company does business outside
the United States Annual advertising expense Whether the company has moved recently or is a new
business The type of ownership Specific industry code General industry code Age of the company (in years)
Step3-Data Type
Correcting the data types.
Make sure "Buyer" is the type Yes/No. Change the type of Age to integer. Make sure the "International" type is string or Boolean. Change "Local Employees" to integer. Change "Local Sales" to integer.
Change "Industry Type" to categorical.
Change "Total Employees" to integer.
Change "Total Sales" to integer.
Step4 Create a Model Set
The number of employees and the number of sales differ based on the size of the company. All of these characteristics represent a picture of company size.
Employee Ratio, Sales Ratio, Productivity Ratio
Step4: Create a Model Set
With our newly applied rules, the World dataset now has redundant columns.
Step5: Fix Problems with the Data
Categorical variables with too many values
Step6: Transform the data
create a training and testing set Total Records : 13117
Step7: Build Model
We use PolyAnalyst to help us to mine the data, and the version is 5.0.
Step7: Build Model
We used MarketData.CSV file which we edited as the source. After the software filtrated out missing values, we had the decision tree.
the Decision TreeRootRoot
Local Employe
e<23
Local Employe
e<23
Local Employee>=23
Local Employee>=23
Age<3Age<3 Age>=3Age>=3
Local
Employee <10
Local
Employee <10
Local
Employee >=1
0
Local
Employee >=1
0
Age<2
Age<2
Age>=2Age>=2
Sales
Ratio < 0.0027
Sales
Ratio < 0.0027
Sales
Ratio
>= 0.0027
Sales
Ratio
>= 0.0027
Sales
Ratio
=N/A
Sales
Ratio
=N/A
Industry
Category = C
Industry
Category = C
Industry Category = H
Industry Category = H
Industry Categor
y = F
Industry Categor
y = F
Industry
Category = E
Industry
Category = E
Industry
Category = D
Industry
Category = D
Industry
Category = A
Industry
Category = A
Industry Category = B
Industry Category = B
Industry Category = G
Industry Category = G
Industry Categor
y = I
Industry Categor
y = I
Employee
Ratio< 0.214
Employee
Ratio< 0.214
Employee Ratio < 0.214
Employee Ratio < 0.214
the Decision Tree
We made a decision tree with: Number of non-terminal nodes : 41 Number of leaves : 91 Depth of the tree : 8
Step 8:Assess model• the result of decision tree of Training set:
• Total classification error: 14.04% • Classification accuracy: 85.96% • Classification error for class No: 14.89% • Classification error for class Yes: 13.01%
Real/predict
No Yes undefined
No 3018 528 49
Yes 379 2535 49
Step 8:Assess model
If we use top 40% of data and can use this model to predict 80% corrected response.
Step 9. Deploy models
The testing set is random selected 50 % of records from the whole dataset.
Total classification error: 15.54% Classification accuracy: 84.46% Classification error for class No: 16.56% Classification error for class Yes: 14.19%
Real/predict
No Yes undefined
No 3074 610 45
Yes 396 2395 39
Step 10. Assess result
RootRoot
Local Employee<2
3
Local Employee<2
3
Local Employee>=
23
Local Employee>=
23
YesNo
Step 10. Assess result
Almost every company that have more than 23 employee have higher ratio to respond. (Class label is Yes and the ratio is 75.5%). a bigger company with more employee which
have higher trends to response. the number of employee is smaller than 23,
are likely not to response (Class label is No and the ratio is 72.9%) a small company doesn’t have trends to
response
Step 10. Assess result
RootRoot
Local Employee
<23
Local Employee
<23
Local Employee
>=23
Local Employee
>=23
Industry
Category = C
Industry
Category = C
Industry Category
= H
Industry Category
= H
Industry Category
= F
Industry Category
= F
Industry
Category = E
Industry
Category = E
Industry
Category = D
Industry
Category = D
Industry
Category = A
Industry
Category = A
Industry Category
= B
Industry Category
= B
Industry Category
= G
Industry Category
= G
Industry Category
= I
Industry Category
= I
Employee Ratio< 0.214
Employee Ratio< 0.214
Employee Ratio >=
0.214
Employee Ratio >=
0.214
YesNo
Step 10. Assess result
if the Local Employee ratio is smaller than 0.214 then the response ratio is low.(class label is No and the ratio is 85.7%)
if the Local Employee ratio is bigger than 0.214 then the response ratio is high.(class label is Yes and the ratio is 66.2%) the Local employee ratio have influence on
response ratio of the bigger companies and Industry Category is E, depends on how is the Local employee Ratio is.
Step 10. Assess result
RootRoot
Local Employee<23
Local Employee<23
Local Employee>=23
Local Employee>=23
Age<3Age<3 Age>=3Age>=3
Local Employee
<10
Local Employee
<10
Local Employee
>=10
Local Employee
>=10Age<2Age<2 Age>=2Age>=2
Sales Ratio < 0.0027
Sales Ratio < 0.0027
Sales Ratio >= 0.0027
Sales Ratio >= 0.0027
Sales Ratio =N/A
Sales Ratio =N/A
Yes
Step 10. Assess result
if the Sales ratio is more than 0.27% then the response ration is high (class label is Yes and the ratio is 98.2%) a new beginning company and his sales
rate is good, so he likes to response.
Conclusion
We use a decision tree to approach the target marketing.
Knowing how the industry category type is, we can get more information from this mining result.
Thanks For Your Listening!