Neural Networks and Data Mining Folie 1
Artificial Neural Networksand
Data Mining
Uwe Lämmel
Wismar Business
School
www.wi.hs-wismar.de/~laemmel
Neural Networks and Data Mining Folie 2
Content
Data Mining Classification: approach Data Mining Cup
– 2004: Who will cancel?– 2007: Who will get a rebate coupon?– 2008: How long will someone participate in a
lottery?– 2009: Forecast of book sales figures– 2010 ?
Clustering: approach– Behaviour of bank customers
Neural Networks and Data Mining Folie 3
Data Mining
Data Mining is a – systematic and automated
discovery and extraction– of previously unknown knowledge – out of huge amount of data.
"KDD – Knowledge Discovery in Data bases" – synonym
Notion wrong: Gold Mining Data Mining
Neural Networks and Data Mining Folie 4
Data Mining – Applications
classification
clustering
association
prediction
text mining
web mining
clustering partitioning a data set into subsets
(clusters), so that the data in each subset (ideally) share some common features – similarity or proximity for some defined
distance measure is building classes
classification items are placed in subsets
(classes) classes have known properties
– customer is bad, average, good– pattern recognition– …
set of training items is used to train the classification algorithm
Neural Networks and Data Mining Folie 6
Content
Data Mining Classification: approach using NN Data Mining Cup Clustering: approach
Neural Networks and Data Mining Folie 7
Classification using NN
prerequisite set of training pattern (many patterns)
approach code the values divide set of training pattern into:
– training set– test set
build a network train the network using the training set check the network quality using the test
set
real data
training p.
coded p.
training set test set
Neural Networks and Data Mining Folie 8
Development of an NN-application
calculate network output compare to
teaching output
use Test set data
evaluate output
compare to teaching output
change parameters
modify weights
input of training pattern
build a network architecture
quality is good enough
error is too high
error is too high
quality is good enough
Neural Networks and Data Mining Folie 9
Build an Artificial Neural Network
Number of Input Neurons?– depends on the number of attributes– depends on the coding
Number of Output Neurons?– depends on the coding of the class attribute
Number of Hidden Neurons?– experiments necessary– generally: not more than input neurons– quarter … half of number of input neurons
may work– see capacity of a neural network
Neural Networks and Data Mining Folie 10
Experiments using the JavaNNS
Build a network Load training-pattern open the Error Graph open the Control Panel Initialize the network try different learning parameter: 0.1, 0.2, 0.5,
0.8 Start Learning
Neural Networks and Data Mining Folie 11
Getting Results
value the error Finally:
– make the test-Pattern the actual one
– Save Data …– include output files– save as a .res-file
Evaluate the .res-file
Neural Networks and Data Mining Folie 12
Experiments
How can we improve the results?– Data pre-processing?– Architecture of ANN?– Learning Parameters?– Evaluation of the results: post-processing?
record your work!
Neural Networks and Data Mining Folie 13
Content
Data Mining Classification: approach Data Mining Cup
– 2004: Who will cancel?– 2007: Who will get a rebate coupon?– 2008: How long will someone participate in a
lottery?– 2009: Forecast of book sales figures– 2010 ?
Clustering: approach– Behaviour of bank customers
Neural Networks and Data Mining Folie 14
Data Mining Cup www.data–mining–cup.de
annual competition for students runs April – May /June real world problem:
– problem– set of training data – set of data for classification– to be developed: classification
supported by many companies (data/software)
~ 200 – 300 participants workshop (user day)
Neural Networks and Data Mining Folie 15
DMC2004: A Mailing Action
mailing action of a company: – special offer– estimated annual income per customer:
given:– 10,000 sets of customer data
containing 1,000 cancellers (training) problem:
– test set contains 10,000 customer data
– Who will cancel ? – Whom to send an offer?
customerwillcancel
willnot cancel
gets an offer 43.80€ 66.30€
gets no offer 0.00€ 72.00€
Neural Networks and Data Mining Folie 16
Mailing Action – Aim?
no mailing action:– 9,000 x 72.00 = 648,000
everybody gets an offer:– 1,000 x 43.80 + 9,000 x 66.30 = 640,500
maximum (100% correct classification):– 1,000 x 43.80 + 9,000 x 72.00 = 691,800
customerwillcancel
willnot cancel
gets an offer 43.80€ 66.30€
gets no offer 0.00€ 72.00€
Neural Networks and Data Mining Folie 17
Goal Function: Lift
basis: no mailing action: 9,000 · 72.00goal = extra income:liftM = 43.8 · cM + 66.30 · nkM – 72.00· nkM
customerwillcancel
willnot cancel
gets an offer
43.80€ 66.30€
gets no offer
0.00€ 72.00€
Neural Networks and Data Mining Folie 18
Dataresults>
<important
^missing values^
----- 32 input data ------
Neural Networks and Data Mining Folie 19
Feed Forward Network – What to do?
train the net with training set (10,000) test the net using the test set ( another 10,000)
– classify all 10,000 customer into canceller or loyal– evaluate the additional income
Neural Networks and Data Mining Folie 20
Results
data mining cup 2002
neural network project 2004
gain: – additional income by the mailing action
if target group was chosen according analysis
Neural Networks and Data Mining Folie 21
DMC 2007: Rebate System
Check-out couponing allows an individual coupon generation at the check-out
The coupon is printed at the end of the sales slip depending on the current customer.
Questions: – How can the retailer identify
whether a customer is a potential couponing customer?
– On what coupons he will respond?
Neural Networks and Data Mining Folie 22
Couponing Print:
– coupon A– coupon B– No coupon
50,000 customer cards for training
Classify another 50,000 customer!
Cost function:– coupon not redeemed (false assignment to A or B): –1 – coupon A redeemed (correct assignment to A): +3– coupon B redeemed (correct assignment to B): +6
Maximize the value!
Neural Networks and Data Mining Folie 23
Data Understanding What is the meaning of the attributes? Type and range of values?
Neural Networks and Data Mining Folie 24
20–20–2 Network
Profit = 3AA + 6 BB – (NA+NB+BA+AB)
results: winner 2007 7,890 my version 6,714 our students 6,468
(73/230)
Neural Networks and Data Mining Folie 25
DMC2008: Participation in a Lottery Predicting, at the beginning of the lottery,
how long participants will participate:
0 – The first ticket has not been paid for 1 – Only the ticket for the first class has been paid for 2 – Only the first two classes were played 3 – The lottery was played until the end
but no ticket purchased for the following lottery
4 – At least first ticket for the following lottery purchased
cost matrix
Neural Networks and Data Mining Folie 26
Data
113,476 pattern! 69 attributes
– new customer (yes/no)
– age– bank– car– …
Neural Networks and Data Mining Folie 27
100–40–20–5 Network
results: 1,030,240 RWTH Aachen (1)
…1,024,535 RWTH Aachen (8)
865,565 Bauhaus Univ. Weimar (100)
Univ. Wismar: 878,550 – 835,035 – 1,494,315 (212)
Neural Networks and Data Mining Folie 28
DMC 2009 – online bookshop „Libri“
Sales figures training:– more than 1.800 books– 2.418 shops
Sales figures forecast– 8 books– 2.394 shops
Neural Networks and Data Mining Folie 31
DMC 2010: Revenue maximisation by intelligent couponing
Many customers only make an order in an online shop once
decision whether to send a voucher worth € 5.00 voucher for those
who would not have decided to re-order by themselves.
32,427 data sets for training 32,428 data sets for prediction 37 attributes per set + target attribute in training set
Neural Networks and Data Mining Folie 33
Content
Data Mining Classification: approach Data Mining Cup Clustering: approach
– Behaviour of bank customers
Neural Networks and Data Mining Folie 34
Clustering Transaction Data
Co–operation Hochschule Wismar HypoVereinsbank Medienhaus Rostock
Issue What information can be extracted
from turnover time series?Strategy1. Clustering time series data2. Assign customers/accounts to clusters3. Examine clusters
Neural Networks and Data Mining Folie 35
Transaction Data & Time Series
Original financial data not suitable: Order of values is important Time displacements are
problematic
Corporate clients 223 branches
Cumulated transactions per
Month Account Type of transaction
... for a total of 6 years
Neural Networks and Data Mining Folie 36
Fourier versus Original Data
No displacementSimilarity detected on both: transaction curve and frequency spectrum
Data is displacedfrequency spectrum shows similarity
Neural Networks and Data Mining Folie 37
Using a classification model
Clustering
Sequence A
Initial Cluster
Preprocessing
Classification Model
t0 tm
1. Building the Model
Customer Turnover ...
New Cluster
Sequence B
Preprocessing
t0+n tm+n
2. Applying themodel
Identical
?
3. Comparing clusterassignments
Different
Initial Cluster
Neural Networks and Data Mining Folie 38
Clustering & Prediction Results
140.000 records 1 record = 1 account 6x5 SOM = max. 30 clusters average changes of cluster assignments: ca.
19%
Variability per Business Sector22,3% Taxi 239/107022,3% Ship Broker Offices
64/47120,9% Churches 228/109120,2% Trucking 1010/5008