k236: basis of data sciencebao/k236/k236-l6.pdf · produce word embeddings.! these models are...
Post on 14-Sep-2020
0 Views
Preview:
TRANSCRIPT
K236: Basis of Data ScienceLecture 6: Data Preprocessing
Lecturer: Tu Bao Ho and Hieu Chi DamTA: Moharasan Gandhimathi
and Nuttapong Sanglerdsinlapachai
2
Schedule of K236
1. Introduction to data science データ科学入門 6/9
2. Introduction to data science データ科学入門 6/13
3. Data and databases データとデータベース 6/16
4. Review of univariate statistics 単変量統計 6/20
5. Review of linear algebra 線形代数 6/23
6. Data mining software データマイニングソフトウェア 6/27
7. Data preprocessing データ前処理 6/30
8. Classification and prediction (1) 分類と予測 (1) 7/4
9. Knowledge evaluation 知識評価 7/7
10. Classification and prediction (2) 分類と予測 (2) 7/11
11. Classification and prediction (3) 分類と予測 (3) 7/14
12. Mining association rules (1) 相関ルールの解析 7/18
13. Mining association rules (2) 相関ルールの解析 7/21
14. Cluster analysis クラスター解析 7/25
15. Review and Examination レビューと試験 (the data is not fixed) 7/27
3
Data organized by function
Create/selecttarget database
Select samplingtechnique and sample data
Supply missing values
Normalizevalues
Select DM task (s)
Transform todifferent
representation
Eliminatenoisy data
Transformvalues
Select DM method (s)
Create derivedattributes
Extract knowledge
Find importantattributes &value ranges
Test knowledge
Refine knowledge
Query & report generationAggregation & sequencesAdvanced methods
Data warehousing
1
2
3
4
5
The data analysis process
Lecture 6
Lecture 7-‐9, 10-‐14 Lecture 8
4
1. Why Preprocess the Data?2. Data Cleaning3. Data Integration 4. Data Reduction5. Data Transformation
Outline
5
Common properties of large real-world databases:
• Incomplete: lacking attribute values or certain of interest
• Noisy: containing errors or outliers
• Inconsistent: containing discrepancies in codes or names
Veracity problem!No quality data, no quality analysis results!
Why preprocess the data?
6
Data cleaning Data integration
Data reduction (instances and dimensions)
1 2
34 Data transformation
Major tasks in data preprocessing
7
Major tasks in data preprocessing
• Data cleaningq Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
• Data integrationq Integration of multiple databases, data cubes, or files
• Data transformationq Normalization and aggregation
• Data reductionq Obtains reduced representation in volume but produces the same or
similar analytical results
• Data discretizationq Part of data reduction but with particular importance, especially for
numerical data
8
1. Why Preprocess the Data?2. Data Cleaning3. Data Integration 4. Data Reduction5. Data Transformation
Outline
9
• Fill in missing values
• Identify outliers and smooth out noisy data
• Correct inconsistent data
Data cleaning tasks
10
Missing data
• Data is not always availableq e.g., many tuples have no recorded value for several attributes,
such as customer income in sales data
• Missing data may be due to q equipment malfunctionq inconsistent with other recorded data and thus deletedq data not entered due to misunderstandingq certain data may not be considered important at the time of
entryq not register history or changes of the data
• Missing data may need to be inferred.
11
• Missing values may hide a true answer underlying in the data
• Many data mining programs cannot be applied with data that includes missing values
Missing values in databases
Class attribute: norm, lt-‐norm, gt-‐normOther six attributes all have missing values
12
Methods1. Ignore the tuples2. Fill in the missing value manually
(tedious + infeasible?)3. Use a global constant to fill in the
missing value4. Use the attribute mean to fill the
missing values5. Use the attribute mean (or mode
for categorical attribute) for all samples belonging to the same class as the given tuple.
6. Use the most probable value to fill the missing value
7. Others
Methods: 2 4 5 3 6 6
Missing values in databases
yes
no
yesnonoyes
unknown
unknown
unknown
2929
29
29
none
none
nonenonenonenone
13
13713
dna
dna
dna
13
Noisy data
• Noise: random error or variance in a measured variable• Incorrect attribute values may due to
q faulty data collection instrumentsq data entry problemsq data transmission problemsq technology limitationq inconsistency in naming convention
• Other data problems which requires data cleaningq duplicate recordsq incomplete dataq inconsistent data
14
How to handle noisy data?
• Binning methodq first sort data and partition into (equi-depth) binsq then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.• Clustering
q detect and remove outliers• Combined computer and human inspection
q detect suspicious values and check by human• Regression
q smooth by fitting the data into regression functions
15
Binning: to smooth a sorted data value by consulting its “neighborhood”, that is, the value around it (local smoothing)
q Smoothing by bin means: each value in a bin is replaced by the mean value of the bin
q Smoothing by bin medians: each bin value is replaced by the bin median
q Smoothing by bin boundaries: the minimum and maximum values in a given bin are identified as bin boundaries
How to handle noisy data?
16
• The original data9, 21, 24, 21, 4, 26, 28, 34, 29, 8, 15, 25
• Sort data in the increasing order, and partition into (equidepth) bins:
4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
• Smoothing by bin means9, 9, 9, 9, 22, 22, 22, 22, 29, 29, 29, 29
• Smoothing by bin boundaries (replaced by the closest boundary)4, 4, 4, 15, 21, 21, 25, 25, 26, 26, 26, 34
How to handle noisy data?
17
• Outliers may be detected by clustering analysis
Values that fall outside of the set of clusters may be considered outliers
How to handle noisy data?
18
n Combined computer and human inspection: Output patterns with surprise content to a list. A human can identify the actual garbage ones.
n Regression: by fitting the data to a function, such as with regression
q Linear regression
q Multiple linear regression: more than two variables and the data are fit to a multidimensional surface
x
y
y = x + 1
X1
Y1
Y1’
How to handle noisy data?
19
1. Why Preprocess the Data?2. Data Cleaning3. Data Integration 4. Data Reduction5. Data Transformation
Outline
20
• Data integration combines data from multiple sources (multiple DBs, data cubes, flat files) into a coherent data store.
• Schema integration (entity identification problem): How can equivalent entities from multiple data sources be matched up?
• Redundancy: An attribute may be redundant if it can be “derived” from another table.
Data integration
21
• Redundancy: can be detected by correlation analysis (correlation coefficient), e.g., how strongly one attribute implies another attribute.
• Detection and resolution of data value conflicts
BABA n
BBAAr
ss)1())((
, ---
=å
Data integration
22
1. Why Preprocess the Data?2. Data Cleaning3. Data Integration 4. Data Reduction5. Data Transformation
Outline
23
• Data cube aggregation
• Dimension reduction
• Data compression
• Numerosity reduction
• Discretization and concept hierarchy generation
Strategies for data reduction
24
Data cube aggregation
§ Aggregation operations are applied to the data in the construction of a data cube
On the left, the salesare shown per quarter.On the right, the dataare aggregated toprovide the annualsales.
25
A data cube for multidimensional analysis of sales data with respect to annual sales per item type for each branch of company
Data cube aggregation
26
Data compression: Attribute selection
Attribute subset selection (also called “feature selection”)q Stepwise forward
selectionq Stepwise backward
eliminationq Combination
of forward and backward elimination
q Many other methods
27
• Discrete wavelet transformation (DWT): a linear signal processing technique that, when applied to a data vector D, transforms it to a numerically different vector D’ of wavelet coefficients.
• Store only a small fraction of the strongest of the wavelet coefficients
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Real data
WT
J=-1
J=-2
RWT
Data compression: Wavelet transforms
28
• Principal Components Analysis: transform data points from k-dimensions into c-dimensions (c £ k) with minimum loss of information
• PCA searches for c-dimensional orthogonal vectors that can best be used to represent data. The original data are thus projected onto a much smaller space of c dimensions (c principal components)
• Only used for numerical data
Data compression: PCA
1 2 3
3
2
1 O1
O2
O3
O4
O5
X
Y
Z2 Z1Question: Reduction to one dimension?Z1 and Z2, which is better?
29
• Can we reduce the data volume by choosing alternative, ‘smaller’ forms of data representation?
• Parameter methods: a model is used to estimate the data, so that typically only the data parameters need be stored, instead of the actual dataq Regression and Log-Linear Models: y = a x + b
• Non-parameter methods: for storing reduced representations of the data include q Histogramsq Clusteringq Sampling
Numerosity reduction
30
Singleton buckets: Each bucket represents one price-value/frequency pair
An equiwidth histogram, where values are aggregated so that each bucket has a uniform width of $10
Numerosity reduction: histogram
31
++
+
A 2-D plot of customer data with respect to customer locations in a city, showing three data clusters. Each cluster “center” is marked with a “+”
Numerosity reduction: Clustering
32
• Simple random sample without replacement of size n (SRSWOR)
• Simple random sample with replacement of size n (SRSWR)
• Cluster sample• Stratified sample
Numerosity reduction: Sampling
equal proportion (e.g., ½)
33
1. Why Preprocess the Data?2. Data Cleaning3. Data Integration 4. Data Reduction5. Data Transformation
Outline
34
• Smoothing: to remove noise from data
• Aggregation: summary or aggregation are applied to the data
• Generalization: low-level or “primitive” data are replaced by higher-level concepts through the use of concept hierarchy
• Normalization: attribute data are scaled so as to fall within a small specified range, says 0.0 to 1.0
• Attribute construction: new attributes are constructed and added from the given set of attributes to help the mining process: from continuous to discrete (discretization) and from discrete to continuous (word embedding).
Data transformation
35
Min-max and z-score normalization• min-‐max normalization: Suppose
minA and maxA are minimum and maximum values of attribute. Wemap a value v of A to v’ in the range [newminA, newmaxA] by
• Example: Suppose minA and maxAare $12,000 and $98,000. We want to map minimum and maximum values of attribute. We want to map income to the range [0.0, 1.0]. So, $73,600 is transformed to
• z-‐score normalization: The values for an attribute A are normalized based on the mean and standard deviation of A
• Example: If the mean and standard deviation are $54,000 and $16,000, the $73,600 is transformed to
𝑣" =𝑣 − 𝑚𝑖𝑛(
𝑚𝑎𝑥( − 𝑚𝑖𝑛(𝑛𝑒𝑤𝑚𝑎𝑥( − 𝑛𝑒𝑤𝑚𝑖𝑛( + 𝑛𝑒𝑤𝑚𝑖𝑛(
73,600 − 12,00098,000 − 12,000 1.0 − 0.0 + 0 = 0.716
𝑣" =𝑣 − �̅�𝜎(
73,600 − 54,00016,000 = 1.225
36
Discretization
• Three types of attributes:q Nominal (categorical): red, yellow, blue, greenq Ordinal: small, middle, large, extreme largeq Continuous: real numbers
• Discretization: divide the range of a continuous attribute into intervalsq Some classification algorithms only accept categorical
attributes.q Reduce data size by discretizationq Prepare for further analysis
37
§ Binning§ Histogram analysis§ Cluster analysis§ Entropy-based discretization§ Segmentation by Natural Partitioning
Discretization
38
§ Given a set of samples S, if S is partitioned into two intervals S1 and S2 using boundary T, the entropy after partitioning is
E S TSEnt
SEntS S S S( , )
| || |
( )| || |
( )= +11
22
Ent S E T S( ) ( , )- >d
Entropy-based discretization
§ The boundary that minimizes the entropy function over all possible boundaries is selected as a binary discretization.
§ The process is recursively applied to partitions obtained until some stopping criterion is met, e.g.,
§ Experiments show that it may reduce data size and improve classification accuracy
What is word embedding?
• Word embedding: Mapping a word (or phrase) from it's original high dimensional input space to a lower-dimensional numerical vector space.
• Word2vec is a group of related models that are used to produce word embeddings.q These models are shallow, two-layer neural networks that are
trained to reconstruct linguistic contexts of words. q Word2vec takes as its input a large corpus of text and produces a
vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space.
• Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space.
Some more complex data transformation
f
Input space Feature space
X Ff: X à F where the problem can be solved in F
C
documents
wor
ds U
dims
wor
ds D
dims
dim
s
Vdocuments
dim
s
Latent semantic indexing
Normalized co-occurrence matrix
C
documents
wor
ds F
topics
wor
ds
Qdocuments
topi
cs
Topic models
41
• Data preprocessing is an important issue as real-world data tend to be incomplete, noisy, and inconstant
• Data cleaning routines can be used to fill in missing values, smooth noisy data, identify outliers, and correct data inconsistencies
• Data integration combines data from multiple sources to form a coherent data store
• Data transformation routines convert the data into appropriate forms for analyzing.
• Data reduction techniques can be used to obtain a reduced representation of the data while minimizing the loss of information content
• Automatic generation of concept hierarchies can involve different techniques for numeric data, and may be based on number of distinct values of attributes for categorical data
• Data preprocessing remains as an active area of research
Summary
HomeworkThe “labor.arff” provided by WEKA has 57 instances, 16 descriptive attributes, and the class attribute with two values ‘bad’ and ‘good’. The atrributes of “labor.arff” have many missing values. Do the following
(1) Use the methods in Lecture 6 to treat the missing values of all attributes in “labor.arff”
(2) Explain why the method you used for each attribute is appropriate?
Submit the written report (pdf) by July 7, 2017. Hint:
1. You can use ARFF-Viewer in ‘Tool’ of WEKA to visualize the “labor.arff” 2. You may have at least to ways to work on labor data (labor.arff):
• Use the tool ‘arff2csv.zip’ at our website http://www.jaist.ac.jp/~bao/K236/ to convert the data into Excel format, and use the data represented in Excel for your preprocessing, or
• Take the ‘labor’ data from UCI: http://archive.ics.uci.edu/ml/machine-learningdatabases/labor-negotiations/C4.5/ and store it in Excel format (or whatever you like) to process.
top related