pre processing

16
Data preprocessing Data preprocessing

Upload: vijay-kumar

Post on 15-Jun-2015

42 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Pre processing

Data preprocessingData preprocessing

Page 2: Pre processing

• Data preprocessing is a data mining technique that involves transforming raw data into an understandable format.

• Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors.

Data Preprocessing

Page 3: Pre processing

Data Preprocessing

• Data preprocessing is a proven method of resolving such issues.

• Data preprocessing prepares raw data for further processing.

• Data preprocessing is used database-driven applications such as customer relationship management and rule-based applications (like neural networks).

Page 4: Pre processing

Number of data preprocessing techniques

• Data cleaning• Data integration• Data transformation• Data reduction

Page 5: Pre processing

Data Preprocessing Techniques

Page 6: Pre processing

• Data cleaning : can be applied to remove noise and correct inconsistencies in the data.

• Data integration :merges data from multiple sources into a coherent data store, such as a data warehouse.

• Data transformations :such as normalization, may be applied.

• Data reduction : can reduce the data size by aggregating, eliminating redundant features, or clustering ,for instance.

Data Preprocessing Techniques

Page 7: Pre processing

• routines work to “clean” the data by filling in missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies.

• If users believe the data are dirty, they are unlikely to trust the results of any data mining that has been applied to it.

• Although most mining routines have some procedures for dealing with incomplete or noisy data, they are not always robust.

Data cleaning

Page 8: Pre processing

Data cleaning• Therefore, a useful preprocessing step is to

run your data through some data cleaning routines.

Page 9: Pre processing

• Include data from multiple sources in your analysis.

• This would involve integrating multiple databases, data cubes, or files, that is, data integration.

• Yet some attributes representing a given concept may have different names in different databases, causing inconsistencies and redundancies.

Data integration

Page 10: Pre processing

• Having a large amount of redundant data may slow down or confuse the knowledge discovery process.

• Clearly, in addition to data cleaning, steps must be taken to help avoid redundancies during data integration.

• Typically, data cleaning and data integration are performed as a preprocessing step when preparing the data for a data warehouse.

Data integration

Page 11: Pre processing

Data integration• Additional data cleaning can be performed to

detect and remove redundancies that may have resulted from data integration.

Page 12: Pre processing

• Getting back to your data, you have decided, say, that you would like to use a distance based mining algorithm for your analysis, such as neural networks, nearest-neighbor classifiers, or clustering.

• methods provide better results if the data to be analyzed have been normalized, that is, scaled to a specific range such as [0.0, 1.0].

Data Transformation

Page 13: Pre processing

Data Transformation• You soon realize that data transformation

operations, such as normalization and aggregation, are additional data preprocessing procedures that would contribute toward the success of the mining process.

Page 14: Pre processing

• Data reduction obtains a reduced representation of the data set that is much smaller in volume, yet produces the same (or almost the same) analytical results.

• There are a number of strategies for data reduction.

• These include data aggregation , attribute subset selection , dimensionality reduction and numerosity reduction.

DATA REDUCTION

Page 15: Pre processing

• Data can also be “reduced” by generalization with the use of concept hierarchies, where low-level concepts, such as city for customer location, are replaced with higher-level concepts, such as region or province or state.

• A concept hierarchy organizes the concepts into varying levels of abstraction.

• Data discretization is a form of data reduction that is very useful for the automatic generation of concept hierarchies from numerical data.

DATA REDUCTION

Page 16: Pre processing

Thank you