Что такое data science

Post on 05-Jul-2015

614 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

What is Data ScienceBig Data Dive, 20.09.2012

Data is everywhere

Apps with data

Google Page Rank

Amazon Recommendations

Meteorology

Healthcare

Big Data processing

Definition ofData Science

Data Science is…

• Data Engineering

• Scientific Method

• Math

• Statistics

• Advanced Computing

• Visualization

• Hacker mindset

• Domain Expertise

Data Science is…

• A/B testing

• Association rule learning

• Classification

• Cluster analysis

• Crowdsourcing

• Data fusion and integration

• Data mining

• Ensemble learning

• Genetic algorithms

• Machine learning

• Massive parallel-processing

• Natural language processing

• Neural networks

• Pattern recognition

• Predictive modelling

• Regression

• Sentiment analysis

• Signal processing

• Simulation

• Time series analysis

• Visualization

Data Science is…

• Explore data

• Build model

• Apply model

The most important goal of data science is

prediction

Process

Explore data

• Preprocessing

• Data cleaning

• Transformations

• Subsets selection

• Feature selection

• Discretization

• Binarization

• Normalization

• Generalization

• Investigation

• Plots

• Histograms

• Smoothing

• Plot matrices

• Distributions

• Multidimensional scaling

• Classification trees

• Correlation matrices

Example: Binarization

Example: Plot Matrices

Build model

• Artificial neural networks

• Association rules

• Bayesian networks

• Clustering

• Decision trees

• Generalized linear models

• Genetic programming

• Inductive logic programming

• Sparse dictionaries

• Support vector machines

• Reinforcement learning

• Representation learning

Example: Decision Trees

Apply model

Tools

R

• Open source programming language and software environment

• Designed for statistical computing and graphics

• CRAN (The Comprehensive R Archive Network) – 5300 packages and counting

• In 2010 has become the data mining tool used by more data miners (43%) than any other

Mathematical packages

They make presentation better

• Google Prediction API

• Microsoft Analysis Services

• Oracle Data Mining

Python

• Well recognized for scientific engineering

• General purpose scientific libraries:

Numpy, Scipy, Matplotlib, python-multiprocessing

• Statistical, data mining, machine learning packages:

Scikit-learn, Pandas, PyBrain

Thank you!

Andrei Paleyes

apalees@gmail.com

andrey.palees@altoros.com

Skype: andrei.paleyes

top related