Что такое data science
Post on 05-Jul-2015
614 Views
Preview:
TRANSCRIPT
What is Data ScienceBig Data Dive, 20.09.2012
Data is everywhere
Apps with data
Google Page Rank
Amazon Recommendations
Meteorology
Healthcare
Big Data processing
Definition ofData Science
Data Science is…
• Data Engineering
• Scientific Method
• Math
• Statistics
• Advanced Computing
• Visualization
• Hacker mindset
• Domain Expertise
Data Science is…
• A/B testing
• Association rule learning
• Classification
• Cluster analysis
• Crowdsourcing
• Data fusion and integration
• Data mining
• Ensemble learning
• Genetic algorithms
• Machine learning
• Massive parallel-processing
• Natural language processing
• Neural networks
• Pattern recognition
• Predictive modelling
• Regression
• Sentiment analysis
• Signal processing
• Simulation
• Time series analysis
• Visualization
Data Science is…
• Explore data
• Build model
• Apply model
The most important goal of data science is
prediction
Process
Explore data
• Preprocessing
• Data cleaning
• Transformations
• Subsets selection
• Feature selection
• Discretization
• Binarization
• Normalization
• Generalization
• Investigation
• Plots
• Histograms
• Smoothing
• Plot matrices
• Distributions
• Multidimensional scaling
• Classification trees
• Correlation matrices
Example: Binarization
Example: Plot Matrices
Build model
• Artificial neural networks
• Association rules
• Bayesian networks
• Clustering
• Decision trees
• Generalized linear models
• Genetic programming
• Inductive logic programming
• Sparse dictionaries
• Support vector machines
• Reinforcement learning
• Representation learning
Example: Decision Trees
Apply model
Tools
R
• Open source programming language and software environment
• Designed for statistical computing and graphics
• CRAN (The Comprehensive R Archive Network) – 5300 packages and counting
• In 2010 has become the data mining tool used by more data miners (43%) than any other
Mathematical packages
They make presentation better
• Google Prediction API
• Microsoft Analysis Services
• Oracle Data Mining
Python
• Well recognized for scientific engineering
• General purpose scientific libraries:
Numpy, Scipy, Matplotlib, python-multiprocessing
• Statistical, data mining, machine learning packages:
Scikit-learn, Pandas, PyBrain
Thank you!
Andrei Paleyes
apalees@gmail.com
andrey.palees@altoros.com
Skype: andrei.paleyes
top related