Download - Data Scientists
Data Scientists
Leonid Zhukov
Higher School of Economics , Moscow, 2013 www.hse.ru
The Sexiest Job of the 21st Century
2
McKinsey estimates 140,000-190,000 shortage by 2018
Higher School of Economics , Moscow, 2013
Data Scientists wanted!
3 Higher School of Economics , Moscow, 2013
Supply and demand
4 Higher School of Economics , Moscow, 2013
Who are Data Scientists?
5
Some backgrounds are better than others: • Computer Science • Statistics (mathematics) • Natural sciences with strong quantitative • PhD’s, but not only
Data Scientist: • Loves data • Investigator mind set • Goal of his work is in finding patterns in data and data driven
products • He is a practitioner, not theorist • Has “hands on” skills • Domain expertise (*) • Team player
demand for a certain set of skills, while later demand wanes as many of those initial skills are automated by even newer tools. Consider, for instance, the way many data processing and network management jobs that used to require legions of computer operators are now handled by automated monitoring tools. Data science is still in its very early phase, with the amount of data exploding and
the right tools to process them just becoming available.
Although data science is generating new opportunities, our capacity to train new data scientists is not keeping up, and nearly two-thirds of respondents foresee a looming shortfall in the number of data scientists over the next five years. This aligns with other research, including a recent McKinsey Global Institute study that predicts a shortage of 190,000 data scientists by the year 2019iii. And when our respondents were asked where the best source for talent was, few looked to today’s business intelligence professional. Instead, nearly two-thirds looked for today’s
university students.
Who is the Data Scientist? Although the term data science has been around for decades – indeed, most scientists’ use data of some form – the term data scientist in its current context is relatively new, frequently credited to DJ Patil, who started the data science team at LinkedIn.iv But as a new term, the field is still very much in flux, and without evidence about the practitioners, we’re left to speculate about what it may mean. In our survey, we allowed users to self-identify as “data science professionals,” in order to avoid conflicts over terminology in job titles. In this section we’ll attempt to define the data scientist by comparing them with the previous big player in the analytics space, business intelligence professionals.
Twenty years ago, business intelligence was itself a new term, just emerging to take over the various database management and decisions support functions within an organization. As the field grew rapidly in the 90s, it also coalesced around a smaller number of tools, more consistent expectations for talent, better training, and more rigorous organizational standards. As our data demonstrates, data scientists are currently going through that transition,
Students studying computer science
34%
Students studying
fields other than
computer science
24%
Professionals in disciplines other than IT or computer
science 27%
Today's BI professionals
12%
Other 3%
The best source of new Data Science talent is:
Jim Asplund, Chief Scientist at Gallup Consulting, is a data scientist focused on evaluating the role that human perception has on everything from disease conditions and GDP to worker productivity and consumer behavior. He works with massive data sets linking perception with actual behavior, and micro
and macroeconomic outcomes. His work has isolated emotional factors that are most highly related to outcomes
organizations care about.
EMC Data Science Community Survey, 2011 Higher School of Economics , Moscow, 2013
What do Data Scientists do?
• Designs customized system and tools • Works with structured and unstructured data • Creates data processing pipelines • Analyzes massive datasets (TB, PB) • Builds predictive models • Creates visualizations • Designs data products • Uses Hadoop, MapReduce, Hive, Python, R
6 Higher School of Economics , Moscow, 2013
Tools of the trade
• Operating systems: • Linux + shell tools
• Big data instruments: • Hadoop (MapReduce) + hadoop tools • Hive, Pig • NoSQL (Hbase, MongoDB, Cassandra, Neo4J)
• Database: • SQL
• Programming: • Python • Java • Scala
• Machine Learning: • R • Matlab • Python libraries (NumPy, SciPy, Nltk,SciKit) • Java libraries (Mahaut)
.
7 Higher School of Economics , Moscow, 2013
Required skills
• Programming • Algorithms • Statistics • Data mining • Machine learning • NLP • Distributed systems • Big data tools • Databases • Visualization
8
From: Swami Chandrasekaran,Executive Architect, IBM, Watson Solutions
Higher School of Economics , Moscow, 2013
Data Scientist roles
9
From: “Analyzing the Analyzers” by Harlan Harris, Sean Murphy, and Marck Vaisman , O’Reilly Strata 2012 Higher School of Economics , Moscow, 2013
Data Science ”dream team”
10
From: “Doing Data Science: Straight Talk from the Frontline”, Rachel Schutt, Cathy O'Neil, O'Reilly Media, 2013 Higher School of Economics , Moscow, 2013
Data Science project pipeline
Learning a problem Acquiring
data Parsing data Cleaning,
filtering and organizing
Exploring and mining for paGerns
Building models
Visualizing results
CommunicaJng findings
11 Higher School of Economics , Moscow, 2013
Business applications
• Marketing: • Market segmentation • Product and media mix analysis • Customer acquisition and churn modeling • Recommendation system and cross sell • Social media analysis
• Finance & Insurance: • Fraud prevention • Anomaly detection • Credit risk analysis • Usage based insurance modeling • Portfolio optimization
12
• Healthcare and Pharmaceuticals: • Genetic analysis • Clinical trials analysis • Clinical decision support system
Higher School of Economics , Moscow, 2013
Industry training
©2013 Cloudera, Inc. All rights reserved. Cloudera and the Cloudera logo are trademarks or registered trademarks of Cloudera Inc. in the USA and other countries. All other trademarks are the property of their respective companies. Information is subject to change without notice.
cloudera-intro-data-science-trainingsheet-102
Cloudera, Inc. 1001 Page Mill Road, Palo Alto, CA 94304 | 1-888-789-1488 or 1-650-362-0488 | cloudera.com
TRAINING SHEET | 2
Course Outline: Cloudera Introduction to Data Science
Introduction
Data Science Overview > What Is Data Science? > The Growing Need for Data Science
> The Role of a Data Scientist
Use Cases > Finance > Retail > Advertising > Defense and Intelligence > Telecommunications and Utilities > Healthcare and Pharmaceuticals
Project Lifecycle > Steps in the Project Lifecycle
> Lab Scenario Explanation
Data Acquisition > Where to Source Data > Acquisition Techniques
Evaluating Input Data > Data Formats > Data Quantity > Data Quality
Data Transformation > Anonymization > File Format Conversion > Joining Datasets
Data Analysis and Statistical Methods > Relationship Between Statistics and
Probability > Descriptive Statistics > Inferential Statistics
Fundamentals of Machine Learning > Overview > The Three Cs of Machine Learning > Spotlight: Naïve Bayes Classifiers > Importance of Data and Algorithms
Recommender Overview > What Is a Recommender System? > Types of Collaborative Filtering > Limitations of Recommender Systems > Fundamental Concepts
Introduction to Apache Mahout > What Apache Mahout Is (and Is Not) > A Brief History of Mahout > Availability and Installation > Demonstration: Using Mahout’s Item-
Based Recommender
Implementing Recommenders with Apache Mahout
> Overview > Similarity Metrics for Binary Preferences > Similarity Metrics for Numeric Preferences > Scoring
Experimentation and Evaluation > Measuring Recommender Effectiveness > Designing Effective Experiments > Conducting an Effective Experiment > User Interfaces for Recommenders
Production Deployment and Beyond > Deploying to Production > Tips and Techniques for Working at Scale > Summarizing and Visualizing Results > Considerations for Improvement > Next Steps for Recommenders
Conclusion
Appendix A : Hadoop Overview
Appendix B: Mathematical Formulas
Appendix C : Language and Tool Reference
Cloudera Certified Professional: Data Scientist (CCP:DS)Establish yourself as an expert by completing the certification exam for data scientists. CCP:DS is the highest level of technical certification Cloudera offers and certifies your knowledge and skills as a data scientist using Apache Hadoop on large data sets. The credential requires both a multiple-choice Data Science Essentials exam and a hands-on, performance-based Data Science Challenge with a real-world problem on a live system.
TRAINING SHEET
Cloudera Introduction to Data Science: Building Recommender Systems
Take your knowledge to the next level with Cloudera’s Data Science Training and Certification
Data scientists build information platforms to ask and answer previously unimaginable questions. Learn how data science helps companies reduce costs, increase profits, improve products, retain customers, and identify new opportunities.
Cloudera University’s three-day course helps participants understand what data scientists do and the problems they solve. Through in-class simulations, participants apply data science methods to real-world challenges in different industries and, ultimately, prepare for data scientist roles in the field.
Hands-On Hadoop Through instructor-led discussion and interactive, hands-on exercises, participants will navigate the Hadoop ecosystem, learning topics such as:
> The role of data scientists, vertical use cases, and business applications of data science > Where and how to acquire data, methods for evaluating source data, and data
transformation and preparation > Types of statistics and analytical methods and their relationship > Machine learning fundamentals and breakthroughs, the importance of algorithms, and
data as a platform > How to implement and manage recommenders using Apache Mahout and how to set up
and evaluate data experiments > Steps for deploying new analytics projects to production and tips for working at scale
Audience & Prerequisites This course is suitable for developers, data analysts, and statisticians with basic knowledge of Apache Hadoop: HDFS, MapReduce, Hadoop Streaming, and Apache Hive. Students should have proficiency in a scripting language; Python is strongly preferred, but familiarity with Perl or Ruby is sufficient.
Data Scientist CertificationFollowing successful completion of the training class, attendees receive a Data Science Essentials practice test. Data Science Essentials plus the Data Science Challenge constitute the Cloudera Certified Professional: Data Scientist (CCP:DS). Certification is a great differen-tiator; it helps establish you as a leader in the field, providing employers and customers with tangible evidence of your skills and expertise.
The professionalism and expansive technical knowledge demonstrated by our instructor were incredible. The quality of the Cloudera training was on par with a university.
GENERAL DYNAMICS
““
13 Higher School of Economics , Moscow, 2013
14
Industry training
Higher School of Economics , Moscow, 2013
Educational programs
University programs: • University of Washington: Certificate in Data Science • UC Berkeley: Master of information and data science program • New York University: Data Science at NYU • Columbia University: Institute for Data Sciences and Engineering • University of Southern California (UCS) : Master of Science in Data
Science
15
Online MOOC courses: • Coursera • edX • Udacity
Accelerated educational programs: • Zipfian Academy (12 weeks intensive program) • Insight Data Science Fellows program ( 6 weeks post doc training)
Higher School of Economics , Moscow, 2013
Conferences
• Industry conferences and meetings: • O’Reilly Strata Conference Making Data Work • Hadoop World • Big Data Techcon • Big Data Innovation summits
16
• Meetups
• Academic conferences (peer reviewed): • IEEE & ACM Supercomputing • IEEE Big Data • ACM KDD Knowledge Discovery and Data Mining • ACM SIGIR Information Retrieval • ICML International Conference on Machine Learning • ICDM International Conference on Data Mining • NIPS Neural Information Processing • WWW World Wide Web Conference • VLDB Very Large Data Bases • ACM CIKM Information and Knowledge Management • SIAM SDM International Conference on Data Mining • IEEE ICDE Data Engineering • IEEE Visualization
Higher School of Economics , Moscow, 2013
Textbooks
17 Higher School of Economics , Moscow, 2013
Open questions
• How important is domain expertise? • What is need more: education or experience?
• Future of Data Scientist, will they be replaced by software?
18 Higher School of Economics , Moscow, 2013
20, Myasnitskaya str., Moscow, Russia, 101000 Tel.: +7 (495) 628-8829, Fax: +7 (495) 628-7931
www.hse.ru