data scientist's daily life

63
DATA SCIENTIST’S DAILY LIFE BRYAN YANG 2015.09

Upload: bryan-yang

Post on 21-Apr-2017

4.562 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: Data Scientist's Daily Life

DATA SC IENT IST ’S DA ILY L I FEBRYAN YANG 2015.09

Page 2: Data Scientist's Daily Life

ABOUT ME

• BlogBryan的行銷研究及資料分析筆記http://bryannotes.blogspot.tw

• GroupSpark.TW

Page 3: Data Scientist's Daily Life

AGENDA

• Data scientist?

• Big data and data scientist

• Data scientist’s Toolbox

• Data is the biggest

Page 4: Data Scientist's Daily Life

Derive Knowledge

fromBig data

Page 5: Data Scientist's Daily Life

Efficiently

and

Intelligently

Page 6: Data Scientist's Daily Life

FROM BACKEND TO FRONTEND

https://doubleclix.wordpress.com/2012/12/15/what-or-who-is-a-data-scientist/

Page 7: Data Scientist's Daily Life

WHAT IS B IG DATA?

Page 8: Data Scientist's Daily Life

WHERE DO THE DATA COME FROM

• Web Log data

• Machine data

• Transactional data

• Social media data

• …

Page 9: Data Scientist's Daily Life

https://plus.google.com/+DigitalStrategyIE

Page 10: Data Scientist's Daily Life
Page 11: Data Scientist's Daily Life

A WEB SERVICE RECE IVE THE LOG DATA MORE THEN 50G PER DAYTOTAL SPACE USED LAST THREE MONTH : 4500GTOTAL SPACE USED LAST ONE YEAR : 18,000G(17.6T)

Page 12: Data Scientist's Daily Life

• Data Storage/ Backup

• 2T/per HDD

• How to save the data MORE than 2T?

• $0.3 USD/per gigabyte

• Pay 900 USR for KEEPING data but do nothing else.

• Read/Write Speed

• Read: 131.6 MB/s / Write 131.4MB/s

• Spend 393s(6 min) reading just ONE day data.

• Large number of transactions immediately

Page 13: Data Scientist's Daily Life

HADOOP AND MAPREDUCE

Page 14: Data Scientist's Daily Life

HADOOP AND HDFS

http://www.fraudtechwire.com/f-level-guide-to-hadoop-hdfs/

Page 15: Data Scientist's Daily Life
Page 16: Data Scientist's Daily Life
Page 17: Data Scientist's Daily Life

– D I ST R I B UT E D AL GOR I TH M

「 The world will change,when data is distributed」

Page 18: Data Scientist's Daily Life

MAP REDUCE

http://www.milanor.net/blog/?p=853

Page 19: Data Scientist's Daily Life

https://chamibuddhika.wordpress.com/2012/02/26/joins-with-map-reduce/

Page 20: Data Scientist's Daily Life

http://blog.agro-know.com/?p=3810

Page 21: Data Scientist's Daily Life

PERFORMANCE OF HADOOP?

• Not good, but at least can run.

• Count 86,389,084 rows/per day in 39 sec. (64G ram, E5 8core * 2/per node * 10)

• How about 39sec * 30days ?

Page 22: Data Scientist's Daily Life

BEFORE ANALYT IC…

Page 23: Data Scientist's Daily Life

EXTRACT TRASFORM LOAD

http://www.wisdomjobs.com/e-university/data-warehouse-etl-toolkit-tutorial-201/surrounding-the-requirements-1319/architecture-8029.html

Page 24: Data Scientist's Daily Life

http://www.slideshare.net/capgemini/emc-world-2014-breakout-move-to-the-business-data-lake-not-as-hard-as-it-sounds

Page 25: Data Scientist's Daily Life

http://www.slideshare.net/hortonworks/modern-data-architecture-for-a-data-lake-with-informatica-and-hortonworks-data-platform

Page 26: Data Scientist's Daily Life

DATA SC IENT IST ’S TOOL BOX

Page 27: Data Scientist's Daily Life

L INUX

• The best server choice

• Free and freedom

• Easy to control system

• Easy data processing

• Hadoop is based on Linux

Page 28: Data Scientist's Daily Life
Page 29: Data Scientist's Daily Life

POWERFUL SHELL SCR IPT

Page 30: Data Scientist's Daily Life

SQL DATABASE

• MySql, Postgresql, Hive, MongoDB(NOSQL)

• Standard SQL Language

• Store and Manage data

Page 31: Data Scientist's Daily Life

REL AT IONAL DATABASE

Page 32: Data Scientist's Daily Life

TABLE REL AT ION

https://cloudant.com/blog/foundbites-data-model-relational-db-vs-nosql-on-cloudant/

Page 33: Data Scientist's Daily Life

http://ghtorrent.org/relational.html

Page 34: Data Scientist's Daily Life

SQL SYNTAX

Page 35: Data Scientist's Daily Life

R & PYTHON

• Basic Analysis Tools

• Easy to Learn

• Many Packages

Page 36: Data Scientist's Daily Life
Page 37: Data Scientist's Daily Life
Page 39: Data Scientist's Daily Life

ETC…

• Excel

• Google Analytics

• Visualisation tools (tableau)

• Web Crawler

• Version control management (git)

• ETL and job scheduling tools (jenkins)

• …

Page 40: Data Scientist's Daily Life

DATA I S THE B IGGEST

Page 41: Data Scientist's Daily Life

– J OS H W I L LS

“Person who is better at statistics than any software engineer and better at software

engineering than any statistician.”

Page 42: Data Scientist's Daily Life

STAT IST IC

Page 43: Data Scientist's Daily Life

WH Y DO WE NEED MACH IN E LEA RN ING ?

• Clustering這些人可以分成幾類• Classification哪個人屬於哪一類?• Regression某個事件發生或某人屬於哪類的機率是多少?• Dimensionality reduction降維

Page 44: Data Scientist's Daily Life

CLUSTER ING

http://simplystatistics.org/2014/02/18/k-means-clustering-in-a-gif/

source http://humble-developer.blogspot.tw/2011/01/kmeans-clustering-algorithm-part-1.html

Page 45: Data Scientist's Daily Life

CL ASS I F ICAT ION

http://letsmakerobots.com/content/tcs3200-color-sensor-with-k-nearest-neighbor-classification-algorithm

Page 46: Data Scientist's Daily Life

http://www.astroml.org/sklearn_tutorial/

Page 47: Data Scientist's Daily Life

LOG IST IC REGRESS ION

https://www.coursera.org/instructor/andrewng

Page 48: Data Scientist's Daily Life

COST FUNCT ION

https://www.coursera.org/instructor/andrewng

Page 49: Data Scientist's Daily Life

OVERF ITT ING

https://www.coursera.org/instructor/andrewng

Page 50: Data Scientist's Daily Life

OH MY GOD!HOW TO CHOOSE IT

Page 51: Data Scientist's Daily Life

M AC H I N E L E A R N I N G A L G O R I T H M N

http://amueller.github.io/sklearn_tutorial/

Page 52: Data Scientist's Daily Life

STAT IST IC VS ML

STATT I ST I C MAC H I NEL EAR N I NG

FOCU S ON U NDER S TAND I NG DATA I N T ER MS OF MODELS

FOCU S ON TH E ANALYS I S OF L EAR N I NG AL G OR I TH MS

I NTER P R ETAB I L I TY , HY P OTH ES I S TE ST I NG

G R EATE R F OC U S ON P R ED I C T I ON

Page 53: Data Scientist's Daily Life

SYSTEMAT ICS AND AUTOMAT ION

http://www.slideshare.net/CetasAnalytics/cetas-e-baymeetupprezofinal

Page 54: Data Scientist's Daily Life

http://mlg.postech.ac.kr/projects/

Page 55: Data Scientist's Daily Life

SHOW YOUR DATA AND F INDINGS

Page 56: Data Scientist's Daily Life

http://hortonworks.com/wp-content/uploads/2012/06/Tableau2.png

Page 57: Data Scientist's Daily Life

http://www.tableau.com

Page 58: Data Scientist's Daily Life

http://www.tableau.com

Page 59: Data Scientist's Daily Life

http://www.tableau.com

Page 60: Data Scientist's Daily Life

THE REAL CASE

Page 61: Data Scientist's Daily Life

HOW TO START?

Page 62: Data Scientist's Daily Life

• Codecademy http://www.codecademy.com/Include kinds of programming language, i.e. python, JavaSrtipt, even shell script and sql

• Coursera http://www.codecademy.com/Famous self-learning MOOC website.

Page 63: Data Scientist's Daily Life

http://nirvacana.com/thoughts/becoming-a-data-scientist/