data scientist's daily life
TRANSCRIPT
DATA SC IENT IST ’S DA ILY L I FEBRYAN YANG 2015.09
ABOUT ME
• BlogBryan的行銷研究及資料分析筆記http://bryannotes.blogspot.tw
• GroupSpark.TW
AGENDA
• Data scientist?
• Big data and data scientist
• Data scientist’s Toolbox
• Data is the biggest
Derive Knowledge
fromBig data
Efficiently
and
Intelligently
FROM BACKEND TO FRONTEND
https://doubleclix.wordpress.com/2012/12/15/what-or-who-is-a-data-scientist/
WHAT IS B IG DATA?
WHERE DO THE DATA COME FROM
• Web Log data
• Machine data
• Transactional data
• Social media data
• …
https://plus.google.com/+DigitalStrategyIE
A WEB SERVICE RECE IVE THE LOG DATA MORE THEN 50G PER DAYTOTAL SPACE USED LAST THREE MONTH : 4500GTOTAL SPACE USED LAST ONE YEAR : 18,000G(17.6T)
• Data Storage/ Backup
• 2T/per HDD
• How to save the data MORE than 2T?
• $0.3 USD/per gigabyte
• Pay 900 USR for KEEPING data but do nothing else.
• Read/Write Speed
• Read: 131.6 MB/s / Write 131.4MB/s
• Spend 393s(6 min) reading just ONE day data.
• Large number of transactions immediately
HADOOP AND MAPREDUCE
HADOOP AND HDFS
http://www.fraudtechwire.com/f-level-guide-to-hadoop-hdfs/
– D I ST R I B UT E D AL GOR I TH M
「 The world will change,when data is distributed」
MAP REDUCE
http://www.milanor.net/blog/?p=853
https://chamibuddhika.wordpress.com/2012/02/26/joins-with-map-reduce/
http://blog.agro-know.com/?p=3810
PERFORMANCE OF HADOOP?
• Not good, but at least can run.
• Count 86,389,084 rows/per day in 39 sec. (64G ram, E5 8core * 2/per node * 10)
• How about 39sec * 30days ?
BEFORE ANALYT IC…
EXTRACT TRASFORM LOAD
http://www.wisdomjobs.com/e-university/data-warehouse-etl-toolkit-tutorial-201/surrounding-the-requirements-1319/architecture-8029.html
http://www.slideshare.net/capgemini/emc-world-2014-breakout-move-to-the-business-data-lake-not-as-hard-as-it-sounds
http://www.slideshare.net/hortonworks/modern-data-architecture-for-a-data-lake-with-informatica-and-hortonworks-data-platform
DATA SC IENT IST ’S TOOL BOX
L INUX
• The best server choice
• Free and freedom
• Easy to control system
• Easy data processing
• Hadoop is based on Linux
POWERFUL SHELL SCR IPT
SQL DATABASE
• MySql, Postgresql, Hive, MongoDB(NOSQL)
• Standard SQL Language
• Store and Manage data
REL AT IONAL DATABASE
TABLE REL AT ION
https://cloudant.com/blog/foundbites-data-model-relational-db-vs-nosql-on-cloudant/
http://ghtorrent.org/relational.html
SQL SYNTAX
R & PYTHON
• Basic Analysis Tools
• Easy to Learn
• Many Packages
• Example
• http://bryannotes.blogspot.tw/2014/08/r-ptt-wantedsocial-network-analysis.html
• http://bryannotes.blogspot.tw/2014/10/python-k-means-script.html
ETC…
• Excel
• Google Analytics
• Visualisation tools (tableau)
• Web Crawler
• Version control management (git)
• ETL and job scheduling tools (jenkins)
• …
DATA I S THE B IGGEST
– J OS H W I L LS
“Person who is better at statistics than any software engineer and better at software
engineering than any statistician.”
STAT IST IC
WH Y DO WE NEED MACH IN E LEA RN ING ?
• Clustering這些人可以分成幾類• Classification哪個人屬於哪一類?• Regression某個事件發生或某人屬於哪類的機率是多少?• Dimensionality reduction降維
CLUSTER ING
http://simplystatistics.org/2014/02/18/k-means-clustering-in-a-gif/
source http://humble-developer.blogspot.tw/2011/01/kmeans-clustering-algorithm-part-1.html
CL ASS I F ICAT ION
http://letsmakerobots.com/content/tcs3200-color-sensor-with-k-nearest-neighbor-classification-algorithm
http://www.astroml.org/sklearn_tutorial/
LOG IST IC REGRESS ION
https://www.coursera.org/instructor/andrewng
COST FUNCT ION
https://www.coursera.org/instructor/andrewng
OVERF ITT ING
https://www.coursera.org/instructor/andrewng
OH MY GOD!HOW TO CHOOSE IT
M AC H I N E L E A R N I N G A L G O R I T H M N
http://amueller.github.io/sklearn_tutorial/
STAT IST IC VS ML
STATT I ST I C MAC H I NEL EAR N I NG
FOCU S ON U NDER S TAND I NG DATA I N T ER MS OF MODELS
FOCU S ON TH E ANALYS I S OF L EAR N I NG AL G OR I TH MS
I NTER P R ETAB I L I TY , HY P OTH ES I S TE ST I NG
G R EATE R F OC U S ON P R ED I C T I ON
SYSTEMAT ICS AND AUTOMAT ION
http://www.slideshare.net/CetasAnalytics/cetas-e-baymeetupprezofinal
http://mlg.postech.ac.kr/projects/
SHOW YOUR DATA AND F INDINGS
http://hortonworks.com/wp-content/uploads/2012/06/Tableau2.png
http://www.tableau.com
http://www.tableau.com
http://www.tableau.com
THE REAL CASE
HOW TO START?
• Codecademy http://www.codecademy.com/Include kinds of programming language, i.e. python, JavaSrtipt, even shell script and sql
• Coursera http://www.codecademy.com/Famous self-learning MOOC website.
http://nirvacana.com/thoughts/becoming-a-data-scientist/