20160708 データ処理のプラットフォームとしてのpython 札幌
TRANSCRIPT
Python
Sky
•
• Python 2000
(**)
• db tech showcase MongoDB
•
• FB: Ryuji Tamagawa
• Twitter : tamagawa_ryuji
2015
2016
• Python
• Python
• Python
•
• Python
• NumPy, SciPy, matplotlib, Pandas • Python
• scikit-learn • TensorFlow
• Python IPython, Jupyter notebook, Spyder, VisualStudio
• Python
• Python
• Pandas
• Spark - PySpark DataFrame API
• matplotlib
Part 1 : Python
Python•
• GoogleGuido GoogleGoogle 1
•
NumPy, SciPy, matplotlib → Pandas
•
•
-2000Linux
-2010 Web Trac
Python•
•
•
•
→
•
Python
•
• pyODBC
• Web WSGI
Python• 2.x 3.x 32bit 64bit
64bit
• 2.x
• 3.x3
• 2.x3.x
• Ruby?
• R?
• Java?
• Scala?
Python• Python ’CPython’ JIT
PyPy JVM Jython .Net IronPython
• CPython
• CPython 2
• C
• processingPySpark
Python• Python
• 1 Linux Mac OS PythonPython Mac
• Python pip 3.x Python 2.7.9 2.xPython pip Linux Python
pip yum apt
• Python Anaconda Pythonconda
• python 2016 http://qiita.com/y__sama/items/5b62d31cb7e6ed50f02c
NumPy, SciPy, matplotlib, Pandas•
• NumPy SciPy
• PandasPandas Pandas NumPy
• Anaconda Python
Python•
scikit-learn http://scikit-learn.org/stable/
Python• TensorFlow
Python
Python
IPython
Jupyter, …
IDESpyder, Rodeo
Visual Studio, PyCharm, PyDev
•
• GUI IDLE
•
OK
• IPython
•
•
• Anaconda
• pip
• Jupyter Notebook
• Python
• IPython NotebookPython
• Apache Zeppelin http://zeppelin.apache.org
IDE
• R RStudio
• IDE
•
• 2 Spyder Rodeo
•
Spyder
•
• Visual Studio
• Eclipse PyDev
• PyCharm
•
Part 2 :
Python
1 1.2 1000000L Python2
‘abc’ u’ ’ Python2
[1, 2, 3, ‘foo’, ‘bar’, ‘foo’]
(1, 2, 3, ‘foo’, ‘bar’, ‘foo’)
{‘k1’: ‘value1’, ‘k2’: ‘value2’}
set(1, 2, 3, ‘foo’, ‘bar’)
•
•
• split
s = ‘foo, bar, baz’
items = s.split(‘,’)
print items[0]
print items[-1]
print items[0][-2:]
• list comprehension
• dictionary comprehension
• lambda map, reduce, filter
sList = [‘foo’, ‘bar’, ‘baz’]
lList = [len(s) for s in sList]
lList = map(lambda s:len(s), sList)
lDict = {s:len(s) for s in sList}
Pandas• Pandas
•
matplotlib / seaborn
• NumPySciPy
Python
• Pandas + matplotlibOK Pandas NumPy
NumPy / SciPy
Pandas• Pandas
DataFrame
• R
• RDB2
• index Series Columns
Columns
Series Series SeriesIndex
Pandas I/O• CSV JSON RDB Excel
• column
• RDB
•
import pandas as pd
pd.read_csv(<filename>)
pd.read_json(<filename>)
pd.to_csv(<filename>)
pd.to_excel(<filename>)
#
pd.to_clipboard()
• http://sinhrks.hatenablog.com/entry/2015/01/28/073327
0 1
import pandas as pddf[‘nValue’] = df[‘value’] / sum(df[‘value’])
id value color
sapporo 43 red
osaka 42 pink
matsumoto 40 green
id value color nValue
sapporo 43 red 0.344
osaka 42 pink 0.336
matsumoto 40 green 0.32
Python
Spark - PySpark DataFrame API
•
Python
• Spark PySparkfindSpark
Spark
• Python Spark APIDataFrame API
• Spark PandasSpark
PySpark
Sparknode
Sparknode
Sparknode
Sparknode
driver
matplotlib / seaborn
•
• Python NumPy / Pandas
• Jupyter NotebookSpyder
Questions ?