20161004 データ処理のプラットフォームとしてのpythonとpandas 東京
TRANSCRIPT
Python / pandasSky
•
• Python 2000
(**)
• db tech showcase MongoDB
•
• FB: Ryuji Tamagawa• Twitter : tamagawa_ryuji
2015-2016
• Python / pandas
• Python / pandas
• Python
•
• Python
• NumPy, SciPy, matplotlib, pandas• Python
• Python IPython, Jupyter notebook, Spyder, VisualStudio
• Python / pandas
• Python
• pandas
• Spark - PySpark DataFrame API
• matplotlib
Part 1 : Python
Python•
• Google GuidoGoogle Google
1
•
NumPy, SciPy, matplotlib →
pandas
•
•
-2000Linux
-2010 Web Trac
Python•
•
•
•
•
• ’Batteries included’
Python• 2.x 3.x 32bit 64bit
64bit
• 2.x
• 3.x3
• 2.x3.x
• Ruby?
• R?
• Java?
• Scala?
Python• Python ’CPython’ JIT
PyPy JVM Jython .Net IronPython
• CPython
• CPython 2
• C
• processingPySpark
Python• Python Linux Mac OS
Python PythonMac
• Python pip 3.x 2.7.9 2.xPython pip Linux Pythonpip yum apt
• Python Anaconda Pythonconda
• python 2016 http://qiita.com/y__sama/items/5b62d31cb7e6ed50f02c
NumPy, SciPy, matplotlib, pandas
•
• NumPy SciPy
• pandaspandas pandas NumPy
• Anaconda Python
Python•
scikit-learn http://scikit-learn.org/stable/
Python• TensorFlow
Python
Python
IPython
Jupyter, …
IDESpyder, Rodeo
Visual Studio, PyCharm, PyDev
• IPython
•
•
• Anaconda
• Jupyter Notebook
• Python
• IPython Notebook
Python
• Apache Zeppelin http://zeppelin.apache.org
IDE
• R RStudio
• IDE
•
• 2 Spyder Rodeo
•
Spyder
•
• Visual Studio
• Eclipse PyDev
• PyCharm
•
Part 2 : Python / pandas
Python / pandas
• pandas
• /etc…
•
Spark
• pandas
processing•
• 64bit Python + GB
• Python 11 CPU
GIL
• processingJenkins
CPU/
Jenkins
1 1.2 1000000
‘abc’ ’ ’
[1, 2, 3, ‘foo’, ‘bar’, ‘foo’]
(1, 2, 3, ‘foo’, ‘bar’, ‘foo’)
{‘k1’: ‘value1’, ‘k2’: ‘value2’}
set(1, 2, 3, ‘foo’, ‘bar’)
•
•
• split
s = ‘foo, bar, baz’
items = s.split(‘,’)
print items[0]
print items[-1]
print items[0][-2:]
• ,
• lambda map, reduce, filter
sList = [‘foo’, ‘bar’, ‘baz’]
lList = [len(s) for s in sList]
lList = map(lambda s:len(s), sList)
lDict = {s:len(s) for s in sList}
lList = []for s in sList: lList.append(len(s))
lDict = {}for s in sList: lDict[s] = len(s)
pandas• pandas
•
matplotlib / seaborn
• NumPySciPy
Python
• pandas + matplotlibOK pandas NumPy
NumPy / SciPy
https://openbook4.me/projects/183
pandas• pandas
DataFrame
• R
• RDB2
• index Series Columns
Columns
Series Series SeriesIndex
IDE /
• IDE
• jupyter notebook
• http://sinhrks.hatenablog.com/entry/2015/01/28/073327
0 1
import pandas as pddf[‘nValue’] = df[‘value’] / sum(df[‘value’])
id value color
sapporo 43 red
osaka 42 pink
matsumoto 40 green
id value color nValue
sapporo 43 red 0.344
osaka 42 pink 0.336
matsumoto 40 green 0.32
Python
pandas I/O• CSV JSON RDB Excel
• column
• RDB
•
import pandas as pd
pd.read_csv(<filename>)
pd.read_json(<filename>)
pd.to_csv(<filename>)
pd.to_excel(<filename>)
#
pd.to_clipboard()
pandas.read_csv• pandas CSV
•
•
• usecols :
• nrows :
• na_values : na
• parse_dates infer_datetime_format:
• chunksize :
• compression : zip CSV
pandas.read_csv(filepath_or_buffer, sep=', ', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, skipfooter=None, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, iterator=False, chunksize=None, compression='infer', thousands=None, decimal='.', lineterminator=None, quotechar='"', quoting=0, escapechar=None, comment=None, encoding=None, dialect=None, tupleize_cols=False, error_bad_lines=True, warn_bad_lines=True, skip_footer=0, doublequote=True, delim_whitespace=False, as_recarray=False, compact_ints=False, use_unsigned=False, low_memory=True, buffer_lines=None, memory_map=False, float_precision=None)
Spark - PySpark DataFrame API
•
Python
• Spark PySparkfindSpark
Spark
• Python Spark APIDataFrame API
• Spark pandas
Spark
PySpark
Sparknode
Sparknode
Sparknode
Sparknode
driver
•
•
Apache Arrow
• Python / R:
feather
• pandas 2.0, parquet for Python
Python / pandas
Questions ?