pysparkの勘所(20170630 sapporo db analytics showcase)

33
PySpark @

Upload: ryuji-tamagawa

Post on 22-Jan-2018

788 views

Category:

Software


2 download

TRANSCRIPT

Page 1: PySparkの勘所(20170630 sapporo db analytics showcase)

PySpark@

Page 2: PySparkの勘所(20170630 sapporo db analytics showcase)

▸ facebook : Ryuji Tamagawa

▸ Twitter : tamagawa_ryuji

▸ FB

▸ Twitter

Page 3: PySparkの勘所(20170630 sapporo db analytics showcase)
Page 4: PySparkの勘所(20170630 sapporo db analytics showcase)

8

Page 5: PySparkの勘所(20170630 sapporo db analytics showcase)

Wes Mckinney blog

▸ http://qiita.com/tamagawa-ryuji

Page 6: PySparkの勘所(20170630 sapporo db analytics showcase)

▸ pandas PyData

▸ Spark Scala Java

Spark

▸ TB

Page 7: PySparkの勘所(20170630 sapporo db analytics showcase)
Page 8: PySparkの勘所(20170630 sapporo db analytics showcase)

▸ Spark Hadoop

▸ PySpark

▸ PySpark

▸ Spark/Hadoop PyData

PySpark

Page 9: PySparkの勘所(20170630 sapporo db analytics showcase)

Spark Hadoop

Page 10: PySparkの勘所(20170630 sapporo db analytics showcase)

Spark Hadoop

Hadoop0.x Spark

OS

HDFS

MapReduce

OS

HDFS

Hive e.t.c.HBase

MapReduce

OSHDFS

Hive e.t.c.

HBaseMapReduce

YARN

Spark Spark Streaming, MLlib, GraphX, Spark SQL)

Impala

SQL

YARN

Spark Spark Streaming, MLlib, GraphX,

Spark SQL)

Mesos

Spark Spark Streaming, MLlib, GraphX,

Spark SQL) Spark Spark Streaming, MLlib, GraphX,

Spark SQL)

Windows

Hadoop 0.x Hadoop 1.x Hadoop 2.x + Spark

Page 11: PySparkの勘所(20170630 sapporo db analytics showcase)

Spark Hadoop

Hadoop Spark

mapJVM

HD

FS

reduceJVM

mapJVM

reduceJVM

f1 RDD

Executor JVM

HD

FS

f2f3

f4f5

f6f7

MapReduce Spark

RDD

Page 12: PySparkの勘所(20170630 sapporo db analytics showcase)

Spark Hadoop

Spark

▸ Hadoop MapReduce

▸ Spark API MapReduce API

▸ Hadoop

Page 13: PySparkの勘所(20170630 sapporo db analytics showcase)

PySpark

Page 14: PySparkの勘所(20170630 sapporo db analytics showcase)

PySpark

(Py)Spark

▸ / Spark

▸ PyData

▸ Spark

▸ Spark Hadoop

PyData

PySpark

Page 15: PySparkの勘所(20170630 sapporo db analytics showcase)

PySpark

▸ SSD

▸ CPU

ParquetS3

CPU

Page 16: PySparkの勘所(20170630 sapporo db analytics showcase)

Spark 1.2 PySpark …

(Py)Spark

Page 17: PySparkの勘所(20170630 sapporo db analytics showcase)

PySpark

Page 18: PySparkの勘所(20170630 sapporo db analytics showcase)

PySpark

RDD API DataFrame API

▸ RDD Resilient Distributed Dataset = Spark

Java

▸ DataFrame RDD

/ R data.frame

▸ Spark 2.x DataFrame Learning PySpark ML Structured Streaming GraphFrames TensorFrame

▸ Python RDD API DataFrame API Scala / Java

Page 19: PySparkの勘所(20170630 sapporo db analytics showcase)

Worker node

PySpark

Executer JVM

Driver JVM

Executer JVM

Executer JVM

Storage

Python VM

Worker node Worker node

Python VM

Python VM

RDD API PySpark

Worker node

Executer JVM

Driver JVM

Executer JVM

Executer JVM

Storage

Python VM

Worker node Worker node

Python VM

Python VM

DataFrame API PySpark

Page 20: PySparkの勘所(20170630 sapporo db analytics showcase)

PySpark

▸ RDD API Executer JVM Python VM

▸ DataFrame API JVM

▸ UDF Python VM

▸ UDF Scala Java

▸ Spark 2.x DataFrame

Page 21: PySparkの勘所(20170630 sapporo db analytics showcase)

Spark PyData

Page 22: PySparkの勘所(20170630 sapporo db analytics showcase)

Spark PyData

Spark PyData

▸ Spark

▸ Python PyData

▸ Parquet

▸ Apache Arrow

Page 23: PySparkの勘所(20170630 sapporo db analytics showcase)

Spark PyData

PyData

Page 24: PySparkの勘所(20170630 sapporo db analytics showcase)

Spark PyData

PyData

Anaconda PythonBlaze NumPy and pandas interface to Big Data'. daskBokeh

Canopy PythonIPython

matplotlib PyDatanose

numba JITNumPy PyDataScipy PyData

StatsmodelsSymPy

pandas NumPy SciPyscikit-imagescikit-learn PyData

Page 26: PySparkの勘所(20170630 sapporo db analytics showcase)

Spark PyData

Parquet

https://parquet.apache.org/documentation/latest/

I/O

Page 27: PySparkの勘所(20170630 sapporo db analytics showcase)

Spark PyData

Sparkdf = spark.read.csv(csvFilename, header=True, schema = theSchema).coalesce(20) df.write.save(filename, compression = 'snappy')

from fastparquet import write

pdf = pd.read_csv(csvFilename)

write(filename, pdf, compression='UNCOMPRESSED')

fastparquet

import pyarrow as pa

import pyarrow.parquet as pq

arrow_table = pa.Table.from_pandas(pdf)

pq.write_table(arrow_table, filename, compression = 'GZIP')

pyarrow

Page 28: PySparkの勘所(20170630 sapporo db analytics showcase)

Spark PyData

▸ pandas CSV Spark

Spark pandas

▸ Spark - pandas

▸ pandas → Spark …

▸ Apache Arrow

Page 29: PySparkの勘所(20170630 sapporo db analytics showcase)

Spark PyData

Apache Arrow

▸ Apache Arrow

▸ PyData / OSS

▸ /

https://arrow.apache.org

Page 30: PySparkの勘所(20170630 sapporo db analytics showcase)

Spark PyData

Wes blog

▸ pandas Apache Arrow

▸ Blog

▸ PyData Blog

Wes OK

▸ 2017 : pandas, Arrow, Feather, Parquet, Spark, Ibis

http://qiita.com/tamagawa-ryuji/items/deb3f63ed4c7c8065e81

Page 31: PySparkの勘所(20170630 sapporo db analytics showcase)

PySpark

Page 32: PySparkの勘所(20170630 sapporo db analytics showcase)

▸ pandas PySpark

▸ PySpark DataFrame API

▸ Parquet

CSV

Parquet

▸ UI

Jupyter NotebookParquet

PySpark

DataFrame API

pandas

PyData Jupyter Notebook

CSV

Page 33: PySparkの勘所(20170630 sapporo db analytics showcase)