20171012 found it #9 pysparkの勘所

38
PySpark found IT project #9 @

Upload: ryuji-tamagawa

Post on 21-Jan-2018

633 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: 20171012 found  IT #9 PySparkの勘所

PySpark found IT project #9

@

Page 2: 20171012 found  IT #9 PySparkの勘所

▸ facebook : Ryuji Tamagawa

▸ Twitter : tamagawa_ryuji

▸ FB

found IT project

▸ Twitter

Page 3: 20171012 found  IT #9 PySparkの勘所
Page 4: 20171012 found  IT #9 PySparkの勘所

11

Page 5: 20171012 found  IT #9 PySparkの勘所

Wes Mckinney blog

▸ http://qiita.com/tamagawa-ryuji

Page 6: 20171012 found  IT #9 PySparkの勘所
Page 7: 20171012 found  IT #9 PySparkの勘所
Page 8: 20171012 found  IT #9 PySparkの勘所

▸ Spark Hadoop

▸ PySpark

▸ Spark/Hadoop PyData

Page 9: 20171012 found  IT #9 PySparkの勘所
Page 10: 20171012 found  IT #9 PySparkの勘所

Page 11: 20171012 found  IT #9 PySparkの勘所

PySpark

▸ SSD

▸ CPU

ParquetS3

CPU

Page 12: 20171012 found  IT #9 PySparkの勘所
Page 13: 20171012 found  IT #9 PySparkの勘所

https://www.slideshare.net/kumagi/ss-78765920/4

Page 14: 20171012 found  IT #9 PySparkの勘所

▸ groupby

▸ Spark API

Page 15: 20171012 found  IT #9 PySparkの勘所

Spark Hadoop

Page 16: 20171012 found  IT #9 PySparkの勘所

Spark Hadoop

Hadoop0.x Spark

OS

HDFS

MapReduce

OS

HDFS

Hive e.t.c.HBase

MapReduce

OSHDFS

Hive e.t.c.

HBaseMapReduce

YARN

Spark Spark Streaming, MLlib, GraphX, Spark SQL)

Impala

SQL

YARN

Spark Spark Streaming, MLlib, GraphX,

Spark SQL)

Mesos

Spark Spark Streaming, MLlib, GraphX,

Spark SQL) Spark Spark Streaming, MLlib, GraphX,

Spark SQL)

Windows

Hadoop 0.x Hadoop 1.x Hadoop 2.x + Spark

Page 17: 20171012 found  IT #9 PySparkの勘所

▸ Amazon EMR

▸ Microsoft Azure HDInsight

▸ Cloudera Altus

▸ Databricks Community Edition Spark

▸ PyData + Jupyter PySpark

Page 18: 20171012 found  IT #9 PySparkの勘所

Spark Hadoop

Hadoop Spark

mapJVM

HD

FS

reduceJVM

mapJVM

reduceJVM

f1 RDD

Executor JVM

HD

FS

f2f3

f4f5

f6f7

MapReduce Spark

RDD

Page 19: 20171012 found  IT #9 PySparkの勘所

Spark Hadoop

Spark

▸ Hadoop MapReduce

▸ Spark API MapReduce API

▸ Hadoop

Page 20: 20171012 found  IT #9 PySparkの勘所

PySpark

(Py)Spark

▸ / Spark

▸ PyData

▸ Spark

▸ Spark Hadoop

PyData

PySpark

Page 21: 20171012 found  IT #9 PySparkの勘所

Spark 1.2 PySpark …

(Py)Spark

Page 22: 20171012 found  IT #9 PySparkの勘所

PySpark

Page 23: 20171012 found  IT #9 PySparkの勘所

PySpark

RDD API DataFrame API

▸ RDD Resilient Distributed Dataset =

Spark Java

▸ DataFrame RDD

/ R data.frame

▸ Python RDD API DataFrame API Scala

/ Java

Page 24: 20171012 found  IT #9 PySparkの勘所

PySpark

DataFrame API

RDD DataFrame / Dataset

MLlib ML

GraphX GraphFrame

Spark Streaming

Structured Streaming

Page 25: 20171012 found  IT #9 PySparkの勘所

Worker node

PySpark

Executer JVM

Driver JVM

Executer JVM

Executer JVM

Storage

Python VM

Worker node Worker node

Python VM

Python VM

RDD API PySpark

Worker node

Executer JVM

Driver JVM

Executer JVM

Executer JVM

Storage

Python VM

Worker node Worker node

Python VM

Python VM

DataFrame API PySpark

Page 26: 20171012 found  IT #9 PySparkの勘所

PySpark

▸ RDD API Executer JVM Python VM

▸ DataFrame API JVM

▸ UDF Python VM

▸ UDF Scala Java

▸ Spark 2.x DataFrame

Page 27: 20171012 found  IT #9 PySparkの勘所

Spark PyData

Page 28: 20171012 found  IT #9 PySparkの勘所

Spark PyData

Spark PyData

▸ Spark

▸ Python PyData

▸ Parquet

▸ Apache Arrow

Page 30: 20171012 found  IT #9 PySparkの勘所

Spark PyData

Parquet

https://parquet.apache.org/documentation/latest/

zip CSV

I/O

ROW BLOCKCOLUMN #0 ROW #0COLUMN #0 ROW #1

COLUMN #0 ROW #NCOLUMN #1 ROW #0COLUMN #1 ROW #1

COLUMN #1 ROW #NCOLUMN #2 ROW #0

COLUMN #2 ROW #1…

COLUMN #M ROW #N

ROW BLOCKCOLUMN #0 ROW #0COLUMN #0 ROW #1

COLUMN #0 ROW #NCOLUMN #1 ROW #0

COLUMN #1 ROW #1

COLUMN #1 ROW #NCOLUMN #2 ROW #0

COLUMN #2 ROW #1…

COLUMN #M ROW #N. . .

Page 31: 20171012 found  IT #9 PySparkの勘所

Spark PyData

Sparkdf = spark.read.csv(csvFilename, header=True, schema = theSchema).coalesce(20) df.write.save(filename, compression = 'snappy')

from fastparquet import write

pdf = pd.read_csv(csvFilename)

write(filename, pdf, compression='UNCOMPRESSED')

fastparquet

import pyarrow as pa

import pyarrow.parquet as pq

arrow_table = pa.Table.from_pandas(pdf)

pq.write_table(arrow_table, filename, compression = 'GZIP')

pyarrow

Page 32: 20171012 found  IT #9 PySparkの勘所

Spark PyData

▸ pandas CSV Spark

Spark pandas

▸ Spark - pandas

▸ pandas → Spark …

▸ Apache Arrow

Page 33: 20171012 found  IT #9 PySparkの勘所

Spark PyData

Apache Arrow

▸ Apache Arrow

▸ PyData / OSS

▸ /

https://arrow.apache.org

Page 34: 20171012 found  IT #9 PySparkの勘所

Spark PyData

Wes blog

▸ pandas Apache Arrow

▸ Blog

▸ PyData Blog

Wes OK

▸ Apache Arrow pandas 10

https://qiita.com/tamagawa-ryuji/items/3d8fc52406706ae0c144

Page 35: 20171012 found  IT #9 PySparkの勘所

PySpark

Page 36: 20171012 found  IT #9 PySparkの勘所
Page 37: 20171012 found  IT #9 PySparkの勘所

11

Page 38: 20171012 found  IT #9 PySparkの勘所