Download - SparkR + Zeppelin
![Page 1: SparkR + Zeppelin](https://reader031.vdocuments.pub/reader031/viewer/2022020410/58f9a9a5760da3da068b70c9/html5/thumbnails/1.jpg)
SparkR + ZeppelinSeattle Spark Meetup
Sept 9, 2015Felix Cheung
![Page 2: SparkR + Zeppelin](https://reader031.vdocuments.pub/reader031/viewer/2022020410/58f9a9a5760da3da068b70c9/html5/thumbnails/2.jpg)
Agenda• R & SparkR• SparkR DataFrame• SparkR in Zeppelin•What’s next
![Page 3: SparkR + Zeppelin](https://reader031.vdocuments.pub/reader031/viewer/2022020410/58f9a9a5760da3da068b70c9/html5/thumbnails/3.jpg)
R• A programming language for statistical computing and
graphics• S – 1975• S4 - advanced object-oriented features
• R – 1993• S + lexical scoping
• Interpreted•Matrix arithmetic• Comprehensive R Archive Network (CRAN) – 7000+ packages
![Page 4: SparkR + Zeppelin](https://reader031.vdocuments.pub/reader031/viewer/2022020410/58f9a9a5760da3da068b70c9/html5/thumbnails/4.jpg)
Fast!
Scalable
Flexible
Statistical!
Interactive
Packages
![Page 5: SparkR + Zeppelin](https://reader031.vdocuments.pub/reader031/viewer/2022020410/58f9a9a5760da3da068b70c9/html5/thumbnails/5.jpg)
SparkR• R language APIs for Spark and Spark SQL• Exposes Spark functionality in an R-friendly DataFrame API• Runs as its own REPL sparkR• or as a standard R package imported in tools like Rstudio library(SparkR)sc <- sparkR.init()sqlContext <- sparkRSQL.init(sc)
5
![Page 6: SparkR + Zeppelin](https://reader031.vdocuments.pub/reader031/viewer/2022020410/58f9a9a5760da3da068b70c9/html5/thumbnails/6.jpg)
History• Shivaram Venkataraman & Zongheng Yang,
amplab – UC Berkeley• RDD APIs in a standalone package (Jan/2014)• Spark SQL and SchemaRDD -> DataFrame• Spark 1.4 – first Spark release with SparkR APIs• Spark 1.5 (today!)
6
![Page 7: SparkR + Zeppelin](https://reader031.vdocuments.pub/reader031/viewer/2022020410/58f9a9a5760da3da068b70c9/html5/thumbnails/7.jpg)
Architecture
7Native S4 classes
& methods RBackend
socket
• A set of native S4 classes and methods that live inside a standard R package• A backend that passes data structures and method calls to
Spark Scala/JVM• A collection of “helper” methods written in Scala
![Page 8: SparkR + Zeppelin](https://reader031.vdocuments.pub/reader031/viewer/2022020410/58f9a9a5760da3da068b70c9/html5/thumbnails/8.jpg)
Advantages• R-like syntax extending DataFrame API• JVM processing with full access to Spark’s DAG capabilities
and Catalyst engine,e.g. execution plan optimization, constant-folding, predicate pushdown, and code generation
8
![Page 9: SparkR + Zeppelin](https://reader031.vdocuments.pub/reader031/viewer/2022020410/58f9a9a5760da3da068b70c9/html5/thumbnails/9.jpg)
https://databricks.com/blog/2015/06/09/announcing-sparkr-r-on-spark.html
SparkR DataFrame• Spark packages • Data Source API• Optimizations
![Page 10: SparkR + Zeppelin](https://reader031.vdocuments.pub/reader031/viewer/2022020410/58f9a9a5760da3da068b70c9/html5/thumbnails/10.jpg)
SparkR in Zeppelin
![Page 11: SparkR + Zeppelin](https://reader031.vdocuments.pub/reader031/viewer/2022020410/58f9a9a5760da3da068b70c9/html5/thumbnails/11.jpg)
Architecture
RR adaptor
![Page 12: SparkR + Zeppelin](https://reader031.vdocuments.pub/reader031/viewer/2022020410/58f9a9a5760da3da068b70c9/html5/thumbnails/12.jpg)
Demo
![Page 13: SparkR + Zeppelin](https://reader031.vdocuments.pub/reader031/viewer/2022020410/58f9a9a5760da3da068b70c9/html5/thumbnails/13.jpg)
DIY• https://
github.com/felixcheung/vagrant-projects/tree/master/SparkR-Zeppelin• Vagrant + VirtualBox• Install prerequisites: JDK, R, R packages• Automatically download Spark 1.5.0 release
• Need to build Zeppelin from https://github.com/felixcheung/incubator-zeppelin/tree/r• Notebook from https://
github.com/felixcheung/spark-notebook-examples/blob/master/Zeppelin_notebook/2AZ9584GE/note.json
![Page 14: SparkR + Zeppelin](https://reader031.vdocuments.pub/reader031/viewer/2022020410/58f9a9a5760da3da068b70c9/html5/thumbnails/14.jpg)
(extracted from the demo)Native R
![Page 15: SparkR + Zeppelin](https://reader031.vdocuments.pub/reader031/viewer/2022020410/58f9a9a5760da3da068b70c9/html5/thumbnails/15.jpg)
(extracted from the demo)
Native R and dplyr...
Similarly SparkR DataFrame…
![Page 16: SparkR + Zeppelin](https://reader031.vdocuments.pub/reader031/viewer/2022020410/58f9a9a5760da3da068b70c9/html5/thumbnails/16.jpg)
(extracted from the demo)
SparkR DataFrame…
![Page 17: SparkR + Zeppelin](https://reader031.vdocuments.pub/reader031/viewer/2022020410/58f9a9a5760da3da068b70c9/html5/thumbnails/17.jpg)
What’s new• Zeppelin - run with provided Spark (SPARK_HOME)• Spark 1.5.0 release• SparkR new APIs
![Page 18: SparkR + Zeppelin](https://reader031.vdocuments.pub/reader031/viewer/2022020410/58f9a9a5760da3da068b70c9/html5/thumbnails/18.jpg)
SparkR in Spark 1.5.0Get this today:• R formula •Machine learning like GLMmodel <- glm(Sepal_Length ~ Sepal_Width + Species, data = df, family = "gaussian")
•More R-likedf[df$age %in% c(19, 30), 1:2]transform(df, newCol = df$col1 / 5, newCol2 = df$col1 * 2)
![Page 19: SparkR + Zeppelin](https://reader031.vdocuments.pub/reader031/viewer/2022020410/58f9a9a5760da3da068b70c9/html5/thumbnails/19.jpg)
Zeppelin• Stay tuned! More to come with R/SparkR• Lots of updates in the upcoming 0.5.x/0.6.0 release
![Page 20: SparkR + Zeppelin](https://reader031.vdocuments.pub/reader031/viewer/2022020410/58f9a9a5760da3da068b70c9/html5/thumbnails/20.jpg)
Question?https://github.com/felixcheung
linkedin: http://linkd.in/1OeZDb7 blog: http://bit.ly/1E2z6OI
![Page 21: SparkR + Zeppelin](https://reader031.vdocuments.pub/reader031/viewer/2022020410/58f9a9a5760da3da068b70c9/html5/thumbnails/21.jpg)
![Page 22: SparkR + Zeppelin](https://reader031.vdocuments.pub/reader031/viewer/2022020410/58f9a9a5760da3da068b70c9/html5/thumbnails/22.jpg)
subset# Columns can be selected using `[[` and `[`df[[2]] == df[["age"]]df[,2] == df[,"age"]df[,c("name", "age")]# Or to filter rowsdf[df$age > 20,]# DataFrame can be subset on both rows and Columnsdf[df$name == "Smith", c(1,2)]df[df$age %in% c(19, 30), 1:2]subset(df, df$age %in% c(19, 30), 1:2)subset(df, df$age %in% c(19), select = c(1,2))
![Page 23: SparkR + Zeppelin](https://reader031.vdocuments.pub/reader031/viewer/2022020410/58f9a9a5760da3da068b70c9/html5/thumbnails/23.jpg)
Transform/mutatenewDF <- mutate(df, newCol = df$col1 * 5, newCol2 = df$col1 * 2)
newDF2 <- transform(df, newCol = df$col1 / 5, newCol2 = df$col1 * 2)