apache zeppelin and spark for enterprise data science

18
1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Enabling Apache Zeppelin* and Spark* for Data Science in the Enterprise Bikas Saha @bikassaha *Apache Hadoop, Falcon, Atlas, Tez, Sqoop, Flume, Kafka, Pig, Hive, HBase, Accumulo, Storm, Solr, Spark, Ranger, Knox, Ambari, ZooKeeper, Oozie, Zeppelin and the Hadoop elephant logo are trademarks of the Apache Software Foundation.

Upload: bikas-saha

Post on 15-Apr-2017

147 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Page 1: Apache Zeppelin and Spark for Enterprise Data Science

1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Enabling Apache Zeppelin* and Spark* for Data Science in the Enterprise

Bikas Saha@bikassaha

*Apache Hadoop, Falcon, Atlas, Tez, Sqoop, Flume, Kafka, Pig, Hive,HBase, Accumulo, Storm, Solr, Spark, Ranger, Knox, Ambari, ZooKeeper,Oozie, Zeppelin and the Hadoop elephant logo are trademarks of theApache Software Foundation.

Page 2: Apache Zeppelin and Spark for Enterprise Data Science

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

AgendaMaking Big Data Science easy to approach

What are the current issues for the enterprise

Making Apache Zeppelin enterprise ready

Future Roadmap

Page 3: Apache Zeppelin and Spark for Enterprise Data Science

3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache Zeppelin

Page 4: Apache Zeppelin and Spark for Enterprise Data Science

4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Zeppelin makes Big Data Science Easy to Approach

Zero install – Just connect via a web browser and ready to run

Support for multiple execution platforms (Apache Spark, JDBC, Hive…)

Support for multiple languages (Scala, SQL, Python…)

Support for built-in visualizations

Support for reporting

Support for sharing and collaborative work

Does NOT have machine learning built-in – that’s where Apache Spark comes in (or your favorite SQL engine Apache Flink/Drill/Hive… and 30+ others)

Page 5: Apache Zeppelin and Spark for Enterprise Data Science

5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Zeppelin for Sharing

Page 6: Apache Zeppelin and Spark for Enterprise Data Science

6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

AgendaMaking Big Data Science easy to approach

What are the current issues for the enterprise

Making Apache Zeppelin enterprise ready

Future Roadmap

Page 7: Apache Zeppelin and Spark for Enterprise Data Science

7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Current Apache Zeppelin and Spark integration

ZeppelinServer

SparkDriver

U

s

e

r SparkExecutor

SparkExecutor

SparkExecutor

SparkExecutor

SparkExecutor

SparkExecutor

SparkExecutor

SparkExecutor

Page 8: Apache Zeppelin and Spark for Enterprise Data Science

8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Architectural Issue with Secure Data Access

ZeppelinServer

SparkDriver

U

s

e

r

1

SparkExecutor

SparkExecutor

SparkExecutor

SparkExecutor

SparkExecutor

SparkExecutor

SparkExecutor

SparkExecutor

Zeppelin ServerUser

H

D

F

S

Page 9: Apache Zeppelin and Spark for Enterprise Data Science

9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Architectural Issues with Multi-Tenancy – Fault Tolerance

ZeppelinServer

SparkDriver

U

s

e

r

1

SparkExecutor

SparkExecutor

SparkExecutor

SparkExecutor

SparkExecutor

SparkExecutor

SparkExecutor

SparkExecutor

U

s

e

r

2

User 1 failure affects User 2

Heavy-weight Spark drivers

Page 10: Apache Zeppelin and Spark for Enterprise Data Science

10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Architectural Issues with Multi-Tenancy – Privacy

ZeppelinServer

SparkDriver

U

s

e

r

1

SparkExecutor

SparkExecutor

SparkExecutor

SparkExecutor

SparkExecutor

SparkExecutor

SparkExecutor

U

s

e

r

2

User 1 can

access User 2Data

Page 11: Apache Zeppelin and Spark for Enterprise Data Science

11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

AgendaMaking Big Data Science easy to approach

What are the current issues for the enterprise

Enterprise Ready Big Data Science

Future Roadmap

Page 12: Apache Zeppelin and Spark for Enterprise Data Science

12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Livy Server as a Session Management Service

LivyServer

Remote Spark Driver

SessionRemote Context

Interactive REST API

BatchREST API

Standard Spark Batch Job

SparkExecutor

SparkExecutor

SparkExecutor

SparkExecutor

Page 13: Apache Zeppelin and Spark for Enterprise Data Science

13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Secure Data Access - Solved

ZeppelinServer

LivyInterpreter

U

s

e

r

SparkExecutor

SparkExecutor

LivyServer

Remote Spark Driver

Session

Remote Context

User

HDFS

Page 14: Apache Zeppelin and Spark for Enterprise Data Science

14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Multi Tenancy - Solved

ZeppelinServer

LivyInterpreter

LivyServer

Session 1

U

s

e

r

1

U

s

e

r

2

LivyInterpreter

Session 2

Remote Spark Driver

Remote Context

SparkExecutor

Remote Spark Driver

Remote Context

SparkExecutor

Page 15: Apache Zeppelin and Spark for Enterprise Data Science

15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

AgendaMaking Big Data Science easy to approach

What are the current issues for the enterprise

Making Apache Zeppelin enterprise ready

Future Roadmap

Page 16: Apache Zeppelin and Spark for Enterprise Data Science

16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Near Term Improvements

Session Management

Debuggability

Unified session for all languages

Better visualizations for Machine Learning

Support for Spark 2.0

Page 17: Apache Zeppelin and Spark for Enterprise Data Science

17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Long Term Improvements

Controlled sharing of sessions for collaboration

Data exploration and browsing with metadata

Taking the model from training to production

Page 18: Apache Zeppelin and Spark for Enterprise Data Science

18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Thank You