#hstokyo16 apache spark crash course

67
Robert Hryniewicz Data Advocate Twitter: @RobH8z Email: [email protected] Apache Spark Crash Course Hadoop Summit Tokyo 2016

Upload: hadoop-summit

Post on 07-Jan-2017

485 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: #HSTokyo16 Apache Spark Crash Course

RobertHryniewiczDataAdvocate

Twitter:@RobH8zEmail:[email protected]

ApacheSparkCrashCourseHadoopSummitTokyo2016

Page 2: #HSTokyo16 Apache Spark Crash Course

2 ©HortonworksInc.2011– 2016.AllRightsReserved

Agenda• Background

• SparkOverview

• ZeppelinOverview

• ComponentsofHDP

• Lab~45min

Page 3: #HSTokyo16 Apache Spark Crash Course

3 ©HortonworksInc.2011– 2016.AllRightsReserved

DataSourcesà InternetofAnything(IoAT)

– WindTurbines,OilRigs,Cars– WeatherStations,SmartGrids– RFIDTags,Beacons,Wearables

à UserGeneratedContent(Web&Mobile)– Twitter,Facebook,Snapchat,YouTube– Clickstream,Ads,UserEngagement– Payments:Paypal,Venmo

44ZBin2020

Page 4: #HSTokyo16 Apache Spark Crash Course

4 ©HortonworksInc.2011– 2016.AllRightsReserved

The“BigData”Problem

à Asinglemachinecannotprocessorevenstoreallthedata!Problem

Solutionà Distributedataoverlargeclusters

Difficultyà Howtosplitworkacrossmachines?

à Movingdataovernetworkisexpensive

à Mustconsiderdata&networklocality

à Howtodealwithfailures?

à Howtodealwithslownodes?

Page 5: #HSTokyo16 Apache Spark Crash Course

5 ©HortonworksInc.2011– 2016.AllRightsReserved

SparkBackground

Page 6: #HSTokyo16 Apache Spark Crash Course

6 ©HortonworksInc.2011– 2016.AllRightsReserved

HistoryofHadoop &Spark

Page 7: #HSTokyo16 Apache Spark Crash Course

7 ©HortonworksInc.2011– 2016.AllRightsReserved

AccessRates

Atleastanorderofmagnitudedifferencebetweenmemoryandharddrive/networkspeed

FAST slower slowest

Page 8: #HSTokyo16 Apache Spark Crash Course

8 ©HortonworksInc.2011– 2016.AllRightsReserved

WhatIsApacheSpark?

à ApacheopensourceprojectoriginallydevelopedatAMPLab(UniversityofCaliforniaBerkeley)

à Unifieddataprocessingenginethatoperatesacrossvarieddataworkloadsandplatforms

Page 9: #HSTokyo16 Apache Spark Crash Course

9 ©HortonworksInc.2011– 2016.AllRightsReserved

WhyApacheSpark?

à ElegantDeveloperAPIs– Singleenvironmentfordatamunging,datawrangling,andMachineLearning(ML)

à In-memorycomputationmodel– Fast!– EffectiveforiterativecomputationsandML

à MachineLearning– ImplementationofdistributedMLalgorithms– PipelineAPI(SparkML)

Page 10: #HSTokyo16 Apache Spark Crash Course

10 ©HortonworksInc.2011– 2016.AllRightsReserved

SparkEcosystem

SparkCore

SparkSQL SparkStreaming SparkMLlib GraphX

Page 11: #HSTokyo16 Apache Spark Crash Course

11 ©HortonworksInc.2011– 2016.AllRightsReserved

ApacheSparkBasics

Page 12: #HSTokyo16 Apache Spark Crash Course

12 ©HortonworksInc.2011– 2016.AllRightsReserved

SparkContext

à MainentrypointforSparkfunctionality

à RepresentsaconnectiontoaSparkcluster

à Representedassc inyourcode(inZeppelin)

Whatisit?

Page 13: #HSTokyo16 Apache Spark Crash Course

13 ©HortonworksInc.2011– 2016.AllRightsReserved

SparkSQL

Page 14: #HSTokyo16 Apache Spark Crash Course

14 ©HortonworksInc.2011– 2016.AllRightsReserved

SparkSQLOverview

à Sparkmoduleforstructureddataprocessing(e.g.DBtables,JSONfiles,CSV)

à Threewaystomanipulatedata:– DataFrames API– SQLqueries– DatasetsAPI

Page 15: #HSTokyo16 Apache Spark Crash Course

15 ©HortonworksInc.2011– 2016.AllRightsReserved

DataFrames

à Distributed collection ofdata organized intonamedcolumns

à ConceptuallyequivalenttoatableinrelationalDBoradataframeinR/Python

à APIavailableinScala,Java,Python,andR

Col1 Col2 … … ColN

DataFrame

Column

Row

DataisdescribedasaDataFramewithrows,columns,andaschema

Page 16: #HSTokyo16 Apache Spark Crash Course

16 ©HortonworksInc.2011– 2016.AllRightsReserved

DataFrames

CSVAvro

HIVE

SparkSQL

Text

Col1 Col2 … … ColN

DataFrame

Column

Row

CreatedfromVariousSources

à DataFrames fromHIVE:– ReadingandwritingHIVEtables

à DataFrames fromfiles:– Built-in:JSON,JDBC,ORC,Parquet,HDFS– Externalplug-in:CSV,HBASE,Avro

JSON

Page 17: #HSTokyo16 Apache Spark Crash Course

17 ©HortonworksInc.2011– 2016.AllRightsReserved

SQLContext

à EntrypointintoallfunctionalityinSparkSQL

à AllyouneedisSparkContextval sqlContext = SQLContext(sc)

SQLContext

à SupersetoffunctionalityprovidedbybasicSQLContext– ReaddatafromHivetables– AccesstoHiveFunctionsà UDFs

HiveContext

val hc = HiveContext(sc)

Usewhenyourdataresidesin

Hive

Page 18: #HSTokyo16 Apache Spark Crash Course

18 ©HortonworksInc.2011– 2016.AllRightsReserved

SparkSQLExamples

Page 19: #HSTokyo16 Apache Spark Crash Course

19 ©HortonworksInc.2011– 2016.AllRightsReserved

SettingupDataFrame API

val flightsDF = … ç Create from CSV, JSON, Hive etc.

Example:

val path = "examples/flights.json"

val flightsDF = sqlContext.read.json(path)

CreateaDataFrame

Page 20: #HSTokyo16 Apache Spark Crash Course

20 ©HortonworksInc.2011– 2016.AllRightsReserved

SettingupSQLAPI

RegisteraTemporaryTable

flightsDF.registerTempTable("flights")

Page 21: #HSTokyo16 Apache Spark Crash Course

21 ©HortonworksInc.2011– 2016.AllRightsReserved

TwoAPIExamples:DataFrame andSQLAPIs

flightsDF.select("Origin", "Dest", "DepDelay”)

.filter($"DepDelay" > 15).show(5)

Results+------+----+--------+|Origin|Dest|DepDelay|+------+----+--------+| IAD| TPA| 19|| IND| BWI| 34|| IND| JAX| 25|| IND| LAS| 67|| IND| MCO| 94|+------+----+--------+

SELECT Origin, Dest, DepDelayFROM flights WHERE DepDelay > 15 LIMIT 5

SQLAPI

DataFrame API

Page 22: #HSTokyo16 Apache Spark Crash Course

22 ©HortonworksInc.2011– 2016.AllRightsReserved

SparkStreaming

Page 23: #HSTokyo16 Apache Spark Crash Course

23 ©HortonworksInc.2011– 2016.AllRightsReserved

WhatisStreamProcessing?

BatchProcessing• Abilitytoprocessandanalyzedataat-rest(storeddata)• Request-based,bulkevaluationandshort-livedprocessing• EnablerforRetrospective,ReactiveandOn-demandAnalytics

StreamProcessing• Abilitytoingest,processandanalyzedatain-motioninreal- ornear-real-time• Eventormicro-batchdriven,continuousevaluationandlong-livedprocessing• Enablerforreal-timeProspective,ProactiveandPredictiveAnalytics forNextBest

Action

StreamProcessing +BatchProcessing =AllDataAnalyticsreal-time (now) historical (past)

Page 24: #HSTokyo16 Apache Spark Crash Course

24 ©HortonworksInc.2011– 2016.AllRightsReserved

Next Generation AnalyticsIterative & ExploratoryData is the structure

Traditional AnalyticsStructured & Repeatable

Structure built to store data

24

ModernDataApplicationsapproachtoInsights

Start with hypothesisTest against selected data

Data leads the way Explore all data, identify correlations

Analyze after landing… Analyze in motion…

Page 25: #HSTokyo16 Apache Spark Crash Course

25 ©HortonworksInc.2011– 2016.AllRightsReserved

SparkStreaming

à ExtensionofSparkCoreAPI

à Streamprocessingoflivedatastreams– Scalable– High-throughput– Fault-tolerant

Overview

ZeroMQ

MQTT

Page 26: #HSTokyo16 Apache Spark Crash Course

26 ©HortonworksInc.2011– 2016.AllRightsReserved

SparkStreaming

Page 27: #HSTokyo16 Apache Spark Crash Course

27 ©HortonworksInc.2011– 2016.AllRightsReserved

SparkStreaming

DiscretizedStreams(DStreams)Ã High-levelabstractionrepresentingcontinuousstreamofdata

à InternallyrepresentedasasequenceofRDDs

à OperationappliedonaDStream translatestooperationsontheunderlyingRDDs

Page 28: #HSTokyo16 Apache Spark Crash Course

28 ©HortonworksInc.2011– 2016.AllRightsReserved

SparkStreaming

Example:flatMap operation

Page 29: #HSTokyo16 Apache Spark Crash Course

29 ©HortonworksInc.2011– 2016.AllRightsReserved

SparkStreaming

à Applytransformationsoveraslidingwindowofdata,e.g.rollingaverageWindowOperations

Page 30: #HSTokyo16 Apache Spark Crash Course

30 ©HortonworksInc.2011– 2016.AllRightsReserved

SparkMLlib

Page 31: #HSTokyo16 Apache Spark Crash Course

31 ©HortonworksInc.2011– 2016.AllRightsReserved

Where Can We Use Machine Learning (Data Science)

Healthcare• Predictdiagnosis• Prioritizescreenings• Reducere-admittancerates

Financialservices• FraudDetection/prevention• Predictunderwritingrisk• Newaccountriskscreens

PublicSector• Analyzepublicsentiment• Optimizeresourceallocation• Lawenforcement&security

Retail• Productrecommendation• Inventorymanagement• Priceoptimization

Telco/mobile• Predictcustomerchurn• Predictequipmentfailure• Customerbehavioranalysis

Oil&Gas• Predictivemaintenance• Seismicdatamanagement• Predictwellproductionlevels

Page 32: #HSTokyo16 Apache Spark Crash Course

32 ©HortonworksInc.2011– 2016.AllRightsReserved

Scatter 2D Data Visualized

scatterData ç DataFrame

+-----+--------+

|label|features|

+-----+--------+

|-12.0| [-4.9]|

| -6.0| [-4.5]|

| -7.2| [-4.1]|

| -5.0| [-3.2]|

| -2.0| [-3.0]|

| -3.1| [-2.1]|

| -4.0| [-1.5]|

| -2.2| [-1.2]|

| -2.0| [-0.7]|

| 1.0| [-0.5]|

| -0.7| [-0.2]|.........

Page 33: #HSTokyo16 Apache Spark Crash Course

33 ©HortonworksInc.2011– 2016.AllRightsReserved

Linear Regression Model Training (one feature)

Coefficients:2.81Intercept:3.05

y=2.81x+3.05

TrainingResult

Page 34: #HSTokyo16 Apache Spark Crash Course

34 ©HortonworksInc.2011– 2016.AllRightsReserved

Linear Regression (two features)

Coefficients: [0.464, 0.464] Intercept: 0.0563

Page 35: #HSTokyo16 Apache Spark Crash Course

35 ©HortonworksInc.2011– 2016.AllRightsReserved

Spark API for building ML pipelines

Featuretransform

1

Featuretransform

2

Combinefeatures

LinearRegression

InputDataFrame

InputDataFrame

OutputDataFrame

Pipeline

PipelineModel

Train

Predict

ExportModel

Page 36: #HSTokyo16 Apache Spark Crash Course

36 ©HortonworksInc.2011– 2016.AllRightsReserved

SparkGraphX

Page 37: #HSTokyo16 Apache Spark Crash Course

37 ©HortonworksInc.2011– 2016.AllRightsReserved

GraphX

à PageRank

à TopicModeling(LDA)

à CommunityDetection

Source:ampcamp.berkeley.edu

Page 38: #HSTokyo16 Apache Spark Crash Course

38 ©HortonworksInc.2011– 2016.AllRightsReserved

ApacheZeppelin&HDPSandbox

Page 39: #HSTokyo16 Apache Spark Crash Course

39 ©HortonworksInc.2011– 2016.AllRightsReserved

What’s Apache Zeppelin?

Web-based notebook that enables interactive

data analytics.

You can make beautiful data-driven, interactive

and collaborative documents with SQL,

Scala and more

Page 40: #HSTokyo16 Apache Spark Crash Course

40 ©HortonworksInc.2011– 2016.AllRightsReserved

What is a Note/Notebook?

• AwebbasedGUIforsmallcodesnippets

• Writecodesnippetsinbrowser

• Zeppelinsendscodetobackendforexecution

• Zeppelingetsdatabackfrombackend

• Zeppelinvisualizesdata

• ZeppelinNote=Setof(Paragraphs/Cells)

• OtherFeatures- Sharing/Collaboration/Reports/Import/Export

Page 41: #HSTokyo16 Apache Spark Crash Course

41 ©HortonworksInc.2011– 2016.AllRightsReserved

BigDataLifecycle

Collect ETL/Process Analysis

Report

DataProduct

BusinessuserCustomer

DataScientistDataEngineer

AllinoneplaceinZeppelin!

Page 42: #HSTokyo16 Apache Spark Crash Course

42 ©HortonworksInc.2011– 2016.AllRightsReserved

HowdoesZeppelinwork?

NotebookAuthor

Collaborators/Reportviewers

Zeppelin

ClusterSpark|Hive|HBaseAnyof30+backends

Page 43: #HSTokyo16 Apache Spark Crash Course

43 ©HortonworksInc.2011– 2016.AllRightsReserved

HDPSandbox

What’sincludedintheHDPSandbox?

à Zeppelin

à Spark

à YARNà ResourceManagement

à HDFSà DistributedStorageLayer

à Andmanymorecomponents: Hive,Solr etc. YARN

ScalaJava

PythonR

APIs

Spark Core Engine

Spark SQL

Spark StreamingMLlib GraphX

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° °

°

NHDFS

Page 44: #HSTokyo16 Apache Spark Crash Course

44 ©HortonworksInc.2011– 2016.AllRightsReserved

Access patterns enabled by YARN

YARN: Data Operating System

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° °

°

°N

HDFS Hadoop Distributed File System

Interactive Real-TimeBatch

Applications BatchNeeds to happen but, no timeframe limitations

InteractiveNeeds to happen at Human time

Real-Time Needs to happen at Machine Execution time.

Page 45: #HSTokyo16 Apache Spark Crash Course

45 ©HortonworksInc.2011– 2016.AllRightsReserved

WhyApacheSparkonYARN?

à Resourcemanagement– ShareSparkworkloadswithother

workloads(HIVE,Solr,etc.)

à UtilizesexistingHDPclusterinfrastructure

à Schedulingandqueues

SparkDriver

ClientSpark

ApplicationMaster

YARNcontainer

SparkExecutor

YARNcontainer

Task Task

SparkExecutor

YARNcontainer

Task Task

SparkExecutor

YARNcontainer

Task Task

Page 46: #HSTokyo16 Apache Spark Crash Course

46 ©HortonworksInc.2011– 2016.AllRightsReserved

Why HDFS?Fault Tolerant Distributed Storage• Dividefilesintobigblocksanddistribute3copiesrandomly acrossthecluster• ProcessingDataLocality

• NotJuststoragebutcomputation

10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111010

0

Logical File

1

2

3

4

Blocks

1

Cluster

1

1

2

22

3

3

34

44

Page 47: #HSTokyo16 Apache Spark Crash Course

47 ©HortonworksInc.2011– 2016.AllRightsReserved

There’s more to HDP

YARN : Data Operating System

DATA ACCESS SECURITYGOVERNANCE & INTEGRATION OPERATIONS

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° °

°

N

Data Lifecycle & Governance

FalconAtlas

AdministrationAuthenticationAuthorizationAuditingData Protection

RangerKnoxAtlasHDFSEncryptionData Workflow

SqoopFlumeKafkaNFSWebHDFS

Provisioning, Managing, & Monitoring

AmbariCloudbreakZookeeper

Scheduling

Oozie

Batch

MapReduce

Script

Pig

Search

Solr

SQL

Hive

NoSQL

HBaseAccumuloPhoenix

Stream

Storm

In-memory Others

ISV Engines

Tez Tez Slider Slider

DATA MANAGEMENT

HortonworksDataPlatform2.4.x

DeploymentChoiceLinux Windows On-Premise Cloud

HDFS Hadoop Distributed File System

Page 48: #HSTokyo16 Apache Spark Crash Course

48 ©HortonworksInc.2011– 2016.AllRightsReserved

HortonworksDataCloud

Page 49: #HSTokyo16 Apache Spark Crash Course

49 ©HortonworksInc.2011– 2016.AllRightsReserved

Page 50: #HSTokyo16 Apache Spark Crash Course

50 ©HortonworksInc.2011– 2016.AllRightsReserved

Page 51: #HSTokyo16 Apache Spark Crash Course

51 ©HortonworksInc.2011– 2016.AllRightsReserved

BringingMultitenancytoApacheZeppelin

Page 52: #HSTokyo16 Apache Spark Crash Course

52 ©HortonworksInc.2011– 2016.AllRightsReserved

IntroducingLivy

à LivyistheopensourceRESTinterfaceforinteractingwithApacheSparkfromanywhere

à InstalledasSparkAmbari Service

Livy Client

HTTP HTTP(RPC)

SparkInteractiveSessionSparkContext

SparkBatchSessionSparkContext

Livy Server

Page 53: #HSTokyo16 Apache Spark Crash Course

53 ©HortonworksInc.2011– 2016.AllRightsReserved

SecurityAcrossZeppelin-Livy-Spark

Shiro

IsparkGroupInterpreter

SPNego:Kerberos Kerberos

LivyAPIs

SparkonYARN

Zeppelin

Driver

LDAP

Livy Server

Page 54: #HSTokyo16 Apache Spark Crash Course

54 ©HortonworksInc.2011– 2016.AllRightsReserved

ReasonstoIntegratewithLivy

à BringSessionstoApacheZeppelin– Isolation– Sessionsharing

à Enableefficientclusterresourceutilization– DefaultSparkinterpreterkeepsYARN/Sparkjobrunningforever– Livyinterpreterrecycledafter60minutesofinactivity

(controlledbylivy.server.session.timeout )

à ToIdentityPropagation– SenduseridentityfromZeppelin>Livy>SparkonYARN

Page 55: #HSTokyo16 Apache Spark Crash Course

55 ©HortonworksInc.2011– 2016.AllRightsReserved

Livy Server

SparkContextSharing

Session-2

Session-1

SparkSession-1SparkContext

SparkSession-2SparkContext

Client1

Client2

Client3

Session-1

Session-1

Session-2

Page 56: #HSTokyo16 Apache Spark Crash Course

56 ©HortonworksInc.2011– 2016.AllRightsReserved

SampleArchitecture

Page 57: #HSTokyo16 Apache Spark Crash Course

57 ©HortonworksInc.2011– 2016.AllRightsReserved

ManagedDataflowSOURCES REGIONAL

INFRASTRUCTURECORE

INFRASTRUCTURE

Page 58: #HSTokyo16 Apache Spark Crash Course

58 ©HortonworksInc.2011– 2016.AllRightsReserved

High-LevelOverview

IoT Edge(singlenode)

IoT Edge(singlenode)

IoT Devices

IoT Devices

NiFi Hub DataBroker

ColumnDB

DataStore

LiveDashboard

DataCenter(onprem/cloud)

HDFS/S3 HBase/Cassandra

Page 59: #HSTokyo16 Apache Spark Crash Course

59 ©HortonworksInc.2011– 2016.AllRightsReserved

What’snewinSpark2.0

Page 60: #HSTokyo16 Apache Spark Crash Course

60 ©HortonworksInc.2011– 2016.AllRightsReserved

Spark2.0Ã APIImprovements

– SparkSession (spark)– newentrypoint (ReplacesSQLContext andHiveContext)– UnifiedDataFrame &DataSet API (DataFrameà aliasforDataSet[Row])– StructuredStreaming/ContinuousApplication (ConceptofaninfiniteDataFrame)– TemporaryTableà TemporaryView

à PerformanceImprovements– TungstenPhase2- Multistagecodegen– ORC&Parquetfileimprovements

à MachineLearning– MLpipelinethenewAPI,MLlib deprecated– DistributedRalgorithms(GLM,NaïveBayes,K-Means,SurvivalRegression)

à SparkSQL– MoreSQLsupport(newANSISQLparser,subquerysupport)

Page 61: #HSTokyo16 Apache Spark Crash Course

61 ©HortonworksInc.2011– 2016.AllRightsReserved

What’sthelatestatHortonworks?

à HDP2.5– BatchProcessing

à HDF2.0– StreamingApps

DATAATREST

DATAINMOTION

ACTIONABLEINTELLIGENCE

ModernDataApplications

Page 62: #HSTokyo16 Apache Spark Crash Course

62 ©HortonworksInc.2011– 2016.AllRightsReserved

LabPreview

Page 63: #HSTokyo16 Apache Spark Crash Course

63 ©HortonworksInc.2011– 2016.AllRightsReserved

LabSetupInstructions

http://tinyurl.com/hwx-spark-intro

LabOptions- LocalSandbox(8GBRAMmemoryrequired):

- VirtualBox orVmware- AmazonAWSCloud:

- HortonworksDataCloudè Setupinfo:http://hortonworks.github.io/hdp-aws/index.html

http://hortonworks.github.io/hdp-aws/index.htmlhttp://hortonworks.github.io/hdp-aws/index.html

Page 64: #HSTokyo16 Apache Spark Crash Course

64 ©HortonworksInc.2011– 2016.AllRightsReserved

HortonworksCommunityConnection

Page 65: #HSTokyo16 Apache Spark Crash Course

65 ©HortonworksInc.2011– 2016.AllRightsReserved

CommunityEngagement

Participate now at: community.hortonworks.com©HortonworksInc.2011– 2015.AllRightsReserved

9,500+RegisteredUsers

21,000+Answers

32,500+TechnicalAssets

One Website!

Page 66: #HSTokyo16 Apache Spark Crash Course

66 ©HortonworksInc.2011– 2016.AllRightsReserved

HortonworksCommunityConnection

Read access for everyone, join to participate and be recognized

• FullQ&APlatform(likeStackOverflow)

• KnowledgeBaseArticles

• CodeSamplesandRepositories

Page 67: #HSTokyo16 Apache Spark Crash Course

RobertHryniewiczE:[email protected]:@RobH8z

Thanks!