#hstokyo16 apache spark crash course

Post on 07-Jan-2017

485 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

RobertHryniewiczDataAdvocate

Twitter:@RobH8zEmail:rhryniewicz@hortonworks.com

ApacheSparkCrashCourseHadoopSummitTokyo2016

2 ©HortonworksInc.2011– 2016.AllRightsReserved

Agenda• Background

• SparkOverview

• ZeppelinOverview

• ComponentsofHDP

• Lab~45min

3 ©HortonworksInc.2011– 2016.AllRightsReserved

DataSourcesà InternetofAnything(IoAT)

– WindTurbines,OilRigs,Cars– WeatherStations,SmartGrids– RFIDTags,Beacons,Wearables

à UserGeneratedContent(Web&Mobile)– Twitter,Facebook,Snapchat,YouTube– Clickstream,Ads,UserEngagement– Payments:Paypal,Venmo

44ZBin2020

4 ©HortonworksInc.2011– 2016.AllRightsReserved

The“BigData”Problem

à Asinglemachinecannotprocessorevenstoreallthedata!Problem

Solutionà Distributedataoverlargeclusters

Difficultyà Howtosplitworkacrossmachines?

à Movingdataovernetworkisexpensive

à Mustconsiderdata&networklocality

à Howtodealwithfailures?

à Howtodealwithslownodes?

5 ©HortonworksInc.2011– 2016.AllRightsReserved

SparkBackground

6 ©HortonworksInc.2011– 2016.AllRightsReserved

HistoryofHadoop &Spark

7 ©HortonworksInc.2011– 2016.AllRightsReserved

AccessRates

Atleastanorderofmagnitudedifferencebetweenmemoryandharddrive/networkspeed

FAST slower slowest

8 ©HortonworksInc.2011– 2016.AllRightsReserved

WhatIsApacheSpark?

à ApacheopensourceprojectoriginallydevelopedatAMPLab(UniversityofCaliforniaBerkeley)

à Unifieddataprocessingenginethatoperatesacrossvarieddataworkloadsandplatforms

9 ©HortonworksInc.2011– 2016.AllRightsReserved

WhyApacheSpark?

à ElegantDeveloperAPIs– Singleenvironmentfordatamunging,datawrangling,andMachineLearning(ML)

à In-memorycomputationmodel– Fast!– EffectiveforiterativecomputationsandML

à MachineLearning– ImplementationofdistributedMLalgorithms– PipelineAPI(SparkML)

10 ©HortonworksInc.2011– 2016.AllRightsReserved

SparkEcosystem

SparkCore

SparkSQL SparkStreaming SparkMLlib GraphX

11 ©HortonworksInc.2011– 2016.AllRightsReserved

ApacheSparkBasics

12 ©HortonworksInc.2011– 2016.AllRightsReserved

SparkContext

à MainentrypointforSparkfunctionality

à RepresentsaconnectiontoaSparkcluster

à Representedassc inyourcode(inZeppelin)

Whatisit?

13 ©HortonworksInc.2011– 2016.AllRightsReserved

SparkSQL

14 ©HortonworksInc.2011– 2016.AllRightsReserved

SparkSQLOverview

à Sparkmoduleforstructureddataprocessing(e.g.DBtables,JSONfiles,CSV)

à Threewaystomanipulatedata:– DataFrames API– SQLqueries– DatasetsAPI

15 ©HortonworksInc.2011– 2016.AllRightsReserved

DataFrames

à Distributed collection ofdata organized intonamedcolumns

à ConceptuallyequivalenttoatableinrelationalDBoradataframeinR/Python

à APIavailableinScala,Java,Python,andR

Col1 Col2 … … ColN

DataFrame

Column

Row

DataisdescribedasaDataFramewithrows,columns,andaschema

16 ©HortonworksInc.2011– 2016.AllRightsReserved

DataFrames

CSVAvro

HIVE

SparkSQL

Text

Col1 Col2 … … ColN

DataFrame

Column

Row

CreatedfromVariousSources

à DataFrames fromHIVE:– ReadingandwritingHIVEtables

à DataFrames fromfiles:– Built-in:JSON,JDBC,ORC,Parquet,HDFS– Externalplug-in:CSV,HBASE,Avro

JSON

17 ©HortonworksInc.2011– 2016.AllRightsReserved

SQLContext

à EntrypointintoallfunctionalityinSparkSQL

à AllyouneedisSparkContextval sqlContext = SQLContext(sc)

SQLContext

à SupersetoffunctionalityprovidedbybasicSQLContext– ReaddatafromHivetables– AccesstoHiveFunctionsà UDFs

HiveContext

val hc = HiveContext(sc)

Usewhenyourdataresidesin

Hive

18 ©HortonworksInc.2011– 2016.AllRightsReserved

SparkSQLExamples

19 ©HortonworksInc.2011– 2016.AllRightsReserved

SettingupDataFrame API

val flightsDF = … ç Create from CSV, JSON, Hive etc.

Example:

val path = "examples/flights.json"

val flightsDF = sqlContext.read.json(path)

CreateaDataFrame

20 ©HortonworksInc.2011– 2016.AllRightsReserved

SettingupSQLAPI

RegisteraTemporaryTable

flightsDF.registerTempTable("flights")

21 ©HortonworksInc.2011– 2016.AllRightsReserved

TwoAPIExamples:DataFrame andSQLAPIs

flightsDF.select("Origin", "Dest", "DepDelay”)

.filter($"DepDelay" > 15).show(5)

Results+------+----+--------+|Origin|Dest|DepDelay|+------+----+--------+| IAD| TPA| 19|| IND| BWI| 34|| IND| JAX| 25|| IND| LAS| 67|| IND| MCO| 94|+------+----+--------+

SELECT Origin, Dest, DepDelayFROM flights WHERE DepDelay > 15 LIMIT 5

SQLAPI

DataFrame API

22 ©HortonworksInc.2011– 2016.AllRightsReserved

SparkStreaming

23 ©HortonworksInc.2011– 2016.AllRightsReserved

WhatisStreamProcessing?

BatchProcessing• Abilitytoprocessandanalyzedataat-rest(storeddata)• Request-based,bulkevaluationandshort-livedprocessing• EnablerforRetrospective,ReactiveandOn-demandAnalytics

StreamProcessing• Abilitytoingest,processandanalyzedatain-motioninreal- ornear-real-time• Eventormicro-batchdriven,continuousevaluationandlong-livedprocessing• Enablerforreal-timeProspective,ProactiveandPredictiveAnalytics forNextBest

Action

StreamProcessing +BatchProcessing =AllDataAnalyticsreal-time (now) historical (past)

24 ©HortonworksInc.2011– 2016.AllRightsReserved

Next Generation AnalyticsIterative & ExploratoryData is the structure

Traditional AnalyticsStructured & Repeatable

Structure built to store data

24

ModernDataApplicationsapproachtoInsights

Start with hypothesisTest against selected data

Data leads the way Explore all data, identify correlations

Analyze after landing… Analyze in motion…

25 ©HortonworksInc.2011– 2016.AllRightsReserved

SparkStreaming

à ExtensionofSparkCoreAPI

à Streamprocessingoflivedatastreams– Scalable– High-throughput– Fault-tolerant

Overview

ZeroMQ

MQTT

26 ©HortonworksInc.2011– 2016.AllRightsReserved

SparkStreaming

27 ©HortonworksInc.2011– 2016.AllRightsReserved

SparkStreaming

DiscretizedStreams(DStreams)Ã High-levelabstractionrepresentingcontinuousstreamofdata

à InternallyrepresentedasasequenceofRDDs

à OperationappliedonaDStream translatestooperationsontheunderlyingRDDs

28 ©HortonworksInc.2011– 2016.AllRightsReserved

SparkStreaming

Example:flatMap operation

29 ©HortonworksInc.2011– 2016.AllRightsReserved

SparkStreaming

à Applytransformationsoveraslidingwindowofdata,e.g.rollingaverageWindowOperations

30 ©HortonworksInc.2011– 2016.AllRightsReserved

SparkMLlib

31 ©HortonworksInc.2011– 2016.AllRightsReserved

Where Can We Use Machine Learning (Data Science)

Healthcare• Predictdiagnosis• Prioritizescreenings• Reducere-admittancerates

Financialservices• FraudDetection/prevention• Predictunderwritingrisk• Newaccountriskscreens

PublicSector• Analyzepublicsentiment• Optimizeresourceallocation• Lawenforcement&security

Retail• Productrecommendation• Inventorymanagement• Priceoptimization

Telco/mobile• Predictcustomerchurn• Predictequipmentfailure• Customerbehavioranalysis

Oil&Gas• Predictivemaintenance• Seismicdatamanagement• Predictwellproductionlevels

32 ©HortonworksInc.2011– 2016.AllRightsReserved

Scatter 2D Data Visualized

scatterData ç DataFrame

+-----+--------+

|label|features|

+-----+--------+

|-12.0| [-4.9]|

| -6.0| [-4.5]|

| -7.2| [-4.1]|

| -5.0| [-3.2]|

| -2.0| [-3.0]|

| -3.1| [-2.1]|

| -4.0| [-1.5]|

| -2.2| [-1.2]|

| -2.0| [-0.7]|

| 1.0| [-0.5]|

| -0.7| [-0.2]|.........

33 ©HortonworksInc.2011– 2016.AllRightsReserved

Linear Regression Model Training (one feature)

Coefficients:2.81Intercept:3.05

y=2.81x+3.05

TrainingResult

34 ©HortonworksInc.2011– 2016.AllRightsReserved

Linear Regression (two features)

Coefficients: [0.464, 0.464] Intercept: 0.0563

35 ©HortonworksInc.2011– 2016.AllRightsReserved

Spark API for building ML pipelines

Featuretransform

1

Featuretransform

2

Combinefeatures

LinearRegression

InputDataFrame

InputDataFrame

OutputDataFrame

Pipeline

PipelineModel

Train

Predict

ExportModel

36 ©HortonworksInc.2011– 2016.AllRightsReserved

SparkGraphX

37 ©HortonworksInc.2011– 2016.AllRightsReserved

GraphX

à PageRank

à TopicModeling(LDA)

à CommunityDetection

Source:ampcamp.berkeley.edu

38 ©HortonworksInc.2011– 2016.AllRightsReserved

ApacheZeppelin&HDPSandbox

39 ©HortonworksInc.2011– 2016.AllRightsReserved

What’s Apache Zeppelin?

Web-based notebook that enables interactive

data analytics.

You can make beautiful data-driven, interactive

and collaborative documents with SQL,

Scala and more

40 ©HortonworksInc.2011– 2016.AllRightsReserved

What is a Note/Notebook?

• AwebbasedGUIforsmallcodesnippets

• Writecodesnippetsinbrowser

• Zeppelinsendscodetobackendforexecution

• Zeppelingetsdatabackfrombackend

• Zeppelinvisualizesdata

• ZeppelinNote=Setof(Paragraphs/Cells)

• OtherFeatures- Sharing/Collaboration/Reports/Import/Export

41 ©HortonworksInc.2011– 2016.AllRightsReserved

BigDataLifecycle

Collect ETL/Process Analysis

Report

DataProduct

BusinessuserCustomer

DataScientistDataEngineer

AllinoneplaceinZeppelin!

42 ©HortonworksInc.2011– 2016.AllRightsReserved

HowdoesZeppelinwork?

NotebookAuthor

Collaborators/Reportviewers

Zeppelin

ClusterSpark|Hive|HBaseAnyof30+backends

43 ©HortonworksInc.2011– 2016.AllRightsReserved

HDPSandbox

What’sincludedintheHDPSandbox?

à Zeppelin

à Spark

à YARNà ResourceManagement

à HDFSà DistributedStorageLayer

à Andmanymorecomponents: Hive,Solr etc. YARN

ScalaJava

PythonR

APIs

Spark Core Engine

Spark SQL

Spark StreamingMLlib GraphX

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° °

°

NHDFS

44 ©HortonworksInc.2011– 2016.AllRightsReserved

Access patterns enabled by YARN

YARN: Data Operating System

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° °

°

°N

HDFS Hadoop Distributed File System

Interactive Real-TimeBatch

Applications BatchNeeds to happen but, no timeframe limitations

InteractiveNeeds to happen at Human time

Real-Time Needs to happen at Machine Execution time.

45 ©HortonworksInc.2011– 2016.AllRightsReserved

WhyApacheSparkonYARN?

à Resourcemanagement– ShareSparkworkloadswithother

workloads(HIVE,Solr,etc.)

à UtilizesexistingHDPclusterinfrastructure

à Schedulingandqueues

SparkDriver

ClientSpark

ApplicationMaster

YARNcontainer

SparkExecutor

YARNcontainer

Task Task

SparkExecutor

YARNcontainer

Task Task

SparkExecutor

YARNcontainer

Task Task

46 ©HortonworksInc.2011– 2016.AllRightsReserved

Why HDFS?Fault Tolerant Distributed Storage• Dividefilesintobigblocksanddistribute3copiesrandomly acrossthecluster• ProcessingDataLocality

• NotJuststoragebutcomputation

10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111010

0

Logical File

1

2

3

4

Blocks

1

Cluster

1

1

2

22

3

3

34

44

47 ©HortonworksInc.2011– 2016.AllRightsReserved

There’s more to HDP

YARN : Data Operating System

DATA ACCESS SECURITYGOVERNANCE & INTEGRATION OPERATIONS

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° °

°

N

Data Lifecycle & Governance

FalconAtlas

AdministrationAuthenticationAuthorizationAuditingData Protection

RangerKnoxAtlasHDFSEncryptionData Workflow

SqoopFlumeKafkaNFSWebHDFS

Provisioning, Managing, & Monitoring

AmbariCloudbreakZookeeper

Scheduling

Oozie

Batch

MapReduce

Script

Pig

Search

Solr

SQL

Hive

NoSQL

HBaseAccumuloPhoenix

Stream

Storm

In-memory Others

ISV Engines

Tez Tez Slider Slider

DATA MANAGEMENT

HortonworksDataPlatform2.4.x

DeploymentChoiceLinux Windows On-Premise Cloud

HDFS Hadoop Distributed File System

48 ©HortonworksInc.2011– 2016.AllRightsReserved

HortonworksDataCloud

49 ©HortonworksInc.2011– 2016.AllRightsReserved

50 ©HortonworksInc.2011– 2016.AllRightsReserved

51 ©HortonworksInc.2011– 2016.AllRightsReserved

BringingMultitenancytoApacheZeppelin

52 ©HortonworksInc.2011– 2016.AllRightsReserved

IntroducingLivy

à LivyistheopensourceRESTinterfaceforinteractingwithApacheSparkfromanywhere

à InstalledasSparkAmbari Service

Livy Client

HTTP HTTP(RPC)

SparkInteractiveSessionSparkContext

SparkBatchSessionSparkContext

Livy Server

53 ©HortonworksInc.2011– 2016.AllRightsReserved

SecurityAcrossZeppelin-Livy-Spark

Shiro

IsparkGroupInterpreter

SPNego:Kerberos Kerberos

LivyAPIs

SparkonYARN

Zeppelin

Driver

LDAP

Livy Server

54 ©HortonworksInc.2011– 2016.AllRightsReserved

ReasonstoIntegratewithLivy

à BringSessionstoApacheZeppelin– Isolation– Sessionsharing

à Enableefficientclusterresourceutilization– DefaultSparkinterpreterkeepsYARN/Sparkjobrunningforever– Livyinterpreterrecycledafter60minutesofinactivity

(controlledbylivy.server.session.timeout )

à ToIdentityPropagation– SenduseridentityfromZeppelin>Livy>SparkonYARN

55 ©HortonworksInc.2011– 2016.AllRightsReserved

Livy Server

SparkContextSharing

Session-2

Session-1

SparkSession-1SparkContext

SparkSession-2SparkContext

Client1

Client2

Client3

Session-1

Session-1

Session-2

56 ©HortonworksInc.2011– 2016.AllRightsReserved

SampleArchitecture

57 ©HortonworksInc.2011– 2016.AllRightsReserved

ManagedDataflowSOURCES REGIONAL

INFRASTRUCTURECORE

INFRASTRUCTURE

58 ©HortonworksInc.2011– 2016.AllRightsReserved

High-LevelOverview

IoT Edge(singlenode)

IoT Edge(singlenode)

IoT Devices

IoT Devices

NiFi Hub DataBroker

ColumnDB

DataStore

LiveDashboard

DataCenter(onprem/cloud)

HDFS/S3 HBase/Cassandra

59 ©HortonworksInc.2011– 2016.AllRightsReserved

What’snewinSpark2.0

60 ©HortonworksInc.2011– 2016.AllRightsReserved

Spark2.0Ã APIImprovements

– SparkSession (spark)– newentrypoint (ReplacesSQLContext andHiveContext)– UnifiedDataFrame &DataSet API (DataFrameà aliasforDataSet[Row])– StructuredStreaming/ContinuousApplication (ConceptofaninfiniteDataFrame)– TemporaryTableà TemporaryView

à PerformanceImprovements– TungstenPhase2- Multistagecodegen– ORC&Parquetfileimprovements

à MachineLearning– MLpipelinethenewAPI,MLlib deprecated– DistributedRalgorithms(GLM,NaïveBayes,K-Means,SurvivalRegression)

à SparkSQL– MoreSQLsupport(newANSISQLparser,subquerysupport)

61 ©HortonworksInc.2011– 2016.AllRightsReserved

What’sthelatestatHortonworks?

à HDP2.5– BatchProcessing

à HDF2.0– StreamingApps

DATAATREST

DATAINMOTION

ACTIONABLEINTELLIGENCE

ModernDataApplications

62 ©HortonworksInc.2011– 2016.AllRightsReserved

LabPreview

63 ©HortonworksInc.2011– 2016.AllRightsReserved

LabSetupInstructions

http://tinyurl.com/hwx-spark-intro

LabOptions- LocalSandbox(8GBRAMmemoryrequired):

- VirtualBox orVmware- AmazonAWSCloud:

- HortonworksDataCloudè Setupinfo:http://hortonworks.github.io/hdp-aws/index.html

http://hortonworks.github.io/hdp-aws/index.htmlhttp://hortonworks.github.io/hdp-aws/index.html

64 ©HortonworksInc.2011– 2016.AllRightsReserved

HortonworksCommunityConnection

65 ©HortonworksInc.2011– 2016.AllRightsReserved

CommunityEngagement

Participate now at: community.hortonworks.com©HortonworksInc.2011– 2015.AllRightsReserved

9,500+RegisteredUsers

21,000+Answers

32,500+TechnicalAssets

One Website!

66 ©HortonworksInc.2011– 2016.AllRightsReserved

HortonworksCommunityConnection

Read access for everyone, join to participate and be recognized

• FullQ&APlatform(likeStackOverflow)

• KnowledgeBaseArticles

• CodeSamplesandRepositories

RobertHryniewiczE:rhryniewicz@hortonworks.comT:@RobH8z

Thanks!

top related