#hstokyo16 apache spark crash course

RobertHryniewiczDataAdvocate

Twitter:@RobH8zEmail:rhryniewicz@hortonworks.com

ApacheSparkCrashCourseHadoopSummitTokyo2016

2 ©HortonworksInc.2011– 2016.AllRightsReserved

Agenda• Background

• SparkOverview

• ZeppelinOverview

• ComponentsofHDP

• Lab~45min

DataSourcesÃ InternetofAnything(IoAT)

– WindTurbines,OilRigs,Cars– WeatherStations,SmartGrids– RFIDTags,Beacons,Wearables

Ã UserGeneratedContent(Web&Mobile)– Twitter,Facebook,Snapchat,YouTube– Clickstream,Ads,UserEngagement– Payments:Paypal,Venmo

44ZBin2020

The“BigData”Problem

Ã Asinglemachinecannotprocessorevenstoreallthedata!Problem

SolutionÃ Distributedataoverlargeclusters

DifficultyÃ Howtosplitworkacrossmachines?

Ã Movingdataovernetworkisexpensive

Ã Mustconsiderdata&networklocality

Ã Howtodealwithfailures?

Ã Howtodealwithslownodes?

SparkBackground

HistoryofHadoop &Spark

AccessRates

Atleastanorderofmagnitudedifferencebetweenmemoryandharddrive/networkspeed

FAST slower slowest

WhatIsApacheSpark?

Ã ApacheopensourceprojectoriginallydevelopedatAMPLab(UniversityofCaliforniaBerkeley)

Ã Unifieddataprocessingenginethatoperatesacrossvarieddataworkloadsandplatforms

WhyApacheSpark?

Ã ElegantDeveloperAPIs– Singleenvironmentfordatamunging,datawrangling,andMachineLearning(ML)

Ã In-memorycomputationmodel– Fast!– EffectiveforiterativecomputationsandML

Ã MachineLearning– ImplementationofdistributedMLalgorithms– PipelineAPI(SparkML)

SparkEcosystem

SparkCore

SparkSQL SparkStreaming SparkMLlib GraphX

ApacheSparkBasics

SparkContext

Ã MainentrypointforSparkfunctionality

Ã RepresentsaconnectiontoaSparkcluster

Ã Representedassc inyourcode(inZeppelin)

Whatisit?

SparkSQL

SparkSQLOverview

Ã Sparkmoduleforstructureddataprocessing(e.g.DBtables,JSONfiles,CSV)

Ã Threewaystomanipulatedata:– DataFrames API– SQLqueries– DatasetsAPI

DataFrames

Ã Distributed collection ofdata organized intonamedcolumns

Ã ConceptuallyequivalenttoatableinrelationalDBoradataframeinR/Python

Ã APIavailableinScala,Java,Python,andR

Col1 Col2 … … ColN

DataFrame

Column

DataisdescribedasaDataFramewithrows,columns,andaschema

DataFrames

CSVAvro

SparkSQL

Col1 Col2 … … ColN

DataFrame

Column

CreatedfromVariousSources

Ã DataFrames fromHIVE:– ReadingandwritingHIVEtables

Ã DataFrames fromfiles:– Built-in:JSON,JDBC,ORC,Parquet,HDFS– Externalplug-in:CSV,HBASE,Avro

SQLContext

Ã EntrypointintoallfunctionalityinSparkSQL

Ã AllyouneedisSparkContextval sqlContext = SQLContext(sc)

SQLContext

Ã SupersetoffunctionalityprovidedbybasicSQLContext– ReaddatafromHivetables– AccesstoHiveFunctionsà UDFs

HiveContext

val hc = HiveContext(sc)

Usewhenyourdataresidesin

SparkSQLExamples

SettingupDataFrame API

val flightsDF = … ç Create from CSV, JSON, Hive etc.

Example:

val path = "examples/flights.json"

val flightsDF = sqlContext.read.json(path)

CreateaDataFrame

SettingupSQLAPI

RegisteraTemporaryTable

flightsDF.registerTempTable("flights")

TwoAPIExamples:DataFrame andSQLAPIs

flightsDF.select("Origin", "Dest", "DepDelay”)

.filter($"DepDelay" > 15).show(5)

Results+------+----+--------+|Origin|Dest|DepDelay|+------+----+--------+| IAD| TPA| 19|| IND| BWI| 34|| IND| JAX| 25|| IND| LAS| 67|| IND| MCO| 94|+------+----+--------+

SELECT Origin, Dest, DepDelayFROM flights WHERE DepDelay > 15 LIMIT 5

SQLAPI

DataFrame API

SparkStreaming

WhatisStreamProcessing?

BatchProcessing• Abilitytoprocessandanalyzedataat-rest(storeddata)• Request-based,bulkevaluationandshort-livedprocessing• EnablerforRetrospective,ReactiveandOn-demandAnalytics

StreamProcessing• Abilitytoingest,processandanalyzedatain-motioninreal- ornear-real-time• Eventormicro-batchdriven,continuousevaluationandlong-livedprocessing• Enablerforreal-timeProspective,ProactiveandPredictiveAnalytics forNextBest

Action

StreamProcessing +BatchProcessing =AllDataAnalyticsreal-time (now) historical (past)

Next Generation AnalyticsIterative & ExploratoryData is the structure

Traditional AnalyticsStructured & Repeatable

Structure built to store data

ModernDataApplicationsapproachtoInsights

Start with hypothesisTest against selected data

Data leads the way Explore all data, identify correlations

Analyze after landing… Analyze in motion…

SparkStreaming

Ã ExtensionofSparkCoreAPI

Ã Streamprocessingoflivedatastreams– Scalable– High-throughput– Fault-tolerant

Overview

ZeroMQ

SparkStreaming

DiscretizedStreams(DStreams)Ã High-levelabstractionrepresentingcontinuousstreamofdata

Ã InternallyrepresentedasasequenceofRDDs

Ã OperationappliedonaDStream translatestooperationsontheunderlyingRDDs

SparkStreaming

Example:flatMap operation

SparkStreaming

Ã Applytransformationsoveraslidingwindowofdata,e.g.rollingaverageWindowOperations

SparkMLlib

Where Can We Use Machine Learning (Data Science)

Healthcare• Predictdiagnosis• Prioritizescreenings• Reducere-admittancerates

Financialservices• FraudDetection/prevention• Predictunderwritingrisk• Newaccountriskscreens

PublicSector• Analyzepublicsentiment• Optimizeresourceallocation• Lawenforcement&security

Retail• Productrecommendation• Inventorymanagement• Priceoptimization

Telco/mobile• Predictcustomerchurn• Predictequipmentfailure• Customerbehavioranalysis

Oil&Gas• Predictivemaintenance• Seismicdatamanagement• Predictwellproductionlevels

Scatter 2D Data Visualized

scatterData ç DataFrame

+-----+--------+

|label|features|

+-----+--------+

|-12.0| [-4.9]|

| -6.0| [-4.5]|

| -7.2| [-4.1]|

| -5.0| [-3.2]|

| -2.0| [-3.0]|

| -3.1| [-2.1]|

| -4.0| [-1.5]|

| -2.2| [-1.2]|

| -2.0| [-0.7]|

| 1.0| [-0.5]|

| -0.7| [-0.2]|.........

Linear Regression Model Training (one feature)

Coefficients:2.81Intercept:3.05

y=2.81x+3.05

TrainingResult

Linear Regression (two features)

Coefficients: [0.464, 0.464] Intercept: 0.0563

Spark API for building ML pipelines

Featuretransform

Combinefeatures

LinearRegression

InputDataFrame

OutputDataFrame

Pipeline

PipelineModel

Predict

ExportModel

SparkGraphX

GraphX

Ã PageRank

Ã TopicModeling(LDA)

Ã CommunityDetection

Source:ampcamp.berkeley.edu

ApacheZeppelin&HDPSandbox

What’s Apache Zeppelin?

Web-based notebook that enables interactive

data analytics.

You can make beautiful data-driven, interactive

and collaborative documents with SQL,

Scala and more

What is a Note/Notebook?

• AwebbasedGUIforsmallcodesnippets

• Writecodesnippetsinbrowser

• Zeppelinsendscodetobackendforexecution

• Zeppelingetsdatabackfrombackend

• Zeppelinvisualizesdata

• ZeppelinNote=Setof(Paragraphs/Cells)

• OtherFeatures- Sharing/Collaboration/Reports/Import/Export

BigDataLifecycle

Collect ETL/Process Analysis

Report

DataProduct

BusinessuserCustomer

DataScientistDataEngineer

AllinoneplaceinZeppelin!

HowdoesZeppelinwork?

NotebookAuthor

Collaborators/Reportviewers

Zeppelin

ClusterSpark|Hive|HBaseAnyof30+backends

HDPSandbox

What’sincludedintheHDPSandbox?

Ã Zeppelin

Ã Spark

Ã YARNà ResourceManagement

Ã HDFSà DistributedStorageLayer

Ã Andmanymorecomponents: Hive,Solr etc. YARN

ScalaJava

PythonR

Spark Core Engine

Spark SQL

Spark StreamingMLlib GraphX

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° °

Access patterns enabled by YARN

YARN: Data Operating System

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° °

HDFS Hadoop Distributed File System

Interactive Real-TimeBatch

Applications BatchNeeds to happen but, no timeframe limitations

InteractiveNeeds to happen at Human time

Real-Time Needs to happen at Machine Execution time.

WhyApacheSparkonYARN?

Ã Resourcemanagement– ShareSparkworkloadswithother

workloads(HIVE,Solr,etc.)

Ã UtilizesexistingHDPclusterinfrastructure

Ã Schedulingandqueues

SparkDriver

ClientSpark

ApplicationMaster

YARNcontainer

SparkExecutor

YARNcontainer

Task Task

SparkExecutor

YARNcontainer

Task Task

SparkExecutor

YARNcontainer

Task Task

Why HDFS?Fault Tolerant Distributed Storage• Dividefilesintobigblocksanddistribute3copiesrandomly acrossthecluster• ProcessingDataLocality

• NotJuststoragebutcomputation

10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111010

Logical File

Blocks

Cluster

There’s more to HDP

YARN : Data Operating System

DATA ACCESS SECURITYGOVERNANCE & INTEGRATION OPERATIONS

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° °

Data Lifecycle & Governance

FalconAtlas

AdministrationAuthenticationAuthorizationAuditingData Protection

RangerKnoxAtlasHDFSEncryptionData Workflow

SqoopFlumeKafkaNFSWebHDFS

Provisioning, Managing, & Monitoring

AmbariCloudbreakZookeeper

Scheduling

MapReduce

Script

Search

HBaseAccumuloPhoenix

Stream

In-memory Others

ISV Engines

Tez Tez Slider Slider

DATA MANAGEMENT

HortonworksDataPlatform2.4.x

DeploymentChoiceLinux Windows On-Premise Cloud

HDFS Hadoop Distributed File System

HortonworksDataCloud

BringingMultitenancytoApacheZeppelin

IntroducingLivy

Ã LivyistheopensourceRESTinterfaceforinteractingwithApacheSparkfromanywhere

Ã InstalledasSparkAmbari Service

Livy Client

HTTP HTTP(RPC)

SparkInteractiveSessionSparkContext

SparkBatchSessionSparkContext

Livy Server

SecurityAcrossZeppelin-Livy-Spark

IsparkGroupInterpreter

SPNego:Kerberos Kerberos

LivyAPIs

SparkonYARN

Zeppelin

Driver

Livy Server

ReasonstoIntegratewithLivy

Ã BringSessionstoApacheZeppelin– Isolation– Sessionsharing

Ã Enableefficientclusterresourceutilization– DefaultSparkinterpreterkeepsYARN/Sparkjobrunningforever– Livyinterpreterrecycledafter60minutesofinactivity

(controlledbylivy.server.session.timeout )

Ã ToIdentityPropagation– SenduseridentityfromZeppelin>Livy>SparkonYARN

Livy Server

SparkContextSharing

Session-2

Session-1

SparkSession-1SparkContext

SparkSession-2SparkContext

Client1

Client2

Client3

Session-1

Session-2

SampleArchitecture

ManagedDataflowSOURCES REGIONAL

INFRASTRUCTURECORE

INFRASTRUCTURE

High-LevelOverview

IoT Edge(singlenode)

IoT Devices

NiFi Hub DataBroker

ColumnDB

DataStore

LiveDashboard

DataCenter(onprem/cloud)

HDFS/S3 HBase/Cassandra

What’snewinSpark2.0

Spark2.0Ã APIImprovements

– SparkSession (spark)– newentrypoint (ReplacesSQLContext andHiveContext)– UnifiedDataFrame &DataSet API (DataFrameà aliasforDataSet[Row])– StructuredStreaming/ContinuousApplication (ConceptofaninfiniteDataFrame)– TemporaryTableà TemporaryView

Ã PerformanceImprovements– TungstenPhase2- Multistagecodegen– ORC&Parquetfileimprovements

Ã MachineLearning– MLpipelinethenewAPI,MLlib deprecated– DistributedRalgorithms(GLM,NaïveBayes,K-Means,SurvivalRegression)

Ã SparkSQL– MoreSQLsupport(newANSISQLparser,subquerysupport)

What’sthelatestatHortonworks?

Ã HDP2.5– BatchProcessing

Ã HDF2.0– StreamingApps

DATAATREST

DATAINMOTION

ACTIONABLEINTELLIGENCE

ModernDataApplications

LabPreview

LabSetupInstructions

http://tinyurl.com/hwx-spark-intro

LabOptions- LocalSandbox(8GBRAMmemoryrequired):

- VirtualBox orVmware- AmazonAWSCloud:

- HortonworksDataCloudè Setupinfo:http://hortonworks.github.io/hdp-aws/index.html

http://hortonworks.github.io/hdp-aws/index.htmlhttp://hortonworks.github.io/hdp-aws/index.html

HortonworksCommunityConnection

CommunityEngagement

9,500+RegisteredUsers

21,000+Answers

32,500+TechnicalAssets

One Website!

HortonworksCommunityConnection

Read access for everyone, join to participate and be recognized

• FullQ&APlatform(likeStackOverflow)

• KnowledgeBaseArticles

• CodeSamplesandRepositories

RobertHryniewiczE:rhryniewicz@hortonworks.comT:@RobH8z

Thanks!

#hstokyo16 apache spark crash course

Technology

apache spark

wprowadzenie do apache spark · 2017-01-20 · wprowadzenie...

apache spark - introduccion a rdds

apache spark performance observations

event driven architecture with apache spark and spring...

apache spark overview

apache spark 2.0: faster, easier, and smarter

introducciÓn a apache spark con python³n_spark.pdf · 3...

introducción a apache spark

uvod u apache spark zagreb meetup

análisis de datos con apache spark

· apache spark hortonworks data platform - operación y...

introduction to apache spark

apache sparkとapache...

apache spark linkedin

apache spark : genel bir bakış

big data with apache spark - wunca · 2017-07-21 · -...

apache spark? if only it worked

the data scientist's guide to apache spark

plugin apache spark