#hstokyo16 apache spark crash course
TRANSCRIPT
RobertHryniewiczDataAdvocate
Twitter:@RobH8zEmail:[email protected]
ApacheSparkCrashCourseHadoopSummitTokyo2016
2 ©HortonworksInc.2011– 2016.AllRightsReserved
Agenda• Background
• SparkOverview
• ZeppelinOverview
• ComponentsofHDP
• Lab~45min
3 ©HortonworksInc.2011– 2016.AllRightsReserved
DataSourcesà InternetofAnything(IoAT)
– WindTurbines,OilRigs,Cars– WeatherStations,SmartGrids– RFIDTags,Beacons,Wearables
à UserGeneratedContent(Web&Mobile)– Twitter,Facebook,Snapchat,YouTube– Clickstream,Ads,UserEngagement– Payments:Paypal,Venmo
44ZBin2020
4 ©HortonworksInc.2011– 2016.AllRightsReserved
The“BigData”Problem
à Asinglemachinecannotprocessorevenstoreallthedata!Problem
Solutionà Distributedataoverlargeclusters
Difficultyà Howtosplitworkacrossmachines?
à Movingdataovernetworkisexpensive
à Mustconsiderdata&networklocality
à Howtodealwithfailures?
à Howtodealwithslownodes?
5 ©HortonworksInc.2011– 2016.AllRightsReserved
SparkBackground
6 ©HortonworksInc.2011– 2016.AllRightsReserved
HistoryofHadoop &Spark
7 ©HortonworksInc.2011– 2016.AllRightsReserved
AccessRates
Atleastanorderofmagnitudedifferencebetweenmemoryandharddrive/networkspeed
FAST slower slowest
8 ©HortonworksInc.2011– 2016.AllRightsReserved
WhatIsApacheSpark?
à ApacheopensourceprojectoriginallydevelopedatAMPLab(UniversityofCaliforniaBerkeley)
à Unifieddataprocessingenginethatoperatesacrossvarieddataworkloadsandplatforms
9 ©HortonworksInc.2011– 2016.AllRightsReserved
WhyApacheSpark?
à ElegantDeveloperAPIs– Singleenvironmentfordatamunging,datawrangling,andMachineLearning(ML)
à In-memorycomputationmodel– Fast!– EffectiveforiterativecomputationsandML
à MachineLearning– ImplementationofdistributedMLalgorithms– PipelineAPI(SparkML)
10 ©HortonworksInc.2011– 2016.AllRightsReserved
SparkEcosystem
SparkCore
SparkSQL SparkStreaming SparkMLlib GraphX
11 ©HortonworksInc.2011– 2016.AllRightsReserved
ApacheSparkBasics
12 ©HortonworksInc.2011– 2016.AllRightsReserved
SparkContext
à MainentrypointforSparkfunctionality
à RepresentsaconnectiontoaSparkcluster
à Representedassc inyourcode(inZeppelin)
Whatisit?
13 ©HortonworksInc.2011– 2016.AllRightsReserved
SparkSQL
14 ©HortonworksInc.2011– 2016.AllRightsReserved
SparkSQLOverview
à Sparkmoduleforstructureddataprocessing(e.g.DBtables,JSONfiles,CSV)
à Threewaystomanipulatedata:– DataFrames API– SQLqueries– DatasetsAPI
15 ©HortonworksInc.2011– 2016.AllRightsReserved
DataFrames
à Distributed collection ofdata organized intonamedcolumns
à ConceptuallyequivalenttoatableinrelationalDBoradataframeinR/Python
à APIavailableinScala,Java,Python,andR
Col1 Col2 … … ColN
DataFrame
Column
Row
DataisdescribedasaDataFramewithrows,columns,andaschema
16 ©HortonworksInc.2011– 2016.AllRightsReserved
DataFrames
CSVAvro
HIVE
SparkSQL
Text
Col1 Col2 … … ColN
DataFrame
Column
Row
CreatedfromVariousSources
à DataFrames fromHIVE:– ReadingandwritingHIVEtables
à DataFrames fromfiles:– Built-in:JSON,JDBC,ORC,Parquet,HDFS– Externalplug-in:CSV,HBASE,Avro
JSON
17 ©HortonworksInc.2011– 2016.AllRightsReserved
SQLContext
à EntrypointintoallfunctionalityinSparkSQL
à AllyouneedisSparkContextval sqlContext = SQLContext(sc)
SQLContext
à SupersetoffunctionalityprovidedbybasicSQLContext– ReaddatafromHivetables– AccesstoHiveFunctionsà UDFs
HiveContext
val hc = HiveContext(sc)
Usewhenyourdataresidesin
Hive
18 ©HortonworksInc.2011– 2016.AllRightsReserved
SparkSQLExamples
19 ©HortonworksInc.2011– 2016.AllRightsReserved
SettingupDataFrame API
val flightsDF = … ç Create from CSV, JSON, Hive etc.
Example:
val path = "examples/flights.json"
val flightsDF = sqlContext.read.json(path)
CreateaDataFrame
20 ©HortonworksInc.2011– 2016.AllRightsReserved
SettingupSQLAPI
RegisteraTemporaryTable
flightsDF.registerTempTable("flights")
21 ©HortonworksInc.2011– 2016.AllRightsReserved
TwoAPIExamples:DataFrame andSQLAPIs
flightsDF.select("Origin", "Dest", "DepDelay”)
.filter($"DepDelay" > 15).show(5)
Results+------+----+--------+|Origin|Dest|DepDelay|+------+----+--------+| IAD| TPA| 19|| IND| BWI| 34|| IND| JAX| 25|| IND| LAS| 67|| IND| MCO| 94|+------+----+--------+
SELECT Origin, Dest, DepDelayFROM flights WHERE DepDelay > 15 LIMIT 5
SQLAPI
DataFrame API
22 ©HortonworksInc.2011– 2016.AllRightsReserved
SparkStreaming
23 ©HortonworksInc.2011– 2016.AllRightsReserved
WhatisStreamProcessing?
BatchProcessing• Abilitytoprocessandanalyzedataat-rest(storeddata)• Request-based,bulkevaluationandshort-livedprocessing• EnablerforRetrospective,ReactiveandOn-demandAnalytics
StreamProcessing• Abilitytoingest,processandanalyzedatain-motioninreal- ornear-real-time• Eventormicro-batchdriven,continuousevaluationandlong-livedprocessing• Enablerforreal-timeProspective,ProactiveandPredictiveAnalytics forNextBest
Action
StreamProcessing +BatchProcessing =AllDataAnalyticsreal-time (now) historical (past)
24 ©HortonworksInc.2011– 2016.AllRightsReserved
Next Generation AnalyticsIterative & ExploratoryData is the structure
Traditional AnalyticsStructured & Repeatable
Structure built to store data
24
ModernDataApplicationsapproachtoInsights
Start with hypothesisTest against selected data
Data leads the way Explore all data, identify correlations
Analyze after landing… Analyze in motion…
25 ©HortonworksInc.2011– 2016.AllRightsReserved
SparkStreaming
à ExtensionofSparkCoreAPI
à Streamprocessingoflivedatastreams– Scalable– High-throughput– Fault-tolerant
Overview
ZeroMQ
MQTT
26 ©HortonworksInc.2011– 2016.AllRightsReserved
SparkStreaming
27 ©HortonworksInc.2011– 2016.AllRightsReserved
SparkStreaming
DiscretizedStreams(DStreams)Ã High-levelabstractionrepresentingcontinuousstreamofdata
à InternallyrepresentedasasequenceofRDDs
à OperationappliedonaDStream translatestooperationsontheunderlyingRDDs
28 ©HortonworksInc.2011– 2016.AllRightsReserved
SparkStreaming
Example:flatMap operation
29 ©HortonworksInc.2011– 2016.AllRightsReserved
SparkStreaming
à Applytransformationsoveraslidingwindowofdata,e.g.rollingaverageWindowOperations
30 ©HortonworksInc.2011– 2016.AllRightsReserved
SparkMLlib
31 ©HortonworksInc.2011– 2016.AllRightsReserved
Where Can We Use Machine Learning (Data Science)
Healthcare• Predictdiagnosis• Prioritizescreenings• Reducere-admittancerates
Financialservices• FraudDetection/prevention• Predictunderwritingrisk• Newaccountriskscreens
PublicSector• Analyzepublicsentiment• Optimizeresourceallocation• Lawenforcement&security
Retail• Productrecommendation• Inventorymanagement• Priceoptimization
Telco/mobile• Predictcustomerchurn• Predictequipmentfailure• Customerbehavioranalysis
Oil&Gas• Predictivemaintenance• Seismicdatamanagement• Predictwellproductionlevels
32 ©HortonworksInc.2011– 2016.AllRightsReserved
Scatter 2D Data Visualized
scatterData ç DataFrame
+-----+--------+
|label|features|
+-----+--------+
|-12.0| [-4.9]|
| -6.0| [-4.5]|
| -7.2| [-4.1]|
| -5.0| [-3.2]|
| -2.0| [-3.0]|
| -3.1| [-2.1]|
| -4.0| [-1.5]|
| -2.2| [-1.2]|
| -2.0| [-0.7]|
| 1.0| [-0.5]|
| -0.7| [-0.2]|.........
33 ©HortonworksInc.2011– 2016.AllRightsReserved
Linear Regression Model Training (one feature)
Coefficients:2.81Intercept:3.05
y=2.81x+3.05
TrainingResult
34 ©HortonworksInc.2011– 2016.AllRightsReserved
Linear Regression (two features)
Coefficients: [0.464, 0.464] Intercept: 0.0563
35 ©HortonworksInc.2011– 2016.AllRightsReserved
Spark API for building ML pipelines
Featuretransform
1
Featuretransform
2
Combinefeatures
LinearRegression
InputDataFrame
InputDataFrame
OutputDataFrame
Pipeline
PipelineModel
Train
Predict
ExportModel
36 ©HortonworksInc.2011– 2016.AllRightsReserved
SparkGraphX
37 ©HortonworksInc.2011– 2016.AllRightsReserved
GraphX
à PageRank
à TopicModeling(LDA)
à CommunityDetection
Source:ampcamp.berkeley.edu
38 ©HortonworksInc.2011– 2016.AllRightsReserved
ApacheZeppelin&HDPSandbox
39 ©HortonworksInc.2011– 2016.AllRightsReserved
What’s Apache Zeppelin?
Web-based notebook that enables interactive
data analytics.
You can make beautiful data-driven, interactive
and collaborative documents with SQL,
Scala and more
40 ©HortonworksInc.2011– 2016.AllRightsReserved
What is a Note/Notebook?
• AwebbasedGUIforsmallcodesnippets
• Writecodesnippetsinbrowser
• Zeppelinsendscodetobackendforexecution
• Zeppelingetsdatabackfrombackend
• Zeppelinvisualizesdata
• ZeppelinNote=Setof(Paragraphs/Cells)
• OtherFeatures- Sharing/Collaboration/Reports/Import/Export
41 ©HortonworksInc.2011– 2016.AllRightsReserved
BigDataLifecycle
Collect ETL/Process Analysis
Report
DataProduct
BusinessuserCustomer
DataScientistDataEngineer
AllinoneplaceinZeppelin!
42 ©HortonworksInc.2011– 2016.AllRightsReserved
HowdoesZeppelinwork?
NotebookAuthor
Collaborators/Reportviewers
Zeppelin
ClusterSpark|Hive|HBaseAnyof30+backends
43 ©HortonworksInc.2011– 2016.AllRightsReserved
HDPSandbox
What’sincludedintheHDPSandbox?
à Zeppelin
à Spark
à YARNà ResourceManagement
à HDFSà DistributedStorageLayer
à Andmanymorecomponents: Hive,Solr etc. YARN
ScalaJava
PythonR
APIs
Spark Core Engine
Spark SQL
Spark StreamingMLlib GraphX
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
NHDFS
44 ©HortonworksInc.2011– 2016.AllRightsReserved
Access patterns enabled by YARN
YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
°N
HDFS Hadoop Distributed File System
Interactive Real-TimeBatch
Applications BatchNeeds to happen but, no timeframe limitations
InteractiveNeeds to happen at Human time
Real-Time Needs to happen at Machine Execution time.
45 ©HortonworksInc.2011– 2016.AllRightsReserved
WhyApacheSparkonYARN?
à Resourcemanagement– ShareSparkworkloadswithother
workloads(HIVE,Solr,etc.)
à UtilizesexistingHDPclusterinfrastructure
à Schedulingandqueues
SparkDriver
ClientSpark
ApplicationMaster
YARNcontainer
SparkExecutor
YARNcontainer
Task Task
SparkExecutor
YARNcontainer
Task Task
SparkExecutor
YARNcontainer
Task Task
46 ©HortonworksInc.2011– 2016.AllRightsReserved
Why HDFS?Fault Tolerant Distributed Storage• Dividefilesintobigblocksanddistribute3copiesrandomly acrossthecluster• ProcessingDataLocality
• NotJuststoragebutcomputation
10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111010
0
Logical File
1
2
3
4
Blocks
1
Cluster
1
1
2
22
3
3
34
44
47 ©HortonworksInc.2011– 2016.AllRightsReserved
There’s more to HDP
YARN : Data Operating System
DATA ACCESS SECURITYGOVERNANCE & INTEGRATION OPERATIONS
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
N
Data Lifecycle & Governance
FalconAtlas
AdministrationAuthenticationAuthorizationAuditingData Protection
RangerKnoxAtlasHDFSEncryptionData Workflow
SqoopFlumeKafkaNFSWebHDFS
Provisioning, Managing, & Monitoring
AmbariCloudbreakZookeeper
Scheduling
Oozie
Batch
MapReduce
Script
Pig
Search
Solr
SQL
Hive
NoSQL
HBaseAccumuloPhoenix
Stream
Storm
In-memory Others
ISV Engines
Tez Tez Slider Slider
DATA MANAGEMENT
HortonworksDataPlatform2.4.x
DeploymentChoiceLinux Windows On-Premise Cloud
HDFS Hadoop Distributed File System
48 ©HortonworksInc.2011– 2016.AllRightsReserved
HortonworksDataCloud
49 ©HortonworksInc.2011– 2016.AllRightsReserved
50 ©HortonworksInc.2011– 2016.AllRightsReserved
51 ©HortonworksInc.2011– 2016.AllRightsReserved
BringingMultitenancytoApacheZeppelin
52 ©HortonworksInc.2011– 2016.AllRightsReserved
IntroducingLivy
à LivyistheopensourceRESTinterfaceforinteractingwithApacheSparkfromanywhere
à InstalledasSparkAmbari Service
Livy Client
HTTP HTTP(RPC)
SparkInteractiveSessionSparkContext
SparkBatchSessionSparkContext
Livy Server
53 ©HortonworksInc.2011– 2016.AllRightsReserved
SecurityAcrossZeppelin-Livy-Spark
Shiro
IsparkGroupInterpreter
SPNego:Kerberos Kerberos
LivyAPIs
SparkonYARN
Zeppelin
Driver
LDAP
Livy Server
54 ©HortonworksInc.2011– 2016.AllRightsReserved
ReasonstoIntegratewithLivy
à BringSessionstoApacheZeppelin– Isolation– Sessionsharing
à Enableefficientclusterresourceutilization– DefaultSparkinterpreterkeepsYARN/Sparkjobrunningforever– Livyinterpreterrecycledafter60minutesofinactivity
(controlledbylivy.server.session.timeout )
à ToIdentityPropagation– SenduseridentityfromZeppelin>Livy>SparkonYARN
55 ©HortonworksInc.2011– 2016.AllRightsReserved
Livy Server
SparkContextSharing
Session-2
Session-1
SparkSession-1SparkContext
SparkSession-2SparkContext
Client1
Client2
Client3
Session-1
Session-1
Session-2
56 ©HortonworksInc.2011– 2016.AllRightsReserved
SampleArchitecture
57 ©HortonworksInc.2011– 2016.AllRightsReserved
ManagedDataflowSOURCES REGIONAL
INFRASTRUCTURECORE
INFRASTRUCTURE
58 ©HortonworksInc.2011– 2016.AllRightsReserved
High-LevelOverview
IoT Edge(singlenode)
IoT Edge(singlenode)
IoT Devices
IoT Devices
NiFi Hub DataBroker
ColumnDB
DataStore
LiveDashboard
DataCenter(onprem/cloud)
HDFS/S3 HBase/Cassandra
59 ©HortonworksInc.2011– 2016.AllRightsReserved
What’snewinSpark2.0
60 ©HortonworksInc.2011– 2016.AllRightsReserved
Spark2.0Ã APIImprovements
– SparkSession (spark)– newentrypoint (ReplacesSQLContext andHiveContext)– UnifiedDataFrame &DataSet API (DataFrameà aliasforDataSet[Row])– StructuredStreaming/ContinuousApplication (ConceptofaninfiniteDataFrame)– TemporaryTableà TemporaryView
à PerformanceImprovements– TungstenPhase2- Multistagecodegen– ORC&Parquetfileimprovements
à MachineLearning– MLpipelinethenewAPI,MLlib deprecated– DistributedRalgorithms(GLM,NaïveBayes,K-Means,SurvivalRegression)
à SparkSQL– MoreSQLsupport(newANSISQLparser,subquerysupport)
61 ©HortonworksInc.2011– 2016.AllRightsReserved
What’sthelatestatHortonworks?
à HDP2.5– BatchProcessing
à HDF2.0– StreamingApps
DATAATREST
DATAINMOTION
ACTIONABLEINTELLIGENCE
ModernDataApplications
62 ©HortonworksInc.2011– 2016.AllRightsReserved
LabPreview
63 ©HortonworksInc.2011– 2016.AllRightsReserved
LabSetupInstructions
http://tinyurl.com/hwx-spark-intro
LabOptions- LocalSandbox(8GBRAMmemoryrequired):
- VirtualBox orVmware- AmazonAWSCloud:
- HortonworksDataCloudè Setupinfo:http://hortonworks.github.io/hdp-aws/index.html
http://hortonworks.github.io/hdp-aws/index.htmlhttp://hortonworks.github.io/hdp-aws/index.html
64 ©HortonworksInc.2011– 2016.AllRightsReserved
HortonworksCommunityConnection
65 ©HortonworksInc.2011– 2016.AllRightsReserved
CommunityEngagement
Participate now at: community.hortonworks.com©HortonworksInc.2011– 2015.AllRightsReserved
9,500+RegisteredUsers
21,000+Answers
32,500+TechnicalAssets
One Website!
66 ©HortonworksInc.2011– 2016.AllRightsReserved
HortonworksCommunityConnection
Read access for everyone, join to participate and be recognized
• FullQ&APlatform(likeStackOverflow)
• KnowledgeBaseArticles
• CodeSamplesandRepositories
RobertHryniewiczE:[email protected]:@RobH8z
Thanks!