spark summit eu talk by bas geerdink
TRANSCRIPT
![Page 1: Spark Summit EU talk by Bas Geerdink](https://reader031.vdocuments.pub/reader031/viewer/2022030317/586f75d01a28ab10258b61a7/html5/thumbnails/1.jpg)
Lambda Architecture in the IoTFast Data Analytics with Spark Streaming and MLlib
Bas GeerdinkING
![Page 2: Spark Summit EU talk by Bas Geerdink](https://reader031.vdocuments.pub/reader031/viewer/2022030317/586f75d01a28ab10258b61a7/html5/thumbnails/2.jpg)
Bas Geerdink
• Chapter Lead in Analytics area at ING
• Master degree in Artificial Intelligence and
Informatics
• Spark Certified Developer
• @bgeerdink
• https://www.linkedin.com/in/geerdink
Who am I?
![Page 3: Spark Summit EU talk by Bas Geerdink](https://reader031.vdocuments.pub/reader031/viewer/2022030317/586f75d01a28ab10258b61a7/html5/thumbnails/3.jpg)
The Internet of …
![Page 4: Spark Summit EU talk by Bas Geerdink](https://reader031.vdocuments.pub/reader031/viewer/2022030317/586f75d01a28ab10258b61a7/html5/thumbnails/4.jpg)
• Data
– Streaming data from more sources
• Use cases
– Combining data streams
• Technology
– Fast processing and scalability
• Challenges
– Security: encrypt the sensors/network/server
What’s new in the IoT?
![Page 5: Spark Summit EU talk by Bas Geerdink](https://reader031.vdocuments.pub/reader031/viewer/2022030317/586f75d01a28ab10258b61a7/html5/thumbnails/5.jpg)
What do we want in the IoT?
• FAST data: process stream of events (sensory data)
• BIG data: process files/tables in batches (static data)
• ONE analytics engine
• ONE querying/visualization tool
• Scalable environment
• Reactive software
![Page 6: Spark Summit EU talk by Bas Geerdink](https://reader031.vdocuments.pub/reader031/viewer/2022030317/586f75d01a28ab10258b61a7/html5/thumbnails/6.jpg)
Lambda Architecture
Source: Nathan Marz (2013)
![Page 7: Spark Summit EU talk by Bas Geerdink](https://reader031.vdocuments.pub/reader031/viewer/2022030317/586f75d01a28ab10258b61a7/html5/thumbnails/7.jpg)
Gamma/Kappa/Omega/…
Architecture• Lambda Criticism:
– Too complex
– Two code bases
– Two data stores
• Alternatives:– Treat a stream as a mini-batch
– Treat a batch as a stream of records
– Combine batch and stream within one system
![Page 8: Spark Summit EU talk by Bas Geerdink](https://reader031.vdocuments.pub/reader031/viewer/2022030317/586f75d01a28ab10258b61a7/html5/thumbnails/8.jpg)
Big Data Reference Architecture
Source: Geerdink (2013)
![Page 9: Spark Summit EU talk by Bas Geerdink](https://reader031.vdocuments.pub/reader031/viewer/2022030317/586f75d01a28ab10258b61a7/html5/thumbnails/9.jpg)
Use case: Smart Parking
Recommend the best car park when driving to Amsterdam
Show the car park with the highest score, determined by
• Batch (every x minutes), data from car parks
• Stream (every x seconds), GPS data of cars
![Page 10: Spark Summit EU talk by Bas Geerdink](https://reader031.vdocuments.pub/reader031/viewer/2022030317/586f75d01a28ab10258b61a7/html5/thumbnails/10.jpg)
![Page 11: Spark Summit EU talk by Bas Geerdink](https://reader031.vdocuments.pub/reader031/viewer/2022030317/586f75d01a28ab10258b61a7/html5/thumbnails/11.jpg)
![Page 12: Spark Summit EU talk by Bas Geerdink](https://reader031.vdocuments.pub/reader031/viewer/2022030317/586f75d01a28ab10258b61a7/html5/thumbnails/12.jpg)
Stream Process
• Get car events (GPS data)
• Filter events (business rules)
• Store events
• For each car park in the neighborhood:
– Get feature set (location, capacity, usage, …)
– Combine event with previous events (running sum)
– Update feature set
– Predict score (machine learning) and update database
![Page 13: Spark Summit EU talk by Bas Geerdink](https://reader031.vdocuments.pub/reader031/viewer/2022030317/586f75d01a28ab10258b61a7/html5/thumbnails/13.jpg)
Batch Process
• Get car park data
• Clean up (remove old car data)
• For each car park in the data set:
– Get recent set of car data (running sum)
– Update feature set
– Predict score (machine learning) and update database
![Page 14: Spark Summit EU talk by Bas Geerdink](https://reader031.vdocuments.pub/reader031/viewer/2022030317/586f75d01a28ab10258b61a7/html5/thumbnails/14.jpg)
Lambda Architecture
GPS
updates
Message
Broker
Analytics
Engine
File
System
Capacity
updates
Data Science
Environment
Machine learning models
Business Rules
Environment
Scores
Batch
Stream
1.
Get
features
2.
Apply
rules
3.
Predict
score
Business rules
Features
Selection
APIFront-end
Data Scientist
User
![Page 15: Spark Summit EU talk by Bas Geerdink](https://reader031.vdocuments.pub/reader031/viewer/2022030317/586f75d01a28ab10258b61a7/html5/thumbnails/15.jpg)
Lambda Architecture
GPS
updates
Message
Broker
Analytics
Engine
File
System
Capacity
updates
Data Science
Environment
Machine learning models
Business Rules
Environment
Scores
Batch
Stream
1.
Get
features
2.
Apply
rules
3.
Predict
score
Business rules
Features
Selection
APIFront-end
Data Scientist
User
![Page 16: Spark Summit EU talk by Bas Geerdink](https://reader031.vdocuments.pub/reader031/viewer/2022030317/586f75d01a28ab10258b61a7/html5/thumbnails/16.jpg)
Lambda Architecture
GPS
updates
Message
Broker
Analytics
Engine
File
System
Capacity
updates
Data Science
Environment
Machine learning models
Business Rules
Environment
Scores
Batch
Stream
1.
Get
features
2.
Apply
rules
3.
Predict
score
Business rules
Features
Selection
APIFront-end
Data Scientist
User
![Page 17: Spark Summit EU talk by Bas Geerdink](https://reader031.vdocuments.pub/reader031/viewer/2022030317/586f75d01a28ab10258b61a7/html5/thumbnails/17.jpg)
Event Processing with Kafka
![Page 18: Spark Summit EU talk by Bas Geerdink](https://reader031.vdocuments.pub/reader031/viewer/2022030317/586f75d01a28ab10258b61a7/html5/thumbnails/18.jpg)
Lambda Architecture
GPS
updates
Message
Broker
Analytics
Engine
File
System
Capacity
updates
Data Science
Environment
Machine learning models
Business Rules
Environment
Scores
Batch
Stream
1.
Get
features
2.
Apply
rules
3.
Predict
score
Business rules
Features
Selection
APIFront-end
Data Scientist
User
![Page 19: Spark Summit EU talk by Bas Geerdink](https://reader031.vdocuments.pub/reader031/viewer/2022030317/586f75d01a28ab10258b61a7/html5/thumbnails/19.jpg)
Stream: Data Preparation
![Page 20: Spark Summit EU talk by Bas Geerdink](https://reader031.vdocuments.pub/reader031/viewer/2022030317/586f75d01a28ab10258b61a7/html5/thumbnails/20.jpg)
Lambda Architecture
GPS
updates
Message
Broker
Analytics
Engine
File
System
Capacity
updates
Data Science
Environment
Machine learning models
Business Rules
Environment
Scores
Batch
Stream
1.
Get
features
2.
Apply
rules
3.
Predict
score
Business rules
Features
Selection
APIFront-end
Data Scientist
User
![Page 21: Spark Summit EU talk by Bas Geerdink](https://reader031.vdocuments.pub/reader031/viewer/2022030317/586f75d01a28ab10258b61a7/html5/thumbnails/21.jpg)
Stream: Create Recommendation
![Page 22: Spark Summit EU talk by Bas Geerdink](https://reader031.vdocuments.pub/reader031/viewer/2022030317/586f75d01a28ab10258b61a7/html5/thumbnails/22.jpg)
Lambda Architecture
GPS
updates
Message
Broker
Analytics
Engine
File
System
Capacity
updates
Data Science
Environment
Machine learning models
Business Rules
Environment
Scores
Batch
Stream
1.
Get
features
2.
Apply
rules
3.
Predict
score
Business rules
Features
Selection
APIFront-end
Data Scientist
User
![Page 23: Spark Summit EU talk by Bas Geerdink](https://reader031.vdocuments.pub/reader031/viewer/2022030317/586f75d01a28ab10258b61a7/html5/thumbnails/23.jpg)
Batch Processing
![Page 24: Spark Summit EU talk by Bas Geerdink](https://reader031.vdocuments.pub/reader031/viewer/2022030317/586f75d01a28ab10258b61a7/html5/thumbnails/24.jpg)
Lambda Architecture
GPS
updates
Message
Broker
Analytics
Engine
File
System
Capacity
updates
Data Science
Environment
Machine learning models
Business Rules
Environment
Scores
Batch
Stream
1.
Get
features
2.
Apply
rules
3.
Predict
score
Business rules
Features
Selection
APIFront-end
Data Scientist
User
![Page 25: Spark Summit EU talk by Bas Geerdink](https://reader031.vdocuments.pub/reader031/viewer/2022030317/586f75d01a28ab10258b61a7/html5/thumbnails/25.jpg)
Data Science with Zeppelin
![Page 26: Spark Summit EU talk by Bas Geerdink](https://reader031.vdocuments.pub/reader031/viewer/2022030317/586f75d01a28ab10258b61a7/html5/thumbnails/26.jpg)
Summary
• The IoT requires new architecture principles: in-memory,
distributed, reactive and scalable are the new normal
• Lambda/Kappa/Omega reference architectures are
designed for combining batch and streaming data flows
• Open source tooling such as Kafka, Cassandra, and
Spark adhere to these principles and design patterns
![Page 27: Spark Summit EU talk by Bas Geerdink](https://reader031.vdocuments.pub/reader031/viewer/2022030317/586f75d01a28ab10258b61a7/html5/thumbnails/27.jpg)
THANK YOU!ING is hiring
#SparkSummit