kafka - github pagesszjug.github.io/files/20150917-big-data/apache_kafka_by... · 2017-04-09 ·...

43
Melanga Dissanayake SEPTEMBER 17, 2015 Apache Kafka

Upload: others

Post on 30-Mar-2020

40 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Kafka - GitHub Pagesszjug.github.io/files/20150917-big-data/Apache_Kafka_by... · 2017-04-09 · Apache Kafka was initially developed by to pipeline the data across various internal

Melanga Dissanayake

SEPTEMBER 17, 2015

Apache Kafka

Page 2: Kafka - GitHub Pagesszjug.github.io/files/20150917-big-data/Apache_Kafka_by... · 2017-04-09 · Apache Kafka was initially developed by to pipeline the data across various internal

About Me

2

Melanga Dissanayake

Senior Software Engineer - EPAM System (Shenzhen)

• Over 10 years of enterprise application development experience

• Focused on financial domain

Page 3: Kafka - GitHub Pagesszjug.github.io/files/20150917-big-data/Apache_Kafka_by... · 2017-04-09 · Apache Kafka was initially developed by to pipeline the data across various internal

3

Big data is like teenage sex: everyone talks about it,

nobody knows how to do it, everyone thinks everyone else is doing it, so everyone claims they

are doing it…(Dan Ariely)

Page 4: Kafka - GitHub Pagesszjug.github.io/files/20150917-big-data/Apache_Kafka_by... · 2017-04-09 · Apache Kafka was initially developed by to pipeline the data across various internal

4

What is Big Data?

Page 5: Kafka - GitHub Pagesszjug.github.io/files/20150917-big-data/Apache_Kafka_by... · 2017-04-09 · Apache Kafka was initially developed by to pipeline the data across various internal

5

What is Big Data?

3Vs

• Petabytes • Records • Transactions • Tables, files

• Batch • Near Time • Real Time • Streams

• Structured • Unstructured • Semi structured • All the above

Volume

Velocity Variety

Page 6: Kafka - GitHub Pagesszjug.github.io/files/20150917-big-data/Apache_Kafka_by... · 2017-04-09 · Apache Kafka was initially developed by to pipeline the data across various internal

What is Big Data?

6

• Traditional Queue

• Website Activity Tracking

• Metrics - Operational data

• Log Aggregation

• Stream Processing

• Event Sourcing

• Commit Log

Page 7: Kafka - GitHub Pagesszjug.github.io/files/20150917-big-data/Apache_Kafka_by... · 2017-04-09 · Apache Kafka was initially developed by to pipeline the data across various internal

Where do we put Big Data?

7

• Traditional Database?

• Flat files?

Page 8: Kafka - GitHub Pagesszjug.github.io/files/20150917-big-data/Apache_Kafka_by... · 2017-04-09 · Apache Kafka was initially developed by to pipeline the data across various internal

HDFS

8

• HDFS - Hadoop Distributed File System

• Highly fault-tolerant

• Java based file system

• Runs on commodity hardware

• Concurrent data access (Coordinated by YARN)

Page 9: Kafka - GitHub Pagesszjug.github.io/files/20150917-big-data/Apache_Kafka_by... · 2017-04-09 · Apache Kafka was initially developed by to pipeline the data across various internal

How do we put these Big Data

9

• CSV file dump

• ETL

• Other messaging systems (ActiveMQ, RabbitMQ, etc)

Page 10: Kafka - GitHub Pagesszjug.github.io/files/20150917-big-data/Apache_Kafka_by... · 2017-04-09 · Apache Kafka was initially developed by to pipeline the data across various internal

Data pipeliens

Data pipeline starts like this

10

Source: Cloudera

Client Backend

Page 11: Kafka - GitHub Pagesszjug.github.io/files/20150917-big-data/Apache_Kafka_by... · 2017-04-09 · Apache Kafka was initially developed by to pipeline the data across various internal

Data pipeliens

Then we reuse them

11

Source: Cloudera

Client Backend

Client

Client

Client

Page 12: Kafka - GitHub Pagesszjug.github.io/files/20150917-big-data/Apache_Kafka_by... · 2017-04-09 · Apache Kafka was initially developed by to pipeline the data across various internal

Data pipeliens

Then we add more backends

12

Source: Cloudera

Client Backend

Client

Client

Client

Backend

Page 13: Kafka - GitHub Pagesszjug.github.io/files/20150917-big-data/Apache_Kafka_by... · 2017-04-09 · Apache Kafka was initially developed by to pipeline the data across various internal

Data pipeliens

Then it starts to look like this

13

Source: Cloudera

Client Backend

Client

Client

Client

Backend

Backend

Backend

Page 14: Kafka - GitHub Pagesszjug.github.io/files/20150917-big-data/Apache_Kafka_by... · 2017-04-09 · Apache Kafka was initially developed by to pipeline the data across various internal

Data pipeliens

with may be some of this

14

Source: Cloudera

Client Backend

Client

Client

Client

Backend

Backend

Backend

Page 15: Kafka - GitHub Pagesszjug.github.io/files/20150917-big-data/Apache_Kafka_by... · 2017-04-09 · Apache Kafka was initially developed by to pipeline the data across various internal

Data pipeliens

ended up having

15

Source: LinkedIn

Page 16: Kafka - GitHub Pagesszjug.github.io/files/20150917-big-data/Apache_Kafka_by... · 2017-04-09 · Apache Kafka was initially developed by to pipeline the data across various internal

What is Apache Kafka

Apache Kafka is an open-source message broker rethought as a distributed commit log.

Developed using Scala and heavily influenced by transaction logs.

16

Page 17: Kafka - GitHub Pagesszjug.github.io/files/20150917-big-data/Apache_Kafka_by... · 2017-04-09 · Apache Kafka was initially developed by to pipeline the data across various internal

Little bit of history

Apache Kafka was initially developed by to pipeline the data across various internal systems.

Developed as a an internal project in early 2011 project was released under open-source license.

Graduated from Apache Incubator on October 2012.

17

Page 18: Kafka - GitHub Pagesszjug.github.io/files/20150917-big-data/Apache_Kafka_by... · 2017-04-09 · Apache Kafka was initially developed by to pipeline the data across various internal

18

Kafka in nutshell

producerproducer producer

consumer consumer consumer

kafka cluster

Page 19: Kafka - GitHub Pagesszjug.github.io/files/20150917-big-data/Apache_Kafka_by... · 2017-04-09 · Apache Kafka was initially developed by to pipeline the data across various internal

19

Kafka Architecture

consumer

broker broker

Zookeeper

consumer

broker broker

producer producer

Page 20: Kafka - GitHub Pagesszjug.github.io/files/20150917-big-data/Apache_Kafka_by... · 2017-04-09 · Apache Kafka was initially developed by to pipeline the data across various internal

Kafka Architecture

• Communication between all nodes based on high performance simple binary API over TCP

• Runs on as a cluster of brokers which is a one or more servers in this case

• High performance low level APIs for producer/consumer

• REST API via Kafka REST Proxy

20

Page 21: Kafka - GitHub Pagesszjug.github.io/files/20150917-big-data/Apache_Kafka_by... · 2017-04-09 · Apache Kafka was initially developed by to pipeline the data across various internal

Topics & Partitions

• A Topic is a category or feed name to which messages are published.

• Each topic separated to partitioned log where messages are kept.

• Partitions are replicated and distributed across the Kafka cluster for high availability and fault tolerance.

21

Page 22: Kafka - GitHub Pagesszjug.github.io/files/20150917-big-data/Apache_Kafka_by... · 2017-04-09 · Apache Kafka was initially developed by to pipeline the data across various internal

Partition

22

10001

10002

10003

10004

10005

10006

10007

10008

10009

10010

producer

Send

10010Write

consumer

consumer

consumer

• Multiple consumers can read from same topic on their own pace

• Messages are kept on log for predefined period of time

• Consumer maintain the message offset

Read

Read

Read

Page 23: Kafka - GitHub Pagesszjug.github.io/files/20150917-big-data/Apache_Kafka_by... · 2017-04-09 · Apache Kafka was initially developed by to pipeline the data across various internal

Partition

• Consumers can go away

23

10001

10002

10003

10004

10005

10006

10007

10008

10009

10010

producer

Send

10010Write

consumer

consumer

Read

Read

Page 24: Kafka - GitHub Pagesszjug.github.io/files/20150917-big-data/Apache_Kafka_by... · 2017-04-09 · Apache Kafka was initially developed by to pipeline the data across various internal

Partition

24

10001

10002

10003

10004

10005

10006

10007

10008

10009

10010

producer

Send

10010Write

consumer

consumer

consumer

• and come back

Read

Read

Read

Page 25: Kafka - GitHub Pagesszjug.github.io/files/20150917-big-data/Apache_Kafka_by... · 2017-04-09 · Apache Kafka was initially developed by to pipeline the data across various internal

25

Kafka Architecture

consumer

broker broker

Zookeeper

consumer

broker broker

producer producer

Page 26: Kafka - GitHub Pagesszjug.github.io/files/20150917-big-data/Apache_Kafka_by... · 2017-04-09 · Apache Kafka was initially developed by to pipeline the data across various internal

Consumers and Consumer Groups

Kafka addresses 2 traditional messaging models

• Queuing

• Publish-subscribe

using “consumer group”, a single consumer abstraction

26

Page 27: Kafka - GitHub Pagesszjug.github.io/files/20150917-big-data/Apache_Kafka_by... · 2017-04-09 · Apache Kafka was initially developed by to pipeline the data across various internal

27

Consumers and Consumer Groups

Page 28: Kafka - GitHub Pagesszjug.github.io/files/20150917-big-data/Apache_Kafka_by... · 2017-04-09 · Apache Kafka was initially developed by to pipeline the data across various internal

28

Consumers and Consumer Groups

Consumer Group A

ConsumerConsumer

Consumer Group B

Consumer

Partition 0 Partition 0 ……………………………………………

Topic

Page 29: Kafka - GitHub Pagesszjug.github.io/files/20150917-big-data/Apache_Kafka_by... · 2017-04-09 · Apache Kafka was initially developed by to pipeline the data across various internal

29

Message order and parallelism

• Retain messages in-order and handover to consumer in-order.

• Message order is not guaranteed when it’s come to parallel processing unless it’s a

exclusive consumer per queue.

• Partition on topics

• Partition is assigned to a consumer group so that each portion consumed by a single

consumer in consumer group.

• Message order guaranteed on partition basis

• Message key can be used to order explicitly (consumer)

• One consumer instance per partition within the consumer group

TRADITIONAL QUEUE

KAFKA WAY - THE PARTITION

Page 30: Kafka - GitHub Pagesszjug.github.io/files/20150917-big-data/Apache_Kafka_by... · 2017-04-09 · Apache Kafka was initially developed by to pipeline the data across various internal

30

Replication

• Each portion of a topic has a 1 leader and 0 or more replicas

• Partitions are selected “round-robin” to balance the load unless it is required to maintain the order

• Leader handles all writes to the partition

Server1 Server1 Server1

A:0

A:1

B:0

A:0

A:1

B:0

A:0

A:1

Controller

Page 31: Kafka - GitHub Pagesszjug.github.io/files/20150917-big-data/Apache_Kafka_by... · 2017-04-09 · Apache Kafka was initially developed by to pipeline the data across various internal

31

Durability and throughput

• Durability can be configured on producer level

• Durability ~ Throughput

• ISR - group of in-sync replicas for a partition

Durability Behaviour Per Event Latency

Highest ACK all ISRs have received Highest

Medium ACK once the leader has received Medium

Lowest No ACK required Lowest

Page 32: Kafka - GitHub Pagesszjug.github.io/files/20150917-big-data/Apache_Kafka_by... · 2017-04-09 · Apache Kafka was initially developed by to pipeline the data across various internal

32

Durability and throughput

Page 33: Kafka - GitHub Pagesszjug.github.io/files/20150917-big-data/Apache_Kafka_by... · 2017-04-09 · Apache Kafka was initially developed by to pipeline the data across various internal

33

Use case - LinkedIn

Source: LinkedIn

Traditional jerry-rigged pipped architecture

Page 34: Kafka - GitHub Pagesszjug.github.io/files/20150917-big-data/Apache_Kafka_by... · 2017-04-09 · Apache Kafka was initially developed by to pipeline the data across various internal

34

Use case - LinkedIn

Source: LinkedIn

Page 35: Kafka - GitHub Pagesszjug.github.io/files/20150917-big-data/Apache_Kafka_by... · 2017-04-09 · Apache Kafka was initially developed by to pipeline the data across various internal

35

Use case - LinkedIn

Source: LinkedIn

Stream-centric data architecture

Page 36: Kafka - GitHub Pagesszjug.github.io/files/20150917-big-data/Apache_Kafka_by... · 2017-04-09 · Apache Kafka was initially developed by to pipeline the data across various internal

36

What is Stream Processing?

Page 37: Kafka - GitHub Pagesszjug.github.io/files/20150917-big-data/Apache_Kafka_by... · 2017-04-09 · Apache Kafka was initially developed by to pipeline the data across various internal

Stream Processing

37

Stream processing is a generalisation of Batch processing

Page 38: Kafka - GitHub Pagesszjug.github.io/files/20150917-big-data/Apache_Kafka_by... · 2017-04-09 · Apache Kafka was initially developed by to pipeline the data across various internal

Stream Processing

38

Transform

Transform

Transform

Intermediate Kafka Topic

Transform

Transform

Transform

Output Kafka Topic

Data Store

Consumer

Input Kafka Topic

cat input | grep “foo” | wc

Page 39: Kafka - GitHub Pagesszjug.github.io/files/20150917-big-data/Apache_Kafka_by... · 2017-04-09 · Apache Kafka was initially developed by to pipeline the data across various internal

Stream Processing with Kafka

39

+ = Stream Processing

Page 40: Kafka - GitHub Pagesszjug.github.io/files/20150917-big-data/Apache_Kafka_by... · 2017-04-09 · Apache Kafka was initially developed by to pipeline the data across various internal

Spark

40

• Data-Parallel computation • Micro batch processing • APIs in Java, Scala, Python • In-memory storage

Page 41: Kafka - GitHub Pagesszjug.github.io/files/20150917-big-data/Apache_Kafka_by... · 2017-04-09 · Apache Kafka was initially developed by to pipeline the data across various internal

Storm

41

• Even-Parallel computation • One-at-a-time processing • Micro batch processing is possible with Trident • APIs in Java, Scala, Python, Clojure, Ruby, etc • Suitable for processing complex event data • Transform unstructured data in to desired format

Page 42: Kafka - GitHub Pagesszjug.github.io/files/20150917-big-data/Apache_Kafka_by... · 2017-04-09 · Apache Kafka was initially developed by to pipeline the data across various internal

Samza

42

• One-at-a-time processing • Based on messages and partitions • APIs in Java, Scala • Suitable for processing large amount od data

Page 43: Kafka - GitHub Pagesszjug.github.io/files/20150917-big-data/Apache_Kafka_by... · 2017-04-09 · Apache Kafka was initially developed by to pipeline the data across various internal

43