approaching real-time-hadoop

62
Approaching real-time: Things you can do before going Impala Chris Huang SPN Hadoop Architect

Upload: chris-huang

Post on 27-Jan-2015

105 views

Category:

Technology


2 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Approaching real-time-hadoop

Approaching real-time: Things you can do before going Impala

Chris HuangSPN Hadoop Architect

Page 2: Approaching real-time-hadoop

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 2

About – Chris Huang

• Chris Huang– SPN Hadoop Architect– SPN Dumbo Team– Hadoop.TW Active Member

Page 3: Approaching real-time-hadoop

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 3

About – SPN

• SPN, Smart Protection Network– 主動式雲端截毒技術

• 2013 Big Data Foresight Forum– Scaling Big Data Mining Infrastructure: The Smart Protection

Network http://www.slideshare.net/chenhsiu/scaling-bigdatamininginfra2

Page 4: Approaching real-time-hadoop

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 4

Page 5: Approaching real-time-hadoop

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 5

Batch v.s. Real-time

Batch, High Throughput Real-time, Timely Information

Q: How can I transport 10,000 people from Taipei to Kaohsiung?

Q: What’s the fastest way to Taipei Train Station?

Page 6: Approaching real-time-hadoop

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 6

67%Query Hadoop using Hive

51%Load data into Hadoop in less than 90 mins

54%Use HBase for real-time data access

* Cloudera customer survey Aug. 2012

Time is Money!

Page 7: Approaching real-time-hadoop

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 7

From Batch to Real-time

• Bridge the gap between batch and now• 80/20 rule

– Hadoop solves 80% easily– Remaining 20% takes 80% of the efforts

• Go as close as possible, don’t overdo it!

Page 8: Approaching real-time-hadoop

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 8

What is Real-time?

• Real-time is NOT always “faster than batch”– If you have really BIG DATA

• Most of the time, we want Timely Information• Minimize the gap between scheduled MR jobs

Hourly Job

Hourly Job

Hourly Job

How to get result at 1:33?

Page 9: Approaching real-time-hadoop

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 9

So, You want to talk about

Impala?

Page 10: Approaching real-time-hadoop

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 10

NO

Page 11: Approaching real-time-hadoop

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 11

Impala is not silver bullet

* Here Impala denotes any interactive query solution, including Apache Drill, Apache Tez + Stinger

Page 12: Approaching real-time-hadoop

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 12

You can do a lot before

using Impala

Page 13: Approaching real-time-hadoop

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 13

3 Arrows for Real-time Applications

HBase (20%)

SolrCloud (60%)

Streaming (20%)

Page 14: Approaching real-time-hadoop

1404/10/2023

Confidential | Copyright 2013 TrendMicro Inc.

Example Case

Page 15: Approaching real-time-hadoop

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 15

Question 1

• If we get a C&C malicious URL hxxp://www.thebadguy.com/?info=12345678

• Yesterday, Who accessed that URL? From where, How? What’s the frequency?

Page 16: Approaching real-time-hadoop

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 16

Very Simple

Page 17: Approaching real-time-hadoop

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 17

But we have 5 billion lines of log per day

Page 18: Approaching real-time-hadoop

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 18

It takes about 20 minutes

~1 hour if you’re not lucky

Page 19: Approaching real-time-hadoop

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 19

And we may query 50,000 times a day

Page 20: Approaching real-time-hadoop

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 20

We need a real-time

(interactive)system

Page 21: Approaching real-time-hadoop

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 21

1st Arrow: HBase

Page 22: Approaching real-time-hadoop

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 22

Make Good Use of HBase Row Key

Region Start Key End KeyR1 net.pwnnetwork#201208 net.tlm100.f19e100f#201304R2 net.tlm100.f19e100f#201304 nl.efkobeton.www#201211R3 nl.efkobeton.www#201211 no.rubrikk#201305R4 no.rubrikk#201305 org.saintalphonsus.www#201304R5 org.saintalphonsus.www#201304 pl.opole.uni.socjologia.www#201301

com.domain.reverse#YYYYMMDD

Easy retrieve data by row key scan

Hadoop in Taiwan 2012 – 設計高效能 HBase Schema 了解 HBasehttp://www.youtube.com/watch?v=8DMzNmVrXEI

Page 23: Approaching real-time-hadoop

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 23

Compute Once, Import Once

• Clarify your use case• Compute the whole thing once

– Daily job + hourly job

• Import into HBase using Bulk Loading• On the fly query, with constant query time

Page 24: Approaching real-time-hadoop

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 24

If You Really Care About Real-Time

• Delta data are not big, don’t use MR• Write another program to calculate on the fly• Dynamically put into HBase

– Row key: com.domain.reverse#YYYYMMDD_HHmmss

• Query from both hourly batch and delta data• Drop delta data in next hourly batch

2 am 3 am

Delta data

Page 25: Approaching real-time-hadoop

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 25

But...Life suffers because of “but”

Page 26: Approaching real-time-hadoop

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 26

Question 2

• Query malicious sites with pattern *.com hosted in Japan, sorted by the distance to GeoLocation (30.0,130.0)

Page 27: Approaching real-time-hadoop

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 27

HBase does not have 2nd

index (yet)

Page 28: Approaching real-time-hadoop

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 28

2nd Arrow: SolrCloud

Page 29: Approaching real-time-hadoop

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 29

Lucene, Solr, SolrCloud

TW Hadoop User Group Q1 Meetup - Solr Tutorialhttp://www.slideshare.net/chenhsiu/20130310-solr-tuorial

Page 30: Approaching real-time-hadoop

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 30

What is Lucene?

• Full-text search library• Written in Java• Indexing & searching• One of the top 5 Apache projects

Page 31: Approaching real-time-hadoop

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 31

Inverted Index

https://developer.apple.com/library/mac/#documentation/userexperience/Conceptual/SearchKitConcepts/searchKit_basics/searchKit_basics.html

Page 32: Approaching real-time-hadoop

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 32

What is Solr?

• Enterprise search server based on Lucene– NOT a database

• Advanced full-text search capabilities• Flexible and adaptable with XML configuration• Extensible plug-in architecture• REST-like APIs• Web admin interface• Runs inside a Java servlet container such as Jetty and

Tomcat

Page 33: Approaching real-time-hadoop

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 33

Use Hadoop MapReduce for

Indexing

Lucene Indexing Flow

Page 34: Approaching real-time-hadoop

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 34

Use SolrCloud for Scalable, Fault Tolerant

Query

Solr: Index Query Flow

Page 35: Approaching real-time-hadoop

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 35

What is SolrCloud?

Page 36: Approaching real-time-hadoop

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 36

Indexing in SolrCloud

Page 37: Approaching real-time-hadoop

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 37

Searching in SolrCloud

Page 38: Approaching real-time-hadoop

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 38

Question 2

• Query malicious sites with pattern *.com hosted in Japan, sorted by the distance to GeoLocation (30.0,130.0)A = load 'date://2013/09/28' using NSCTmProxyURLFProtobufLoader();B = foreach A generate value.addr.peerIp as ip, value.NSCLog.URL as url, Location(value.addr.peerIp) as loc;C = foreach B generate ip, url, loc.countryName as cn, CONCAT(CONCAT((chararray)loc.latitude, ','), (chararray)loc.longitude) as loc;store C into 'solrcloud://$COLLECTION' using SolrStorage('ip_s,url_domain,cn_s,loc_p', '$USERNAME', '$PASSWORD');hxxp://$SERVER:8983/solr/$SHARD/select?q=cn_s:Japan+url_s:com*&wt=json&indent=true&rows=5&sort=geodist(loc_p,30.0,130.0)+asc

Page 39: Approaching real-time-hadoop

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 39

That’s it?

YES

Page 40: Approaching real-time-hadoop

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 40

If You Really Care About Real-Time

• Delta data are not big, don’t use MR• Write another program to calculate on the fly• Solr supports dynamic indexing

– Send your data to Solr to create a delta index

• Query from both batch index and delta index• Drop delta index in next hourly batch

2 am 3 am

Delta data

Page 41: Approaching real-time-hadoop

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 41

Domain/IP Census

Page 42: Approaching real-time-hadoop

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 42

www.facebook.com

Page 43: Approaching real-time-hadoop

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 43

Excellent!

Page 44: Approaching real-time-hadoop

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 44

But...Life suffers because of “but”

Page 45: Approaching real-time-hadoop

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 45

We need to identify use

case first

Yesterday, Who accessed hxxp://www.thebadbuy.com? From where, How? What’s the frequency?

Query malicious sites with pattern *.com hosted in Japan, sorted by the distance to GeoLocation (30.0,130.0)

Page 46: Approaching real-time-hadoop

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 46

3rd Arrow: Streaming

Page 47: Approaching real-time-hadoop

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 47

Question 1 Revisited

Yesterday, Who accessed hxxp://www.thebadbuy.com? From where, How? What’s the frequency?

• Can you send email when there is a contact to specific C&C server?

• Can you monitor a specific client IP to a list of C&C server?

• I found there is certain pattern in C&C URL paths, can you give me a hourly update of top 10 path grouping?

• Report the C&C connect’s parent process SHA-1 to Virus DB for sourcing

Page 48: Approaching real-time-hadoop

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 48

The Messaging

OSDC.TW 2012 - TME: Open Source Realtime Big Data Processing Platformhttp://cloud.github.com/downloads/trendmicro/tme/TME_Introduction_OSDC.tw2012%20.pdf

Page 49: Approaching real-time-hadoop

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 49

Let’s dump the data

Page 50: Approaching real-time-hadoop

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 50

You need lots of workers!

Page 51: Approaching real-time-hadoop

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 51

Your boss won’t buy you another 100

servers

Page 52: Approaching real-time-hadoop

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 52

NextGen MapReduce (YARN)

Page 53: Approaching real-time-hadoop

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 53

Storm-YARN

Storm-on-YARN: Convergence of Low-Latency and Big-Datahttp://www.slideshare.net/Hadoop_Summit/feng-june26-1120amhall1v2

Page 54: Approaching real-time-hadoop

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 54

Continuously Processing

• Calculate data on the fly, endless processing• Hook up your processing anytime

– Or store scripts on ZooKeeper

• Leverage your existing Hadoop cluster• Dynamically scale in/out your workers

Page 55: Approaching real-time-hadoop

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 55

Summary

Page 56: Approaching real-time-hadoop

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 56

3 Arrows for Real-time Applications

HBase (20%)

SolrCloud (60%)

Streaming (20%)

Page 57: Approaching real-time-hadoop

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 57

80/20 Rule

As close as possible,

don’t overdo

Page 58: Approaching real-time-hadoop

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 58

Why not just use Impala?

Page 59: Approaching real-time-hadoop

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 59

The same problem, anyway

Page 60: Approaching real-time-hadoop

04/10/2023

60

Q&A

Confidential | Copyright 2013 TrendMicro Inc.

Page 61: Approaching real-time-hadoop

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 61

You’re Brilliant We’re hiring!

Page 62: Approaching real-time-hadoop

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 62