approaching real-time-hadoop
DESCRIPTION
TRANSCRIPT
Approaching real-time: Things you can do before going Impala
Chris HuangSPN Hadoop Architect
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 2
About – Chris Huang
• Chris Huang– SPN Hadoop Architect– SPN Dumbo Team– Hadoop.TW Active Member
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 3
About – SPN
• SPN, Smart Protection Network– 主動式雲端截毒技術
• 2013 Big Data Foresight Forum– Scaling Big Data Mining Infrastructure: The Smart Protection
Network http://www.slideshare.net/chenhsiu/scaling-bigdatamininginfra2
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 4
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 5
Batch v.s. Real-time
Batch, High Throughput Real-time, Timely Information
Q: How can I transport 10,000 people from Taipei to Kaohsiung?
Q: What’s the fastest way to Taipei Train Station?
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 6
67%Query Hadoop using Hive
51%Load data into Hadoop in less than 90 mins
54%Use HBase for real-time data access
* Cloudera customer survey Aug. 2012
Time is Money!
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 7
From Batch to Real-time
• Bridge the gap between batch and now• 80/20 rule
– Hadoop solves 80% easily– Remaining 20% takes 80% of the efforts
• Go as close as possible, don’t overdo it!
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 8
What is Real-time?
• Real-time is NOT always “faster than batch”– If you have really BIG DATA
• Most of the time, we want Timely Information• Minimize the gap between scheduled MR jobs
Hourly Job
Hourly Job
Hourly Job
How to get result at 1:33?
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 9
So, You want to talk about
Impala?
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 10
NO
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 11
Impala is not silver bullet
* Here Impala denotes any interactive query solution, including Apache Drill, Apache Tez + Stinger
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 12
You can do a lot before
using Impala
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 13
3 Arrows for Real-time Applications
HBase (20%)
SolrCloud (60%)
Streaming (20%)
1404/10/2023
Confidential | Copyright 2013 TrendMicro Inc.
Example Case
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 15
Question 1
• If we get a C&C malicious URL hxxp://www.thebadguy.com/?info=12345678
• Yesterday, Who accessed that URL? From where, How? What’s the frequency?
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 16
Very Simple
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 17
But we have 5 billion lines of log per day
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 18
It takes about 20 minutes
~1 hour if you’re not lucky
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 19
And we may query 50,000 times a day
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 20
We need a real-time
(interactive)system
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 21
1st Arrow: HBase
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 22
Make Good Use of HBase Row Key
Region Start Key End KeyR1 net.pwnnetwork#201208 net.tlm100.f19e100f#201304R2 net.tlm100.f19e100f#201304 nl.efkobeton.www#201211R3 nl.efkobeton.www#201211 no.rubrikk#201305R4 no.rubrikk#201305 org.saintalphonsus.www#201304R5 org.saintalphonsus.www#201304 pl.opole.uni.socjologia.www#201301
com.domain.reverse#YYYYMMDD
Easy retrieve data by row key scan
Hadoop in Taiwan 2012 – 設計高效能 HBase Schema 了解 HBasehttp://www.youtube.com/watch?v=8DMzNmVrXEI
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 23
Compute Once, Import Once
• Clarify your use case• Compute the whole thing once
– Daily job + hourly job
• Import into HBase using Bulk Loading• On the fly query, with constant query time
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 24
If You Really Care About Real-Time
• Delta data are not big, don’t use MR• Write another program to calculate on the fly• Dynamically put into HBase
– Row key: com.domain.reverse#YYYYMMDD_HHmmss
• Query from both hourly batch and delta data• Drop delta data in next hourly batch
2 am 3 am
Delta data
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 25
But...Life suffers because of “but”
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 26
Question 2
• Query malicious sites with pattern *.com hosted in Japan, sorted by the distance to GeoLocation (30.0,130.0)
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 27
HBase does not have 2nd
index (yet)
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 28
2nd Arrow: SolrCloud
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 29
Lucene, Solr, SolrCloud
TW Hadoop User Group Q1 Meetup - Solr Tutorialhttp://www.slideshare.net/chenhsiu/20130310-solr-tuorial
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 30
What is Lucene?
• Full-text search library• Written in Java• Indexing & searching• One of the top 5 Apache projects
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 31
Inverted Index
https://developer.apple.com/library/mac/#documentation/userexperience/Conceptual/SearchKitConcepts/searchKit_basics/searchKit_basics.html
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 32
What is Solr?
• Enterprise search server based on Lucene– NOT a database
• Advanced full-text search capabilities• Flexible and adaptable with XML configuration• Extensible plug-in architecture• REST-like APIs• Web admin interface• Runs inside a Java servlet container such as Jetty and
Tomcat
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 33
Use Hadoop MapReduce for
Indexing
Lucene Indexing Flow
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 34
Use SolrCloud for Scalable, Fault Tolerant
Query
Solr: Index Query Flow
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 35
What is SolrCloud?
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 36
Indexing in SolrCloud
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 37
Searching in SolrCloud
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 38
Question 2
• Query malicious sites with pattern *.com hosted in Japan, sorted by the distance to GeoLocation (30.0,130.0)A = load 'date://2013/09/28' using NSCTmProxyURLFProtobufLoader();B = foreach A generate value.addr.peerIp as ip, value.NSCLog.URL as url, Location(value.addr.peerIp) as loc;C = foreach B generate ip, url, loc.countryName as cn, CONCAT(CONCAT((chararray)loc.latitude, ','), (chararray)loc.longitude) as loc;store C into 'solrcloud://$COLLECTION' using SolrStorage('ip_s,url_domain,cn_s,loc_p', '$USERNAME', '$PASSWORD');hxxp://$SERVER:8983/solr/$SHARD/select?q=cn_s:Japan+url_s:com*&wt=json&indent=true&rows=5&sort=geodist(loc_p,30.0,130.0)+asc
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 39
That’s it?
YES
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 40
If You Really Care About Real-Time
• Delta data are not big, don’t use MR• Write another program to calculate on the fly• Solr supports dynamic indexing
– Send your data to Solr to create a delta index
• Query from both batch index and delta index• Drop delta index in next hourly batch
2 am 3 am
Delta data
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 41
Domain/IP Census
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 42
www.facebook.com
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 43
Excellent!
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 44
But...Life suffers because of “but”
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 45
We need to identify use
case first
Yesterday, Who accessed hxxp://www.thebadbuy.com? From where, How? What’s the frequency?
Query malicious sites with pattern *.com hosted in Japan, sorted by the distance to GeoLocation (30.0,130.0)
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 46
3rd Arrow: Streaming
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 47
Question 1 Revisited
Yesterday, Who accessed hxxp://www.thebadbuy.com? From where, How? What’s the frequency?
• Can you send email when there is a contact to specific C&C server?
• Can you monitor a specific client IP to a list of C&C server?
• I found there is certain pattern in C&C URL paths, can you give me a hourly update of top 10 path grouping?
• Report the C&C connect’s parent process SHA-1 to Virus DB for sourcing
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 48
The Messaging
OSDC.TW 2012 - TME: Open Source Realtime Big Data Processing Platformhttp://cloud.github.com/downloads/trendmicro/tme/TME_Introduction_OSDC.tw2012%20.pdf
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 49
Let’s dump the data
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 50
You need lots of workers!
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 51
Your boss won’t buy you another 100
servers
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 52
NextGen MapReduce (YARN)
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 53
Storm-YARN
Storm-on-YARN: Convergence of Low-Latency and Big-Datahttp://www.slideshare.net/Hadoop_Summit/feng-june26-1120amhall1v2
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 54
Continuously Processing
• Calculate data on the fly, endless processing• Hook up your processing anytime
– Or store scripts on ZooKeeper
• Leverage your existing Hadoop cluster• Dynamically scale in/out your workers
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 55
Summary
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 56
3 Arrows for Real-time Applications
HBase (20%)
SolrCloud (60%)
Streaming (20%)
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 57
80/20 Rule
As close as possible,
don’t overdo
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 58
Why not just use Impala?
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 59
The same problem, anyway
04/10/2023
60
Q&A
Confidential | Copyright 2013 TrendMicro Inc.
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 61
You’re Brilliant We’re hiring!
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 62