cloudera - using morphlines for on the-fly etl by wolfgang hoschek

33
1 Using Morphlines for onthefly ETL Wolfgang Hoschek (@whoschek) SF Data Engineering Meetup July 2013

Upload: hakka-labs

Post on 06-May-2015

1.175 views

Category:

Technology


3 download

DESCRIPTION

In this talk Senior Software engineer Wolfgang Hoschek from Cloudera discusses Morphlines, the easy way to build and integrate ETL apps for Hadoop. The talk was recorded at the SumbleUpon offices. Cloudera Morphlines is a new open source framework that reduces the time and skills necessary to integrate, build, and change Hadoop processing applications that extract, transform, and load data into Apache Solr, Apache HBase, HDFS, enterprise data warehouses, or analytic online dashboards. Wolfgang Hoschek is a Software Engineer on the Platform team and the lead developer on Morphlines. He is a former CERN fellow and received his Ph.D from the Technical University of Vienna, Austria, and M.S from the University of Linz, Austria.

TRANSCRIPT

Page 1: Cloudera - Using morphlines for on the-fly ETL by Wolfgang Hoschek

1

Using  Morphlines  for  on-­‐the-­‐fly  ETL  

Wolfgang  Hoschek  (@whoschek)  SF  Data  Engineering  Meetup  July  2013  

Page 2: Cloudera - Using morphlines for on the-fly ETL by Wolfgang Hoschek

Agenda  

•  Big  Data,  ETL  and  Search  –  seMng  the  stage  

•  Cloudera  Morphlines  Architecture  •  Component  Deep  Dive  •  Cloudera  Search  Use  Cases  

• What’s  next?  

Feel  free  to  ask  quesUons  as  we  go!  

Page 3: Cloudera - Using morphlines for on the-fly ETL by Wolfgang Hoschek

Example  ETL  Use  Case:    Distributed  Search  on  Hadoop  

Flume  Hue  UI  

Custom  UI  

Custom  App  

Solr  

Solr  

Solr  

SolrCloud  query  

query  

query  

Index  (ETL)  

Hadoop  Cluster  

MR  

HDFS  

Index  (ETL)  

HBase  Index  (ETL)  

Page 4: Cloudera - Using morphlines for on the-fly ETL by Wolfgang Hoschek

Cloudera  Morphlines  Architecture  

Solr  

Solr  

Solr  

SolrCloud  

Logs,  tweets,  social  media,  html,  

images,  pdf,  text….  

Anything  you  want  to  index  

Flume,  MR  Indexer,  HBase  indexer,  etc...    Or  your  applicaUon!  

Morphline  Library  

Morphlines  can  be  embedded  in  any  applicaUon…  

Your  App!  

Page 5: Cloudera - Using morphlines for on the-fly ETL by Wolfgang Hoschek

Cloudera  Morphlines  

•  Open  Source  framework  for  simple  ETL  •  Consume  any  kind  of  data  from  any  kind  of  data  source,  process  and  load  into  any  app  or  storage  system  

•  Designed  for  Near  Real  Time  apps  &  Batch  apps  •  Ships  as  part  Cloudera  Developer  Kit  (CDK)  and  Cloudera  Search  

•  It’s  a  Java  library  •  ASL  licensed  on  github  hbps://github.com/cloudera/cdk  

•  Similar  to  Unix  pipelines,  but  more  convenient  &  efficient  •  ConfiguraUon  over  coding  (reduce  Ume  &  skills)  •  Supports  common  file  formats  

•  Log  Files  &  Text  •  Avro,  Sequence  file  •  JSON,  HTML  &  XML  •  Etc…  (pluggable)  

•  Extensible  set  of  transformaUon  commands  

Page 6: Cloudera - Using morphlines for on the-fly ETL by Wolfgang Hoschek

ExtracUon,  TransformaUon  and  Loading  

•  Chain  of  pipelined  commands  

•  Simple  and  flexible  data  mapping  &  transformaUon    

•  Reusable  across  mulUple  index  workloads  

•  Over  Ume,  extend  and  re-­‐use  across  plagorm  workloads  

syslog   Flume  Agent  

Solr  sink  

Command:  readLine  

Command:  grok  

Command:  loadSolr  

Solr  

Event  

Record  

Record  

Record  

Document  

Morph

line  Library  

Page 7: Cloudera - Using morphlines for on the-fly ETL by Wolfgang Hoschek

Like  a  Unix  Pipeline  

•  Like  Unix  pipelines  where  the  data  model  is  generalized  to  work  with  streams  of  generic  records,  including  arbitrary  binary  payloads  

•  Designed  to  be  embedded  into  Hadoop  components  such  as  Search,  Flume,  MapReduce,  Pig,  Hive,  Sqoop  

Page 8: Cloudera - Using morphlines for on the-fly ETL by Wolfgang Hoschek

Stdlib  +  plugins  

•  Framework  ships  with  a  set  of  frequently  used  high  level  transformaUon  and  I/O  commands  that  can  be  combined  in  applicaUon  specific  ways  

•  The  plugin  system  allows  the  adding  of  new  transformaUons  and  I/O  commands  and  integrates  exisUng  funcUonality  and  third  party  systems  in  a  straighgorward  manne  

Page 9: Cloudera - Using morphlines for on the-fly ETL by Wolfgang Hoschek

Flexible  Data  Model  

•  A  record  is  a  set  of  named  fields  where  each  field  has  an  ordered  list  of  one  or  more  Java  Objects  (i.e.  Guava’s  ArrayListMulUmap)  

•  Field  can  have  mulUple  values  and  any  two  records  need  not  use  common  field  names  

•  Corresponds  exactly  to  Solr/Lucene  data  model  •  Pass  not  only  structured  data,  but  also  arbitrary  binary  data  

Page 10: Cloudera - Using morphlines for on the-fly ETL by Wolfgang Hoschek

Passing  Binary  Data  

•  _abachment_body  field  (opUonal)  •  java.io.InputStream  or  Java  byte[]    

•  opUonal  fields  assist  w/  detecUng  &  parsing  data  type  •  _abachment_mimetype  field  

•  e.g.  "applicaUon/pdf"    

•  _abachment_charset  field  •  e.g.  "UTF-­‐8"  

•  _abachment_name  field  •  e.g.  "cars.pdf”  

•  Conceptually  similar  to  email  and  HTTP  headers/body  

Page 11: Cloudera - Using morphlines for on the-fly ETL by Wolfgang Hoschek

Processing  Model  

• Morphline  commands  manipulate  conUnuous  or  arbitrarily  large  streams  of  records  

•  A  command  transforms  a  record  into  zero  or  more  records  

•  The  output  records  of  a  command  are  passed  to  the  next  command  in  the  chain  

•  A  command  can  contain  nested  commands    •  A  morphline  is  a  tree  of  commands,  essenUally  a  push-­‐based  data  flow  engine  

Page 12: Cloudera - Using morphlines for on the-fly ETL by Wolfgang Hoschek

Processing  Model  Non-­‐Goals  

•  Designed  to  embedded  into  mulUple  host  systems,  thus…  •  No  noUon  of  persistence  or  durability  or  distributed  compuUng  or  node  failover  

•  Basically  just  a  chain  of  in-­‐memory  transformaUons  in  the  current  thread  

•  No  need  to  manage  mulUple  nodes  or  threads  -­‐    already  covered  by  host  systems  such  as  MapReduce,  Flume,  Storm,  etc.    

•  However,  a  morphline  does  support  passing  noUficaUons  •  E.g.  BEGIN_TRANSACTION,  COMMIT_TRANSACTION,  ROLLBACK_TRANSACTION,  SHUTDOWN  

Page 13: Cloudera - Using morphlines for on the-fly ETL by Wolfgang Hoschek

Performance  and  Scaling  

•  The  runUme  compiles  morphline  on  the  fly    

•  The  runUme  processes  all  commands  of  a  given  morphline  in  the  same  thread    

•  For  scalability,  deploy  many  morphline  instances  on  a  cluster  in  many  Flume  agents  and  MapReduce  tasks  

Page 14: Cloudera - Using morphlines for on the-fly ETL by Wolfgang Hoschek

Syntax  

•  HOCON  format  (Human-­‐OpUmized  Config  Object  NotaUon)  

•  Basically  JSON  slightly  adjusted  for  the  configuraUon  file  use  case    

•  Came  out  of  typesafe.com  •  Also  used  by  Akka  and  Play  frameworks  

Page 15: Cloudera - Using morphlines for on the-fly ETL by Wolfgang Hoschek

Example:  Indexing  log4j  w/  stacktraces  

juil. 25, 2012 10:49:40 AM hudson.triggers.SafeTimerTask run ok juil. 25, 2012 10:49:46 AM hudson.triggers.SafeTimerTask run failed com.amazonaws.AmazonClientException: Unable to calculate a request signature at com.amazonaws.auth.AbstractAWSSigner.signAndBase64Encode(AbstractAWSSigner.java:71) at java.util.TimerThread.run(Timer.java:505) Caused by: com.amazonaws.AmazonClientException: Unable to calculate a request signature at com.amazonaws.auth.AbstractAWSSigner.sign(AbstractAWSSigner.java:90) at com.amazonaws.auth.AbstractAWSSigner.signAndBase64Encode(AbstractAWSSigner.java:68) ... 14 more Caused by: java.lang.IllegalArgumentException: Empty key at javax.crypto.spec.SecretKeySpec.<init>(SecretKeySpec.java:96) at com.amazonaws.auth.AbstractAWSSigner.sign(AbstractAWSSigner.java:87) ... 15 more juil. 25, 2012 10:49:54 AM hudson.slaves.SlaveComputer tryReconnect

Record  1  

Record  2  

Record  3  

Page 16: Cloudera - Using morphlines for on the-fly ETL by Wolfgang Hoschek

Example:  Indexing  log4j  w/  stacktraces  

morphlines : [ { id : morphline1 importCommands : ["com.cloudera.**", "org.apache.solr.**"]

commands : [ { readMultiLine { regex : "(^.+Exception: .+)|(^\\s+at .+)|(^\\s+... \\d+ more)|(^\\s*Caused by:.+)" what : previous charset : UTF-8 } } { logDebug { format : "output record: {}", args : ["@{}"] } } { loadSolr {} ] } ]

Page 17: Cloudera - Using morphlines for on the-fly ETL by Wolfgang Hoschek

Example:  Escape  to  Java  Code  

morphlines : [ { id : morphline1 importCommands : ["com.cloudera.**", "org.apache.solr.**"] commands : [ { java { code: """ List tags = record.get("tags"); if (!tags.contains("hello")) { return false; } tags.add("world"); return child.process(record); """ } } ] } ]

Page 18: Cloudera - Using morphlines for on the-fly ETL by Wolfgang Hoschek

Current  Command  Library  

•  Integrate  with  and  load  into  Apache  Solr  

•  Flexible  log  file  analysis  •  Single-­‐line  record,  mulU-­‐line  records,  CSV  files    •  Regex  based  pabern  matching  and  extracUon    

•  IntegraUon  with  Avro,  JSON,  XML,  HTML    •  IntegraUon  with  Apache  Hadoop  Sequence  Files  •  IntegraUon  with  SolrCell  and  all  Apache  Tika  parsers    

•  Auto-­‐detecUon  of  MIME  types  from  binary  data  using  Apache  Tika  

Page 19: Cloudera - Using morphlines for on the-fly ETL by Wolfgang Hoschek

Current  Command  Library  (cont’d)  

•  ScripUng  support  for  dynamic  java  code    •  OperaUons  on  fields  for  assignment  and  comparison  •  OperaUons  on  fields  with  list  and  set  semanUcs    •  if-­‐then-­‐else  condiUonals    •  A  small  rules  engine  (tryRules)  •  String  and  Umestamp  conversions    •  slf4j  logging  •  Yammer  metrics  and  counters    •  Decompression  and  unpacking  of  arbitrarily  nested  container  file  formats  

•  etc  

Page 20: Cloudera - Using morphlines for on the-fly ETL by Wolfgang Hoschek

Plugin  Commands  

•  Easy  to  add  new  I/O  &  transformaUon  cmds    

•  Integrate  exisUng  funcUonality  and  third  party  systems  

•  Implement  Java  interface  Command  or  subclass  AbstractCommand

•  Add  it  to  Java  classpath  

•  No  registraUon  or  other  administraUve  acUon  required  

Page 21: Cloudera - Using morphlines for on the-fly ETL by Wolfgang Hoschek

Morphline  Example  –  syslog  with  grok  

morphlines  :  [    {        id  :  morphline1        importCommands  :  ["com.cloudera.**",  "org.apache.solr.**"]        commands  :  [            {  readLine  {}  }                                                    {                  grok  {                      dicUonaryFiles  :  [/tmp/grok-­‐dicUonaries]                                                                                  expressions  :  {                          message  :  """<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:syslog_Umestamp}  %{SYSLOGHOST:syslog_hostname}  %{DATA:syslog_program}(?:\[%{POSINT:syslog_pid}\])?:  %{GREEDYDATA:syslog_message}"""                    }                }            }            {  loadSolr  {}  }                    ]    }  ]  

Example  Input  <164>Feb    4  10:46:14  syslog  sshd[607]:  listening  on  0.0.0.0  port  22  Output  Record  syslog_pri:164  syslog_Umestamp:Feb    4  10:46:14  syslog_hostname:syslog  syslog_program:sshd  syslog_pid:607  syslog_message:listening  on  0.0.0.0  port  22.  

Page 22: Cloudera - Using morphlines for on the-fly ETL by Wolfgang Hoschek

PotenUal  New  Plugin  Commands  

•  Extract,  clean,  transform,  join,  integrate,  enrich  and  decorate  records  

•  Examples  •  join  records  with  external  data  sources  such  as  relaUonal  databases,  key-­‐value  stores,  local  files  or  IP  Geo  lookup  tables.    

•  Perform  DNS  resoluUon,  expand  shortened  URLs  •  fetch  linked  metadata  from  social  networks  •  do  senUment  analysis  &  annotate  record  accordingly  •  conUnuously  maintain  stats  over  sliding  windows  •  compute  exact  or  approx.  disUnct  values  &  quanUles  

Page 23: Cloudera - Using morphlines for on the-fly ETL by Wolfgang Hoschek

Use  Case:  Cloudera  Search  

An  Integrated  Part  of  the  Hadoop  System  

One  pool  of  data  

One  security  framework  

One  set  of  system  resources  

One  management  interface  

Page 24: Cloudera - Using morphlines for on the-fly ETL by Wolfgang Hoschek

What  is  Cloudera  Search?  

•  Full-­‐text,  interacUve  search  and  faceted  navigaUon  

•  Batch,  near  real-­‐Ume,  and  on-­‐demand  indexing  •  Apache  Solr  integrated  with  CDH  

•  Established,  mature  search  with  vibrant  community  

•  Separate  runUme  like  MapReduce,  Impala  •  Incorporated  as  part  of  the  Hadoop  ecosystem  

•  Open  Source  •  100%  Apache,  100%  Solr  

•  Standard  Solr  APIs  

Page 25: Cloudera - Using morphlines for on the-fly ETL by Wolfgang Hoschek

ETL  for  Distributed  Search  on  Apache  Hadoop  

Flume  Hue  UI  

Custom  UI  

Custom  App  

Solr  

Solr  

Solr  

SolrCloud  query  

query  

query  

Index  (ETL)  

Hadoop  Cluster  

MR  

HDFS  

Index  (ETL)  

HBase  Index  (ETL)  

Page 26: Cloudera - Using morphlines for on the-fly ETL by Wolfgang Hoschek

Near  Real  Time  ETL  &  Indexing  with  Flume  

Log  File  

Apache  Solr  and  Apache  Flume  •  Data  ingest  at  scale  •  Flexible  extracUon  and  mapping  

•  Indexing  at  data  ingest  •  Packaged  as  Flume  Morphline  Solr  Sink  

HDFS  

Flume  Agent  

Indexer  w/  Morphline  

Other  Log  File  

Flume  Agent  

Indexer  w/  Morphline  

26  

agent.sinks.solrSink.type = org.apache.flume.sink.solr.morphline.MorphlineSolrSink agent.sinks.solrSink.morphlineFile = /etc/flume-ng/conf/morphline.conf

Flume.conf  

Page 27: Cloudera - Using morphlines for on the-fly ETL by Wolfgang Hoschek

Cloudera  Manager  Flume  Morphline  GUI  

27

Page 28: Cloudera - Using morphlines for on the-fly ETL by Wolfgang Hoschek

Scalable  Batch  ETL  &  Indexing  

Index  shard  

Files  

Index  shard  

Indexer  w/  Morphline  

Files  

Solr  server  

Indexer  w/  Morphline  

Solr  server  

28

HDFS  

Solr  and  MapReduce  •  Flexible,  scalable  batch  indexing  

•  Start  serving  new  indices  with  no  downUme  

•  On-­‐demand  indexing,  cost-­‐efficient  re-­‐indexing  

•  Packaged  as  MapReduceIndexerTool  

hadoop ... MapReduceIndexerTool --morphline-file morphline.conf ...

Page 29: Cloudera - Using morphlines for on the-fly ETL by Wolfgang Hoschek

MapReduceIndexerTool  

29

hadoop ... MapReduceIndexerTool --morphline-file morphline.conf ...

S0_0_0

Extractors(Mappers)

Leaf Shards(Reducers)

Root Shards(Mappers)

S0_0_1S0S0_1_0

S0_1_1

S1_0_0

S1_0_1S1S1_1_0

S1_1_1

Input Files

...

...

...

...

•  Morphline  runs  inside  Mapper  

Page 30: Cloudera - Using morphlines for on the-fly ETL by Wolfgang Hoschek

Near  Real  Time  indexing  of  Apache  HBase  

HDFS  

HBase  

interacUve  load  

Lily  HBase  Indexer(s)  

with  Morphline  Tr

iggers  on  

updates   Solr  server  

Solr  server  Solr  server  Solr  server  Solr  server  

Search  

+   =  Large  scale  tabular  data  immediate  access  &  updates  fast  &  flexible  informaDon  discovery  

B IG  DATA  DATAMANAGEMENT  

Page 31: Cloudera - Using morphlines for on the-fly ETL by Wolfgang Hoschek

Batch  &  Near  Real  Time  ETL  

Tweets

Flume Solr

Hue UI

HDFS

MapReduceIndexerTool, Impala, HBase, Mahout, EDW, MR, etc

Lily HBase Indexer

HdfsSink

Query

MapReduceIndexerTool

Log Formats

Social Media

HTML

Images

PDF

Custom UI

Query

Custom App

...

Morphline

Morphline

MorphlineSinkMorphline

HBase

OLTP

Page 32: Cloudera - Using morphlines for on the-fly ETL by Wolfgang Hoschek

What’s  next  

• More  work  on  Apache  HBase  IntegraUon  

•  IntegraUon  into  Apache  Crunch  •  Stream  AnalyUcs  

Page 33: Cloudera - Using morphlines for on the-fly ETL by Wolfgang Hoschek

Conclusion  

•  Cloudera  Development  Kit  w/  Morphlines    •  Open  Source  -­‐  ASL  License  •  Version  0.4.1  shipping  

•  Extensive  documentaUon  •  Send  your  quesUons  and  feedback  to  cdk-­‐dev  mailing  list  

•  Also  ships  integrated  with  Cloudera  Search  

•  Free  QuickStart  VM  also  available!