strata + hadoop world 2012 keynote: beyond batch - doug cutting
DESCRIPTION
Hadoop started as an offline, batch-processing system. It made it practical to store and process much larger datasets than before. Subsequently, more interactive, online systems emerged, integrating with Hadoop. First among these was HBase, the key/value store. Now scalable interactive query engines are beginning to join the Hadoop ecosystem. Realtime is gradually becoming a viable peer to batch in big data.TRANSCRIPT
1
Headline Goes HereSpeaker Name or Subhead Goes Here
DO NOT USE PUBLICLY PRIOR TO 10/23/12Beyond Batch
Doug Cutting October 2012
2
Hadoop Started As Batch
MapReduce• Simple, powerful• Kills a lot of birds
• Efficient, scalable• Compute at storage
• Shared platform• Used by Pig, Hive, etc.
• Incredibly useful!• But not sufficient
3
Big Data Is Not (Just) Batch
Its true themes are:• Scalability
• Affordability• Commodity hardware• Open-source software
• Distributed & reliable• Schema on read• Data beats algorithms
4
HBase: First Non-Batch Component
Online key/value store• Complement to batch
• Online put/get• Batch load & analyze• Best of both• Popular combination
• A step towards the future…
5
Holy Grail Of Big Data
• Open source, commodity HW, etc.• Linear scaling
• To scale, just buy more hardware• On many axes
• Storage capacity• Throughput & latency
• of batch & query• Transactions, Joins, Indexes
• and batch!
6
Google Gives Us A Map
Google publication Apache project
2004 GFS & MapReduce 2006 Hadoop batch programs
2005 Sawzall 2008 Pig & Hive batch queries
2006 BigTable 2008 HBase online key/value
... ... ... ... ...
2012 Spanner ? ? holy grail?
Google publication Open source project
2004 GFS & MapReduce 2006 Hadoop batch programs
2005 Sawzall 2008 Pig & Hive batch queries
2006 BigTable 2008 HBase online key/value
... ... ... ... ...
2012 Spanner ? ? transactions, etc.
5 years – 26 authors!
7
Impala Is Latest Step
Google publication Apache project
2004 GFS & MapReduce 2006 Hadoop batch programs
2005 Sawzall 2008 Pig & Hive batch queries
2006 BigTable 2008 HBase online key/value
... ... ... ... ...
2012 Spanner ? ? holy grail?
Google publication Open source project
2004 GFS & MapReduce 2006 Hadoop batch programs
2005 Sawzall 2008 Pig & Hive batch queries
2006 BigTable 2008 HBase online key/value
2010 Dremel/F1 2012 Impala online queries
2012 Spanner ? ? transactions, etc.
8
@cutting #bigquestions