cloud optimized big data
DESCRIPTION
What makes a big-data platform 'cloud-optimized'. Here's our (Qubole's) shot at it. @Cloud-Asia 2014.TRANSCRIPT
Cloud-Optimized Big-Data as a Service
Joydeep Sen SarmaCo-Founder Qubole, Apache-Hive
About Me
• @Facebook (2007-2011):– First Hadoop Engineer– Founder - Apache Hive project, PMC Member– Contributor to Apache Hadoop/HBase
• Founder Qubole (2012-)– Hadoop-as-a-Service– 30+ customers: Pinterest, Quora, Mediamath, Tubemogul …– Design/Code/Ops/Support/…
Big Data Cloud
• Elasticity:– Workloads are Bursty– Allows easy rolling upgrades and testing
• Lower TCO:– Cloud Storage is Inexpensive (2-3c/GB/month – globally replicated)– Zero cost to try new projects– Upgrade to new hardware easily (no cluster migrations!)
Big Data Cloud
• Global:– Easily set up where employees/customer/entities are located
• Collaboration:– Zero-Copy sharing of data with Partners and across Departments– Easy access to great public data sets
• As-a-Service delivery model vastly lowers Operational Cost
Cloud-Optimized Big Data?
• Optimized for lower TCO
• Optimized for Speed
• Optimized for Operations/Support
Cloud-Optimized Big Data
Optimized for lower TCO
7
select t.county, count(1) from (select transform(a.zip) using ‘geo.py’ as a.county from SMALL_TABLE a) t group by t.county;
insert overwrite table dest select a.id, a.zip, count(distinct b.uid) from ads a join LARGE_TABLE b on (a.id=b.ad_id) group by a.id, a.zip;
hadoop jar –Dmapred.min.split.size=32000000 myapp.jar –partitioner .org.apache…
AdCo Hadoop
Automated LifeCycle Mgmt
insert overwrite table dest select … from ads join campaigns on …group by …;
8
StarCluster
Map Tasks
ReduceTasks
Demand
Supply
AWS
Progress
Master
Slaves
Job Tracker
Auto-Scaling
9
Spot Instances
On an average 50-60% cheaper
• Fallback to regular instances when Spot unavailable
• Replace regular instances with Spot when available
10
Using Fast but ‘Thin’ nodes
• C3 instances: 50% better performance at 20% lower cost• Little local storage
11
Using Fast but ‘Thin’ nodes
Modify Hadoop to use Network drives for overflow
Map-Reduce HDFS
LocalSSD
Network Drives
Disk I/O
Overflow
Cloud-Optimized Big Data
Optimized for Speed
• Optimize I/O to AWS S3– Faster Split Computation (8x)– Prefetching S3 files (30%)– Zero-Copy writes to S3
• JVM Reuse (1.2-2x speedup)
• Columnar File Caches on local disks (1.2-2x speedup)
• 30-50% cost savings because of cluster consolidation
Faster, Faster ..
• 5x Faster than nearest competitor (Hive against S3)
• 30-50% cost savings because of cluster consolidation
Faster, Faster ..
• Presto-as-a-Service – 3-22x faster SQL against S3– (as tested by customer)
• 30-50% cost savings because of cluster consolidation
Faster, Faster ..
Cloud-Optimized Big Data
Optimized for Operations/Support
Rolling Upgrades
• @Facebook – we spent months upgrading large cluster• @Qubole: Start new cluster, Reassign label
• 30-50% cost savings because of cluster consolidation
Support
CHATEMail
Visually browse Historical Jobs
Visually browse Historical Jobs