atlhug 20150625
TRANSCRIPT
Cardlytics & Drill Use Case: Matching Big Data
David Kim
Principal Engineer
2015.06.25
About Cardlytics
© 2013 Cardlytics. Proprietary and Confidential. 2
• Privately held company leveraging proprietary purchase-‐driven intelligence pla6orm to provide ac7onable insights into consumer behavior to numerous organiza7ons using consumer purchase data that we have exclusive rights to
• Founded in 2008 by ScoA Grimes (CEO) and Lynne Laube (COO) both former execu7ves at Capital One
• Headquartered in Atlanta, we have 320 employees with offices in NY, Chicago, San Francisco & London
• Owns mul7ple patents and nearly 700 banking rela7onships in the US and the UK represen7ng over 100 million households and $1 trillion in yearly spend
Problem Statement
A customer (advertiser) requested analysis to provide insight into their own business and customer base in order to better understand and make better business decisions. • Must match advertiser customers to Cardlytics customers
• Matches must be highly confident and unique
© 2013 Cardlytics. Proprietary and Confidential. 3
Our Approach: Pattern Matching
time
© 2013 Cardlytics. Proprietary and Confidential. 4
Challenges
• Matches must be unique
• Matches must be highly confident
• Limited information available to match data points (no PII)
• Missing data points
• Scale (Drill) » Depending on the advertiser, data points are sparse or densely packed
© 2013 Cardlytics. Proprietary and Confidential. 5
Scale Issues with Dense Data Points
© 2013 Cardlytics. Proprietary and Confidential. 6
Scale Issues
• 60M x 40M = 2.4T potential matches evaluated
• 120M x 120M = 14.4T potential matches evaluated
• 590M x 130M = 76.7T potential matches evaluated
© 2013 Cardlytics. Proprietary and Confidential. 7
Our Environment…
SQL Server: 64 cores (32 physical), 256GB RAM, direct-attached storage w/enterprise disks
Hadoop Cluster: 10 nodes, 32 cores/node, 128GB RAM, 12 x 4TB consumer grade disks
© 2013 Cardlytics. Proprietary and Confidential. 8
Actual Results…
• POC 1: 60M customer data points x 40M Cardlytics data points collected over 2 years
» SQL Server : ~20 hours
• POC 2:120M x 120M over 6 months
» SQL Server: 1~2 months » Hive: Killed after several days (estimated to take about a week) » Drill: 17-18 hours yielding 91+B matching data points
• POC3: 590M x 130M over 1 year » Drill: ~17 hours to yield 1.3T matches and 72TB » Required some tweaking and turning some secret knobs
© 2013 Cardlytics. Proprietary and Confidential. 9
…PROBABLY
…from the MapR Drill team
Compliments of Jacques Nadeau/Aman Sinha
• store.format
• store.parquet.block-size
• planner.broadcast_threshold
• planner.broadcast_factor
• planner.join.row_count_estimate_factor
• planner.enable_multiphase_agg
• planner.enable_mux_exchange
• exec.min_hash_table_size
• planner.enable_hashjoin
• select * from sys.options;
© 2013 Cardlytics. Proprietary and Confidential. 10
Other Nuggets
• Drill is memory intensive
• You will always know more about your data than Drill
• Hadoop and Drill are great tools but doesn’t solve stupidity
• Some of the basic principles of querying a dataset still apply
» Intelligent batching
» Applying filters early to work with smaller datasets » Bringing back only the data that you need
» Partitioning
» Understanding the configurations and internals of your tools
© 2013 Cardlytics. Proprietary and Confidential. 11
"Louis, I think this is the beginning of a beautiful friendship."
Our close partnership with MapR includes… • Semi-weekly check-ins with Drill dev team
• Weekly check-ins with MapR product managers
• Improving Drill with real world applications, tests, and data
• Input to future roadmap » Large IN-clause
» DST support
» Auto-partitioning
» Windowing functions
» Support for inserts
© 2013 Cardlytics. Proprietary and Confidential. 12
Grab a seat at the cool kids’ table!!
Careers @Cardlytics
http://cardlytics.com/cardlytics/?s=career
Apache Drill
https://drill.apache.org/
https://drill.apache.org/docs/
MapR
https://www.mapr.com/products/product-overview/apache-drill
© 2013 Cardlytics. Proprietary and Confidential. 13
Michael Fabacher, VP of Data Development [email protected] David Kim, Principal Engineer [email protected]