helsinki spark meetup nov 20 2015

146
Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles After Dark 1.5 High Performance, Real-time, Streaming, Machine Learning, Natural Language Processing, Text Analytics, and Recommendations Chris Fregly Principal Data Solutions Engineer IBM Spark Technology Center ** We’re Hiring -- Only Nice People, Please!! ** November 20, 2015

Upload: chris-fregly

Post on 19-Jan-2017

642 views

Category:

Software


5 download

TRANSCRIPT

Page 1: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles

After Dark 1.5

High Performance, Real-time, Streaming, Machine Learning, Natural Language Processing,

Text Analytics, and Recommendations

Chris Fregly Principal Data Solutions Engineer IBM Spark Technology Center

** We’re Hiring -- Only Nice People, Please!! **

November 20, 2015

Page 2: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Who Am I?

2

Streaming Data Engineer Open Source Committer

Data Solutions Engineer

Apache Contributor

Principal Data Solutions Engineer IBM Technology Center

Founder Advanced Apache Meetup

Author Advanced .

Due 2016

My Ma’s First Time in California

Page 3: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Random Slide: More Ma “First Time” Pics

3

In California Using Chopsticks Using “New” iPhone

Page 4: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Upcoming Meetups and Conferences London Spark Meetup (Oct 12th) Scotland Data Science Meetup (Oct 13th) Dublin Spark Meetup (Oct 15th)

Barcelona Spark Meetup (Oct 20th) Madrid Big Data Meetup (Oct 22nd)

Paris Spark Meetup (Oct 26th) Amsterdam Spark Summit (Oct 27th) Brussels Spark Meetup (Oct 30th)

Zurich Big Data Meetup (Nov 2nd) Geneva Spark Meetup (Nov 5th) San Francisco Datapalooza.io (Nov 10th)

4

San Francisco Advanced Spark (Nov 12th) Oslo Big Data Hadoop Meetup (Nov 19th) Helsinki Spark Meetup (Nov 20th)

Stockholm Spark Meetup (Nov 23rd) Copenhagen Spark Meetup (Nov 25th)

Budapest Spark Meetup (Nov 26th) Singapore Strata Conference (Dec 1st) San Francisco Advanced Spark (Dec 8th)

Mountain View Advanced Spark (Dec 10th) Toronto Spark Meetup (Dec 14th) Austin Data Days Conference (Jan 2016)

Page 5: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Advanced Apache Spark Meetup Meetup Metrics 1600+ Members in just 4 mos! Top 5 Most Active Spark Meetup!! Meetup Goals   Dig deep into codebase of Spark and related projects   Study integrations of Cassandra, ElasticSearch,

Tachyon, S3, BlinkDB, Mesos, YARN, Kafka, R   Surface and share patterns and idioms of these

well-designed, distributed, big data components

Page 6: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

All Slides and Code Are Available!

advancedspark.com slideshare.net/cfregly

github.com/fluxcapacitor hub.docker.com/r/fluxcapacitor

6

Page 7: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

What is “ After Dark”? Spark-based, Advanced Analytics Reference App End-to-End, Scalable, Real-time Big Data Pipeline Demonstration of Spark & Related Big Data Projects

7

github.com/fluxcapacitor

Page 8: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Tools of This Talk

8

  Kafka   Redis   Docker   Ganglia   Cassandra   Parquet, JSON, ORC, Avro   Apache Zeppelin Notebooks   Spark SQL, DataFrames, Hive   ElasticSearch, Logstash, Kibana   Spark ML, GraphX, Stanford CoreNLP

github.com/fluxcapacitor hub.docker.com/r/fluxcapacitor

Page 9: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Themes of this Talk  Filter  Off-Heap  Parallelize  Approximate  Find Similarity  Minimize Seeks  Maximize Scans  Customize for Workload  Tune Performance At Every Layer

9

  Be Nice, Collaborate! Like a Mom!!

Page 10: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Presentation Outline

 Spark Core: Tuning & Mechanical Sympathy

 Spark SQL: Query Optimizing & Catalyst

 Spark Streaming: Scaling & Approximations

 Spark ML: Featurizing & Recommendations 10

Page 11: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles

Spark Core: Tuning & Mechanical Sympathy Understand and Acknowledge Mechanical Sympathy

Study AlphaSort and 100Tb GraySort Challenge

Dive Deep into Project Tungsten

11

Page 12: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Mechanical Sympathy Hardware and software working together in harmony. - Martin Thompson http://mechanical-sympathy.blogspot.com

Whatever your data structure, my array will beat it. - Scott Meyers Every C++ Book, basically

12

Hair Sympathy

- Bruce Jenner

Page 13: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Spark and Mechanical Sympathy

13

Project Tungsten (Spark 1.4-1.6+)

GraySort Challenge (Spark 1.1-1.2)

Minimize Memory and GC Maximize CPU Cache Locality

Saturate Network I/O Saturate Disk I/O

Page 14: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

AlphaSort Technique: Sort 100 Bytes Recs

14

Value

Ptr Key Dereference Not Required! AlphaSort

List [(Key, Pointer)] Key is directly available for comparison

Naïve List [Pointer] Must dereference key for comparison

Ptr Dereference for Key Comparison

Key

Page 15: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

CPU Cache Line and Memory Sympathy Key (10 bytes)+Pointer (*4 bytes)*Compressed OOPs = 14 bytes

15

Key Ptr

Not CPU Cache-line Friendly!

Ptr Key-Prefix

2x CPU Cache-line Friendly! Key-Prefix (4 bytes) + Pointer (4 bytes) = 8 bytes

Key (10 bytes)+Pad (2 bytes)+Pointer (4 bytes) = 16 bytes Key Ptr

Pad

/Pad CPU Cache-line Friendly!

Page 16: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Performance Comparison

16

Page 17: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Similar Trick: Direct Cache Access (DCA) Pull out packet header along side pointer to payload

17

Page 18: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

CPU Cache Line Sizes

18

MyLaptop

MySoftLayerBareMetal

Page 19: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Cache Hits: Sequential v Random Access

19

Page 20: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles

Mechanical Sympathy CPU Cache Lines and Matrix Multiplication

20

Page 21: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

CPU Cache Naïve Matrix Multiplication

// Dot product of each row & column vector for (i <- 0 until numRowA) for (j <- 0 until numColsB) for (k <- 0 until numColsA) res[ i ][ j ] += matA[ i ][ k ] * matB[ k ][ j ];

21

Bad: Row-wise traversal, not using CPU cache line,

ineffective pre-fetching

Page 22: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

CPU Cache Friendly Matrix Multiplication // Transpose B for (i <- 0 until numRowsB) for (j <- 0 until numColsB) matBT[ i ][ j ] = matB[ j ][ i ];

// Modify dot product calculation for B Transpose for (i <- 0 until numRowsA) for (j <- 0 until numColsB) for (k <- 0 until numColsA) res[ i ][ j ] += matA[ i ][ k ] * matBT[ j ][ k ];

22

Good: Full CPU cache line, effective prefetching

OLD: res[ i ][ j ] += matA[ i ][ k ] * matB [ k ] [ j ];

Reference jbefore k

Page 23: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Instrumenting and Monitoring CPU Use Linux perf command!

23

http://www.brendangregg.com/blog/2015-11-06/java-mixed-mode-flame-graphs.html

Page 24: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles

Demo! Compare CPU Naïve & Cache-Friendly Matrix Multiplication

24

Page 25: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Results of Matrix Multiply Comparison

Naïve Matrix Multiply

25

Cache-Friendly Matrix Multiply ~27x ~13x ~13x ~2x

perf stat -XX:-Inline –event \ L1-dcache-load-misses,L1-dcache-prefetch-misses,LLC-load-misses, \

LLC-prefetch-misses,cache-misses,stalled-cycles-frontend

~10x 55 hp 550 hp L1-dcache-load-misses 7916379222 214482863052

27.09355591

L1-dcache-prefetch-misses 4248568884 114878389282.70393142

8

LLC-load-misses 336612743 449261229613.3465306

6

LLC-prefetch-misses 4300544980 497580590.01157017

5cache-misses 320086472 4447068200 13.8933338

stalled-cycles-frontend 1575227969401 52463114772913.33050934

8

elapsed@me 1073 22982.14163124

6

Page 26: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles

Mechanical Sympathy CPU Cache Lines and Lock-Free Thread Sync

26

Page 27: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

CPU Cache Naïve Tuple Counters object CacheNaiveTupleIncrement { var tuple = (0,0) … def increment(leftIncrement: Int, rightIncrement: Int) : (Int, Int) = { this.synchronized { tuple = (tuple._1 + leftIncrement, tuple._2 + rightIncrement) tuple } } }

27

Page 28: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

CPU Cache Naïve Case Class Counters case class MyTuple(left: Int, right: Int) object CacheNaiveCaseClassCounters { var tuple = new MyTuple(0,0) … def increment(leftIncrement: Int, rightIncrement: Int) : MyTuple = { this.synchronized { tuple = new MyTuple(tuple.left + leftIncrement, tuple.right + rightIncrement)

tuple } } }

28

Page 29: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

CPU Cache Friendly Lock-Free Counters object CacheFriendlyLockFreeCounters { // a single Long (8-bytes) will maintain 2 separate Ints (4-bytes each) val tuple = new AtomicLong() … def increment(leftIncrement: Int, rightIncrement: Int) : Long = { var originalLong = 0L var updatedLong = 0L do {

originalLong = tuple.get() val originalRightInt = originalLong.toInt // cast originalLong to Int to get right counter val originalLeftInt = (originalLong >>> 32).toInt // shift right to get left counter val updatedRightInt = originalRightInt + rightIncrement // increment right counter val updatedLeftInt = originalLeftInt + leftIncrement // increment left counter

updatedLong = updatedLeftInt // update the new long with the left counter updatedLong = updatedLong << 32 // shift the new long left updatedLong += updatedRightInt // update the new long with the right counter

} while (tuple.compareAndSet(originalLong, updatedLong) == false) updatedLong } }

29

Q: Why not @volatile long?

A: Java Memory Model does not guarantee synchronousupdates of 64-bit longs or doubles

Page 30: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles

Demo! Compare CPU Naïve & Cache-Friendly Tuple Counter Sync

30

Page 31: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Results of Counters Comparison Naïve Tuple Counters

Naïve Case Class Counters

31

Cache Friendly Lock-Free Counters

~2x ~1.5x

~3.5x ~2x ~2x

~1.5x

~1.5x

~1.5x

Page 32: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Profiling Visualizations: Flame Graphs

32 Example: Spark Word Count

Java Stack Traces (-XX:+PreserveFramePointer)

Plateausare Bad!!

Page 33: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles

100TB Daytona GraySort Challenge Focus on Network and Disk I/O Optimizations

Improve Data Structs/Algos for Sort & Shuffle

Saturate Network and Disk Controllers 33

Page 34: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Winning Results

34

Spark Goals   Saturate Network I/O   Saturate Disk I/O

(2013) (2014)

Page 35: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Winning Hardware Configuration Compute 206 Workers, 1 Master (AWS EC2 i2.8xlarge) 32 Intel Xeon CPU E5-2670 @ 2.5 Ghz 244 GB RAM, 8 x 800GB SSD, RAID 0 striping, ext4 3 GBps mixed read/write disk I/O per node

Network AWS Placement Groups, VPC, Enhanced Networking Single Root I/O Virtualization (SR-IOV) 10 Gbps, low latency, low jitter (iperf: ~9.5 Gbps)

35

Page 36: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Winning Software Configuration Spark 1.2, OpenJDK 1.7 Disable caching, compression, spec execution, shuffle spill Force NODE_LOCAL task scheduling for optimal data locality HDFS 2.4.1 short-circuit local reads, 2x replication Empirically chose between 4-6 partitions per cpu 206 nodes * 32 cores = 6592 cores 6592 cores * 4 = 26,368 partitions 6592 cores * 6 = 39,552 partitions 6592 cores * 4.25 = 28,000 partitions (empirical best)

Range partitioning takes advantage of sequential keyspace Required ~10s of sampling 79 keys from in each partition

36

Page 37: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

New Sort Shuffle Manager for Spark 1.2 Original “hash-based” New “sort-based” ①  Use less OS resources (socket buffers, file descriptors) ②  TimSort partitions in-memory ③  MergeSort partitions on-disk into a single master file ④  Serve partitions from master file: seek once, sequential scan

37

Page 38: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Asynchronous Network Module Switch to asyncronous Netty vs. synchronous java.nio Switch to zero-copy epoll Use only kernel-space between disk and network controllers

Custom memory management spark.shuffle.blockTransferService=netty

Spark-Netty Performance Tuning spark.shuffle.io.preferDirectBuffers=true Reuse off-heap buffers spark.shuffle.io.numConnectionsPerPeer=8 (for example) Increase to saturate hosts with multiple disks (8x800 SSD)

38

Details in SPARK-2468

Page 39: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Custom Algorithms and Data Structures Optimized for sort & shuffle workloads o.a.s.util.collection.TimSort[K,V] Based on JDK 1.7 TimSort Performs best with partially-sorted runs Optimized for elements of (K,V) pairs Sorts impl of SortDataFormat (ie. KVArraySortDataFormat)

o.a.s.util.collection.AppendOnlyMap Open addressing hash, quadratic probing Array of [(K, V), (K, V)] Good memory locality Keys never removed, values only append

39

Page 40: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Daytona GraySort Challenge Goal Success

1.1 Gbps/node network I/O (Reducers) Theoretical max = 1.25 Gbps for 10 GB ethernet

3 GBps/node disk I/O (Mappers)

40

Aggregate Cluster

Network I/O!

220 Gbps / 206 nodes ~= 1.1 Gbps per node

Page 41: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Shuffle Performance Tuning Tips Hash Shuffle Manager (Deprecated) spark.shuffle.consolidateFiles (Mapper) o.a.s.shuffle.FileShuffleBlockResolver

Intermediate Files Increase spark.shuffle.file.buffer (Reducer) Increase spark.reducer.maxSizeInFlight if memory allows

Use Smaller Number of Larger Executors Minimizes intermediate files and overall shuffle More opportunity for PROCESS_LOCAL

SQL: BroadcastHashJoin vs. ShuffledHashJoin spark.sql.autoBroadcastJoinThreshold Use DataFrame.explain(true) or EXPLAIN to verify

41

Many Threads (1 per CPU)

Page 42: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles

Project Tungsten Data Struts & Algos Operate Directly on Byte Arrays

Maximize CPU Cache Locality, Minimize GC

Utilize Dynamic Code Generation

42

SPARK-7076 (Spark 1.4)

Page 43: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Quick Review of Project Tungsten Jiras

43

SPARK-7076 (Spark 1.4)

Page 44: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Why is CPU the Bottleneck? CPU is used for serialization, hashing, compression! Network and Disk I/O bandwidth are relatively high GraySort optimizations improved network & shuffle Partitioning, pruning, and predicate pushdowns Binary, compressed, columnar file formats (Parquet)

44

Page 45: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Yet Another Spark Shuffle Manager! spark.shuffle.manager = hash (Deprecated) < 10,000 reducers Output partition file hashes the key of (K,V) pair Mapper creates an output file per partition Leads to M*P output files for all partitions sort (GraySort Challenge) > 10,000 reducers Default from Spark 1.2-1.5 Mapper creates single output file for all partitions Minimizes OS resources, netty + epoll optimizes network I/O, disk I/O, and memory Uses custom data structures and algorithms for sort-shuffle workload Wins Daytona GraySort Challenge tungsten-sort (Project Tungsten) Default since 1.5 Modification of existing sort-based shuffle Uses com.misc.Unsafe for self-managed memory and garbage collection Maximize CPU utilization and cache locality with AlphaSort-inspired binary data structures/algorithms Perform joins, sorts, and other operators on both serialized and compressed byte buffers

45

Page 46: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

CPU & Memory Optimizations Custom Managed Memory Reduces GC overhead Both on and off heap Exact size calculations

Direct Binary Processing Operate on serialized/compressed arrays Kryo can reorder/sort serialized records LZF can reorder/sort compressed records

More CPU Cache-aware Data Structs & Algorithms o.a.s.sql.catalyst.expression.UnsafeRow o.a.s.unsafe.map.BytesToBytesMap

Code Generation (default in 1.5) Generate source code from overall query plan 100+ UDFs converted to use code generation

46

UnsafeFixedWithAggregationMap TungstenAggregationIterator

CodeGenerator GeneratorUnsafeRowJoiner

UnsafeSortDataFormat UnsafeShuffleSortDataFormat

PackedRecordPointer UnsafeRow

UnsafeInMemorySorter UnsafeExternalSorter UnsafeShuffleWriter

Mostly Same Join Code, UnsafeProjection

UnsafeShuffleManager UnsafeShuffleInMemorySorter UnsafeShuffleExternalSorter

Details in SPARK-7075

Page 47: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

sun.misc.Unsafe

47

Info addressSize() pageSize()

Objects allocateInstance() objectFieldOffset()

Classes staticFieldOffset() defineClass() defineAnonymousClass() ensureClassInitialized()

Synchronization monitorEnter() tryMonitorEnter() monitorExit() compareAndSwapInt() putOrderedInt()

Arrays arrayBaseOffset() arrayIndexScale()

Memory allocateMemory() copyMemory() freeMemory() getAddress() – not guaranteed after GC getInt()/putInt() getBoolean()/putBoolean() getByte()/putByte() getShort()/putShort() getLong()/putLong() getFloat()/putFloat() getDouble()/putDouble() getObjectVolatile()/putObjectVolatile()

Used by Tungsten

Page 48: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Spark + com.misc.Unsafe

48

org.apache.spark.sql.execution. aggregate.SortBasedAggregate aggregate.TungstenAggregate aggregate.AggregationIterator aggregate.udaf aggregate.utils SparkPlanner rowFormatConverters UnsafeFixedWidthAggregationMap UnsafeExternalSorter UnsafeExternalRowSorter UnsafeKeyValueSorter UnsafeKVExternalSorter local.ConvertToUnsafeNode local.ConvertToSafeNode local.HashJoinNode local.ProjectNode local.LocalNode local.BinaryHashJoinNode local.NestedLoopJoinNode joins.HashJoin joins.HashSemiJoin joins.HashedRelation joins.BroadcastHashJoin joins.ShuffledHashOuterJoin (not yet converted) joins.BroadcastHashOuterJoin joins.BroadcastLeftSemiJoinHash joins.BroadcastNestedLoopJoin joins.SortMergeJoin joins.LeftSemiJoinBNL joins.SortMergerOuterJoin Exchange SparkPlan UnsafeRowSerializer SortPrefixUtils sort basicOperators aggregate.SortBasedAggregationIterator aggregate.TungstenAggregationIterator datasources.WriterContainer datasources.json.JacksonParser datasources.jdbc.JDBCRDD Window

org.apache.spark. unsafe.Platform unsafe.KVIterator unsafe.array.LongArray unsafe.array.ByteArrayMethods unsafe.array.BitSet unsafe.bitset.BitSetMethods unsafe.hash.Murmur3_x86_32 unsafe.map.BytesToBytesMap unsafe.map.HashMapGrowthStrategy unsafe.memory.TaskMemoryManager unsafe.memory.ExecutorMemoryManager unsafe.memory.MemoryLocation unsafe.memory.UnsafeMemoryAllocator unsafe.memory.MemoryAllocator (trait/interface) unsafe.memory.MemoryBlock unsafe.memory.HeapMemoryAllocator unsafe.memory.ExecutorMemoryManager unsafe.sort.RecordComparator unsafe.sort.PrefixComparator unsafe.sort.PrefixComparators unsafe.sort.UnsafeSorterSpillWriter serializer.DummySerializationInstance shuffle.unsafe.UnsafeShuffleManager shuffle.unsafe.UnsafeShuffleSortDataFormat shuffle.unsafe.SpillInfo shuffle.unsafe.UnsafeShuffleWriter shuffle.unsafe.UnsafeShuffleExternalSorter shuffle.unsafe.PackedRecordPointer shuffle.ShuffleMemoryManager util.collection.unsafe.sort.UnsafeSorterSpillMerger util.collection.unsafe.sort.UnsafeSorterSpillReader util.collection.unsafe.sort.UnsafeSorterSpillWriter util.collection.unsafe.sort.UnsafeShuffleInMemorySorter util.collection.unsafe.sort.UnsafeInMemorySorter util.collection.unsafe.sort.RecordPointerAndKeyPrefix util.collection.unsafe.sort.UnsafeSorterIterator network.shuffle.ExternalShuffleBlockResolver scheduler.Task rdd.SqlNewHadoopRDD executor.Executor

org.apache.spark.sql.catalyst.expressions. regexpExpressions BoundAttribute SortOrder SpecializedGetters ExpressionEvalHelper UnsafeArrayData UnsafeReaders UnsafeMapData Projection LiteralGeneartor UnsafeRow JoinedRow SpecializedGetters InputFileName SpecificMutableRow codegen.CodeGenerator codegen.GenerateProjection codegen.GenerateUnsafeRowJoiner codegen.GenerateSafeProjection codegen.GenerateUnsafeProjection codegen.BufferHolder codegen.UnsafeRowWriter codegen.UnsafeArrayWriter complexTypeCreator rows literals misc stringExpressions

Over 200 source files affected!!

Page 49: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Traditional Java Object Row Layout 4-byte String

Multi-field Object

49

Page 50: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Custom Data Structures for Workload UnsafeRow

(Dense Binary Row)

TaskMemoryManager (Virtual Memory Address)

BytesToBytesMap (Dense Binary HashMap)

50

Dense, 8-bytes per field (word-aligned)

Key Ptr

AlphaSort-Style (Key + Pointer)

OS-Style Memory Paging

Page 51: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

UnsafeRow Layout Example

51

Pre-Tungsten

Tungsten

Page 52: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Custom Memory Management o.a.s.memory. TaskMemoryManager & MemoryConsumer Memory management: virtual memory allocation, pageing Off-heap: direct 64-bit address On-heap: 13-bit page num + 27-bit page offset

o.a.s.shuffle.sort. PackedRecordPointer 64-bit word (24-bit partition key, (13-bit page num, 27-bit page offset))

o.a.s.unsafe.types. UTF8String Primitive Array[Byte]

52

2^13 pages * 2^27 page size = 1 TB RAM per Task

Page 53: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

UnsafeFixedWidthAggregationMap

Aggregations o.a.s.sql.execution. UnsafeFixedWidthAggregationMap Uses BytesToBytesMap In-place updates of serialized data No object creation on hot-path Improved external agg support No OOM’s for large, single key aggs

o.a.s.sql.catalyst.expression.codegen. GenerateUnsafeRowJoiner Combine 2 UnsafeRows into 1

o.a.s.sql.execution.aggregate. TungstenAggregate & TungstenAggregationIterator Operates directly on serialized, binary UnsafeRow 2 Steps: hash-based agg (grouping), then sort-based agg Supports spilling and external merge sorting

53

Page 54: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Equality Bitwise comparison on UnsafeRow No need to calculate equals(), hashCode() Row 1

Equals! Row 2

54

Page 55: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Joins Surprisingly, not many code changes o.a.s.sql.catalyst.expressions. UnsafeProjection Converts InternalRow to UnsafeRow

55

Page 56: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Sorting o.a.s.util.collection.unsafe.sort. UnsafeSortDataFormat UnsafeInMemorySorter UnsafeExternalSorter RecordPointerAndKeyPrefix UnsafeShuffleWriter

AlphaSort-Style Cache Friendly

56

Ptr Key-Prefix

2x CPU Cache-line Friendly!

Using multiple subclasses of SortDataFormat simultaneously will prevent JIT inlining. This affects sort & shuffle performance.

Supports merging compressed records if compression CODEC supports it (LZF)

Page 57: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Spilling Efficient Spilling Exact data size is known No need to maintain heuristics & approximations Controls amount of spilling

Spill merge on compressed, binary records! If compression CODEC supports it

57

UnsafeFixedWidthAggregationMap.getPeakMemoryUsedBytes()

Exact Peak Memory for Spark Jobs

Page 58: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Code Generation Problem Boxing causes excessive object creation Expensive expression tree evals per row JVM can’t inline polymorphic impls

Solution Codegen by-passes virtual function calls Defer source code generation to each operator, UDF, UDAF Use Scala quasiquote macros for Scala AST source code gen Rewrite and optimize code for overall plan, 8-byte align, etc Use Janino to compile generated source code into bytecode

58

Page 59: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles IBM | spark.tc

Spark SQL UDF Code Generation 100+ UDFs now generating code

More to come in Spark 1.6+

Details in SPARK-8159, SPARK-9571

Each Implements Expression.genCode() !

Page 60: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Creating a Custom UDF with Codegen Study existing implementations https://github.com/apache/spark/pull/7214/files

Extend base trait o.a.s.sql.catalyst.expressions.Expression.genCode()

Register the function o.a.s.sql.catalyst.analysis.FunctionRegistry.registerFunction()

Augment DataFrame with new UDF (Scala implicits) o.a.s.sql.functions.scala

Don’t forget about Python! python.pyspark.sql.functions.py 60

Page 61: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Who Benefits from Project Tungsten? Users of DataFrames All Spark SQL Queries Catalyst

All RDDs Serialization, Compression, and Aggregations

61

Page 62: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Project Tungsten Performance Results Query Time

Garbage Collection

62

OOM’d on Large Dataset!

Page 63: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Presentation Outline

 Spark Core: Tuning & Mechanical Sympathy

 Spark SQL: Query Optimizing & Catalyst

 Spark Streaming: Scaling & Approximations

 Spark ML: Featurizing & Recommendations 63

Page 64: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles

Spark SQL: Query Optimizing & Catalyst Explore DataFrames/Datasets/DataSources, Catalyst

Review Partitions, Pruning, Pushdowns, File Formats

Create a Custom DataSource API Implementation

64

Page 65: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

DataFrames Inspired by R and Pandas DataFrames

Schema-aware Cross language support

SQL, Python, Scala, Java, R Levels performance of Python, Scala, Java, and R

Generates JVM bytecode vs serializing to Python DataFrame is container for logical plan

Lazy transformations represented as tree Only logical plan is sent from Python -> JVM

Only results returned from JVM -> Python UDF and UDAF Support

Custom UDF support using registerFunction() Experimental UDAF support (ie. HyperLogLog)

Supports existing Hive metastore if available Small, file-based Hive metastore created if not available

*DataFrame.rdd returns underlying RDD if needed

65

Use DataFrames instead of RDDs!!

Page 66: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Spark and Hive Early days, Shark was “Hive on Spark” Hive Optimizer slowly replaced with Catalyst Always use HiveContext – even if not using Hive! If no Hive, a small Hive metastore file is created

Spark 1.5+ supports all Hive versions 0.12+ Separate classloaders for isolation Breaks dependency between Spark internal Hive

version and User’s external Hive version

66

Page 67: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Catalyst Optimizer Optimize DataFrame Transformation Tree Subquery elimination: use aliases to collapse subqueries Constant folding: replace expression with constant Simplify filters: remove unnecessary filters Predicate/filter pushdowns: avoid unnecessary data load Projection collapsing: avoid unnecessary projections Create Custom Rules Rules are Scala Case Classes val newPlan = MyFilterRule(analyzedPlan) 67

Implements oas.sql.catalyst.rules.Rule

Apply to any plan stage

Page 68: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

DataSources API Relations (o.a.s.sql.sources.interfaces.scala)

BaseRelation (abstract class): Provides schema of data TableScan (impl): Read all data from source PrunedFilteredScan (impl): Column pruning & predicate pushdowns InsertableRelation (impl): Insert/overwrite data based on SaveMode RelationProvider (trait/interface): Handle options, BaseRelation factory

Execution (o.a.s.sql.execution.commands.scala) RunnableCommand (trait/interface): Common commands like EXPLAIN ExplainCommand(impl: case class) CacheTableCommand(impl: case class)

Filters (o.a.s.sql.sources.filters.scala) Filter (abstract class): Handles all predicates/filters supported by this source EqualTo (impl) GreaterThan (impl) StringStartsWith (impl)

68

Page 69: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Native Spark SQL DataSources

69

Page 70: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Query Plan Debugging

70

gendersCsvDF.select($"id", $"gender").filter("gender != 'F'").filter("gender != 'M'").explain(true)

DataFrame.queryExecution.logical

DataFrame.queryExecution.analyzed

DataFrame.queryExecution.optimizedPlan

DataFrame.queryExecution.executedPlan

Page 71: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Query Plan Visualization & Metrics

71

Effectiveness of Filter

CPU Cache Friendly

Binary Format Cost-based Join Optimization

Similar to MapReduce

Map-side Join

Peak Memory for Joins and Aggs

Page 72: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

JSON Data Source DataFrame val ratingsDF = sqlContext.read.format("json")

.load("file:/root/pipeline/datasets/dating/ratings.json.bz2") -- or – val ratingsDF = sqlContext.read.json ("file:/root/pipeline/datasets/dating/ratings.json.bz2")

SQL Code CREATE TABLE genders USING json OPTIONS (path "file:/root/pipeline/datasets/dating/genders.json.bz2") 72

json() convenience method

Page 73: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

JDBC Data Source Add Driver to Spark JVM System Classpath $ export SPARK_CLASSPATH=<jdbc-driver.jar>

DataFrame val jdbcConfig = Map("driver" -> "org.postgresql.Driver", "url" -> "jdbc:postgresql:hostname:port/database", "dbtable" -> ”schema.tablename") df.read.format("jdbc").options(jdbcConfig).load()

SQL CREATE TABLE genders USING jdbc OPTIONS (url, dbtable, driver, …)

73

Page 74: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Parquet Data Source Configuration

spark.sql.parquet.filterPushdown=true spark.sql.parquet.mergeSchema=true spark.sql.parquet.cacheMetadata=true spark.sql.parquet.compression.codec=[uncompressed,snappy,gzip,lzo]

DataFrames val gendersDF = sqlContext.read.format("parquet") .load("file:/root/pipeline/datasets/dating/genders.parquet") gendersDF.write.format("parquet").partitionBy("gender") .save("file:/root/pipeline/datasets/dating/genders.parquet")

SQL CREATE TABLE genders USING parquet OPTIONS (path "file:/root/pipeline/datasets/dating/genders.parquet")

74

Page 75: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

ORC Data Source Configuration spark.sql.orc.filterPushdown=true

DataFrames val gendersDF = sqlContext.read.format("orc") .load("file:/root/pipeline/datasets/dating/genders") gendersDF.write.format("orc").partitionBy("gender") .save("file:/root/pipeline/datasets/dating/genders")

SQL CREATE TABLE genders USING orc OPTIONS (path "file:/root/pipeline/datasets/dating/genders")

75

Page 76: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Third-Party Spark SQL DataSources

76

spark-packages.org

Page 77: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

CSV DataSource (Databricks) Github https://github.com/databricks/spark-csv

Maven com.databricks:spark-csv_2.10:1.2.0

Code val gendersCsvDF = sqlContext.read .format("com.databricks.spark.csv") .load("file:/root/pipeline/datasets/dating/gender.csv.bz2") .toDF("id", "gender")

77

toDF() is required if CSV does not contain header

Page 78: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

ElasticSearch DataSource (Elastic.co) Github https://github.com/elastic/elasticsearch-hadoop

Maven org.elasticsearch:elasticsearch-spark_2.10:2.1.0

Code

val esConfig = Map("pushdown" -> "true", "es.nodes" -> "<hostname>", "es.port" -> "<port>") df.write.format("org.elasticsearch.spark.sql”).mode(SaveMode.Overwrite) .options(esConfig).save("<index>/<document-type>")

78

Page 79: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Elasticsearch Tips

Change id field to not_analyzed to avoid indexing

Use term filter to build and cache the query

Perform multiple aggregations in a single request

Adapt scoring function to current trends at query time

79

Page 80: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

AWS Redshift Data Source (Databricks) Github https://github.com/databricks/spark-redshift

Maven com.databricks:spark-redshift:0.5.0

Code val df: DataFrame = sqlContext.read

.format("com.databricks.spark.redshift") .option("url", "jdbc:redshift://<hostname>:<port>/<database>…") .option("query", "select x, count(*) my_table group by x") .option("tempdir", "s3n://tmpdir") .load(...)

80

UNLOAD and copy to tmp bucket in S3 enables

parallel reads

Page 81: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

DB2 and BigSQL DataSources (IBM) Coming Soon!

81

Page 82: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Cassandra DataSource (DataStax) Github https://github.com/datastax/spark-cassandra-connector

Maven com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M1

Code

ratingsDF.write .format("org.apache.spark.sql.cassandra") .mode(SaveMode.Append) .options(Map("keyspace"->"<keyspace>", "table"->"<table>")).save(…)

82

Page 83: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Cassandra Pushdown Support spark-cassandra-connector/…/o.a.s.sql.cassandra.PredicatePushDown.scala

Pushdown Predicate Rules 1. Only push down no-partition key column predicates with =, >, <, >=, <= predicate 2. Only push down primary key column predicates with = or IN predicate. 3. If there are regular columns in the pushdown predicates, they should have at least one EQ expression on an indexed column and no IN predicates. 4. All partition column predicates must be included in the predicates to be pushed down, only the last part of the partition key can be an IN predicate. For each partition column,

only one predicate is allowed. 5. For cluster column predicates, only last predicate can be non-EQ predicate

including IN predicate, and preceding column predicates must be EQ predicates. If there is only one cluster column predicate, the predicates could be any non-IN predicate.

6. There is no pushdown predicates if there is any OR condition or NOT IN condition. 7. We're not allowed to push down multiple predicates for the same column if any of them

is equality or IN predicate.

83

Page 84: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

New Cassandra DataSource By-pass CQL optimized for transactional data Instead, do bulk reads/writes directly on SSTables Similar to 5 year old Netflix Open Source project Aegisthus

Promotes Cassandra to first-class Analytics Option Potentially only part of DataStax Enterprise?! Please mail a nasty letter to your local DataStax office

84

Page 85: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Rumor of REST DataSource (Databricks) Coming Soon?

Ask Michael Armbrust Spark SQL Lead @ Databricks

85

Page 86: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Custom DataSource (Me and You!) Coming Right Now!

86

DEMO ALERT!!

Page 87: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Create a Custom DataSource Study Existing Native & Third-Party Data Sources Native Spark JDBC (o.a.s.sql.execution.datasources.jdbc) class JDBCRelation extends BaseRelation with PrunedFilteredScan with InsertableRelation

Third-Party DataStax Cassandra (o.a.s.sql.cassandra) class CassandraSourceRelation extends BaseRelation with PrunedFilteredScan with InsertableRelation!

87

Page 88: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles

Demo! Create a Custom DataSource

88

Page 89: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Contribute a Custom Data Source spark-packages.org Managed by Contains links to external github projects Ratings and comments Declare Spark version support for each package

Examples https://github.com/databricks/spark-csv https://github.com/databricks/spark-avro https://github.com/databricks/spark-redshift

89

Page 90: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Parquet Columnar File Format Based on Google Dremel

Collaboration with Twitter and Cloudera

Self-describing, evolving schema

Fast columnar aggregation

Supports filter pushdowns

Columnar storage format

Excellent compression

90

Page 91: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Types of Compression Run Length Encoding: Repeated data Dictionary Encoding: Fixed set of values

Delta, Prefix Encoding: Sorted data

91

Page 92: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles

Demo! Demonstrate File Formats, Partition Schemes, and Query Plans

92

Page 93: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Hive JDBC ODBC ThriftServer Allow BI Tools to Query and Process Spark Data Register Permanent Table CREATE TABLE ratings(fromuserid INT, touserid INT, rating INT) USING org.apache.spark.sql.json OPTIONS (path "datasets/dating/ratings.json.bz2")

Register Temp Table ratingsDF.registerTempTable("ratings_temp")

Configuration spark.sql.thriftServer.incrementalCollect=true spark.driver.maxResultSize > 10gb (default)

93

Page 94: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles

Demo! Query and Process Spark Data from BI Tools

94

Page 95: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Presentation Outline

 Spark Core: Tuning & Mechanical Sympathy

 Spark SQL: Query Optimizing & Catalyst

 Spark Streaming: Scaling & Approximations

 Spark ML: Featurizing & Recommendations 95

Page 96: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles

Spark Streaming: Scaling & Approximations Discuss Delivery Guarantees, Parallelism, and Stability

Compare Receiver and Receiver-less Impls

Demonstrate Stream Approximations

96

Page 97: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Non-Parallel Receiver Implementation

97

Page 98: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Receiver Implementation (Kinesis)   KinesisRDD partitions store relevant offsets   Single receiver required to see all data/offsets   Kinesis offsets not deterministic like Kafka   Partitions rebuild from Kinesis using offsets   No Write Ahead Log (WAL) needed   Optimizes happy path by avoiding the WAL   At least once delivery guarantee

98

Page 99: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Parallel Receiver-less Implementation (Kafka)

99

Page 100: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Receiver-less Implementation (Kafka)   KafkaRDD partitions store relevant offsets   Each partition acts as a Receiver   Tasks/Executors pull from Kafka in parallel   Partitions rebuild from Kafka using offsets   No Write Ahead Log (WAL) needed   Optimizes happy path by avoiding the WAL   At least once delivery guarantee

100

Page 101: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Maintain Stability of Stream Processing Rate Limiting Since Spark 1.2 Fixed limit on number of messages per second Potential to drops messages on the floor

Back Pressure Since Spark 1.5 (TypeSafe Contribution) More dynamic than rate limiting Push back on reliable, buffered source (Kafka, Kinesis) Fundamentals of Control Theory and Observability

101

Page 102: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles

Streaming Approximations HyperLogLog and CountMin Sketch

102

Page 103: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

HyperLogLog (HLL) Approx Distinct Count   Approximate count distinct   Twitter’s Algebird   Better than HashSet   Low, fixed memory   Only 1.5K, 2% error,10^9 counts (tunable) Redis HLL: 12K per key, 0.81%, 2^64 counts   Spark’s countApproxDistinctByKey()   Streaming example in Spark codebase

103

http://research.neustar.biz/

Page 104: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

CountMin Sketch (CMS) Approx Count   Approximate count   Twitter’s Algebird   Better than HashMap   Low, fixed memory   Known error bounds   Large num counters   Streaming example in Spark codebase

104

Page 105: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles

Demo! Using HLL and CMS for Streaming Count Approximations

105

Page 106: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Monte Carlo Simulations From Manhattan Project (Atomic bomb) Simulate movement of neutrons Law of Large Numbers (LLN) Average of results of many trials Converge on expected value SparkPi example in Spark codebase 1 Argument: # of trials Pi ~= # red dots

/ # total dots * 4

106

Page 107: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles

Demo! Using a Monte Carlo Simulation to Estimate Pi

107

Page 108: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Streaming Best Practices Get Data Out of Streaming ASAP Processing interval may exceed batch interval Leads to unstable streaming system

Please Don’t… Use updateStateByKey() like an in-memory DB Put streaming jobs on the request/response hot path

Use Separate Jobs for Different Batch Intervals Small Batch Interval: Store raw data (Redis, Cassandra, etc) Medium Batch Interval: Transform, join, process data High Batch Interval: Model training

Gotchas Tune streamingContext.remember()

Use Approximations!! 108

Page 109: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Presentation Outline

 Spark Core: Tuning & Mechanical Sympathy

 Spark SQL: Query Optimizing & Catalyst

 Spark Streaming: Scaling & Approximations

 Spark ML: Featurizing & Recommendations 109

Page 110: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles

Spark ML: Featurizing & Recommendations Understand Similarity and Dimension Reduction

Demonstrate Sampling and Bucketing

Generate Recommendations

110

Page 111: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles

Live, Interactive Demo! sparkafterdark.com

111

Page 112: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Audience Participation Needed!!

112

-> You are

here ->

Audience Instructions   Navigate to sparkafterdark.com

  Click 3 actresses and 3 actors   Wait for us to analyze together!

Note: This is totally anonymous!! Project Links   https://github.com/fluxcapacitor/pipeline

  https://hub.docker.com/r/fluxcapacitor

Page 113: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles

Similarity

113

Page 114: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Types of Similarity Euclidean Linear-based measure Suffers from Magnitude bias Cosine Angle-based measure Adjusts for magnitude bias Jaccard Set intersection / union Suffers Popularity bias Log Likelihood Netflix “Shawshank” Problem Adjusts for popularity bias

114

Ali Matei Reynold Patrick AndyKimberly 1 1 1 1Leslie 1 1!Meredith 1 1 1Lisa 1 1 1Holden 1 1 1 1 1

z!

Page 115: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

All-Pairs Similarity Comparison Compare everything to everything aka. “pair-wise similarity” or “similarity join” Naïve shuffle: O(m*n^2); m=rows, n=cols Minimize shuffle through approximations! Reduce m (rows) Sampling and bucketing Reduce n (cols) Remove most frequent value (ie.0) Principle Component Analysis

115

Dimension reduction!!

Page 116: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles

Dimension Reduction Sampling and Bucketing

116

Page 117: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Reduce m: DIMSUM Sampling “Dimension Independent Matrix Square Using MR” Remove rows with low similarity probability MLlib: RowMatrix.columnSimilarities(…)

Twitter: 40% efficiency gain vs. Cosine Similarity 117

Page 118: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Reduce m: LSH Bucketing “Locality Sensitive Hashing” Split m into b buckets Use similarity hash algorithm Requires pre-processing of data Parallel compare bucket contents O(m*n^2) -> O(m*n/b*b^2); m=rows, n=cols, b=buckets

ie. 500k x 500k matrix O(1.25e17) -> O(1.25e13); b=50

118

github.com/mrsqueeze/spark-hash

Page 119: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Reduce n: Remove Most Frequent Value Eliminate most-frequent value Represent other values with (index,value) pairs Converts O(m*n^2) -> O(m*nnz^2); nnz=num nonzeros, nnz << n Note: Choose most frequent value (may not be 0)

119

(index,value)

(index,value)

Page 120: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles

Recommendations Summary Statistics and Top-K Historical Analysis

Collaborative Filtering and Clustering

Text Featurization and NLP

120

Page 121: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Types of Recommendations Non-personalized No preference or behavior data for user, yet aka “Cold Start Problem” Personalized User-Item Similarity Items that others with similar prefs have liked

Item-Item Similarity Items similar to your previously-liked items

121

Page 122: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Recommendation Terminology Feedback Explicit: like, rating Implicit: search, click, hover, view, scroll

Feature Engineering Dimension reduction, polynomial expansion

Hyper-parameter Tuning K-Folds Cross Validation, Grid Search

Pipelines/Workflows Chaining together Transformers and Evaluators

122

Page 123: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Single Machine ML Algorithms Stay Local, Distribute As Needed

Helps migration of existing single-node algos to Spark

Convert between Spark and Pandas DataFrames

New “pdspark” package: integration w/ scikitlearn, R

123

Page 124: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles

Non-Personalized Recommendations Use Aggregate Data to Generate Recommendations

124

Page 125: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

  Top Users by Like Count “I might like users who have the most-likes overall

based on historical data.” SparkSQL, DataFrames: Summary Stat, Aggs

125

Page 126: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

  Top Influencers by Like Graph “I might like the most-influential users in overall like graph.”

GraphX: PageRank

126

Page 127: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles

Demo! Generate Non-Personalized Recommendations

127

Page 128: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles

Personalized Recommendations Understand Similarity and Personalized Recommendations

128

Page 129: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

  Like Behavior of Similar Users “I like the same people that you like.

What other people did you like that I haven’t seen?” MLlib: Matrix Factorization, User-Item Similarity

129

Page 130: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles

Demo! Generate Personalized Recommendations using

Collaborative Filtering & Matrix Factorization

130

Page 131: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

  Similar Text-based Profiles as Me “Our profiles have similar keywords and named entities.

We might like each other!” MLlib: Word2Vec, TF/IDF, k-skip n-grams

131

Page 132: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

  Similar Profiles to Previous Likes

132

“Your profile text has similar keywords and named entities to other profiles of people I like. I might like you, too!”

MLlib: Word2Vec, TF/IDF, Doc Similarity

Page 133: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

  Relevant, High-Value Emails “Your initial email references a lot of things in my profile.

I might like you for making the effort!” MLlib: Word2Vec, TF/IDF, Entity Recognition

133

^ Her Email < My Profile

Page 134: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles

Demo! Feature Engineering for Text/NLP Use Cases

134

Page 135: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles

The Future of Recommendations

135

Page 136: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

  Eigenfaces: Facial Recognition “Your face looks similar to others that I’ve liked.

I might like you.” MLlib: RowMatrix, PCA, Item-Item Similarity

136 Image courtesy of http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html

Page 137: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

  NLP Conversation Starter Bot! “If your responses to my generic opening lines are positive, I may read your profile.”

MLlib: TF/IDF, DecisionTrees, Sentiment Analysis

137

Positive Negative

Page 138: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles 138

Maintaining the Spark

Page 139: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

⑨  Recommendations for Couples “I want Mad Max. You want Message In a Bottle.

Let’s find something in between to watch tonight.” MLlib: RowMatrix, Item-Item Similarity

GraphX: Nearest Neighbors, Shortest Path similar similar •  plots -> <- actors

139

Page 140: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles

Final Recommendation!

140

Page 141: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

  Get Off the Computer & Meet People! Thank you, Helsinki!! Chris Fregly @cfregly IBM Spark Technology Center San Francisco, CA, USA

Relevant Links advancedspark.com Signup for the book & global meetup! github.com/fluxcapacitor/pipeline Clone, contribute, and commit code! hub.docker.com/r/fluxcapacitor/pipeline/wiki Run all demos in your own environment with Docker!

141

Page 142: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

More Relevant Links http://meetup.com/Advanced-Apache-Spark-Meetup http://advancedspark.com http://github.com/fluxcapacitor/pipeline http://hub.docker.com/r/fluxcapacitor/pipeline http://sortbenchmark.org/ApacheSpark2014.pd https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html http://0x0fff.com/spark-architecture-shuffle/ http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report.pdf http://stackoverflow.com/questions/763262/how-does-one-write-code-that-best-utilizes-the-cpu-cache-to-improve-performance http://www.aristeia.com/TalkNotes/ACCU2011_CPUCaches.pdf http://mishadoff.com/blog/java-magic-part-4-sun-dot-misc-dot-unsafe/ http://docs.scala-lang.org/overviews/quasiquotes/intro.html http://lwn.net/Articles/252125/ (Memory Part 2: CPU Caches) http://lwn.net/Articles/255364/ (Memory Part 5: What Programmers Can Do) https://www.safaribooksonline.com/library/view/java-performance-the/9781449363512/ch04.html http://web.eece.maine.edu/~vweaver/projects/perf_events/perf_event_open.html http://www.brendangregg.com/perf.html https://perf.wiki.kernel.org/index.php/Tutorial http://techblog.netflix.com/2015/07/java-in-flames.html http://techblog.netflix.com/2015/04/introducing-vector-netflixs-on-host.html http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html#Java http://sortbenchmark.org/ApacheSpark2014.pdf https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html http://0x0fff.com/spark-architecture-shuffle/ http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report.pdf http://stackoverflow.com/questions/763262/how-does-one-write-code-that-best-utilizes-the-cpu-cache-to-improve-performance http://www.aristeia.com/TalkNotes/ACCU2011_CPUCaches.pdf http://mishadoff.com/blog/java-magic-part-4-sun-dot-misc-dot-unsafe/ http://docs.scala-lang.org/overviews/quasiquotes/intro.html http://lwn.net/Articles/252125/ <-- Memory Part 2: CPU Caches http://lwn.net/Articles/255364/ <-- Memory Part 5: What Programmers Can Do 142

Page 143: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles

What’s Next?

143

Page 144: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

What’s Next? Autoscaling Spark Workers Completely Docker-based Docker Compose and Docker Machine

Lots of Demos and Examples! Zeppelin & IPython/Jupyter notebooks Advanced streaming use cases Advanced ML, Graph, and NLP use cases

Performance Tuning and Profiling Work closely with Brendan Gregg & Netflix Surface & share more low-level details of Spark internals

144

Page 145: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Upcoming Meetups and Conferences London Spark Meetup (Oct 12th) Scotland Data Science Meetup (Oct 13th) Dublin Spark Meetup (Oct 15th)

Barcelona Spark Meetup (Oct 20th) Madrid Big Data Meetup (Oct 22nd)

Paris Spark Meetup (Oct 26th) Amsterdam Spark Summit (Oct 27th) Brussels Spark Meetup (Oct 30th)

Zurich Big Data Meetup (Nov 2nd) Geneva Spark Meetup (Nov 5th) San Francisco Datapalooza.io (Nov 10th)

145

San Francisco Advanced Spark (Nov 12th) Oslo Big Data Hadoop Meetup (Nov 19th) Helsinki Spark Meetup (Nov 20th)

Stockholm Spark Meetup (Nov 23rd) Copenhagen Spark Meetup (Nov 25th)

Budapest Spark Meetup (Nov 26th) Singapore Strata Conference (Dec 1st) San Francisco Advanced Spark (Dec 8th)

Mountain View Advanced Spark (Dec 10th) Toronto Spark Meetup (Dec 14th) Austin Data Days Conference (Jan 2016)

Page 146: Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

Click to edit Master text styles

IBM Spark spark.tc

Click to edit Master text styles Power of data. Simplicity of design. Speed of innovation.

IBM Spark