cloud computing applications - hazelcast, spark and ignite

88
Cloud Computing Applications Hazelcast, Spark and Ignite Joseph S. Kuo a.k.a. CyberJos

Upload: joseph-kuo

Post on 16-Apr-2017

370 views

Category:

Software


0 download

TRANSCRIPT

Page 1: Cloud Computing Applications - Hazelcast, Spark and Ignite

Cloud Computing Applications Hazelcast, Spark and Ignite

Joseph S. Kuo a.k.a. CyberJos

Page 2: Cloud Computing Applications - Hazelcast, Spark and Ignite

About Me

.大學唸數學系時玩了一堆語言和架構

.22年程式資歷,17年Java資歷

.擔任過資訊講師,曾任職於遊戲雲端平台公司、全球電子商務公司、知名資安公司以及社群趨勢分析公司

.希望能一輩子寫程式玩技術到老

Page 3: Cloud Computing Applications - Hazelcast, Spark and Ignite

Agenda

.Briefing of Hazelcast

.More about Hazelcast

.Spark Introduction

.Hazelcast and Spark

.About Apache Ignite

.Things between Ignite and Hazelcast

Page 4: Cloud Computing Applications - Hazelcast, Spark and Ignite

Briefing of Hazelcast

Page 5: Cloud Computing Applications - Hazelcast, Spark and Ignite
Page 6: Cloud Computing Applications - Hazelcast, Spark and Ignite

What is Hazelcast?

Hazelcast is an in-memory data grid which distributed data evenly among the nodes of a computing cluster, and shares available processing power and storage space to provide services. It also has the ability for failure tolerance and node loss.

Page 7: Cloud Computing Applications - Hazelcast, Spark and Ignite

Features

.Distributed Caching: Queue, Set, List, Map, MultiMap, Lock, Topic, AtomicReference, AtomicLong, IdGenerator, Ringbuffer, Semaphores

.Distributed Compute: Entry Processor, Executor Service, User Defined Services

.Distributed Query: Query, Aggregators, Listener with Predicate, MapReduce

Page 8: Cloud Computing Applications - Hazelcast, Spark and Ignite

Features (Cont.)

.Integrated Clustering: Hibernate 2nd Level Cache, Grails 3, JCS Resource Adapter

.Standards: JCache, Apache jclouds

.Cloud and Virtualization Support: Docker, AWS, Azure, Discovery Service Provider Interface, Kubernetes, Zookeper Discovery

.Client-Server Protocols: Memcache, Open Binary Client Protocol, REST

Page 9: Cloud Computing Applications - Hazelcast, Spark and Ignite

Use Cases

.In-Memory Data Grid

.Caching

.In-Memory NoSQL

.Messaging

.Application Scaling

.Clustering

Page 10: Cloud Computing Applications - Hazelcast, Spark and Ignite

In-Memory Data Grid

.Scale-out Computing: shared CPU power

.Resilience: failure & data loss/performance

.Programming Model: easily code clusters

.Fast, Big Data: handle large sets in RAM

.Dynamic Scalability: join/leave a cluster

.Elastic Main Memory: memory pool

Page 11: Cloud Computing Applications - Hazelcast, Spark and Ignite

Caching

.Elastic Memcached: Hazelcast has been used as an alternative to Memcached.

.Hibernate 2nd Level Cache: It organizes caching into 1st and 2nd level caches.

.Spring Cache: It supports Spring Cache which allows it to plug in to any Spring application.

Page 12: Cloud Computing Applications - Hazelcast, Spark and Ignite

In-Memory NoSQL

.Scalability: size of RAM vs DISKBy joining nodes in a cluster, we can gather RAM to store maps, and the CPU and RAM resources become available to the network.

.Volatility: volatility of RAM vs DiskIt uses P2P data distribution to provide no single node of failure. By default, it has data stored in two locations in the cluster.

Page 13: Cloud Computing Applications - Hazelcast, Spark and Ignite

In-Memory NoSQL (Cont.)

.RebalancingIt automatically rebalances data if a node crashes. Shuffling data has a negative effect as it consumes network, CPU and RAM.

.Going NativeThe High-Density Memory Store can avoid GC pauses. It uses NIO DirectByteBuffers and does not require any defragmentation.

Page 14: Cloud Computing Applications - Hazelcast, Spark and Ignite

Messaging

Hazelcast provides Topic for distribution mechanism for publishing messages that are delivered to multiple subscribers. Publish and subscriptions are cluster-wide. Messages are ordered, that is, listeners will process the messages in the order they are actually published.

Page 15: Cloud Computing Applications - Hazelcast, Spark and Ignite

Application Scaling

.Elastic Scalability: new servers join a cluster automatically

.Super Speeds: memory transaction speed

.High Availability: can deploy in backup pairs or even WAN replicated

.Fault Tolerance: no single point of failure

.Cloud Readiness: deploy right into EC2

Page 16: Cloud Computing Applications - Hazelcast, Spark and Ignite

Clustering

Hazelcast is easily able to handle Session Clustering with in-memory performance, linear scalability as you add new nodes and reliability. This is a great way to ensure that session information is maintained when you are clustering web servers. You can also use a similar pattern for managing user identities.

Page 17: Cloud Computing Applications - Hazelcast, Spark and Ignite

Dependency

.Maven<dependency> <groupId>com.hazelcast</groupId> <artifactId>hazelcast</artifactId> <version>3.7.2</version></dependency>

.Gradledependencies { compile 'com.hazelcast:hazelcast:3.7.2'}

Page 18: Cloud Computing Applications - Hazelcast, Spark and Ignite

More about Hazelcast

Page 19: Cloud Computing Applications - Hazelcast, Spark and Ignite

What’s New in Hazelcast 3.4

.High-Density Memory Store

.Hazelcast Configuration Import

.Back Pressure

Page 20: Cloud Computing Applications - Hazelcast, Spark and Ignite

What’s New in Hazelcast 3.5

.Async Back Pressure

.Client Configuration Import

.Cluster Quorum

.Hazelcast Client Protocol

.Listener for Lost Partitions

.Increased Visibility of Slow Operations

.Sub-Listener Interfaces for Map Listener

Page 21: Cloud Computing Applications - Hazelcast, Spark and Ignite

What’s New in Hazelcast 3.6

.High-Density Memory Store for Map

.Discovery SPI

.Client Protocol & Version Compatibility

.Support for cloud providers by jclouds®

.Hot Restart Persistence

.Lite Members

.Lots of Features for Hazelcast JCache

.Hazelcast Docker image

Page 22: Cloud Computing Applications - Hazelcast, Spark and Ignite

What’s New in Hazelcast 3.7

.Custom Eviction Policies

.Discovery SPI for Azure

.Hazelcast CLI with Scripting

.OpenShift and CloudFoundry Plugin

.Apache Spark Connector

.Alignment of WAN Replication Clusters

.Fault Tolerant Executor Service

Page 23: Cloud Computing Applications - Hazelcast, Spark and Ignite

Sample Codepublic class GetStartedMain { public static void main(final String[] args) { Config cfg = new Config(); HazelcastInstance instance =       Hazelcast.newHazelcastInstance(cfg); Map<Long, String> map = instance.getMap("test"); map.put(1L, "Demo"); System.our.println(map.get(1L)); }}

Page 24: Cloud Computing Applications - Hazelcast, Spark and Ignite

Sharding – 4 nodes

Page 25: Cloud Computing Applications - Hazelcast, Spark and Ignite

How Data is Partitioned?

Data entries are distributed into partitions by using a hashing algorithm (key/name):

.the key or name is serialized (converted into a byte array),.this byte array is hashed, and.the result of the hash is mod by the number of partitions.

Page 26: Cloud Computing Applications - Hazelcast, Spark and Ignite

Partition ID

The result of this modulo - MOD (hash result, partition count) - is the partition in which the data will be stored, that is the partition ID. For ALL members you have in your cluster, the partition ID for a given key will always be the same.

Page 27: Cloud Computing Applications - Hazelcast, Spark and Ignite

Partition Table

When we start a member, a partition table is created within it. This table stores the partition IDs and the cluster members to which they belong. The purpose of this table is to make all members (including lite members) in the cluster aware of this information, ensuring that each member knows where the data is.

Page 28: Cloud Computing Applications - Hazelcast, Spark and Ignite

Partition Table (Cont.)

The oldest member in the cluster (the one that started first) periodically sends the partition table to all members. In this way each member in the cluster is informed about any changes to partition ownership. The ownerships may be changed when a new member joins the cluster, or when a member leaves the cluster.

Page 29: Cloud Computing Applications - Hazelcast, Spark and Ignite

Repartitioning

Repartitioning is the process of redistribution of partition ownerships:

.When a member joins to the cluster..When a member leaves the cluster.

In these cases, the partition table in the oldest member is updated with the new partition ownerships.

Page 30: Cloud Computing Applications - Hazelcast, Spark and Ignite

Topology - Embedded

Page 31: Cloud Computing Applications - Hazelcast, Spark and Ignite

Topology - Client/Server

Page 32: Cloud Computing Applications - Hazelcast, Spark and Ignite
Page 33: Cloud Computing Applications - Hazelcast, Spark and Ignite

Spark Introduction

Page 34: Cloud Computing Applications - Hazelcast, Spark and Ignite
Page 35: Cloud Computing Applications - Hazelcast, Spark and Ignite

What is Spark?

.Spark is a fast and general-purpose cluster computing system. It provides high-level APIs and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools.

.It provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance.

Page 36: Cloud Computing Applications - Hazelcast, Spark and Ignite

Advantages

.SpeedRun programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.

.Ease of UseWrite application quickly. Spark offers over 80 high-level operators to build parallel applications.

Page 37: Cloud Computing Applications - Hazelcast, Spark and Ignite

Advantages (Cont.)

.GeneralityCombine SQL, streaming and complex analytics libraries seamlessly in the same application.

.Run EverywhereSupport multiple cluster management and distributed storage system.

Page 38: Cloud Computing Applications - Hazelcast, Spark and Ignite

Features

.Resilient distributed dataset (RDD)

.Fault Tolerant

.Map-reduce cluster computing

.Build-in libraries

.Languages: Java, Scala, Python and R

.Interactive shell (Python, Scala, R) and web-based UI

Page 39: Cloud Computing Applications - Hazelcast, Spark and Ignite

RDD

Resilient distributed dataset is a read-only distributed data set of elements partitioned across the nodes of the cluster that can be operated on in parallel. It can stay in memory and fall back to disk gracefully. An RDD in memory (cached) can be reused efficiently across parallel operations. Finally, RDD automatically recovers from node failures.

Page 40: Cloud Computing Applications - Hazelcast, Spark and Ignite

RDD Operations

Two types of things that can be done on an RDD:

.transformations like map, filter than results in another RDD

.actions like count that result in an output

Page 41: Cloud Computing Applications - Hazelcast, Spark and Ignite

RDD Operations (Cont.)

Page 42: Cloud Computing Applications - Hazelcast, Spark and Ignite

RDD Fault Recovery

Page 43: Cloud Computing Applications - Hazelcast, Spark and Ignite

Directed Acyclic Graph

Page 44: Cloud Computing Applications - Hazelcast, Spark and Ignite

Cluster Topology

Page 45: Cloud Computing Applications - Hazelcast, Spark and Ignite

Dependency

.Maven<dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.11</artifactId> <version>2.0.0</version></dependency>

.Gradledependencies { compile 'org.apache.spark:spark-core_2.11:2.0.0'}

Page 46: Cloud Computing Applications - Hazelcast, Spark and Ignite

Spark Node with Docker

.Pull image (Docker 2.0)docker pull maguowei/spark

.Launch a Spark nodedocker run -it -p 4040:4040 maguowei/spark pyspark

docker run -it -p 4040:4040 maguowei/spark spark-shell

.Monitoringhttp://localhost:4040/

Page 47: Cloud Computing Applications - Hazelcast, Spark and Ignite
Page 48: Cloud Computing Applications - Hazelcast, Spark and Ignite

Spark Cluster with Docker

.Launch master image (driver program)docker run -it -h sandbox1 -p 7077:7077 -p 8080:8080 maguowei/spark bash

.Append text to /etc/hosts172.17.0.2 sandbox1172.17.0.3 sandbox2

.Launch the master node/opt/spark-2.0.0-bin-hadoop2.7/sbin/start-master.sh

.Monitoringhttp://localhost:8080/

Page 49: Cloud Computing Applications - Hazelcast, Spark and Ignite

Spark Cluster with Docker (Cont.)

.Launch work imagesdocker run -it -h sandbox2 maguowei/spark bash

.Append text to /etc/hosts172.17.0.2 sandbox1172.17.0.3 sandbox2

.Launch a work node/opt/spark-2.0.0-bin-hadoop2.7/sbin/start-slave.sh spark://sandbox1:7077

.Run tasksdocker exec <CONTAINER_ID> run-example <class> <arg>

Page 50: Cloud Computing Applications - Hazelcast, Spark and Ignite
Page 51: Cloud Computing Applications - Hazelcast, Spark and Ignite

same version for all placessame version for all places

same version for all places

Page 52: Cloud Computing Applications - Hazelcast, Spark and Ignite

Very important so say 3 times

Page 53: Cloud Computing Applications - Hazelcast, Spark and Ignite

Hazelcast and Spark

Page 54: Cloud Computing Applications - Hazelcast, Spark and Ignite

What is this Connector?

A plug-in which allows maps and caches to be used as shared RDD caches by Spark using the Spark RDD API.

Page 55: Cloud Computing Applications - Hazelcast, Spark and Ignite

What is this Connector?

Clients Clients

\ /

Hazelcast (MapReduce) Spark (MapReduce)

\ /

Hazelcast Spark Connector

Page 56: Cloud Computing Applications - Hazelcast, Spark and Ignite
Page 57: Cloud Computing Applications - Hazelcast, Spark and Ignite
Page 58: Cloud Computing Applications - Hazelcast, Spark and Ignite

Features

.Read/Write support for Hazelcast Maps

.Read/Write support for Hazelcast Caches

Page 59: Cloud Computing Applications - Hazelcast, Spark and Ignite

Requirements

.Hazelcast 3.7.x

.Apache Spark 1.6.1

Page 60: Cloud Computing Applications - Hazelcast, Spark and Ignite

Dependency

.Maven<dependency> <groupId>com.hazelcast</groupId> <artifactId>hazelcast-spark</artifactId> <version>0.1</version></dependency>

.Gradledependencies { compile 'com.hazelcast:hazelcast-spark:0.1'}

Page 61: Cloud Computing Applications - Hazelcast, Spark and Ignite

Properties

The options for the SparkConf object.hazelcast.server.addresses: 127.0.0.1:5701 (Comma separated list)

.hazelcast.server.groupName: dev

.hazelcast.server.groupPass: dev-pass

.hazelcast.spark.valueBatchingEnabled: true

.hazelcast.spark.readBatchSize: 1000

.hazelcast.spark.writeBatchSize: 1000

.hazelcast.spark.clientXmlPath

Page 62: Cloud Computing Applications - Hazelcast, Spark and Ignite

Creating the SparkContextSparkConf conf = new SparkConf() .set("hazelcast.server.addresses", "127.0.0.1:5701") .set("hazelcast.server.groupName", "dev") .set("hazelcast.server.groupPass", "dev-pass") .set("hazelcast.spark.valueBatchingEnabled", "true") .set("hazelcast.spark.readBatchSize", "5000") .set("hazelcast.spark.writeBatchSize", "5000");

JavaSparkContext jsc = new JavaSparkContext("spark://127.0.0.1:7077", "appname", conf);// provide Hazelcast functions to the Spark Context.HazelcastSparkContext hsc = new HazelcastSparkContext(jsc);

Page 63: Cloud Computing Applications - Hazelcast, Spark and Ignite

Read Data to Hazelcast// readHazelcastJavaRDD rddFromMap = hsc.fromHazelcastMap("map-name-to-be-loaded");HazelcastJavaRDD rddFromCache = hsc.fromHazelcastCache("cache-name-to-be-loaded");

Page 64: Cloud Computing Applications - Hazelcast, Spark and Ignite

Write Data to Hazelcastimport static com.hazelcast.spark.connector.HazelcastJavaPairRDDFunctions.javaPairRddFunctions;

JavaPairRDD<Object, Long> rdd = hsc.parallelize(new ArrayList<Object>() { add(1); add(2); add(3); }).zipWithIndex();

// writejavaPairRddFunctions(rdd).saveToHazelcastMap(name);javaPairRddFunctions(rdd).saveToHazelcastCache(name);

Page 65: Cloud Computing Applications - Hazelcast, Spark and Ignite

About Apache Ignite

Page 66: Cloud Computing Applications - Hazelcast, Spark and Ignite
Page 67: Cloud Computing Applications - Hazelcast, Spark and Ignite

What is Ignite?

Apache Ignite In-Memory Data Fabric is a high-performance, integrated and distributed in-memory platform for computing and transacting on large-scale data sets in real-time, orders of magnitude faster than possible with traditional disk-based or flash technologies.

Page 68: Cloud Computing Applications - Hazelcast, Spark and Ignite

Features

.Data Grid

.Compute Grid

.Streaming and CEP

.Data Structures

.Messaging and Events

.Service Grid

Page 69: Cloud Computing Applications - Hazelcast, Spark and Ignite

Data Grid

Page 70: Cloud Computing Applications - Hazelcast, Spark and Ignite

Data Grid

.Distributed Caching: Key-Value Store, Partitioning & Replication, Client-Side Cache

.Cluster Resiliency: Self-Healing Cluster

.Memory Formats: On-heap, Off-heap, Tiered Storage

.Marshalling: Binary Protocol

.Distributed Transactions and Locks: ACID, Deadlock-free, Cross-partition, Locks

Page 71: Cloud Computing Applications - Hazelcast, Spark and Ignite

Data Grid (Cont.)

.Distributed Query: SQL Queries, Joins, Continuous Queries, Indexing, Consistency, Fault-Tolerance

.Persistence: Write-Through, Read-Through, Write-Behind Caching, Automatic Persistence

.Standards: JCache, SQL, JDBC, OSGi

.Integrations: DB, Hibernate L2 Cache, Session Clustering, Spring Caching

Page 72: Cloud Computing Applications - Hazelcast, Spark and Ignite

Computing Grid

Page 73: Cloud Computing Applications - Hazelcast, Spark and Ignite

Computing Grid

.Distributed Closure Execution

.Clustered Executor Service

.MapReduce and ForkJoin

.Load Balancing

.Fault-Tolerance

.Job Scheduling

.Checkpointing

Page 74: Cloud Computing Applications - Hazelcast, Spark and Ignite

Streaming and CEP

Ignite streaming allows to process continuous never-ending streams of data in scalable and fault-tolerant fashion. The rates at which data can be injected into Ignite can be very high and easily exceed millions of events per second on a moderately sized cluster.

Page 75: Cloud Computing Applications - Hazelcast, Spark and Ignite

Streaming and CEP

Page 76: Cloud Computing Applications - Hazelcast, Spark and Ignite

Data Structures

.Queue and Set

.Atomic Types

.CountDownLatch

.IdGenerator

.Semaphore

Page 77: Cloud Computing Applications - Hazelcast, Spark and Ignite

Messaging and Events

.Topic Based Messaging

.Point-to-Point Messaging

.Event Notifications

.Automatic Batching

Page 78: Cloud Computing Applications - Hazelcast, Spark and Ignite

Service Grid

Page 79: Cloud Computing Applications - Hazelcast, Spark and Ignite

Dependency

.Maven<dependency> <groupId>org.apache.ignite</groupId> <artifactId>ignite-core</artifactId> <version>1.7.0</version></dependency>

.Gradledependencies { compile 'org.apache.ignite:ignite-core:1.7.0'}

Page 80: Cloud Computing Applications - Hazelcast, Spark and Ignite

Things between Ignite & Hazelcast

Page 81: Cloud Computing Applications - Hazelcast, Spark and Ignite

Benchmark Fight

.GridGain posted: GridGain vs Hazelcast Benchmarks

.It was also posted to Hazelcast Forum

.Hazelcast CEO removed that post

.Hazelcast fought back and claimed that GridGain cheated

.GridGain re-tested and clarified

Page 82: Cloud Computing Applications - Hazelcast, Spark and Ignite
Page 83: Cloud Computing Applications - Hazelcast, Spark and Ignite

DifferenceIgnite Hazelcast

Off-heap Memory Configurable Enterprise

Off-heap Indexing Yes No

Continuous Query Yes Enterprise

SSL Encryption Yes Enterprise

SQL Query Full ANSI 99 Limited

Join Query Yes No

Data Consistency Yes Partial

Page 84: Cloud Computing Applications - Hazelcast, Spark and Ignite

Difference (Cont.)Ignite Hazelcast

Deadlock-free Yes No

Computing GridMapReduce, ForkJoin

LoadBalance, ...MapReduce

Streaming/ Yes No

Service Grid Yes No

Language .Net/C#/C++/Node.js .Net/C#/C++

Data Structures Less More

Plug-in Less More

Page 85: Cloud Computing Applications - Hazelcast, Spark and Ignite

It doesn’t matter which you select

Page 86: Cloud Computing Applications - Hazelcast, Spark and Ignite

How you use it does matter

Page 87: Cloud Computing Applications - Hazelcast, Spark and Ignite

References.Hazelcast: http://hazelcast.org/

.Hazelcast Doc: http://hazelcast.org/documentation/

.Spark: http://spark.apache.org/

.Hazelcast Spark Connector: https://github.com/hazelcast/hazelcast-spark

.Apache Ignite: https://ignite.apache.org/

.Sample Code: https://github.com/CyberJos/jcconf2016-hazelcast-spark

Page 88: Cloud Computing Applications - Hazelcast, Spark and Ignite

Thank You!!