cloud computing applications - hazelcast, spark and ignite

Post on 16-Apr-2017

370 Views

Category:

Software

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Cloud Computing Applications Hazelcast, Spark and Ignite

Joseph S. Kuo a.k.a. CyberJos

About Me

.大學唸數學系時玩了一堆語言和架構

.22年程式資歷,17年Java資歷

.擔任過資訊講師,曾任職於遊戲雲端平台公司、全球電子商務公司、知名資安公司以及社群趨勢分析公司

.希望能一輩子寫程式玩技術到老

Agenda

.Briefing of Hazelcast

.More about Hazelcast

.Spark Introduction

.Hazelcast and Spark

.About Apache Ignite

.Things between Ignite and Hazelcast

Briefing of Hazelcast

What is Hazelcast?

Hazelcast is an in-memory data grid which distributed data evenly among the nodes of a computing cluster, and shares available processing power and storage space to provide services. It also has the ability for failure tolerance and node loss.

Features

.Distributed Caching: Queue, Set, List, Map, MultiMap, Lock, Topic, AtomicReference, AtomicLong, IdGenerator, Ringbuffer, Semaphores

.Distributed Compute: Entry Processor, Executor Service, User Defined Services

.Distributed Query: Query, Aggregators, Listener with Predicate, MapReduce

Features (Cont.)

.Integrated Clustering: Hibernate 2nd Level Cache, Grails 3, JCS Resource Adapter

.Standards: JCache, Apache jclouds

.Cloud and Virtualization Support: Docker, AWS, Azure, Discovery Service Provider Interface, Kubernetes, Zookeper Discovery

.Client-Server Protocols: Memcache, Open Binary Client Protocol, REST

Use Cases

.In-Memory Data Grid

.Caching

.In-Memory NoSQL

.Messaging

.Application Scaling

.Clustering

In-Memory Data Grid

.Scale-out Computing: shared CPU power

.Resilience: failure & data loss/performance

.Programming Model: easily code clusters

.Fast, Big Data: handle large sets in RAM

.Dynamic Scalability: join/leave a cluster

.Elastic Main Memory: memory pool

Caching

.Elastic Memcached: Hazelcast has been used as an alternative to Memcached.

.Hibernate 2nd Level Cache: It organizes caching into 1st and 2nd level caches.

.Spring Cache: It supports Spring Cache which allows it to plug in to any Spring application.

In-Memory NoSQL

.Scalability: size of RAM vs DISKBy joining nodes in a cluster, we can gather RAM to store maps, and the CPU and RAM resources become available to the network.

.Volatility: volatility of RAM vs DiskIt uses P2P data distribution to provide no single node of failure. By default, it has data stored in two locations in the cluster.

In-Memory NoSQL (Cont.)

.RebalancingIt automatically rebalances data if a node crashes. Shuffling data has a negative effect as it consumes network, CPU and RAM.

.Going NativeThe High-Density Memory Store can avoid GC pauses. It uses NIO DirectByteBuffers and does not require any defragmentation.

Messaging

Hazelcast provides Topic for distribution mechanism for publishing messages that are delivered to multiple subscribers. Publish and subscriptions are cluster-wide. Messages are ordered, that is, listeners will process the messages in the order they are actually published.

Application Scaling

.Elastic Scalability: new servers join a cluster automatically

.Super Speeds: memory transaction speed

.High Availability: can deploy in backup pairs or even WAN replicated

.Fault Tolerance: no single point of failure

.Cloud Readiness: deploy right into EC2

Clustering

Hazelcast is easily able to handle Session Clustering with in-memory performance, linear scalability as you add new nodes and reliability. This is a great way to ensure that session information is maintained when you are clustering web servers. You can also use a similar pattern for managing user identities.

Dependency

.Maven<dependency> <groupId>com.hazelcast</groupId> <artifactId>hazelcast</artifactId> <version>3.7.2</version></dependency>

.Gradledependencies { compile 'com.hazelcast:hazelcast:3.7.2'}

More about Hazelcast

What’s New in Hazelcast 3.4

.High-Density Memory Store

.Hazelcast Configuration Import

.Back Pressure

What’s New in Hazelcast 3.5

.Async Back Pressure

.Client Configuration Import

.Cluster Quorum

.Hazelcast Client Protocol

.Listener for Lost Partitions

.Increased Visibility of Slow Operations

.Sub-Listener Interfaces for Map Listener

What’s New in Hazelcast 3.6

.High-Density Memory Store for Map

.Discovery SPI

.Client Protocol & Version Compatibility

.Support for cloud providers by jclouds®

.Hot Restart Persistence

.Lite Members

.Lots of Features for Hazelcast JCache

.Hazelcast Docker image

What’s New in Hazelcast 3.7

.Custom Eviction Policies

.Discovery SPI for Azure

.Hazelcast CLI with Scripting

.OpenShift and CloudFoundry Plugin

.Apache Spark Connector

.Alignment of WAN Replication Clusters

.Fault Tolerant Executor Service

Sample Codepublic class GetStartedMain { public static void main(final String[] args) { Config cfg = new Config(); HazelcastInstance instance =       Hazelcast.newHazelcastInstance(cfg); Map<Long, String> map = instance.getMap("test"); map.put(1L, "Demo"); System.our.println(map.get(1L)); }}

Sharding – 4 nodes

How Data is Partitioned?

Data entries are distributed into partitions by using a hashing algorithm (key/name):

.the key or name is serialized (converted into a byte array),.this byte array is hashed, and.the result of the hash is mod by the number of partitions.

Partition ID

The result of this modulo - MOD (hash result, partition count) - is the partition in which the data will be stored, that is the partition ID. For ALL members you have in your cluster, the partition ID for a given key will always be the same.

Partition Table

When we start a member, a partition table is created within it. This table stores the partition IDs and the cluster members to which they belong. The purpose of this table is to make all members (including lite members) in the cluster aware of this information, ensuring that each member knows where the data is.

Partition Table (Cont.)

The oldest member in the cluster (the one that started first) periodically sends the partition table to all members. In this way each member in the cluster is informed about any changes to partition ownership. The ownerships may be changed when a new member joins the cluster, or when a member leaves the cluster.

Repartitioning

Repartitioning is the process of redistribution of partition ownerships:

.When a member joins to the cluster..When a member leaves the cluster.

In these cases, the partition table in the oldest member is updated with the new partition ownerships.

Topology - Embedded

Topology - Client/Server

Spark Introduction

What is Spark?

.Spark is a fast and general-purpose cluster computing system. It provides high-level APIs and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools.

.It provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance.

Advantages

.SpeedRun programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.

.Ease of UseWrite application quickly. Spark offers over 80 high-level operators to build parallel applications.

Advantages (Cont.)

.GeneralityCombine SQL, streaming and complex analytics libraries seamlessly in the same application.

.Run EverywhereSupport multiple cluster management and distributed storage system.

Features

.Resilient distributed dataset (RDD)

.Fault Tolerant

.Map-reduce cluster computing

.Build-in libraries

.Languages: Java, Scala, Python and R

.Interactive shell (Python, Scala, R) and web-based UI

RDD

Resilient distributed dataset is a read-only distributed data set of elements partitioned across the nodes of the cluster that can be operated on in parallel. It can stay in memory and fall back to disk gracefully. An RDD in memory (cached) can be reused efficiently across parallel operations. Finally, RDD automatically recovers from node failures.

RDD Operations

Two types of things that can be done on an RDD:

.transformations like map, filter than results in another RDD

.actions like count that result in an output

RDD Operations (Cont.)

RDD Fault Recovery

Directed Acyclic Graph

Cluster Topology

Dependency

.Maven<dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.11</artifactId> <version>2.0.0</version></dependency>

.Gradledependencies { compile 'org.apache.spark:spark-core_2.11:2.0.0'}

Spark Node with Docker

.Pull image (Docker 2.0)docker pull maguowei/spark

.Launch a Spark nodedocker run -it -p 4040:4040 maguowei/spark pyspark

docker run -it -p 4040:4040 maguowei/spark spark-shell

.Monitoringhttp://localhost:4040/

Spark Cluster with Docker

.Launch master image (driver program)docker run -it -h sandbox1 -p 7077:7077 -p 8080:8080 maguowei/spark bash

.Append text to /etc/hosts172.17.0.2 sandbox1172.17.0.3 sandbox2

.Launch the master node/opt/spark-2.0.0-bin-hadoop2.7/sbin/start-master.sh

.Monitoringhttp://localhost:8080/

Spark Cluster with Docker (Cont.)

.Launch work imagesdocker run -it -h sandbox2 maguowei/spark bash

.Append text to /etc/hosts172.17.0.2 sandbox1172.17.0.3 sandbox2

.Launch a work node/opt/spark-2.0.0-bin-hadoop2.7/sbin/start-slave.sh spark://sandbox1:7077

.Run tasksdocker exec <CONTAINER_ID> run-example <class> <arg>

same version for all placessame version for all places

same version for all places

Very important so say 3 times

Hazelcast and Spark

What is this Connector?

A plug-in which allows maps and caches to be used as shared RDD caches by Spark using the Spark RDD API.

What is this Connector?

Clients Clients

\ /

Hazelcast (MapReduce) Spark (MapReduce)

\ /

Hazelcast Spark Connector

Features

.Read/Write support for Hazelcast Maps

.Read/Write support for Hazelcast Caches

Requirements

.Hazelcast 3.7.x

.Apache Spark 1.6.1

Dependency

.Maven<dependency> <groupId>com.hazelcast</groupId> <artifactId>hazelcast-spark</artifactId> <version>0.1</version></dependency>

.Gradledependencies { compile 'com.hazelcast:hazelcast-spark:0.1'}

Properties

The options for the SparkConf object.hazelcast.server.addresses: 127.0.0.1:5701 (Comma separated list)

.hazelcast.server.groupName: dev

.hazelcast.server.groupPass: dev-pass

.hazelcast.spark.valueBatchingEnabled: true

.hazelcast.spark.readBatchSize: 1000

.hazelcast.spark.writeBatchSize: 1000

.hazelcast.spark.clientXmlPath

Creating the SparkContextSparkConf conf = new SparkConf() .set("hazelcast.server.addresses", "127.0.0.1:5701") .set("hazelcast.server.groupName", "dev") .set("hazelcast.server.groupPass", "dev-pass") .set("hazelcast.spark.valueBatchingEnabled", "true") .set("hazelcast.spark.readBatchSize", "5000") .set("hazelcast.spark.writeBatchSize", "5000");

JavaSparkContext jsc = new JavaSparkContext("spark://127.0.0.1:7077", "appname", conf);// provide Hazelcast functions to the Spark Context.HazelcastSparkContext hsc = new HazelcastSparkContext(jsc);

Read Data to Hazelcast// readHazelcastJavaRDD rddFromMap = hsc.fromHazelcastMap("map-name-to-be-loaded");HazelcastJavaRDD rddFromCache = hsc.fromHazelcastCache("cache-name-to-be-loaded");

Write Data to Hazelcastimport static com.hazelcast.spark.connector.HazelcastJavaPairRDDFunctions.javaPairRddFunctions;

JavaPairRDD<Object, Long> rdd = hsc.parallelize(new ArrayList<Object>() { add(1); add(2); add(3); }).zipWithIndex();

// writejavaPairRddFunctions(rdd).saveToHazelcastMap(name);javaPairRddFunctions(rdd).saveToHazelcastCache(name);

About Apache Ignite

What is Ignite?

Apache Ignite In-Memory Data Fabric is a high-performance, integrated and distributed in-memory platform for computing and transacting on large-scale data sets in real-time, orders of magnitude faster than possible with traditional disk-based or flash technologies.

Features

.Data Grid

.Compute Grid

.Streaming and CEP

.Data Structures

.Messaging and Events

.Service Grid

Data Grid

Data Grid

.Distributed Caching: Key-Value Store, Partitioning & Replication, Client-Side Cache

.Cluster Resiliency: Self-Healing Cluster

.Memory Formats: On-heap, Off-heap, Tiered Storage

.Marshalling: Binary Protocol

.Distributed Transactions and Locks: ACID, Deadlock-free, Cross-partition, Locks

Data Grid (Cont.)

.Distributed Query: SQL Queries, Joins, Continuous Queries, Indexing, Consistency, Fault-Tolerance

.Persistence: Write-Through, Read-Through, Write-Behind Caching, Automatic Persistence

.Standards: JCache, SQL, JDBC, OSGi

.Integrations: DB, Hibernate L2 Cache, Session Clustering, Spring Caching

Computing Grid

Computing Grid

.Distributed Closure Execution

.Clustered Executor Service

.MapReduce and ForkJoin

.Load Balancing

.Fault-Tolerance

.Job Scheduling

.Checkpointing

Streaming and CEP

Ignite streaming allows to process continuous never-ending streams of data in scalable and fault-tolerant fashion. The rates at which data can be injected into Ignite can be very high and easily exceed millions of events per second on a moderately sized cluster.

Streaming and CEP

Data Structures

.Queue and Set

.Atomic Types

.CountDownLatch

.IdGenerator

.Semaphore

Messaging and Events

.Topic Based Messaging

.Point-to-Point Messaging

.Event Notifications

.Automatic Batching

Service Grid

Dependency

.Maven<dependency> <groupId>org.apache.ignite</groupId> <artifactId>ignite-core</artifactId> <version>1.7.0</version></dependency>

.Gradledependencies { compile 'org.apache.ignite:ignite-core:1.7.0'}

Things between Ignite & Hazelcast

Benchmark Fight

.GridGain posted: GridGain vs Hazelcast Benchmarks

.It was also posted to Hazelcast Forum

.Hazelcast CEO removed that post

.Hazelcast fought back and claimed that GridGain cheated

.GridGain re-tested and clarified

DifferenceIgnite Hazelcast

Off-heap Memory Configurable Enterprise

Off-heap Indexing Yes No

Continuous Query Yes Enterprise

SSL Encryption Yes Enterprise

SQL Query Full ANSI 99 Limited

Join Query Yes No

Data Consistency Yes Partial

Difference (Cont.)Ignite Hazelcast

Deadlock-free Yes No

Computing GridMapReduce, ForkJoin

LoadBalance, ...MapReduce

Streaming/ Yes No

Service Grid Yes No

Language .Net/C#/C++/Node.js .Net/C#/C++

Data Structures Less More

Plug-in Less More

It doesn’t matter which you select

How you use it does matter

References.Hazelcast: http://hazelcast.org/

.Hazelcast Doc: http://hazelcast.org/documentation/

.Spark: http://spark.apache.org/

.Hazelcast Spark Connector: https://github.com/hazelcast/hazelcast-spark

.Apache Ignite: https://ignite.apache.org/

.Sample Code: https://github.com/CyberJos/jcconf2016-hazelcast-spark

Thank You!!

top related