an overview of view

86
Scientific Workflows for Big Data Prof. Shiyong Lu Big Data Research Laboratory Department of Computer Science Wayne State University [email protected]

Upload: shiyong-lu

Post on 15-Jan-2015

278 views

Category:

Education


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: An Overview of VIEW

Scientific Workflows for Big Data

Prof. Shiyong LuBig Data Research Laboratory

Department of Computer ScienceWayne State University

[email protected]

Page 2: An Overview of VIEW

Today’s data-intensive science

Jim Gray: Turing Award laureate

Looking for needle in haystack

Looking into haystack

Page 3: An Overview of VIEW

Big Data Challenges

Ian Foster: Father of Grid Computing

Looking for needle in haystack

Looking needle in haystack

For Big Data, data management and movement is a frequent challenge…between facilities, archives, researchers…Many files, large data volumesWith security, reliability, performance…

Page 4: An Overview of VIEW

Big Data Challenges

Looking for needle in haystack

Looking needle in haystackCapture Curation Storage Search Sharing Analysis Visualizatio

n

Page 5: An Overview of VIEW

Big Data Science

15 PB/year173 TB/day500 MB/sec

Large Hardron Collider (LHC))Higgs discovery is “only possible because of the

extraordinary achievements of … grid

computing”—Rolf Heuer, CERN DG

Page 6: An Overview of VIEW

Data management challenges

Short-term

storage

163

143

100

99

150100

External sources

Advanced Photon Source

Argonne Leadership Computing

Facility

10

50

Long-term

storage

Data analysis

Argonne data flows in TB/day

(estimates)

Data flows at Argonne National Lab

Credit: Ian Foster

Page 7: An Overview of VIEW

Big Data demands new CS research

For example, existing clustering algorithms are typically cubic in N, and when N is too big, they do not work! - Jim Gray

Page 8: An Overview of VIEW

What is Big Data?

•Definition of Big Data:

“…refers to large, diverse, complex, longitudinal, and/or distributed data sets generated from instruments, sensors, Internet transactions, email, video, click streams, and/or all other digital sources available today and in the future.”

from nsf.gov website

Page 9: An Overview of VIEW

Big Data Challenges

•Challenges of Big Data:

“national big data challenges, which include advances in core techniques and technologies; big data infrastructure projects in various science, biomedical research, health and engineering communities; education and workforce development; and a comprehensive integrative program to support collaborations of multi-disciplinary teams and communities to make advances in the complex grand challenge science, biomedical research, and engineering problems of a computational- and data-intensive world.”

from nsf.gov website

Page 10: An Overview of VIEW

Big Data demands big workflows

Reminiscent of

Page 11: An Overview of VIEW

And thousands of parallel executions

Managing big workflows and large-scale parallel execution is a big CS challenge !

Page 12: An Overview of VIEW

Outline

Introduction1

VIEW: A Prototypical SWFMS2

A Scientific Workflow Composition Model3

A Collectional Data Model4

Conclusions and Future Work5

Page 13: An Overview of VIEW

Introduction

Data Intensive Science From computation intensive to data intensive. A new research cycle – from data capture and data

curation to data analysis and data visualization. “In the future, the rapidity with which any given

discipline advances is likely to depend on how well the community acquires the necessary expertise in database, workflow management, visualization, and cloud computing technologies.” (“Beyond the Data Deluge”, Science, Vol. 323. no. 5919, pp. 1297 – 1298, 2009.)

Page 14: An Overview of VIEW

Introduction

Scientific Workflow

A formal specification of a scientific process.

Represents, streamlines, and automates the steps from dataset selection and integration, computation and analysis, to final data product presentation and visualization.

Applications: Bioinformatics, Oceanography, Neuroinformatics, Astronomy, etc.

Page 15: An Overview of VIEW

Introduction

Scientific Workflow Management System (SWFMS) Supports the specification, modification,

execution, failure handling, and monitoring of a scientific workflow.

Existing SWFMSs: • Taverna, • Kepler,• Pegasus,• VisTrails, • VIEW, • …

Page 16: An Overview of VIEW

Our VIEW System

Page 17: An Overview of VIEW

Our VIEW System

Enables scientist to design workflows

Page 18: An Overview of VIEW

Our VIEW System

Enables scientist to design workflows Provides runtime system to execute workflow

Page 19: An Overview of VIEW

Our VIEW System

Enables scientist to design workflows Provides runtime system to execute workflow

on dedicated VIEW server

Page 20: An Overview of VIEW

Our VIEW System

Enables scientist to design workflows Provides runtime system to execute workflow

on dedicated VIEW server in Cloud computing environment

Page 21: An Overview of VIEW

Our VIEW System

Enables scientist to design workflows Provides runtime system to execute workflow

on dedicated VIEW server in Cloud computing environment

Supports efficient collection, storage, querying, and visualization of workflow provenance

Page 22: An Overview of VIEW

Our VIEW System

Enables scientist to design workflows Provides runtime system to execute workflow

on dedicated VIEW server in Cloud computing environment

Supports efficient collection, storage, querying, and visualization of workflow provenance

Is currently used in several bioinformatics applications, including genomic recombination and gene conversion data analysis

Page 23: An Overview of VIEW

An Example Workflow in VIEW

Example workflows in

Page 24: An Overview of VIEW

An Example Workflow in VIEW

Page 25: An Overview of VIEW

VIEW 1-2-3

Step 1: Drag and drop inputs and outputs, and computational modules

Page 26: An Overview of VIEW

VIEW 1-2-3

Step 2: Link them into a scientific workflow

Page 27: An Overview of VIEW

VIEW 1-2-3

Step 3: Click the run button, you get the result!

Page 29: An Overview of VIEW

An Example Workflow in VIEW

FiberFlow Transforms the large-scale neuroimaging data to knowledge through cross-

subject, cross-modality computation, ultimately leading to high clinical intelligence in neural diseases.

Page 30: An Overview of VIEW

VIEW: A Prototypical SWFMS

Minimum complexity for users, but massive techniques in the backstage. To provide a clear and simple abstraction for manipulating

and coordinating resources

Service-oriented architecture.

Intuitive, user-friendly GUI

Page 31: An Overview of VIEW

A Reference Architecture for SWFMSs

Service-oriented architecture of VIEW

Page 32: An Overview of VIEW

A Reference Architecture for SWFMSs

Other advantages of :

Page 33: An Overview of VIEW

A Reference Architecture for SWFMSs

Other advantages of :

VIEW workflows can be executed in other systems (specifications are not tied to a particular SWFMS)

Use of open standards (Web Services, XML) promotes collaboration, interoperability and extensibility of the system

Workflow and data models implemented in VIEW are specifically geared towards heavy scientific data

Page 34: An Overview of VIEW

A Reference Architecture for SWFMSs

Page 35: An Overview of VIEW

VIEW: A Prototypical SWFMS

A typical scientific workflow execution diagram.

Page 36: An Overview of VIEW

Workflow Engine

Workflow Engine is the heart of the system. Workflow Orchestration. Workflow Execution. Coordination of other subsystems.

Workflow Engine in VIEW. Dataflow based. Pure workflow composition. Workflow constructs.

Page 37: An Overview of VIEW

SWL

Example of our proposed scientific workflow specification language (SWL).

Page 38: An Overview of VIEW

Primitive Workflow Specification

Example SWL specification of a primitive workflow.

Page 39: An Overview of VIEW

Workflow Execution

Workflow Execution Primitive workflow Unary construct based workflow Graph based workflow

• A workflow graph is a composition of workflows by binary constructs.

• Optimistic scheduling.

Page 40: An Overview of VIEW

Workflow Database Schema

Page 41: An Overview of VIEW

Data Product Manager

Data Product Manager Solid data model. Scalable data storage. Convenient data access. Data Independence.

Data Product Manager is based on the collectional data model.

Page 42: An Overview of VIEW

DPM Architecture

Architecture of the Data Product Manager.D a ta P ro d u ct M a n a g er

D ata A ccess L ayer

D ata M ap p in g L ayer

D ata S to rage L ayer

R elatio nalD atab as es

N o d eD atab as e

M as ter

F ileR e p o s ito ry s

D ata S e t 1

M ainS erver

N o d eD atab as e

N o d eD atab as e

R elatio nalD atab as es

F ileR e p o s ito ry s

D ata S e t 2

Page 43: An Overview of VIEW

DPL

Example of the XML description of a collectional data product.

Page 44: An Overview of VIEW

Data Storage

VIEW supports two ways of storage: A collection can be stored in a table containing a

set of its key/value pairs, whose values are references to existing collections.

A collection can be expanded and stored in two tables. • The Group By operator.• The Compress operator.

Page 45: An Overview of VIEW

Data Typing

A Data Product a Collection or a List or an Empty.

The List type Introduced in the workflow engine. Each element is a data product. Heterogeneous.

Page 46: An Overview of VIEW

Collectional Data Querying

Operators are implemented in primitive workflows. Arithmetic operators. Boolean operators. Collectional operators. List operators.

Queries are implemented in workflow compositions.

Page 47: An Overview of VIEW

Example

Given a table Reference < Student, Company, GradTime >, Find the total number of students offered in each company and each graduation year; Sort the result in descending GradTime and ascending Company order.

SQL query. SELECT Company, GradTime, COUNT(DISTINCT Student)

AS NumberOfJob

FROM Reference

GROUP BY Company, GradTime

ORDER BY GradTime DESC, Company ASC;

Page 48: An Overview of VIEW

Example of Query Workflow

Query Workflow.

Page 49: An Overview of VIEW

Key Requirements for Workflow Modeling

R1: Programming-in-the-large.R2: Dataflow programming model.R3: Composable dataflow constructs.R4: Workflow encapsulation and

hierarchical composition.R5: Single-assignment property.R6: Physical and logical data models.R7: Exception handling.

Page 50: An Overview of VIEW

A Scientific Workflow Model

Workflows are the basic and the only operands for workflow composition.

Task components (e.g. Web services) are constructed to primitive workflows (a.k.a. tasks) which are the basic building blocks of scientific workflows.

W 3

ik W 2 o 1i1

o 1

M

ikW 1i1 o 1

o 1i1 i1

Page 51: An Overview of VIEW

A Scientific Workflow Model

A workflow construct is a mapping from a set of workflows to a workflow. Unary workflow constructs Binary workflow constructs …

A construct C takes a set of workflows W1, ...., Wn as input, and composes them into Wc as the output workflow.

Page 52: An Overview of VIEW

A Scientific Workflow Model

Our proposed scientific workflow model consists of the following two layers: The logical layer contains the workflow interface that

models the input ports and output ports of a workflow. The physical layer contains the workflow body that models

the physical implementation of the workflow.• Primitive workflows.• Graph-based workflows.• Unary-construct-based workflows.

Page 53: An Overview of VIEW

Unary Workflow Constructs

Dataflow-based Unary Workflow Constructs

Page 54: An Overview of VIEW

The Map Construct

The Map construct enables the parallel processing of a collection of data products based on a workflow that can only process a single data product.

Example:

[ 4 ,7 ]

[ 3 ,6 ]

2[ 1 ,2 ]

1 8

2 8ik

W 1 o 1

[[ 1 ,2 ] ,[ 3 ,6 ] ,[ 4 ,7 ] ]

i1o 1

W 2

i1M

ik W 1i1o 1

ik W 1i1 o 1

ik W 1i1 o 1

Page 55: An Overview of VIEW

The Reduce Construct

The Reduce construct enables the aggregation of a list of data products to a single data product based on a workflow that aggregates a limited (two or more) number of input data products.

Example:

ikA d di1 o 1

i2

0 o 1

W 3

[3 ,5 ,9 ]

i1

i2

R

9

5

A d di1 o 103

8

17

A d di1 o 13

A d do 1

i2i1i2

i2

Page 56: An Overview of VIEW

The Tree Construct

The Tree construct Enables parallel aggregation of a collection of data products. Aggregates a collection pairwisely as a binary tree until one

single aggregated product is generated.

The Tree construct can be applied on associative workflows.

Example:

ikA d di1 o 1

o 1

W 4

[0 ,3 ,5 ,9 ]i1

i2

9

3A d di1 o 10

317A d di1 o 1

i2

i2

A d di1 o 15i2 1 4

T

Page 57: An Overview of VIEW

The Conditional Construct

The Conditional construct enables the conditional execution of a workflow based on a condition on one of the inputs.

Example:

P ro je c tio n o 1ikP ro je c tio n

i1 o 1

i2

o 1

W 4

2

i1[ 2 ,3 ]

2 i2

[ 2 ,3 ]3

pi1

i2

p=(PI1 < PI2)

p = tru e

C

F a il

P ro je c tio nikP ro je c tio n

i1 o 1

i2

o 1

W 4

1

i1[ 2 ,3 ]

2 i2

pi1

i2

p=(PI1 > = PI2)

p = fa ls e

C

Page 58: An Overview of VIEW

The Loop Construct

The Loop construct enables cyclic executions of a workflow.

The output of the workflow will be repetitively returned (fed back) to a specified input port until the predicate evaluates to true.

Example:

A d d o 1

ik A d d

i1 o 1

i2

o 1

1

i1

1 i2

1 0 1p

i1

i2

p=(PI1 > 1 0 0 )

p = tru e

L

0

0

A d d o 1i11 i2

. . .A d d

1 i2

1

2

p = fa ls e

p = fa ls e

Page 59: An Overview of VIEW

The Curry Construct

The Curry construct allows users to fix one of the input ports with a specified argument and thus reduce the number of input ports.

By applying multiple Curry constructs, a workflow that takes multiple arguments can be translated into a chain of workflows each with a single argument.

Example:

ikA d di1 o 1i1

o 1

W 8

4i2

1ik

A d di1 o 1i2

1

45

U

Page 60: An Overview of VIEW

Workflow Composition

Example of the composition of Map and Map constructs. A Workflow that increase all the numbers in a nested list

by 1.

ik A d di1 o 1i2

o 1

i2

i1

(a ) W 9

1

M M

[[ 1 ,2 ,3 ] ,[ 4 ,5 ,6 ] ]

ik A d di1 o 1i2

11

ik A d di1 o 1i2

12

ik A d di1 o 1i2

13

ik A d di1 o 1i2

14

ik A d di1 o 1i2

15

ik A d di1 o 1i2

16

2

3

4

5

6

7

Page 61: An Overview of VIEW

Workflow Composition

Example of the composition of Map and Reduce constructs. A workflow for parallel summation of each row in a matrix

.

ikA d d

i1 o 1

i2

o 1

W 1 1

ikA d d itio n

i1ik

A d d itio ni1 o 1

i2

0

1

o 1

2 3

[[ 1 ,2 ,3 ] ,[ 4 ,5 ,6 ] ]i2

0i1 ik

A d d itio ni1i2

o 16i2

M R

ikA d d itio n

i1ik

A d d itio ni1 o 1

i2

0

4

o 1

5 6ik

A d d itio ni1i2

o 11 5i2

Page 62: An Overview of VIEW

Workflow Composition

Example of complicated workflow composition.

A workflow to calculate the greatest common divisor.

ikM o d u lu s ikM e rgei1 i1 o 1S p lit i2 i2

W 1 3

i1

o 1 o 1i1o 2

o 1i1

L

W 1 4 o 1i1

o 1

W 1 5

i1

M

M e rgeo 1

i1

i2

o 1

i1

i2 W 1 6

ik

o 1i1i1

M

i21

o 1

G 2 W

G 2 W

W 1 4

W 1 7

P ro je c tio n

o 1

p = (P I (2 )= = 0 )

U

Page 63: An Overview of VIEW

A Collectional Data Model

A collectional data model Support collection oriented datasets.

• Scientists often work with collection oriented datasets, such as arrays, lists, tables or file collections.

• A collection-oriented data model enables data parallelism in scientific workflows.

Support nested data structures. • Scientific data is often hierarchically organized. • Scientific workflow tasks often produce collections of

data products, and the execution of a workflow composed from such tasks can create increasingly nested data collections.

Provide well-defined operators and their arbitrary compositions to manipulate and query scientific data collections.

Page 64: An Overview of VIEW

A Collectional Data Model

A relation is a pair < R, r > where R is a schema of the relation and r is an instance of that schema.

A relation schema can be defined as an unordered tuple < c1 : d1, c2 : d2, …, cn : dn > where c1, c2, …, cn are column names and d1, d2, …, dn are domain names.

A relation instance is a table with rows (called tuples) and named columns (called attributes).

Page 65: An Overview of VIEW

A Collectional Data Model

A collection schema is a pair < K, V >. K, the key, is a pair k : d where k is the key name and d is

the domain name . V, the value, is either a relation schema or a collection

schema.

A collection instance is a set of key-value pairs (pi, qi) (i∈ {1,…,m}). Each pi is a scalar value.

Each qi is either a relation instance or a collection instance.

Page 66: An Overview of VIEW

A Collectional Data Model

An example: Parameters< Model : String, Experiments :

Integer, <Concentration : Double, Degree : Integer >>.

Page 67: An Overview of VIEW

The Collectional Operators

We extend the relational operators to the collectional operators of which the collections are the only operands. Six primitive operators: union, set difference,

selection, projection, Cartesian product and renaming.

The set of the collections is closed under those operators.

A relation can be defined as a collection whose height and cardinality are equal to 1. The collectional operators will then reduce to the relational operators.

Page 68: An Overview of VIEW

The Collectional Operators

The union and the set difference operators can only be applied on union-compatible collections.

m 1

m 2

Mode lR esult

26

R esult

32

m 2

m 3

Mode lR esult

3 2

R esult

3 1

Page 69: An Overview of VIEW

The Collectional Operators

Example of the union operator and the set difference operator.

m 1

m 2

Mod e l

m 1

m 2

m 3

R esult

26Mo d e l

R esult

32

R esult

31

R e sult

2 6

R e sult

Page 70: An Overview of VIEW

The Collectional Operators

Example of the Cartesian product Operator and the Renaming Operator.

m 1

m 2

M1.m od e lm 1

m 2

M2 .m od e l

m 1

m 2

M2 .m o d e l

M 1.R esult M2.R e sult

2 6 3 2

M 1.R esult M2.R e sult

2 6 3 1

M 1.R esult M2.R e sult

3 2 3 2

M 1.R esult M2.R e sult

3 2 3 1

Page 71: An Overview of VIEW

The Collectional Operators

Example of the selection operator.

m 2

1

Mo de l

E xp e rim ent

C oncentration D eg re e ...

7 .1 1 5 ...

Page 72: An Overview of VIEW

The Collectional Operators

Example of the projection operator.

1

2

...

E xp e rim e nt

C oncentration D e g re e ...

7 .0 1 5 ...

7 .1 1 5 ...

C oncentration D e g re e ...

7 .0 3 0 ...

7 .1 3 0 ...

Page 73: An Overview of VIEW

Key Features of VIEW

F1: VIEW features the first uniform workflow model, in which workflows are the only building blocks. In VIEW, tasks are primitive workflows and all workflow constructs do not discriminate workflows from tasks. Such a model greatly simplifies workflow design, in which a workflow designer only needs to compose complex workflows from simpler ones without the need to first encapsulate workflows to tasks or vice versa during the composition process.

Page 74: An Overview of VIEW

F2: VIEW has a powerful workflow composition power in which workflow constructs are fully compositional one with another with arbitrary levels. This often results in VIEW workflows that are more concise and efficient to execute, which can be hard to model in other workflow systems.

Page 75: An Overview of VIEW

F3: VIEW features a pure dataflow-based workflow language SWL, including the dataflow counterparts of controlflow-style constructs, such as conditional and loop. Existing workflow languages often require both controlflow and dataflow constructs, resulting in complex or even obscure semantics and non-trivial workflow design.

Page 76: An Overview of VIEW

F4: VIEW supports the cloud MapReduce programming model not only at the job level, but also at the workflow level. Therefore, one can apply the Map and Reduce constructs on an arbitrary workflow with arbitrary number of times. As a result, VIEW can process nested lists of data products in parallel using multiple runs of a workflow.

Page 77: An Overview of VIEW

F5: VIEW features a collectional data model that supports not only traditional primitive data types, such as integer, float, double, boolean, char, string, but also files, relations, hierarchical collections (hierarchical key-value pairs) to support parallel processing of data collections.

Page 78: An Overview of VIEW

F6: VIEW supports a high-level graph-based provenance query language OPQL. In most cases, users can formulate lineage queries easily without the need of writing recursive queries or knowing the underlying database schema.

Page 79: An Overview of VIEW

F7: VIEW features the first service-oriented architecture that conforms to the reference architecture for scientific workflow management systems (SWFMSs). This architecture greatly facilitates interoperability and subsystem reusability in the community. This architecture also provides a generic infrastructure upon which a domain-specific scientific workflow application system (SWFAS) can be easily developed with custom interface for various platforms and devices.

Page 80: An Overview of VIEW

Conclusions and Future Works

A scientific workflow composition model. A collectional data model. A protypical SWFMS. Future work:

Formalization of the scientific workflow algebra and collectional algebra.

• Completeness.• Integration.

Collaborative scientific workflow composition.• Concurrent design and composition.• Concurrent execution.

Page 81: An Overview of VIEW

VIEW application

Fiber tract analysis for Epilepsy.

Page 82: An Overview of VIEW

VIEW application

Computational detection of MARS in genome.

Page 83: An Overview of VIEW

VIEW application

DNA analysis for bacteria E. Coli

Page 84: An Overview of VIEW

VIEW application

Simulation of Nereis succinea mate search behavior.

Page 85: An Overview of VIEW

Big Data is a Pyramid

Can you contribute a piece too?

Page 86: An Overview of VIEW

Big Data Research LaboratoryWayne State University

viewsystem.org