an overview of view

Scientific Workflows for Big Data

Prof. Shiyong LuBig Data Research Laboratory

Department of Computer ScienceWayne State University

shiyong@wayne.edu

Today’s data-intensive science

Jim Gray: Turing Award laureate

Looking for needle in haystack

Looking into haystack

Big Data Challenges

Ian Foster: Father of Grid Computing

Looking needle in haystack

For Big Data, data management and movement is a frequent challenge…between facilities, archives, researchers…Many files, large data volumesWith security, reliability, performance…

Big Data Challenges

Looking needle in haystackCapture Curation Storage Search Sharing Analysis Visualizatio

Big Data Science

15 PB/year173 TB/day500 MB/sec

Large Hardron Collider (LHC))Higgs discovery is “only possible because of the

extraordinary achievements of … grid

computing”—Rolf Heuer, CERN DG

Data management challenges

Short-term

storage

150100

External sources

Advanced Photon Source

Argonne Leadership Computing

Facility

Long-term

storage

Data analysis

Argonne data flows in TB/day

(estimates)

Data flows at Argonne National Lab

Credit: Ian Foster

Big Data demands new CS research

For example, existing clustering algorithms are typically cubic in N, and when N is too big, they do not work! - Jim Gray

What is Big Data?

•Definition of Big Data:

“…refers to large, diverse, complex, longitudinal, and/or distributed data sets generated from instruments, sensors, Internet transactions, email, video, click streams, and/or all other digital sources available today and in the future.”

from nsf.gov website

Big Data Challenges

•Challenges of Big Data:

“national big data challenges, which include advances in core techniques and technologies; big data infrastructure projects in various science, biomedical research, health and engineering communities; education and workforce development; and a comprehensive integrative program to support collaborations of multi-disciplinary teams and communities to make advances in the complex grand challenge science, biomedical research, and engineering problems of a computational- and data-intensive world.”

from nsf.gov website

Big Data demands big workflows

Reminiscent of

And thousands of parallel executions

Managing big workflows and large-scale parallel execution is a big CS challenge !

Outline

Introduction1

VIEW: A Prototypical SWFMS2

A Scientific Workflow Composition Model3

A Collectional Data Model4

Conclusions and Future Work5

Introduction

Data Intensive Science From computation intensive to data intensive. A new research cycle – from data capture and data

curation to data analysis and data visualization. “In the future, the rapidity with which any given

discipline advances is likely to depend on how well the community acquires the necessary expertise in database, workflow management, visualization, and cloud computing technologies.” (“Beyond the Data Deluge”, Science, Vol. 323. no. 5919, pp. 1297 – 1298, 2009.)

Introduction

Scientific Workflow

A formal specification of a scientific process.

Represents, streamlines, and automates the steps from dataset selection and integration, computation and analysis, to final data product presentation and visualization.

Applications: Bioinformatics, Oceanography, Neuroinformatics, Astronomy, etc.

Introduction

Scientific Workflow Management System (SWFMS) Supports the specification, modification,

execution, failure handling, and monitoring of a scientific workflow.

Existing SWFMSs: • Taverna, • Kepler,• Pegasus,• VisTrails, • VIEW, • …

Our VIEW System

Enables scientist to design workflows

Our VIEW System

Enables scientist to design workflows Provides runtime system to execute workflow

Our VIEW System

on dedicated VIEW server

Our VIEW System

on dedicated VIEW server in Cloud computing environment

Our VIEW System

Supports efficient collection, storage, querying, and visualization of workflow provenance

Our VIEW System

Supports efficient collection, storage, querying, and visualization of workflow provenance

Is currently used in several bioinformatics applications, including genomic recombination and gene conversion data analysis

An Example Workflow in VIEW

Example workflows in

VIEW 1-2-3

Step 1: Drag and drop inputs and outputs, and computational modules

VIEW 1-2-3

Step 2: Link them into a scientific workflow

VIEW 1-2-3

Step 3: Click the run button, you get the result!

Kids Play VIEW

FiberFlow Transforms the large-scale neuroimaging data to knowledge through cross-

subject, cross-modality computation, ultimately leading to high clinical intelligence in neural diseases.

VIEW: A Prototypical SWFMS

Minimum complexity for users, but massive techniques in the backstage. To provide a clear and simple abstraction for manipulating

and coordinating resources

Service-oriented architecture.

Intuitive, user-friendly GUI

A Reference Architecture for SWFMSs

Service-oriented architecture of VIEW

Other advantages of :

VIEW workflows can be executed in other systems (specifications are not tied to a particular SWFMS)

Use of open standards (Web Services, XML) promotes collaboration, interoperability and extensibility of the system

Workflow and data models implemented in VIEW are specifically geared towards heavy scientific data

VIEW: A Prototypical SWFMS

A typical scientific workflow execution diagram.

Workflow Engine

Workflow Engine is the heart of the system. Workflow Orchestration. Workflow Execution. Coordination of other subsystems.

Workflow Engine in VIEW. Dataflow based. Pure workflow composition. Workflow constructs.

Example of our proposed scientific workflow specification language (SWL).

Primitive Workflow Specification

Example SWL specification of a primitive workflow.

Workflow Execution

Workflow Execution Primitive workflow Unary construct based workflow Graph based workflow

• A workflow graph is a composition of workflows by binary constructs.

• Optimistic scheduling.

Workflow Database Schema

Data Product Manager

Data Product Manager Solid data model. Scalable data storage. Convenient data access. Data Independence.

Data Product Manager is based on the collectional data model.

DPM Architecture

Architecture of the Data Product Manager.D a ta P ro d u ct M a n a g er

D ata A ccess L ayer

D ata M ap p in g L ayer

D ata S to rage L ayer

R elatio nalD atab as es

N o d eD atab as e

M as ter

F ileR e p o s ito ry s

D ata S e t 1

M ainS erver

N o d eD atab as e

R elatio nalD atab as es

F ileR e p o s ito ry s

D ata S e t 2

Example of the XML description of a collectional data product.

Data Storage

VIEW supports two ways of storage: A collection can be stored in a table containing a

set of its key/value pairs, whose values are references to existing collections.

A collection can be expanded and stored in two tables. • The Group By operator.• The Compress operator.

Data Typing

A Data Product a Collection or a List or an Empty.

The List type Introduced in the workflow engine. Each element is a data product. Heterogeneous.

Collectional Data Querying

Operators are implemented in primitive workflows. Arithmetic operators. Boolean operators. Collectional operators. List operators.

Queries are implemented in workflow compositions.

Example

Given a table Reference < Student, Company, GradTime >, Find the total number of students offered in each company and each graduation year; Sort the result in descending GradTime and ascending Company order.

SQL query. SELECT Company, GradTime, COUNT(DISTINCT Student)

AS NumberOfJob

FROM Reference

GROUP BY Company, GradTime

ORDER BY GradTime DESC, Company ASC;

Example of Query Workflow

Query Workflow.

Key Requirements for Workflow Modeling

R1: Programming-in-the-large.R2: Dataflow programming model.R3: Composable dataflow constructs.R4: Workflow encapsulation and

hierarchical composition.R5: Single-assignment property.R6: Physical and logical data models.R7: Exception handling.

A Scientific Workflow Model

Workflows are the basic and the only operands for workflow composition.

Task components (e.g. Web services) are constructed to primitive workflows (a.k.a. tasks) which are the basic building blocks of scientific workflows.

ik W 2 o 1i1

ikW 1i1 o 1

o 1i1 i1

A workflow construct is a mapping from a set of workflows to a workflow. Unary workflow constructs Binary workflow constructs …

A construct C takes a set of workflows W1, ...., Wn as input, and composes them into Wc as the output workflow.

Our proposed scientific workflow model consists of the following two layers: The logical layer contains the workflow interface that

models the input ports and output ports of a workflow. The physical layer contains the workflow body that models

the physical implementation of the workflow.• Primitive workflows.• Graph-based workflows.• Unary-construct-based workflows.

Unary Workflow Constructs

Dataflow-based Unary Workflow Constructs

The Map Construct

The Map construct enables the parallel processing of a collection of data products based on a workflow that can only process a single data product.

Example:

[ 4 ,7 ]

[ 3 ,6 ]

2[ 1 ,2 ]

W 1 o 1

[[ 1 ,2 ] ,[ 3 ,6 ] ,[ 4 ,7 ] ]

ik W 1i1o 1

ik W 1i1 o 1

The Reduce Construct

The Reduce construct enables the aggregation of a list of data products to a single data product based on a workflow that aggregates a limited (two or more) number of input data products.

Example:

ikA d di1 o 1

[3 ,5 ,9 ]

A d di1 o 103

A d di1 o 13

A d do 1

i2i1i2

The Tree Construct

The Tree construct Enables parallel aggregation of a collection of data products. Aggregates a collection pairwisely as a binary tree until one

single aggregated product is generated.

The Tree construct can be applied on associative workflows.

Example:

ikA d di1 o 1

[0 ,3 ,5 ,9 ]i1

3A d di1 o 10

317A d di1 o 1

A d di1 o 15i2 1 4

The Conditional Construct

The Conditional construct enables the conditional execution of a workflow based on a condition on one of the inputs.

Example:

P ro je c tio n o 1ikP ro je c tio n

i1 o 1

i1[ 2 ,3 ]

[ 2 ,3 ]3

p=(PI1 < PI2)

p = tru e

F a il

P ro je c tio nikP ro je c tio n

i1 o 1

i1[ 2 ,3 ]

p=(PI1 > = PI2)

p = fa ls e

The Loop Construct

The Loop construct enables cyclic executions of a workflow.

The output of the workflow will be repetitively returned (fed back) to a specified input port until the predicate evaluates to true.

Example:

A d d o 1

ik A d d

i1 o 1

1 0 1p

p=(PI1 > 1 0 0 )

p = tru e

A d d o 1i11 i2

. . .A d d

p = fa ls e

The Curry Construct

The Curry construct allows users to fix one of the input ports with a specified argument and thus reduce the number of input ports.

By applying multiple Curry constructs, a workflow that takes multiple arguments can be translated into a chain of workflows each with a single argument.

Example:

ikA d di1 o 1i1

A d di1 o 1i2

Workflow Composition

Example of the composition of Map and Map constructs. A Workflow that increase all the numbers in a nested list

ik A d di1 o 1i2

(a ) W 9

[[ 1 ,2 ,3 ] ,[ 4 ,5 ,6 ] ]

ik A d di1 o 1i2

Example of the composition of Map and Reduce constructs. A workflow for parallel summation of each row in a matrix

ikA d d

i1 o 1

ikA d d itio n

A d d itio ni1 o 1

[[ 1 ,2 ,3 ] ,[ 4 ,5 ,6 ] ]i2

0i1 ik

A d d itio ni1i2

o 16i2

ikA d d itio n

A d d itio ni1 o 1

A d d itio ni1i2

o 11 5i2

Example of complicated workflow composition.

A workflow to calculate the greatest common divisor.

ikM o d u lu s ikM e rgei1 i1 o 1S p lit i2 i2

o 1 o 1i1o 2

W 1 4 o 1i1

M e rgeo 1

i2 W 1 6

o 1i1i1

P ro je c tio n

p = (P I (2 )= = 0 )

A Collectional Data Model

A collectional data model Support collection oriented datasets.

• Scientists often work with collection oriented datasets, such as arrays, lists, tables or file collections.

• A collection-oriented data model enables data parallelism in scientific workflows.

Support nested data structures. • Scientific data is often hierarchically organized. • Scientific workflow tasks often produce collections of

data products, and the execution of a workflow composed from such tasks can create increasingly nested data collections.

Provide well-defined operators and their arbitrary compositions to manipulate and query scientific data collections.

A relation is a pair < R, r > where R is a schema of the relation and r is an instance of that schema.

A relation schema can be defined as an unordered tuple < c1 : d1, c2 : d2, …, cn : dn > where c1, c2, …, cn are column names and d1, d2, …, dn are domain names.

A relation instance is a table with rows (called tuples) and named columns (called attributes).

A collection schema is a pair < K, V >. K, the key, is a pair k : d where k is the key name and d is

the domain name . V, the value, is either a relation schema or a collection

schema.

A collection instance is a set of key-value pairs (pi, qi) (i∈ {1,…,m}). Each pi is a scalar value.

Each qi is either a relation instance or a collection instance.

An example: Parameters< Model : String, Experiments :

Integer, <Concentration : Double, Degree : Integer >>.

The Collectional Operators

We extend the relational operators to the collectional operators of which the collections are the only operands. Six primitive operators: union, set difference,

selection, projection, Cartesian product and renaming.

The set of the collections is closed under those operators.

A relation can be defined as a collection whose height and cardinality are equal to 1. The collectional operators will then reduce to the relational operators.

The union and the set difference operators can only be applied on union-compatible collections.

Mode lR esult

R esult

Mode lR esult

R esult

Example of the union operator and the set difference operator.

Mod e l

R esult

26Mo d e l

R esult

R e sult

Example of the Cartesian product Operator and the Renaming Operator.

M1.m od e lm 1

M2 .m od e l

M2 .m o d e l

M 1.R esult M2.R e sult

2 6 3 2

2 6 3 1

3 2 3 2

3 2 3 1

Example of the selection operator.

Mo de l

E xp e rim ent

C oncentration D eg re e ...

7 .1 1 5 ...

Example of the projection operator.

E xp e rim e nt

C oncentration D e g re e ...

7 .0 1 5 ...

7 .1 1 5 ...

C oncentration D e g re e ...

7 .0 3 0 ...

7 .1 3 0 ...

Key Features of VIEW

F1: VIEW features the first uniform workflow model, in which workflows are the only building blocks. In VIEW, tasks are primitive workflows and all workflow constructs do not discriminate workflows from tasks. Such a model greatly simplifies workflow design, in which a workflow designer only needs to compose complex workflows from simpler ones without the need to first encapsulate workflows to tasks or vice versa during the composition process.

F2: VIEW has a powerful workflow composition power in which workflow constructs are fully compositional one with another with arbitrary levels. This often results in VIEW workflows that are more concise and efficient to execute, which can be hard to model in other workflow systems.

F3: VIEW features a pure dataflow-based workflow language SWL, including the dataflow counterparts of controlflow-style constructs, such as conditional and loop. Existing workflow languages often require both controlflow and dataflow constructs, resulting in complex or even obscure semantics and non-trivial workflow design.

F4: VIEW supports the cloud MapReduce programming model not only at the job level, but also at the workflow level. Therefore, one can apply the Map and Reduce constructs on an arbitrary workflow with arbitrary number of times. As a result, VIEW can process nested lists of data products in parallel using multiple runs of a workflow.

F5: VIEW features a collectional data model that supports not only traditional primitive data types, such as integer, float, double, boolean, char, string, but also files, relations, hierarchical collections (hierarchical key-value pairs) to support parallel processing of data collections.

F6: VIEW supports a high-level graph-based provenance query language OPQL. In most cases, users can formulate lineage queries easily without the need of writing recursive queries or knowing the underlying database schema.

F7: VIEW features the first service-oriented architecture that conforms to the reference architecture for scientific workflow management systems (SWFMSs). This architecture greatly facilitates interoperability and subsystem reusability in the community. This architecture also provides a generic infrastructure upon which a domain-specific scientific workflow application system (SWFAS) can be easily developed with custom interface for various platforms and devices.

Conclusions and Future Works

A scientific workflow composition model. A collectional data model. A protypical SWFMS. Future work:

Formalization of the scientific workflow algebra and collectional algebra.

• Completeness.• Integration.

Collaborative scientific workflow composition.• Concurrent design and composition.• Concurrent execution.

VIEW application

Fiber tract analysis for Epilepsy.

VIEW application

Computational detection of MARS in genome.

VIEW application

DNA analysis for bacteria E. Coli

VIEW application

Simulation of Nereis succinea mate search behavior.

Big Data is a Pyramid

Can you contribute a piece too?

Big Data Research LaboratoryWayne State University

viewsystem.org

an overview of view

Education

an overview of management

lobby day 2013 an overview #breadrising

an overview of eigensolvers on hpcx

mycotoxins an overview

an overview of cyprus ngos

an overview on_biofiltration_modeling

herogami agile+kanban: an overview

an overview and guide for hotels considering an integrated

an overview of financial management

agamben foucault an overview - artículo

deutschland germany – an overview

an overview of clil

makalah an overview financing choice

dual banach algebras: an overview

diagnostic process ,an overview

user experience design: an overview

software agent technology-an overview

an islamic overview

green benches of thailand : an overview

unicef wes: an overview