on-demand exploration of distributed datasets in large ... · pdf fileon-demand exploration of...

33
On-demand Exploration of Distributed Datasets in Large-scale Simulation Studies Tahsin Kurc Biomedical Informatics Department Ohio State University

Upload: vuongthuan

Post on 19-Mar-2018

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: On-demand Exploration of Distributed Datasets in Large ... · PDF fileOn-demand Exploration of Distributed Datasets in Large-scale Simulation Studies Tahsin Kurc Biomedical Informatics

On-demand Exploration of Distributed Datasets in Large-scale Simulation

Studies

Tahsin KurcBiomedical Informatics Department

Ohio State University

Page 2: On-demand Exploration of Distributed Datasets in Large ... · PDF fileOn-demand Exploration of Distributed Datasets in Large-scale Simulation Studies Tahsin Kurc Biomedical Informatics

��� � � ��� ��

�� � �� � � ��� �� � �

�� ��� �� � � � �

� � � �� � �

� � � � � � � � � � � � ��

�� �� � � � � � � � �

�� !� �"� � �

� !� � � � � � �� � �� � �

�� � � #� �

� � � $ � � �� � � �

�� � �� � � �

� � � � � #%� � � � � � �

� � � � � & �� � '� �

( �� � �� � � � �� � � �

& � � � � � � � � � �

� �� � � � � � � � � ��

� � �� � � � � )

� � � � � � � *

& � � � � (%� + � � � �� �

� � � , � �

��� � - � � �

� � . �� � � � . �� ��

� �� � ��

� � � � , � � � � ��0/� � � � � � � � � # � � � � � �� /

#%� � � �� � ))�� /

�� �� � � � �� /

(� � � � � � � � �

�� � �� � � � � ) � � 1� � � �

& � � � ��

Collaborators

Page 3: On-demand Exploration of Distributed Datasets in Large ... · PDF fileOn-demand Exploration of Distributed Datasets in Large-scale Simulation Studies Tahsin Kurc Biomedical Informatics

Outline

• Dynamic Data Driven Applications• Energy and Environment Studies

– Instrumented Oil Field: Effective Reservoir Management

• Software Support– DataCutter– GridDB-Lite– ActiveProxy-G– Compiler Support

• Applications from Biomedical Research– Translational Biomedical Research– Dynamic Contrast Enhanced MRI

Page 4: On-demand Exploration of Distributed Datasets in Large ... · PDF fileOn-demand Exploration of Distributed Datasets in Large-scale Simulation Studies Tahsin Kurc Biomedical Informatics

Dynamic Data Driven Applications• Many applications involve on-

demand dataset exploration to

– extract features and patterns

– understand the structure or the function

– study patterns and changes over time

• In most cases, however, we have partial knowledge of the physical environment. – Simulation is a powerful

mechanism to search the space for solutions

– Simulations can be driven by experimental/field data

– Analysis of simulation data drives new simulations or new field measurements

AnalysisData products, New

models

WorkflowGather/Generate

more data

DataDifferent types of

data

Page 5: On-demand Exploration of Distributed Datasets in Large ... · PDF fileOn-demand Exploration of Distributed Datasets in Large-scale Simulation Studies Tahsin Kurc Biomedical Informatics

Instrumented Oil Field

��� �� � � � � � �� � � � � � �� � � � � � � � ��� � � �� � � � � � � � � � �

��� �� � � � � � � � � � � � � � � � � � � � � �� �

� � � � � � �� � � � � � � � � � � � � � � � � �� � � � � � � � �

� �� � � � � � � �� � � � � � � � � � � �� �

� � � �� � � � � � � � � � � � � � � ��� � � � � � � � � � � ��� �� �� � � � � � � � � � � � �

Page 6: On-demand Exploration of Distributed Datasets in Large ... · PDF fileOn-demand Exploration of Distributed Datasets in Large-scale Simulation Studies Tahsin Kurc Biomedical Informatics

Effective Oil Reservoir Management• Implementing effective oil and gas production

– Optimizing well placement– Efficient exploration of possible production strategies

• Challenges: Geologic uncertainty, operational flexibility, and large, detailed flow models

Page 7: On-demand Exploration of Distributed Datasets in Large ... · PDF fileOn-demand Exploration of Distributed Datasets in Large-scale Simulation Studies Tahsin Kurc Biomedical Informatics

Effective Oil Reservoir Management• Implementing effective oil and gas production

– Optimizing well placement– Efficient exploration of possible production strategies

• Challenges: Geologic uncertainty, operational flexibility, and large, detailed flow models

AnalysisProduction rates, bypass

oil, net present value

WorkflowRun new reservoir

simulations

DataSeismic, well

pressures, reservoir simulations

Generate requests for new simulations, new

seismic studies

Obtain initial, boundary conditions, input parameters for

simulations

Store and index simulation results

Summary data from datasets

Spatio-temporal queries

• Simulate multiple realizations of multiple geostatistical models and production strategies

• Evaluate geologic uncertainty and production strategies simultaneously

• Enable on-demand exploration and comparison of multiple scenarios– Integration of a robust,

Grid-based computational and data handling infrastructure

– Distributed databases of reservoir and geophysical data

– Storage and computing resources at multiple institutions

Page 8: On-demand Exploration of Distributed Datasets in Large ... · PDF fileOn-demand Exploration of Distributed Datasets in Large-scale Simulation Studies Tahsin Kurc Biomedical Informatics

Production Simulationvia

Reservoir Modeling

Revise Knowledge of Reservoir Model

viaImaging and Inversion

of Seismic Data

Monitor Productionby acquiring

Time Lapse Observations of Seismic Data

Modify Production Strategyusing an

Optimization Criteria

Model 1 Model 2 Model N

50.00

Data Analysis

Data Analysis

Sea bedReservoir

50.00

New Modelor Parameters

Data Parameters

Data Management and Manipulation

Tools

Data Analysis Tools(e.g., Visualization)

Page 9: On-demand Exploration of Distributed Datasets in Large ... · PDF fileOn-demand Exploration of Distributed Datasets in Large-scale Simulation Studies Tahsin Kurc Biomedical Informatics

Storage ResourceBroker

Reservoir Datasets

Reservoir Datasets

Reservoir Datasets

Reservoir DatasetsStorage Area

Network

RD

SUM

AVG

DIFF

SUM SUM

RD

……..Node 1 Node 20

DIFF DIFF DIFF

Transparent Copies

Transparent Copies(one copy per node)

SUM

Filters

Data Storage and Analysis

IPARS

Geostatistics

Model 1Model 2

Model n

…m realizations

Well Pattern p

Production Strategies

Well Pattern 1

…Well Pattern 2

Page 10: On-demand Exploration of Distributed Datasets in Large ... · PDF fileOn-demand Exploration of Distributed Datasets in Large-scale Simulation Studies Tahsin Kurc Biomedical Informatics

Characteristics• Spatio-temporal datasets – datasets describe physical scenarios

• Data products often involve results from ensemble of datasets

• Synthesis of different types of data: measured data, simulated data• Common operations: subsetting, filtering, interpolations, projections,

comparisons, generalized reductions

• Tools for

– Efficient simulation and optimization

– Distributed data storage and indexing – Data query and retrieval

– Data filtering, generalized reductions, data product generation

– Data caching and replication

Page 11: On-demand Exploration of Distributed Datasets in Large ... · PDF fileOn-demand Exploration of Distributed Datasets in Large-scale Simulation Studies Tahsin Kurc Biomedical Informatics

Software Support• DataCutter: Component Framework for Combined

Task/Data Parallelism:– Filtering/Program coupling Service: Distributed C++ component

framework

• GridDB Lite: Large Data Query Layered on DataCutter – Range Query– Value based selects– Client can be parallel program– Indexing: Multilevel hierarchical indexes based on R-tree

indexing method

• Active Proxy G: Active Semantic Data Cache – Employ user semantics to cache and retrieve data– Store and reuse results of computations

• Compiler Support– Generalized reductions and SQL select queries– Compiler generated data extraction objects, filters, query plans

Page 12: On-demand Exploration of Distributed Datasets in Large ... · PDF fileOn-demand Exploration of Distributed Datasets in Large-scale Simulation Studies Tahsin Kurc Biomedical Informatics

DataCutter

� ��� �� � �� � � � � �� � � � � ��� � �� �� � � � � �� � � � � � � � � � � � � �

� ��� � � � ���� � �

� ��� �! " # � $ %! &' #)( ! * +-, ./ # . ! ' ! .

0 # ! &1 &( 0 #)( 1 . # � ! "! 2 3 1 2! �4 � 576/ &! ! # � ! " # � $ 8! / 04 :9� ; # . / ! & # � $ %! & ' #)( ! * < # 2/ & # 3�, / ! = > >

( 4 8? 4 � ! � / @ &1 8! A4 & B

� C)D EF G E HI)J ED D EK L EJ M E N H

OI O E P I)J E G M N Q O NJ EJ RD S HI P R EF D TJ G

HI P R EF UF N L OD V

� W . ! 1 2 # � $ .YX W1 &1 . . ! .

� Z! � ! &1 . #)[ ! 5! , ( / #4 �

� C)D EF G I F E M RI�\ E R N U EJ EF T R E TJ G

IJ D R TJ RI T R E M N OI ED N H HI P R EF D

� ] L P RI O P E HI P R EF UF N L OD M TJ ^ E

T M RI�\ E D I Q L P R TJ E N LD PY_

� ` P Na M NJ RF N P ^ E Ra E EJ RF TJ D O TF EJ R

HI P R EF M N OI ED

� 5! ? . #)( 1 / ! # � # ' # , 1 . @ # . / ! & 2

� b &1 � 2 ? 1 &! � / * 2 # � $ . ! 2/ &! 1 8 # . . , 2 #4 �9/11/2002 DataCutter 19

ced f ghji k lm n o n pq nr st nu n v v k vh r f

host1

R0

R1

host2

R2

host3

Ra0

host1

E0

EK

host2

EK+1

EN

host4

Ra1

host5

Ra2

host1

M

Cluster 1

Cluster 3

Cluster 2

Download at www.datacutter.org

Page 13: On-demand Exploration of Distributed Datasets in Large ... · PDF fileOn-demand Exploration of Distributed Datasets in Large-scale Simulation Studies Tahsin Kurc Biomedical Informatics

– SRB integration: Subset and filter datasets

– Globus integration: DataCutter uses Globus’ resource discovery, resource allocation, authentication, and authorization services.

– Network Weather Service (NWS) integration: NWS for used for system monitoring.

– Integration with GridFTPplanned

– OGSI compliant interfaces

Integrating DataCutter with existing Grid toolkits

SRB, Globus, NWS

Page 14: On-demand Exploration of Distributed Datasets in Large ... · PDF fileOn-demand Exploration of Distributed Datasets in Large-scale Simulation Studies Tahsin Kurc Biomedical Informatics

GridDB-LiteSupport efficient selection of the data of interest from

distributed scientific datasets and transfer of data from storage clusters to compute clusters

• Data Subsetting Model– Virtual Tables– Select Queries

– Distributed Arrays

SELECT RID, TIME, X, Y, Z, FROM Ipars-Realiz-1, Ipars-Realiz-2WHERE TIME>=1000 AND TIME<1200 AND SOIL>0.7

AND Speed (OVelX, OVelY, OVelZ)<50.0GROUP-BY-PROCESSOR MyPartitioner(RID, TIME)

Services• Query• Meta-data• Indexing• Data Source• Filtering• Partition Generation• Data Mover

Page 15: On-demand Exploration of Distributed Datasets in Large ... · PDF fileOn-demand Exploration of Distributed Datasets in Large-scale Simulation Studies Tahsin Kurc Biomedical Informatics

GridDB Lite:Select Operation on Grid Data

�Distributed Array

Page 16: On-demand Exploration of Distributed Datasets in Large ... · PDF fileOn-demand Exploration of Distributed Datasets in Large-scale Simulation Studies Tahsin Kurc Biomedical Informatics

Active Semantic Cache - Data Transformation

Model

New query: compute data product I

Cache has a data product J available

1. Can J help computing I?

2. How can J help computing I?

S1

S2

�� � �� � �� � � � �

projected

subqueries

Page 17: On-demand Exploration of Distributed Datasets in Large ... · PDF fileOn-demand Exploration of Distributed Datasets in Large-scale Simulation Studies Tahsin Kurc Biomedical Informatics

Active Proxy-G Ecosystem

ApplicationServer

Client

Active

Proxy-G

ApplicationServer

ApplicationServerClient

Client

Client

CacheServer

queryresults

subquery

subq results

insertCacheServer

insert

query

results

queryresults

CacheServer

query search aggregates

subquerysubquery results

searchaggregates

results

Page 18: On-demand Exploration of Distributed Datasets in Large ... · PDF fileOn-demand Exploration of Distributed Datasets in Large-scale Simulation Studies Tahsin Kurc Biomedical Informatics

Compiler Support

• Specify Dataset layout– Distribution of dataset files– Layout of data within a file

• Specify portion of dataset– e.g. through SQL

• Compiler generated instances of data extraction, filter objects

• Compiler generated plans for multi-query batches

Page 19: On-demand Exploration of Distributed Datasets in Large ... · PDF fileOn-demand Exploration of Distributed Datasets in Large-scale Simulation Studies Tahsin Kurc Biomedical Informatics

System ArchitectureInput query in SQL3

Parser

Raw dataFiles

Meta-data Descriptor

Query Optimizer

Index() & Extractor()code generation

GridDB-Lite

Transformed query

ParserDataset description

Dataset list

Page 20: On-demand Exploration of Distributed Datasets in Large ... · PDF fileOn-demand Exploration of Distributed Datasets in Large-scale Simulation Studies Tahsin Kurc Biomedical Informatics

[IPARS]RID = INT2TIME = INT4X = FLOATY = FLOATZ = FLOATPOIL = FLOATPWAT = FLOAT……

[bh]DatasetDescription = IPARSio = fileDim = 17x65x65Npart = 8…Osumed1 = osumed01.epn.osc.edu,

osumed02.epn.osc.edu, …

0 = bh-10-1 osumed1 /scratch1/bh-10-11 = bh-10-2 osumed1 /scratch1/bh-10-2……

Description file

Data list file

{ Group “ ROOT” {DATASET “ bh” {

DATATYPE { IPARS }DATASPACE {RANK 3 }DATAINDEX { RID, TIME }PARTS { 9503, 9503, 9537, 9554,

9503, 9707, 9520, 9520}

DATA { DATASET SPACIAL,DATASET POIL,DATASET PWAT,……}

}Group “ SUBGROUP” {

DATASET “ SPACIAL” {DATATYPE { }DATASPACE {

SKIP 4 LINESLOOP PARTS {

X SPACE Y SPACE ZSKIP 1 LINE

}}

DATA {PART in (0,1,2,3,4,5,6,7).0.PART.5.init

}}

DATASET “ POIL” {DATATYPE { }DATASPACE {

LOOP TIME {SKIP 1 doubleLOOP PARTS

{ POIL }}

}DATA { PART in (0,1,2,3,4,5,6,7)

.0.PART.5.0}

……}

Meta-data

Page 21: On-demand Exploration of Distributed Datasets in Large ... · PDF fileOn-demand Exploration of Distributed Datasets in Large-scale Simulation Studies Tahsin Kurc Biomedical Informatics

Current Status• Datasets

– A 1.5TB Dataset• 207 simulations, selected from several Geostatistics models and

well patterns• Each simulation is ~6.9GB: 10,000 time steps, 9,000 grid

elements, 8 scalars + 3 vectors = 17 variables – A 5TB Dataset

• 500 simulations, selected from several Geostatistics models and well patterns

• Each simulation is ~10GB: 2,000 time steps, 65K grid elements, 8scalars + 3 vectors = 17 variables

• Stored at – SDSC: HPSS and 30TB Storage Area Network System – UMD: 9TB disks on 50 nodes: PIII-650, 768MB, Switched

Ethernet– OSU: 7.2TB disks on 24 nodes: PIII-900, 512MB, Switched

Ethernet• Data Analysis

– Economic model assessment– Bypassed oil regions– Representative Realization Selection for more simulations

Page 22: On-demand Exploration of Distributed Datasets in Large ... · PDF fileOn-demand Exploration of Distributed Datasets in Large-scale Simulation Studies Tahsin Kurc Biomedical Informatics

Bypassed Oil

• RD -- Read data filter. Access data sets.

• CC -- Connected component filter. Perform connected component analysis to find oil regions per time step.

• MT – Merge over time. Combine over multiple time steps for bypassed oil.

RD CC MTClient

• Query: Find all the datasets in D that have bypassed oil pockets with at least Tcc grid cells.

Page 23: On-demand Exploration of Distributed Datasets in Large ... · PDF fileOn-demand Exploration of Distributed Datasets in Large-scale Simulation Studies Tahsin Kurc Biomedical Informatics

Representative Realization

• Select the simulation/realization that has values closest to a user-defined criteria. – analyze that simulation or use its initial conditions for further

simulation studies.

• Find the dataset among a set of datasets– values of oil concentration, water pressure, and gas pressure are

closest to the average of these values across the set of datasets

• User selects– A set of datasets (D) and a set of time steps (T1,T2,…,TN).

• Query: Find the dataset that is closest to the average.

min Σ(all grid points) | Oc – Ocavg | + | Wp – Wpavg | + | Gp – Gpavg|

Page 24: On-demand Exploration of Distributed Datasets in Large ... · PDF fileOn-demand Exploration of Distributed Datasets in Large-scale Simulation Studies Tahsin Kurc Biomedical Informatics

Production Simulationvia

Reservoir Modeling

Revise Knowledge of Reservoir Modelvia

Imaging and Inversion of Seismic Data

Monitor Productionby acquiring

Time Lapse Observations of Seismic Data

Modify Production Strategyusing an

Optimization Criteria

Page 25: On-demand Exploration of Distributed Datasets in Large ... · PDF fileOn-demand Exploration of Distributed Datasets in Large-scale Simulation Studies Tahsin Kurc Biomedical Informatics

Seismic Data Collection

Page 26: On-demand Exploration of Distributed Datasets in Large ... · PDF fileOn-demand Exploration of Distributed Datasets in Large-scale Simulation Studies Tahsin Kurc Biomedical Informatics

Seismic Modeling of Reservoirs

Data Conversion

Data ManipulationTools

Seismic Sim

Seismic Sim

Seismic Sim

Seismic Sim

Seismic Sim

Seismic Sim

Seismic Datasets

Seismic Datasets

DistributedExecution

Reservoir Datasets

Reservoir Datasets Data ManipulationTools

50.00

50.00

VisualizationTools

Page 27: On-demand Exploration of Distributed Datasets in Large ... · PDF fileOn-demand Exploration of Distributed Datasets in Large-scale Simulation Studies Tahsin Kurc Biomedical Informatics

Array #

Component #

Component #

Component #

Sp (or CDP) #& position

Receiver group # & positionReceiver group #

& positionReceiver group # & position

50. 00

50. 00

50. 00

Component #

Component #

Component #

Array #

Receiver group # & positionReceiver group #

& positionReceiver group # & position

50. 00

50. 00

50. 00

Component #

Component #

Component #

Array #

Receiver group # & positionReceiver group #

& positionReceiver group # & position

50. 00

50. 00

50. 00

Seismic Imaging and Inversion

Seismic Imaging and Inversion• Extract a subset of seismic traces from

the dataset• Allocate and initialize 3D velocity and

image volume3. foreach shot point p do

3.1 read all traces for p;3.2 add contribution of each trace;

(Kirchhoff time migration)4. enddo

Step 3.2 is commutative and associative

Sea bedReservoir

Page 28: On-demand Exploration of Distributed Datasets in Large ... · PDF fileOn-demand Exploration of Distributed Datasets in Large-scale Simulation Studies Tahsin Kurc Biomedical Informatics

Seismic Imaging and Conversion

Global Merge

50.00 50.00 50.00 50.00 50.00 50.00 50.00

Reduction

Page 29: On-demand Exploration of Distributed Datasets in Large ... · PDF fileOn-demand Exploration of Distributed Datasets in Large-scale Simulation Studies Tahsin Kurc Biomedical Informatics

Translational Biomedical Research

Biology, Imaging,

Bioinformatics

Disease mechanism, disease

classification, diagnosis, treatment

Page 30: On-demand Exploration of Distributed Datasets in Large ... · PDF fileOn-demand Exploration of Distributed Datasets in Large-scale Simulation Studies Tahsin Kurc Biomedical Informatics

Translational Research: Types of

Information22.0 Kb

Exon1 5 10 16

Taq1B

Intron12

I405V

G84A

Page 31: On-demand Exploration of Distributed Datasets in Large ... · PDF fileOn-demand Exploration of Distributed Datasets in Large-scale Simulation Studies Tahsin Kurc Biomedical Informatics

Analysis

Pharmacokinetics, minimum effective pharmacological

dose, drug toxicity

Workflow

Rule based protocols, plan tests and

treatments, plan patient consenting, specimen collection and analysis

Data

Diagnosis, Treatment, Laboratory, Imaging,

Proteomic, Gene Expression, Gene Sequence

Drives accrual, protocol changes, choice of laboratory, imaging,

genomic testing

Data driven algorithms -patient

accrual, clinical, laboratory, genomic

testing

Generates requests for data

Data streamed to Analysis Request for data

updates

Page 32: On-demand Exploration of Distributed Datasets in Large ... · PDF fileOn-demand Exploration of Distributed Datasets in Large-scale Simulation Studies Tahsin Kurc Biomedical Informatics

Dynamic Contrast Enhanced MRI (DCE-MRI) Analyses

• DCE-MRI – MRI imaging using contrast agent.

– The agent serves to increase local signal intensity and indicate the diffusion patterns in various types of tissue.

• Distinctive feature of malignant lesions is the leakiness of their vasculature.

• Fit pharmacokinetic model ODEs– Intensity of the enhancement as

amplitude (Amp)– Redistribution rate constant (K21)– Elimination rate constant (Kel).– Time lag (T)

• Tumor characterization using texture analysis and feature detection techniques

• Register images within single time dependent study to correct for patient motion

• Images obtained with varying time/space resolution

tumor cellblood flow

Gd-chelates

Capillary wall

Erythrozyt

Intracellular space

Extracellular space

Diffusion in extracellular space

Re-diffusion into the intravacular space

tumor cellblood flow

Gd-chelates

Capillary wall

Erythrozyt

Intracellular space

Extracellular space

Diffusion in extracellular space

Re-diffusion into the intravacular space

Page 33: On-demand Exploration of Distributed Datasets in Large ... · PDF fileOn-demand Exploration of Distributed Datasets in Large-scale Simulation Studies Tahsin Kurc Biomedical Informatics

DCE-MRIImage Acquisition

MRI

Severe image artifact

Detectable Motionin Dataset

Motion Correctable

No DCEEvaluation

YES

NO YES

NO

YES

Co-registration(cross-modality orLongitudinal study)

NO

Computer-aidedtumor detection

Computer-aidedtumor characterization

Algorithmic tumorClassification

(3D sub classification ofHeterogeneous lesion)

MotionCorrection