on-demand exploration of distributed datasets in large ... · pdf fileon-demand exploration of...
TRANSCRIPT
On-demand Exploration of Distributed Datasets in Large-scale Simulation
Studies
Tahsin KurcBiomedical Informatics Department
Ohio State University
��� � � ��� ��
�� � �� � � ��� �� � �
�� ��� �� � � � �
� � � �� � �
� � � � � � � � � � � � ��
�� �� � � � � � � � �
�� !� �"� � �
� !� � � � � � �� � �� � �
�� � � #� �
� � � $ � � �� � � �
�� � �� � � �
� � � � � #%� � � � � � �
� � � � � & �� � '� �
( �� � �� � � � �� � � �
& � � � � � � � � � �
� �� � � � � � � � � ��
� � �� � � � � )
� � � � � � � *
& � � � � (%� + � � � �� �
� � � , � �
��� � - � � �
� � . �� � � � . �� ��
� �� � ��
� � � � , � � � � ��0/� � � � � � � � � # � � � � � �� /
#%� � � �� � ))�� /
�� �� � � � �� /
(� � � � � � � � �
�� � �� � � � � ) � � 1� � � �
& � � � ��
Collaborators
Outline
• Dynamic Data Driven Applications• Energy and Environment Studies
– Instrumented Oil Field: Effective Reservoir Management
• Software Support– DataCutter– GridDB-Lite– ActiveProxy-G– Compiler Support
• Applications from Biomedical Research– Translational Biomedical Research– Dynamic Contrast Enhanced MRI
Dynamic Data Driven Applications• Many applications involve on-
demand dataset exploration to
– extract features and patterns
– understand the structure or the function
– study patterns and changes over time
• In most cases, however, we have partial knowledge of the physical environment. – Simulation is a powerful
mechanism to search the space for solutions
– Simulations can be driven by experimental/field data
– Analysis of simulation data drives new simulations or new field measurements
AnalysisData products, New
models
WorkflowGather/Generate
more data
DataDifferent types of
data
Instrumented Oil Field
��� �� � � � � � �� � � � � � �� � � � � � � � ��� � � �� � � � � � � � � � �
��� �� � � � � � � � � � � � � � � � � � � � � �� �
� � � � � � �� � � � � � � � � � � � � � � � � �� � � � � � � � �
� �� � � � � � � �� � � � � � � � � � � �� �
� � � �� � � � � � � � � � � � � � � ��� � � � � � � � � � � ��� �� �� � � � � � � � � � � � �
Effective Oil Reservoir Management• Implementing effective oil and gas production
– Optimizing well placement– Efficient exploration of possible production strategies
• Challenges: Geologic uncertainty, operational flexibility, and large, detailed flow models
Effective Oil Reservoir Management• Implementing effective oil and gas production
– Optimizing well placement– Efficient exploration of possible production strategies
• Challenges: Geologic uncertainty, operational flexibility, and large, detailed flow models
AnalysisProduction rates, bypass
oil, net present value
WorkflowRun new reservoir
simulations
DataSeismic, well
pressures, reservoir simulations
Generate requests for new simulations, new
seismic studies
Obtain initial, boundary conditions, input parameters for
simulations
Store and index simulation results
Summary data from datasets
Spatio-temporal queries
• Simulate multiple realizations of multiple geostatistical models and production strategies
• Evaluate geologic uncertainty and production strategies simultaneously
• Enable on-demand exploration and comparison of multiple scenarios– Integration of a robust,
Grid-based computational and data handling infrastructure
– Distributed databases of reservoir and geophysical data
– Storage and computing resources at multiple institutions
Production Simulationvia
Reservoir Modeling
Revise Knowledge of Reservoir Model
viaImaging and Inversion
of Seismic Data
Monitor Productionby acquiring
Time Lapse Observations of Seismic Data
Modify Production Strategyusing an
Optimization Criteria
Model 1 Model 2 Model N
50.00
Data Analysis
Data Analysis
Sea bedReservoir
50.00
New Modelor Parameters
Data Parameters
Data Management and Manipulation
Tools
Data Analysis Tools(e.g., Visualization)
Storage ResourceBroker
Reservoir Datasets
Reservoir Datasets
Reservoir Datasets
Reservoir DatasetsStorage Area
Network
RD
SUM
AVG
DIFF
SUM SUM
RD
……..Node 1 Node 20
DIFF DIFF DIFF
Transparent Copies
Transparent Copies(one copy per node)
SUM
Filters
Data Storage and Analysis
IPARS
Geostatistics
Model 1Model 2
Model n
…
…m realizations
Well Pattern p
Production Strategies
Well Pattern 1
…Well Pattern 2
Characteristics• Spatio-temporal datasets – datasets describe physical scenarios
• Data products often involve results from ensemble of datasets
• Synthesis of different types of data: measured data, simulated data• Common operations: subsetting, filtering, interpolations, projections,
comparisons, generalized reductions
• Tools for
– Efficient simulation and optimization
– Distributed data storage and indexing – Data query and retrieval
– Data filtering, generalized reductions, data product generation
– Data caching and replication
Software Support• DataCutter: Component Framework for Combined
Task/Data Parallelism:– Filtering/Program coupling Service: Distributed C++ component
framework
• GridDB Lite: Large Data Query Layered on DataCutter – Range Query– Value based selects– Client can be parallel program– Indexing: Multilevel hierarchical indexes based on R-tree
indexing method
• Active Proxy G: Active Semantic Data Cache – Employ user semantics to cache and retrieve data– Store and reuse results of computations
• Compiler Support– Generalized reductions and SQL select queries– Compiler generated data extraction objects, filters, query plans
DataCutter
� ��� �� � �� � � � � �� � � � � ��� � �� �� � � � � �� � � � � � � � � � � � � �
� ��� � � � ���� � �
� ��� �! " # � $ %! &' #)( ! * +-, ./ # . ! ' ! .
0 # ! &1 &( 0 #)( 1 . # � ! "! 2 3 1 2! �4 � 576/ &! ! # � ! " # � $ 8! / 04 :9� ; # . / ! & # � $ %! & ' #)( ! * < # 2/ & # 3�, / ! = > >
( 4 8? 4 � ! � / @ &1 8! A4 & B
� C)D EF G E HI)J ED D EK L EJ M E N H
OI O E P I)J E G M N Q O NJ EJ RD S HI P R EF D TJ G
HI P R EF UF N L OD V
� W . ! 1 2 # � $ .YX W1 &1 . . ! .
� Z! � ! &1 . #)[ ! 5! , ( / #4 �
� C)D EF G I F E M RI�\ E R N U EJ EF T R E TJ G
IJ D R TJ RI T R E M N OI ED N H HI P R EF D
� ] L P RI O P E HI P R EF UF N L OD M TJ ^ E
T M RI�\ E D I Q L P R TJ E N LD PY_
� ` P Na M NJ RF N P ^ E Ra E EJ RF TJ D O TF EJ R
HI P R EF M N OI ED
� 5! ? . #)( 1 / ! # � # ' # , 1 . @ # . / ! & 2
� b &1 � 2 ? 1 &! � / * 2 # � $ . ! 2/ &! 1 8 # . . , 2 #4 �9/11/2002 DataCutter 19
ced f ghji k lm n o n pq nr st nu n v v k vh r f
host1
R0
R1
host2
R2
host3
Ra0
host1
E0
EK
host2
EK+1
EN
host4
Ra1
host5
Ra2
host1
M
Cluster 1
Cluster 3
Cluster 2
Download at www.datacutter.org
– SRB integration: Subset and filter datasets
– Globus integration: DataCutter uses Globus’ resource discovery, resource allocation, authentication, and authorization services.
– Network Weather Service (NWS) integration: NWS for used for system monitoring.
– Integration with GridFTPplanned
– OGSI compliant interfaces
Integrating DataCutter with existing Grid toolkits
SRB, Globus, NWS
GridDB-LiteSupport efficient selection of the data of interest from
distributed scientific datasets and transfer of data from storage clusters to compute clusters
• Data Subsetting Model– Virtual Tables– Select Queries
– Distributed Arrays
SELECT RID, TIME, X, Y, Z, FROM Ipars-Realiz-1, Ipars-Realiz-2WHERE TIME>=1000 AND TIME<1200 AND SOIL>0.7
AND Speed (OVelX, OVelY, OVelZ)<50.0GROUP-BY-PROCESSOR MyPartitioner(RID, TIME)
Services• Query• Meta-data• Indexing• Data Source• Filtering• Partition Generation• Data Mover
GridDB Lite:Select Operation on Grid Data
�Distributed Array
Active Semantic Cache - Data Transformation
Model
New query: compute data product I
Cache has a data product J available
1. Can J help computing I?
2. How can J help computing I?
S1
S2
�� � �� � �� � � � �
projected
subqueries
Active Proxy-G Ecosystem
ApplicationServer
Client
Active
Proxy-G
ApplicationServer
ApplicationServerClient
Client
Client
CacheServer
queryresults
subquery
subq results
insertCacheServer
insert
query
results
queryresults
CacheServer
query search aggregates
subquerysubquery results
searchaggregates
results
Compiler Support
• Specify Dataset layout– Distribution of dataset files– Layout of data within a file
• Specify portion of dataset– e.g. through SQL
• Compiler generated instances of data extraction, filter objects
• Compiler generated plans for multi-query batches
System ArchitectureInput query in SQL3
Parser
Raw dataFiles
Meta-data Descriptor
Query Optimizer
Index() & Extractor()code generation
GridDB-Lite
Transformed query
ParserDataset description
Dataset list
[IPARS]RID = INT2TIME = INT4X = FLOATY = FLOATZ = FLOATPOIL = FLOATPWAT = FLOAT……
[bh]DatasetDescription = IPARSio = fileDim = 17x65x65Npart = 8…Osumed1 = osumed01.epn.osc.edu,
osumed02.epn.osc.edu, …
0 = bh-10-1 osumed1 /scratch1/bh-10-11 = bh-10-2 osumed1 /scratch1/bh-10-2……
Description file
Data list file
{ Group “ ROOT” {DATASET “ bh” {
DATATYPE { IPARS }DATASPACE {RANK 3 }DATAINDEX { RID, TIME }PARTS { 9503, 9503, 9537, 9554,
9503, 9707, 9520, 9520}
DATA { DATASET SPACIAL,DATASET POIL,DATASET PWAT,……}
}Group “ SUBGROUP” {
DATASET “ SPACIAL” {DATATYPE { }DATASPACE {
SKIP 4 LINESLOOP PARTS {
X SPACE Y SPACE ZSKIP 1 LINE
}}
DATA {PART in (0,1,2,3,4,5,6,7).0.PART.5.init
}}
DATASET “ POIL” {DATATYPE { }DATASPACE {
LOOP TIME {SKIP 1 doubleLOOP PARTS
{ POIL }}
}DATA { PART in (0,1,2,3,4,5,6,7)
.0.PART.5.0}
……}
Meta-data
Current Status• Datasets
– A 1.5TB Dataset• 207 simulations, selected from several Geostatistics models and
well patterns• Each simulation is ~6.9GB: 10,000 time steps, 9,000 grid
elements, 8 scalars + 3 vectors = 17 variables – A 5TB Dataset
• 500 simulations, selected from several Geostatistics models and well patterns
• Each simulation is ~10GB: 2,000 time steps, 65K grid elements, 8scalars + 3 vectors = 17 variables
• Stored at – SDSC: HPSS and 30TB Storage Area Network System – UMD: 9TB disks on 50 nodes: PIII-650, 768MB, Switched
Ethernet– OSU: 7.2TB disks on 24 nodes: PIII-900, 512MB, Switched
Ethernet• Data Analysis
– Economic model assessment– Bypassed oil regions– Representative Realization Selection for more simulations
Bypassed Oil
• RD -- Read data filter. Access data sets.
• CC -- Connected component filter. Perform connected component analysis to find oil regions per time step.
• MT – Merge over time. Combine over multiple time steps for bypassed oil.
RD CC MTClient
• Query: Find all the datasets in D that have bypassed oil pockets with at least Tcc grid cells.
Representative Realization
• Select the simulation/realization that has values closest to a user-defined criteria. – analyze that simulation or use its initial conditions for further
simulation studies.
• Find the dataset among a set of datasets– values of oil concentration, water pressure, and gas pressure are
closest to the average of these values across the set of datasets
• User selects– A set of datasets (D) and a set of time steps (T1,T2,…,TN).
• Query: Find the dataset that is closest to the average.
min Σ(all grid points) | Oc – Ocavg | + | Wp – Wpavg | + | Gp – Gpavg|
Production Simulationvia
Reservoir Modeling
Revise Knowledge of Reservoir Modelvia
Imaging and Inversion of Seismic Data
Monitor Productionby acquiring
Time Lapse Observations of Seismic Data
Modify Production Strategyusing an
Optimization Criteria
Seismic Data Collection
Seismic Modeling of Reservoirs
Data Conversion
Data ManipulationTools
Seismic Sim
Seismic Sim
Seismic Sim
Seismic Sim
Seismic Sim
Seismic Sim
Seismic Datasets
Seismic Datasets
DistributedExecution
Reservoir Datasets
Reservoir Datasets Data ManipulationTools
50.00
50.00
VisualizationTools
Array #
Component #
Component #
Component #
Sp (or CDP) #& position
Receiver group # & positionReceiver group #
& positionReceiver group # & position
50. 00
50. 00
50. 00
Component #
Component #
Component #
Array #
Receiver group # & positionReceiver group #
& positionReceiver group # & position
50. 00
50. 00
50. 00
Component #
Component #
Component #
Array #
Receiver group # & positionReceiver group #
& positionReceiver group # & position
50. 00
50. 00
50. 00
Seismic Imaging and Inversion
Seismic Imaging and Inversion• Extract a subset of seismic traces from
the dataset• Allocate and initialize 3D velocity and
image volume3. foreach shot point p do
3.1 read all traces for p;3.2 add contribution of each trace;
(Kirchhoff time migration)4. enddo
Step 3.2 is commutative and associative
Sea bedReservoir
Seismic Imaging and Conversion
Global Merge
50.00 50.00 50.00 50.00 50.00 50.00 50.00
Reduction
Translational Biomedical Research
Biology, Imaging,
Bioinformatics
Disease mechanism, disease
classification, diagnosis, treatment
Translational Research: Types of
Information22.0 Kb
Exon1 5 10 16
Taq1B
Intron12
I405V
G84A
Analysis
Pharmacokinetics, minimum effective pharmacological
dose, drug toxicity
Workflow
Rule based protocols, plan tests and
treatments, plan patient consenting, specimen collection and analysis
Data
Diagnosis, Treatment, Laboratory, Imaging,
Proteomic, Gene Expression, Gene Sequence
Drives accrual, protocol changes, choice of laboratory, imaging,
genomic testing
Data driven algorithms -patient
accrual, clinical, laboratory, genomic
testing
Generates requests for data
Data streamed to Analysis Request for data
updates
Dynamic Contrast Enhanced MRI (DCE-MRI) Analyses
• DCE-MRI – MRI imaging using contrast agent.
– The agent serves to increase local signal intensity and indicate the diffusion patterns in various types of tissue.
• Distinctive feature of malignant lesions is the leakiness of their vasculature.
• Fit pharmacokinetic model ODEs– Intensity of the enhancement as
amplitude (Amp)– Redistribution rate constant (K21)– Elimination rate constant (Kel).– Time lag (T)
• Tumor characterization using texture analysis and feature detection techniques
• Register images within single time dependent study to correct for patient motion
• Images obtained with varying time/space resolution
tumor cellblood flow
Gd-chelates
Capillary wall
Erythrozyt
Intracellular space
Extracellular space
Diffusion in extracellular space
Re-diffusion into the intravacular space
tumor cellblood flow
Gd-chelates
Capillary wall
Erythrozyt
Intracellular space
Extracellular space
Diffusion in extracellular space
Re-diffusion into the intravacular space
DCE-MRIImage Acquisition
MRI
Severe image artifact
Detectable Motionin Dataset
Motion Correctable
No DCEEvaluation
YES
NO YES
NO
YES
Co-registration(cross-modality orLongitudinal study)
NO
Computer-aidedtumor detection
Computer-aidedtumor characterization
Algorithmic tumorClassification
(3D sub classification ofHeterogeneous lesion)
MotionCorrection