metadata-based tools at the encode portal

1
Metadatadriven tools to access data from the ENCODE project Esther T. Chan 1 , Cricket A. Sloan 1 , Eurie L. Hong 1 , Venkat S. Malladi 1 , Laurence D. Rowe 1 , J. Seth StraIan 1 , Jean M. Davidson 1 , Marcus Ho 1 , Nikhil R. Podduturi 1 , Benjamin C. Hitz 1 , Forrest Tanaka 1 , Brian J. Lee 2 , Katrina Learned 2 , MaI Simison 1 , W. James Kent 2 , J. Michael Cherry 1 1) Department of GeneTcs, School of Medicine, Stanford University, Stanford, CA 94305; 2) Center for Biomolecular Science and Engineering, University of California, Santa Cruz, CA, 95064 The Encyclopedia of DNA Elements (ENCODE) project is an ongoing collaborative effort to create a comprehensive catalog of functional elements in the human and mouse genomes. Its current data corpus exceeds 4000 experiments across more than 400 cell lines and tissues using a wide array of experimental techniques to survey the chromatin structure, regulatory and transcriptional landscape in human and mouse genomes. All ENCODE experimental data, metadata and associated computational analyses are submitted to the ENCODE Data Coordination Center (DCC) for validation, tracking, storage, and distribution to community resources and the scientific community. As the volume of data increases, the identification and organization of data sets becomes challenging to present to users. Here, we describe the web interface, search tools and underlying database the DCC have built for simple and intuitive access to ENCODE data and metadata. Extensive and structured metadata describing experimental variables, such as the biological samples, specific reagents, and protocols necessary to replicate the assays and their subsequent analysis collected from ENCODE data producers drive a powerful faceted browsing interface, allowing users to filter and retrieve particular slices of the large data corpus. Elasticsearch-driven real-time indexing also allows users to perform full-text searches to directly access specific data of interest. Upcoming features planned include access to data standards, quality metrics for data files, data visualization tools, uniform processing and analysis pipelines as part of the revamped ENCODE public portal. Data and metadata from the ENCODE project can currently be accessed at https://www.encodeproject.org . APPROACH II. Detailed capture of metadata in JSON • Type (e.g. Tssue, cell line) • Source • Product id • Lot id • Dates (e.g. growth, harvest, procurement) • Passage number • StarTng amount • Lab assigned IDs • Link to donor BIOSAMPLE • Species • Age • Sex • Life stage • Health status • Ethnicity • DeidenTfied ID HUMAN DONOR • Lysis method • SonicaTon method • ExtracTon method • Nucleic acid type • Nucleic acid size range • Strand specificity • Size selecTon method • Protocol document LIBRARY EXPERIMENT REPLICATE [1..n] LIBRARY [1..n] FILES FILE [0..n] CONSTRUCT [0..n] DONOR [1..n] BIOSAMPLE (1..n) ANTIBODY [0..1] has has has has TREATMENT [0..n] has RNAi [0..1] has I. Principles driving metadata defini<on III. Rela<onships between metadata objects reflect underlying experimental processes has • Provide transparency about the experimental process • Communicate key variables of each experiment • Capture the data provenance of computaTonal analyses • Use ontologies and controlled vocabularies to standardize terminology and promote interoperability with other data resources ENCODE portal hIp://www.github.com/ENCODEDCC @ENCODEDCC [email protected] hIps://www.encodeproject.org/help/gejngstarted hIps://www.encodeproject.org Help documenta<on Code repository Browse data collections by assay type, biosamples, antibodies and annotations. Data type Description ENCODE accession format Experiment An ENCODE produced experiment with 2 or more biological replicates.* ENCSR###XXX Dataset A collection of data files, e.g. associated with an ENCODE analysis or publication. ENCSR###XXX Biosample A distinct growth of a cell line, excised tissue, or whole organism used in an assay. ENCBS###XXX Antibody lot A distinct lot of an antibody, identified by its product ID, lot number and source used in an assay ENCAB###XXX Donor A distinct donor or strain from which the biosample was obtained. ENCDO###XXX File A data file (raw or processed) with a unique md5sum ENCFF###XXX USING THE ENCODE DATA PORTAL I. Navigate the home page for the ENCODE portal Find data for over 4000 experiments across more than 1700 different biosamples and 300+ antibodies. II. Browse data collec<ons, e.g. assay types. III. Search by term A faceted browsing interface allows users to filter data collections and search results by relevant experimental metadata properties. Visualize signal and segment files via trackhubs in a genome browser. IV. Search by accession FEATURES IN DEVELOPMENT Download raw and processed data files directly from the experiment page. Batch downloads of files coming soon. https://www.encodeproject.org See poster 1653S for more details on programmaTc data retrieval using a REST API. Access the underlying ENCODE data referenced in a publication directly using the unique accessions. Enter a search term for a biosample (e.g. skin) or an assay name (e.g. ChIP -seq), or a protein target of an antibody (e.g. CTCF). II. Search by region of interest I. ENCODE standard analysis pipelines Find ENCODE datasets overlapping a region of interest by its genomic coordinates, or rs ID (SNP), or gene name etc. Figure 1 from Boyle et al. Genome Res. 2012 Sep;22(9):17907 Primary Data (fastq) Mapped Reads (bam) QA Metrics Signal detecTon Find documentation on standards, software, formats and how-tos. JSON is an open standard human and machine readable data exchange format expressed as attribute-value pairs. The ENCODE Data Analysis Center is defining standard analysis pipelines for ChIP-seq, RNA-seq, DNase-seq and whole genome bisulfite-seq. These ENCODE pipelines are being implemented on the DNAnexus cloud platform and will be available for users to run on their own data.

Upload: encode-dcc

Post on 04-Aug-2015

51 views

Category:

Science


2 download

TRANSCRIPT

Page 1: Metadata-based tools at the ENCODE Portal

Metadata-­‐driven  tools  to  access  data  from    the  ENCODE  project  

Esther  T.  Chan1,  Cricket  A.  Sloan1,  Eurie  L.  Hong1,  Venkat  S.  Malladi1,  Laurence  D.  Rowe1,  J.  Seth  StraIan1,  Jean  M.  Davidson1,  Marcus  Ho1,  Nikhil  R.  Podduturi1,    Benjamin  C.  Hitz1,  Forrest  Tanaka1,  Brian  J.  Lee2,  Katrina  Learned2,  MaI  Simison1,  W.  James  Kent2,  J.  Michael  Cherry1  

1)  Department  of  GeneTcs,  School  of  Medicine,  Stanford  University,  Stanford,  CA  94305;  2)  Center  for  Biomolecular  Science  and  Engineering,  University  of  California,  Santa  Cruz,  CA,  95064  

The Encyclopedia of DNA Elements (ENCODE) project is an ongoing collaborative effort to create a comprehensive catalog of functional elements in the human and mouse genomes. Its current data corpus exceeds 4000 experiments across more than 400 cell lines and tissues using a wide array of experimental techniques to survey the chromatin structure, regulatory and transcriptional landscape in human and mouse genomes. All ENCODE experimental data, metadata and associated computational analyses are submitted to the ENCODE Data Coordination Center (DCC) for validation, tracking, storage, and distribution to community resources and the scientific community. As the volume of data increases, the identification and organization of data sets becomes challenging to present to users. Here, we describe the web interface, search tools and underlying database the DCC have built for simple and intuitive access to ENCODE data and metadata. Extensive and structured metadata describing experimental variables, such as the biological samples, specific reagents, and protocols necessary to replicate the assays and their subsequent analysis collected from ENCODE data producers drive a powerful faceted browsing interface, allowing users to filter and retrieve particular slices of the large data corpus. Elasticsearch-driven real-time indexing also allows users to perform full-text searches to directly access specific data of interest. Upcoming features planned include access to data standards, quality metrics for data files, data visualization tools, uniform processing and analysis pipelines as part of the revamped ENCODE public portal. Data and metadata from the ENCODE project can currently be accessed at https://www.encodeproject.org.

APPR

OAC

H  

II.  Detailed  capture  of  metadata  in  JSON  

•  Type  (e.g.  Tssue,  cell  line)  •  Source  •  Product  id  •  Lot  id  •  Dates  (e.g.  growth,  harvest,          procurement)  •  Passage  number  •  StarTng  amount    •  Lab  assigned  IDs  •  Link  to  donor  

BIOSAMPLE  

•  Species  •  Age  •  Sex  •  Life  stage  •  Health  status  •  Ethnicity  •  De-­‐idenTfied  ID  

HUMAN  DONOR  

•  Lysis  method  •  SonicaTon  method  •  ExtracTon  method  •  Nucleic  acid  type  •  Nucleic  acid  size  range  •  Strand  specificity  •  Size  selecTon  method  •  Protocol  document  

LIBRARY  

EXPERIMENT  

 REPLICATE  [1..n]  

LIBRARY  [1..n]  

FILES  

FILE  [0..n]  

CONSTRUCT  [0..n]  

DONOR  [1..n]    

BIOSAMPLE  (1..n)    

ANTIBODY  [0..1]  

has  

has  

has  

has  

TREATMENT  [0..n]  

has  

RNAi  [0..1]  has  

I.  Principles  driving  metadata  defini<on   III.  Rela<onships  between  metadata  objects                    reflect  underlying  experimental  processes  

has  

   

•  Provide  transparency  about  the  experimental  process  

•  Communicate  key  variables  of  each  experiment            •  Capture  the  data  provenance  of  computaTonal  analyses  

   

•  Use  ontologies  and  controlled  vocabularies  to  standardize  terminology  and  promote  interoperability  with  other  data  resources  

ENCODE  portal  hIp://www.github.com/ENCODE-­‐DCC  

@ENCODE-­‐DCC  

encode-­‐[email protected]  hIps://www.encodeproject.org/help/gejng-­‐started  hIps://www.encodeproject.org  

Help  documenta<on   Code  repository  

Browse data collections by assay type, biosamples, antibodies and annotations.

Data type   Description   ENCODE accession

format  

Experiment   An ENCODE produced experiment with 2 or more biological replicates.*  

ENCSR###XXX  

Dataset   A collection of data files, e.g. associated with an ENCODE analysis or publication.  

ENCSR###XXX  

Biosample   A distinct growth of a cell line, excised tissue, or whole organism used in an assay.  

ENCBS###XXX  

Antibody lot   A distinct lot of an antibody, identified by its product ID, lot number and source used in an

assay  

ENCAB###XXX  

Donor   A distinct donor or strain from which the biosample was obtained.  

ENCDO###XXX  

File   A data file (raw or processed) with a unique md5sum  

ENCFF###XXX  

USING  TH

E  EN

CODE

 DAT

A  PO

RTAL    

I.  Navigate  the  home  page  for  the  ENCODE  portal  

Find data for over 4000 experiments across more than 1700 different biosamples and 300+ antibodies.

II.  Browse  data  collec<ons,  e.g.  assay  types.    

III.  Search  by  term  

A faceted browsing interface allows users to filter data collections and search results by relevant experimental metadata properties.

Visualize signal and segment files via trackhubs in a genome browser.

IV.  Search  by  accession    

FEAT

URE

S  IN  DEV

ELOPM

ENT    

Download raw and processed data files directly from the experiment page. Batch downloads of files coming soon.

https://www.encodeproject.org

See  poster  1653S  for  more  details  on  programmaTc  data  retrieval  using  a  REST  API.  

Access the underlying ENCODE data referenced in a publication directly using the unique accessions.

Enter a search term for a biosample (e.g. skin) or an assay name (e.g. ChIP -seq), or a protein target of an antibody (e.g. CTCF).

II.  Search  by  region  of  interest  

I.  ENCODE  standard  analysis  pipelines  

Find ENCODE datasets overlapping a region of interest by its genomic coordinates, or rs ID (SNP), or gene name etc.

Figure  1  from  Boyle  et  al.  Genome  Res.  2012  Sep;22(9):1790-­‐7      

Primary  Data  (fastq)  

Mapped  Reads  (bam)  

QA  Metrics  

Signal  detecTon  

Find documentation on standards, software, formats and how-tos.

JSON is an open standard human and machine readable data exchange format expressed as attribute-value pairs.

The ENCODE Data Analysis Center is defining standard analysis pipelines for ChIP-seq, RNA-seq, DNase-seq and whole genome bisulfite-seq. These ENCODE pipelines are being implemented on the DNAnexus cloud platform and will be available for users to run on their own data.