[ieee 2014 international conference on high performance computing & simulation (hpcs) - bologna,...

8
An Automated Infrastructure to Support High-Throughput Bioinformatics Gianmauro Cuccuru, Simone Leo, Luca Lianas, Michele Muggiri, Andrea Pinna, Luca Pireddu, Paolo Uva, Andrea Angius, Giorgio Fotia, Gianluigi Zanetti CRS4 Pula, CA, Italy [email protected] Abstract-The number of domains affected by the big data phenomenon is constantly increasing, both in science and indus- try, with high-throughput DNA sequencers being among the most massive data producers. Building analysis frameworks that can keep up with such a high production rate, however, is only part of the problem: current challenges include dealing with articulated data repositories where objects are connected by multiple re- lationships, managing complex processing pipelines where each step depends on a large number of configuration parameters and ensuring reproducibility, error control and usability by non- technical staff. Here we describe an automated infrastructure built to address the above issues in the context of the analysis of the data produced by the CRS4 next-generation sequencing facility. The system integrates open source tools, either written by us or publicly available, into a framework that can handle the whole data transformation process, from raw sequencer output to primary analysis results. Keywords-Bioinformatics; NGS; MapReduce. I. INTRODUCTION The data-intensive revolution [1], [2] in the life sciences is being driven by the increasing diffusion of massively parallel data acquisition system, next-generation sequencing (NGS) machines [3] being among the most cited examples. One of the main challenges brought forth by this phenomenon is to develop scalable computing tools that can keep up with such a massive data generation throughput [4]-[6]. However, efficient data processing is only part of the problem: additional issues include dealing with highly structured data where individual components are connected by multiple relationships, keep- ing track of data provenance, managing complex processing workſtows, minimizing operational costs and providing simple access interfaces to non-technical users. In this paper, we describe our experience in developing a fully automated infrasucture for the analysis of DNA sequencing data produced by the CRS4 NGS facility - currently the largest in Italy by throughput, number of samples processed and amount of data generated. The system, which has been in production use since July 2013, has allowed to This work was partially supported by a Wellcome Trust Strategic Award [0959311Z1ll/Z] and by the Sardinian Regional Authorities. Parts of S.L.'s and L.P.'s activities were performed within the context of the Ph.D. program in Biomedical Engineering at the University of Cagliari, Italy. reduce the amount of human resources required to process the data from four to one full-time individual. The inastructure has been built by composing open source tools - many written at CRS4 - with new purpose-built software which will also be contributed to the open source community. In particular, we developed Orione [7], an online amework for data-intensive analysis and integration of NGS data. Based on Galaxy (see Section III-C), Orione supports the whole life cycle of bacteria and eukaryotes research data: from production, to annotation, to publication and reuse. The remainder of this article is structured as follows. We be- gin by describing the CRS4 NGS facilities in Section II. Then, in Section III, we give an overview of the system architecture and its components, following with a discussion on overall system performance in Section IV. Section V delineates the related work in this area, after which we conclude and describe future work in Section VI. II. CRS4 NEXT-GENERATION SEQUENCING LAB CRS4 hosts a high-throughput genotyping and sequencing facility that is directly interconnected to its computational resources (3000 cores, 4.5 PB storage). With three I1Iumina HiSeq2000 and two I1Iumina Genome Analyzer IIx, ours is the largest NGS platform in Italy, with a cumulative output of up to about 20 TB of raw sequencing data every ten days. The NGS lab has been used for complex large-scale genetic analy- sis in a broad range of applications, including whole-genome and transcriptome sequencing, metagenomics, elucidation of DNA binding sites for chromatin and regulatory proteins, and human exome and targeted resequencing using enrichment strategies (oligonucleotide array capture-based). A consistent part of the analysis has been done in the context of two studies on the genetics of autoimmune diseases and longevity. The former is based on the sequencing of thousands of individuals from Sardinia, which is one of the main reservoirs of genetic variation in Europe and also one of the regions with the highest incidences of autoiune diseases worldwide; the latter is one of the largest longitudinal studies of the Sardinian founder population [8], [9]. Over the past five years, the NGS Lab has processed more than 1500 whole-genome resequencing, 800 RNA-Seq and 200 exome sequencing samples [10], [11]. 978-1-4799-5313-4114/$3l.00 ©2014 IEEE 600

Upload: gianluigi

Post on 27-Feb-2017

214 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: [IEEE 2014 International Conference on High Performance Computing & Simulation (HPCS) - Bologna, Italy (2014.7.21-2014.7.25)] 2014 International Conference on High Performance Computing

An Automated Infrastructure to Support

High-Throughput Bioinformatics

Gianmauro Cuccuru, Simone Leo, Luca Lianas, Michele Muggiri, Andrea Pinna, Luca Pireddu, Paolo Uva, Andrea Angius, Giorgio Fotia, Gianluigi Zanetti

CRS4

Pula, CA, Italy

[email protected]

Abstract-The number of domains affected by the big data phenomenon is constantly increasing, both in science and indus­try, with high-throughput DNA sequencers being among the most massive data producers. Building analysis frameworks that can keep up with such a high production rate, however, is only part of the problem: current challenges include dealing with articulated data repositories where objects are connected by multiple re­lationships, managing complex processing pipelines where each step depends on a large number of configuration parameters and ensuring reproducibility, error control and usability by non­technical staff. Here we describe an automated infrastructure built to address the above issues in the context of the analysis of the data produced by the CRS4 next-generation sequencing facility. The system integrates open source tools, either written by us or publicly available, into a framework that can handle the whole data transformation process, from raw sequencer output to primary analysis results.

Keywords-Bioinformatics; NGS; MapReduce.

I. INTRODUCTION

The data-intensive revolution [1], [2] in the life sciences is

being driven by the increasing diffusion of massively parallel data acquisition system, next-generation sequencing (NGS)

machines [3] being among the most cited examples. One of the main challenges brought forth by this phenomenon is to

develop scalable computing tools that can keep up with such a

massive data generation throughput [4]-[6]. However, efficient data processing is only part of the problem: additional issues

include dealing with highly structured data where individual

components are connected by multiple relationships, keep­ing track of data provenance, managing complex processing

workftows, minimizing operational costs and providing simple access interfaces to non-technical users.

In this paper, we describe our experience in developing

a fully automated infrastructure for the analysis of DNA

sequencing data produced by the CRS4 NGS facility -currently the largest in Italy by throughput, number of samples

processed and amount of data generated. The system, which

has been in production use since July 2013, has allowed to

This work was partially supported by a Wellcome Trust Strategic Award [0959311Z1 ll/Z] and by the Sardinian Regional Authorities. Parts of S.L.'s and L.P.'s activities were performed within the context of the Ph.D. program in Biomedical Engineering at the University of Cagliari, Italy.

reduce the amount of human resources required to process the data from four to one full-time individual. The infrastructure

has been built by composing open source tools - many written at CRS4 - with new purpose-built software which

will also be contributed to the open source community. In

particular, we developed Orione [7], an online framework for

data-intensive analysis and integration of NGS data. Based

on Galaxy (see Section III-C), Orione supports the whole

life cycle of bacteria and eukaryotes research data: from production, to annotation, to publication and reuse.

The remainder of this article is structured as follows. We be­gin by describing the CRS4 NGS facilities in Section II. Then,

in Section III, we give an overview of the system architecture

and its components, following with a discussion on overall system performance in Section IV. Section V delineates the

related work in this area, after which we conclude and describe

future work in Section VI.

II. CRS4 NEXT-GENER ATION SEQUENCING LAB

CRS4 hosts a high-throughput genotyping and sequencing

facility that is directly interconnected to its computational resources (3000 cores, 4.5 PB storage). With three I1Iumina

HiSeq2000 and two I1Iumina Genome Analyzer IIx, ours is

the largest NGS platform in Italy, with a cumulative output of up to about 20 TB of raw sequencing data every ten days. The

NGS lab has been used for complex large-scale genetic analy­

sis in a broad range of applications, including whole-genome and transcriptome sequencing, metagenomics, elucidation of

DNA binding sites for chromatin and regulatory proteins, and

human exome and targeted resequencing using enrichment strategies (oligonucleotide array capture-based). A consistent

part of the analysis has been done in the context of two studies on the genetics of autoimmune diseases and longevity. The

former is based on the sequencing of thousands of individuals

from Sardinia, which is one of the main reservoirs of genetic variation in Europe and also one of the regions with the highest

incidences of autoimmune diseases worldwide; the latter is one

of the largest longitudinal studies of the Sardinian founder population [8], [9]. Over the past five years, the NGS Lab has

processed more than 1500 whole-genome resequencing, 800

RNA-Seq and 200 exome sequencing samples [10], [11].

978-1-4799-5313-4114/$3l.00 ©2014 IEEE 600

Page 2: [IEEE 2014 International Conference on High Performance Computing & Simulation (HPCS) - Bologna, Italy (2014.7.21-2014.7.25)] 2014 International Conference on High Performance Computing

Sample submission

Figure I. Overall architecture of the automation and processing system used at CRS4. The Automator is programmed to orchestrate operatIOns and control the other components, which have specific duties pertaining to data processing, storage, metadata maintenance and interaction with the users.

III. GENER AL SYSTEM ARCHITECTURE

The data production rate of the NGS laboratory presents

a significant challenge with respect to operator effort, data management complexity and processing throughput. We have

developed a system that can efficiently and autonomously

perform the standard primary processing of the data produced by the NGS lab, thus preparing it for further ad-hoc analysis

by bioinformaticians or for shipping to external collaborators.

The system achieves scalability via three main features. The first is the automatic execution and monitoring of standard

operations, which reduces the human effort required to process the data, thus lowering the occurrence of errors and allowing

the analysis to scale to large numbers of datasets. The second

is the handling of provenance information on all datasets, which allows to reconstruct their history up to the original raw

data: this is crucial to effectively manage large data collections,

allowing to quickly query data interdependencies and facilitate integration between multiple studies. Finally, the system is

designed for high processing throughput, which is a strict requirement given the growing volumes of data produced by

modern data-intensive acquisition technologies.

Fig. 1 summarizes the overall system architecture: computa­

tional engines are the core analysis tools that process raw data

to yield the final results; OMERO.biobank handles metadata storage and provenance tracking; iRODS acts as a single point

of access for all datasets; the workflow manager takes care

of composing and executing the various steps that make up each analysis pipeline; the sample submission system allows

researchers to provide detailed specifications on input samples and request specific processing; finally, the automator performs

global orchestration of other components in order to minimize

human intervention and increase reliability. Each component is described in more detail in a dedicated subsection.

A. Automator

The automator performs standard processing on data pro­duced by the laboratory, including format conversion, de­

multiplexing, quality control and preparation for archival or

Automator Run finished Event Handler

Execute handler lli Register data

IV ,-----.::w""ith.:..;ir;.::;od:;: s,----,

1

Create meta dataset in OMERO

Emit

Figure 2. Activity diagram illustrating the registration process for a new sequencing run.

shipment. When appropriate, further sample-specific work­

flows are also run. The system is based on a reactive, event­

driven design. For example, the activity diagram in Fig. 2 shows what happens when a sequencing run is completed:

an event announces that the run is finished; the automator reacts by executing the appropriate handler, which registers the new datasets with our iRODS [12] catalogue and with

OMERO.biobank (see Sec. III-B). The system's kernel is implemented by an event queue built with RabbitMQ [13];

clients can add new events to the queue to notify the system

that something has occurred: for instance, a periodic check adds an event when new data is ready for processing; one

or more daemons monitor the queue and execute appropriate

actions for each event. The design allows multiple instances of the automator to run concurrently, thus making the system

more robust to node failures and other technical problems.

In addition to the event-dispatching kernel, the automator

consists of a number of purpose-built event handlers that are specific to the process implemented at CRS4, and a software

library to communicate programmatically with the other com­

ponents. In fact, the automator does not execute operations directly on the data; instead, these are grouped into workflows

that are defined and executed through the workflow manager

(see Sec. III-C). The automator monitors the execution of these workflows and, when they complete, registers new datasets

in OMERO.biobank along with a detailed description of the operations that generated them, thus ensuring reproducibility. The automator's role in the overall architecture is therefore

that of a middleware layer that serves to drive the automation, integrate the various components and execute specialized site­

specific operations.

B. OMERO.biobank

OMERO.biobank is a robust, extensible and scalable trace­

ability framework developed to support large-scale experi-

601

Page 3: [IEEE 2014 International Conference on High Performance Computing & Simulation (HPCS) - Bologna, Italy (2014.7.21-2014.7.25)] 2014 International Conference on High Performance Computing

ments in data-intensive biology. The data management system is built on top of the core services of OME Remote Objects

(OMERO) [14], an open source software platform that in­

cludes a number of storage mechanisms, remoting middleware, an API and client applications.

At its core, OMERO.biobank's data model (see Fig. 3)

consists of entities (e.g., biological samples, analysis results)

connected by actions that keep track of provenance informa­tion. The system is designed to avoid strong bindings with

respect to static data flow patterns: each entity is only aware

of the action from which it derives, and vice versa. Additional

information is provided by devices, which are linked to actions

and hold all the details required to describe hardware compo­nents (e.g., an NGS machine), software programs or whole

pipelines involved in the data generation process. An example

of pipelines mapped as devices is given by Galaxy workflows, which can be easily manipulated with the BioBlend.objects

package (see Sec. III-C2).

OMERO.biobank's kernel is complemented by an indexing

system that maintains a persistent version of the traceability structure by mapping entities to nodes and actions to edges in

a graph database. Implemented with Ne04j [15], the system

allows to manage a large number of items: at CRS4, we are currently handling over 130000 entities linked by over 190000

actions. The data persistence layer and the graph index are

synchronized by an event-driven mechanism implemented with RabbitMQ: all save, delete and update transactions are mapped

to messages, sent to the events queue and consumed by a

daemon that updates the graph database. During the query process, users interact with the index engine transparently:

queries are redirected to the graph database, which responds with a list of nodes and edges used to retrieve the actual data

from OMERO.

C. Workflow Manager

The continuously increasing size of the data produced in the life sciences has led to a progressive intensification of the

effort required for their analysis. Large and diverse datasets must be processed by workflows consisting of many steps,

each with its own configuration parameters. In addition, the

entire analysis process should be transparent and reproducible, and the analysis frameworks usable and cost-effective for

biomedical researchers. Since keeping track of all information

associated with complex pipelines can be very time consuming and error prone, easy-to-use data processing platforms that can

automate at least part of the process are highly sought-after.

Our workflow management system has been specifically

designed to address the above challenges. It is based on Galaxy [16], an open platform for biomedical data analysis

that provides a standard way to encapsulate computational

tools and datasets in a graphical user interface (GUI), together with a mechanism to keep track of execution history. The

system consists of two main components: Orione, a highly

customized Galaxy instance, and BioBlend.objects, an API

Figure 3. Traceability graph for an exome processing workflow (see Fig. 5) stored within OMERO.biobank. Rectangles represent entities, while circles represent actions.

that enables programmatic interaction with Galaxy entities.

1) Orione: Orione [7] is an online framework for integrated analysis of NGS data. It includes bioinformatics tools cov­

ering end-to-end data analysis for bacteria (resequencing, de

novo assembly, scaffolding, bacterial RNA-Seq, gene annota­tion, metagenomics and metatranscriptomics) and eukaryotes

(RNA-Seq, whole genome and exome sequencing and variant

annotations - see Fig. 4). Orione has been built by integrating publicly available research tools into Galaxy. Since some

of these tools were never included in Galaxy before, we developed wrappers and user interfaces for them. In addition,

Orione includes several data libraries and workflows newly

developed by CRS4. Fig. 5 and 6 show two examples related, respectively, to exome processing and variant annotation. To

ensure scalability, Orione is configured to run computationally

intensive operations on our HPC cluster, including Hadoop­based tools (see Sec. III-F2).

602

Page 4: [IEEE 2014 International Conference on High Performance Computing & Simulation (HPCS) - Bologna, Italy (2014.7.21-2014.7.25)] 2014 International Conference on High Performance Computing

Quality control Altering

Trimming Manipulation

M8pp1ng: BLAT. BFAST. _.

_2.BWA. lAST2. MOSAIK. NCBI

BLAST +. SOAP De novo:

M"goy MUSCLE

""Nnt .. lIIng: Galaxy metagenomlcs

MetlPhIAn create Krona chart

AbySS. Eden ••

SSAKE. \lelvet

SAMtools VCF tools

GATK

Figure 4. Overall schema of the main functionalities provided by Orione: boxes represent collections of tools performing specific tasks.

slJanel. FASTQ Alignment slJanel.dedup.realn.recal.BAM Pre- ----110> slJane2.FASTQ ---+ Oedupping � slJane2.dedup.realn.recaI.BAM

processing s2-'anel.FASTQ Indel realign s2-'anel.dedup.realn.recal.BAM

s2-'ane2.FASTQ Recalibrate s2-,ane2.dedup.realn.recal.BAM

Merge Pe�-sample lanes per ----+ sl.merged.BAM -----+ refinement: .---. sl.merged.dedup.realn.BAM

sample s2.merged.BAM Oedupping s2.merged.dedup.realn.BAM

In del realign

Variant Calling --+ family.raw.VCF --+ Annotation ----+ family.annotated.VCF

(UG)

Figure 5. Exome processing workflow implemented in Orione, from raw reads (FASTQ) to annotated variants (V CF).

ewmm'"fi'

�"'''d'u'''<JH(SNP5,HHPS,

[nDeb)

u .. ""otomin,erval .... kI< �-�

I Onlyu .. IMlr..,scnptJinlhlslle

onpefl_outl"'1 (vd,tabulor, becj)

-' =!��� =!�� =!�:'�

SnpSn\Wl/lntlype _

IOgFiIe{UCIJ

Figure 6. A workflow for variant annotation in Orione.

2) BioBlend. objects: Galaxy (see Sec. III-Cl) provides

a simple and effective way of accessing a wide array of computational tools for the life sciences. However convenient,

though, it is not well-suited to automated bulk processing,

which is ever more frequently required to extend the useful life of experimental data as models (e.g., reference genomes) and

software tools get updated. For this reason, Galaxy includes a

RESTful API that allows programmatic access to a consistent subset of its workflow management infrastructure. This API,

however, is rather low-level, requiring the user to directly build HTTP requests, handle (de )serialization and manage

error cases. A higher level, dictionary-based API is provided

by BioBlend [17], a Python package that greatly simplifies interaction with the Galaxy server.

While offering substantial improvements over the basic

API, however, BioBlend still consists of one-to-one mappings

of generic Python dictionaries to REST resources, with no

explicit modeling of the relationships among the main Galaxy

entities (workflows, histories, datasets and data libraries). Also, by passing to the client the same data structures sent by

the server, BioBlend provides no isolation from changes in the Galaxy API. The above issues have been addressed by

BioBlend.objects, an object-based API developed at CRS4 as

a fork of the original BioBlend project (hup://github.comJ crs4lbioblend). The API offers an object-oriented interface

that simplifies development and isolates client code from

server-side changes, as well as an explicit modeling of the relationships between the various objects: for instance, library

objects expose a method to retrieve all datasets related to

that specific library. BioBlend.objects plays a key role in the automation mechanism (see Sec. III-A) used at CRS4 to run its

sequencing workflows and keep track of the dataset production process by storing all relevant information in its computable

biobank framework (see Sec. III-B).

D. Sample Submission System

NGS labs such as the one at our institute (see Sec. II) are characterized by a huge data throughput. As the rate

of samples to process increases, manually performing and tracking operations becomes increasingly difficult, costly and

error-prone. At CRS4 we addressed this issue by integrat­

ing a sample submission and tracking system into our data processing framework. Based on the Galaxy sample tracking

platform [18], the system provides an intuitive interface for

managing operations on samples, including submission by researchers, management by sequencing technicians and orga­

nization via projects, with complete status reports to laboratory

personnel. We have extended the original system to handle native IIIumina flow cell descriptors and support the retrieval

of all such information via a web service. The system's core component is a specialized Galaxy server that acts as an

entry point for the submission of sample details; after all

required information has been entered, the user submits the set of samples as a "sequencing project", which includes billing

information; sequencing technicians can then assign samples

to flow cells and monitor the status of sequencing runs via interactive plots. Through the web service, sample descriptions

and flow cell information are made available to the other

components in the overall framework.

603

Page 5: [IEEE 2014 International Conference on High Performance Computing & Simulation (HPCS) - Bologna, Italy (2014.7.21-2014.7.25)] 2014 International Conference on High Performance Computing

E. iRODS

NGS platforms generate a significant amount of data split

over a large number of files and datasets. In addition, frequent collaborations among geographically dispersed entities intro­

duce a requirement for fast and controlled remote data access. To simplify this process, CRS4 adopted iRODS as a front-end

to its large-scale heterogeneous storage system (about 4.5 PB

distributed in various boxes). The service stores and maintains sequencing datasets in a way that allows users to safely access

and manage them through a variety of clients, such as web

browsers and command line interfaces. It relies on general­

purpose file systems to store data and on SQL databases for

metadata. Designed to scale to millions of files and petabytes

of data, iRODS is a key component of our infrastructure, providing a single point of access to data sets that may be

distributed across a number of disjoint storage systems.

F Compute Engines

1) Seal: Seal is a suite of tools that harnesses the pop­

ular Hadoop [19] distributed computing platform to process

sequencing data. In the past years, Hadoop has established itself as the de Jacto standard for large scale data processing,

allowing both commercial and academic institutions to deal

with projects of unprecedented size [20]. Seal's main goal is to remove current processing bottlenecks by providing a

suite of scalable, distributed applications to perform common

time-consuming sequence processing operations. The current version includes the following Hadoop-based distributed pro­

cessing tools:

Bcl2Qseq: extract reads in Qseq format from Illumina base call files (BCL);

Demux: demultiplex reads from a multiplexed se­quencing run;

Prq: convert reads from the qseq or fastq formats

to the prq format for alignment with Seqal; Seqal: BWA-based distributed read mapping and du­

plicate identification;

ReadSort: distributed read sorting based on read id or alignment position;

RecabTable: extract empirical base quality statistics for

recalibration.

Seal components have been shown to scale well both in

the size of input data and in the amount of computational

nodes available [6]. Moreover, thanks to the robust platform provided by Hadoop, the effort required by operators to run

the analyses on a large cluster is generally reduced compared

to conventional HPC approaches, since Hadoop transparently handles most hardware and network problems.

To simplify their use and incorporate them into the process

automation mechanisms, Seal tools have been integrated into Galaxy, thus allowing their usage as workflow components. In­

cidentally, the toolbox has also been independently integrated

into other high-level workflow tools such as Cloudgene [21].

In addition to being called directly, Seal can also be used as a library, lending its functionality to new custom and

complementary applications such as SeqPig [22], a scripting

language for processing sequencing data on Hadoop that is also used at CRS4.

2) Special Purpose Tools: At CRS4, the main NGS pro­

cessing infrastructure is complemented by a series of special­

ized tools that address several other bioinformatics problems. These tools make heavy use of Pydoop [23], our Python

MapReduce and HDFS API for Hadoop.

In addition to the sequencing data produced at our site, we have recently processed a consistent number of Roche

454 reads, in the context of the safety assessment of a novel

hematopoietic stem cell gene therapy (HSC-GT) approach for the treatment of metachromatic leukodystrophy (MLD) [24].

The technique consists of infusing the patient with autolo­gous HSCs, transduced with viral vectors that can express

the enzyme whose absence causes the disease. Despite its

clinical efficacy, however, GT can give rise to adverse events collectively known as insertional mutagenesis: the genomic

proximity of the vector's integration site (IS) can be altered,

activating the expression of harmful genes. Thus, investigating the distribution of vectors across the genome is fundamental

to guarantee the safety of the procedure. To identify ISs,

host DNA is amplified through polymerase chain reaction (PCR) and sequenced. At this point, computational tools are

needed to trim out viral and artificial subsequences, map the reads to a reference genome, apply various filters and

annotate ISs with nearby genomic features. At CRS4, we

developed a custom pipeline that allows to perform the whole analysis, from the raw sequencing data to the annotated sites.

The pipeline includes a distributed, Hadoop-based version of the ubiquitous BLAST aligner for read mapping; moreover, wrappers have been developed to make all pipeline steps

accessible via Galaxy.

Genomic data processing at CRS4 is not limited to NGS reads: the large-scale population studies we are involved

in [25], [26] also include genotyping data gathered through high-density single nucleotide polymorphism (SNP) microar­rays. In such genome-wide association studies (GWAS), hun­dreds of thousands of genetic variants are analyzed simul­

taneously across the genome to assess possible risk factors for complex diseases. For some technologies, the accuracy

of genotype calling (GC) is heavily dependent on the batch of samples being processed [27], so that the most reliable

strategy consists of analyzing all available arrays as a single,

possibly huge, group. However, conventional GC software does not scale well to large batch sizes: this led us to develop a

MapReduce workflow that offers both greater scalability and

flexibility than previous solutions [28]. Fig. 7 compares the performance of our implementation to that of the gold standard

single-core implementation included in the Affymetrix power

tools [29].

604

Page 6: [IEEE 2014 International Conference on High Performance Computing & Simulation (HPCS) - Bologna, Italy (2014.7.21-2014.7.25)] 2014 International Conference on High Performance Computing

80

20

o����====--�--�--�--�--� \'O�3 �13\\ 3\\�\\ �'O'O'O �oC!J \\3'0\ 1�C!J�

dataset size (n. samples)

Figure 7. Time required to perform genotype calling for different dataset sizes (number of input CEL files). Each measurement has been repeated three times (error bars are not visible at this scale). The baseline corresponds to the single-core implementation included in the Affymetrix power tools, while the "hadoop" curve represents our MapReduce implementation, run on a 30-node Hadoop cluster.

3) Hadoocca: While the Hadoop platform is a strong vector for computational scalability [30], it imposes some

requirements on the underlying computational infrastructure

that are not compatible with the established resource allocation patterns used on HPC clusters, like the one at CRS4. Namely,

Hadoop has its own mechanisms for job submission, queue­

ing, and scheduling that conflict with HPC batch scheduling systems. In addition, Hadoop needs to run daemons and store

data locally on the nodes it uses, essentially assuming their exclusive and long-term allocation.

Thus, to support Hadoop at CRS4, we devised a strategy to make both scheduling paradigms co-exist in an efficient

and manageable manner. We implemented a dynamic Hadoop­node allocation system that seamlessly integrates with our ex­

isting HPC infrastructure (based on Open Grid Scheduler). The

system occupies resources on-demand (see Fig. 8), improving node utilization over static allocation approaches without

breaking existing scheduling policies. Therefore, it provides

a low-cost and low-risk path to testing and adopting Hadoop which effectively allowed our HPC center to set up a Hadoop

cluster with minimal investment, albeit with some trade-offs. Specifically, our set up - which we called Hadoocca - fore­

goes the Hadoop Distributed File System (HDFS) and instead

relies on our shared parallel file system. Thus, though our approach does not benefit from the advantages of HDFS [31],

it allows to run HPC and Hadoop jobs at the same time in

the same computing environment. The system is being used in production at CRS4 to run computational biology pipelines

and other workloads on a 3200-core HPC cluster that is shared

with other jobs.

"'

i" 0 "

:::> CL 0

1000 Example of Hadoocca core ramp-up

800

600

400

200

�.�0--�0�.5�-�1�.0�-�1.5�-�2.�0--�2�.5--�3�.0�-�3.5 Time (hours)

Figure 8. Progression of CPU core assignment by Hadoocca during the execution of a workflow. The assigned core count varies adapting to load and all machines are automatically released once the analysis is concluded.

IV. PRODUCTION CAPABILITIES

With the introduction of the framework described in the

previous section, CRS4 has been able to scale its operations while containing research costs. Specifically, the number of

full-time individuals required to operate the processing went

down from approximately three to less than one, freeing resources for downstream, research-specific analysis. Its adop­

tion has also enforced complete digital tracking of all analysis operations and datasets. In addition to ensuring reproducibility,

this feature provides an important source of information for

the quantitative monitoring, evaluation and management of the facility. In addition, the automation system, together with

the high-throughput distributed computing applications, has

allowed the center to reach its throughput targets. Fig. 9 shows the number of samples processed and the corresponding

amount of gzip-compressed sequence data generated each

week since the framework went into production. The system has coped with peak loads of over 200 samples per week

and about 2 TB of compressed data (approx. six flow cells, or 20 TB of raw input data) per week. This rate is already

sufficient to handle the capacity of CRS4's sequencing facility,

but we believe the system could scale to higher numbers.

Fig. 10 shows the rate of data production since system start-up.

V. RELATED W ORK

The development of a comprehensive data infrastructure

for the management and analysis of NGS data has been pursued extensively in different contexts and with varying

goals in mind. However, automatically piloting and monitoring

standard operations as well as ensuring reproducibility and

605

Page 7: [IEEE 2014 International Conference on High Performance Computing & Simulation (HPCS) - Bologna, Italy (2014.7.21-2014.7.25)] 2014 International Conference on High Performance Computing

n==:::::,::===,::w=ee�k=IY:is=a=m.::PI=efa",nd:.,.d::..:a:..:;ta;...;t:;.;.hrc,.:.o.=,U9;;.h:!:.P.=;Ut'--���1 2500 No. samples

� '* 150 .0, � � Q. E � 100 o z

50

Week

2000

1500 iIl S2. 1l .,. �

1000 .g:

Figure 9. Weekly throughput of operations at the CRS4 sequencing facility. For the period from the last week of July 2013 to the first week of February 2014, the graph shows the number of samples processed each week and the corresponding volume of gzip-compressed sequence data generated.

12000 r-��---.;::..Cu:.:.m:;.:.u::.:la:;:ti.:.:ve:..;o:.::u.:t:tPc;:ut:..l( c:;oo::..:m:.<;p:.::re c:: ss;:.:: e""d)",d:.::a::;ta...:.v.:::ol;:.:um.:.::e:;-��---,

10000

co � 8000 is � � 6000 "0

� � � 4000 o u

2000

Week

Figure 10. Growth of compressed output data accumulated at the CRS4 sequencing facility since July 2013, when the automated system fully entered production use, to February 2014.

traceability of analysis are issues that have been less compre­

hensively addressed.

Previous work has been carried out at The Genome Analysis Centre (TGAC), an institute in the UK that conducts research in genomics and bioinformatics. Their work [32] is primarily

focused on the initial analysis of sequencing data and provides a number of tools, packages and pipelines to ascertain, store,

and expose quality metrics. The computed quality metrics and

contamination screening analyses are stored using a flexible MySQL database and API - useful for storing any run

metric or metadata. Furthermore, an iRODS layer is provided, through which data can be annotated with descriptive metadata

enabling consolidated searching and discovery of grouped datasets. In principle, the combination of these tools offers the

potential to provide richer contexts for downstream analysis.

There are, however, a number of issues that have not been addressed. For example, there is no support for automated

selection of the processing pipeline based on the nature of the sequencing project. Furthermore, the lack of integration

with an analysis platform such as Galaxy hinders the possi­

bility of automatically and rapidly exposing sequence data for downstream analysis.

Since 201 0, iRODS has been running as a production data

management system at the Wellcome Trust Sanger Institute

(W TSI), one of the world's major sequencing centres. The W TSI uses iRODS as an archive system [33]. Currently, W TSI

users are mainly using iRODS for managing and accessing sequencing Binary Alignment/Map (BAM) files for further

analysis and research. Moreover, the W TSI uses iRODS to

manage user-defined metadata related to BAM files, whereas more advanced uses (e.g., metadata queries, and management

of experimental output for further analysis) are currently under

investigation on various internal testbeds. In addition to the W TSI, iRODS has also been used in several other large-scale

biological and biomedical initiatives and institutes, including

the Broad Institute, the Genome Biology Unit at the University of Helsinki, and the National Center for Microscopy and

Imaging Research (NCMIR) at UCSD.

iPlant is a collaborative 5-year, NSF-funded effort to de­velop a cyberinfrastructure to address a series of grand

challenges in plant science based on iRODS. Interestingly,

the iPlant data infrastructure [34] is designed to support preservation of the experimental provenance of data and of

the computational transformations applied to them, providing support for rerunning a workflow using the same data from

reference databases or for reproducing experiments and the

processing done on resulting experimental data.

The UPPNEX initiative provides high-performance comput­ing resources and large-scale storage together with a software

infrastructure for NGS research in Sweden [35]. Currently

managing about 300 projects concurrently, UPPNEX is being used by three sequencing platforms, each with their own data

delivery workflow; research groups may then analyze their data using the installed software or with custom pipelines.

UPPNEX uses iRODS to facilitate moving data between

different types of storage resources and to share resources with other domains. Currently, most of the installed software

at UPPNEX is only available via command line interface, and

limited support is provided for workflow management systems such as Galaxy. In addition to that, there is no evidence

on how provenance information is treated and if support for

reproducibility and traceability is offered to the users.

VI. CONCLUSIONS

We have described our experience in constructing a fully

automated infrastructure to support the analysis of data pro-

606

Page 8: [IEEE 2014 International Conference on High Performance Computing & Simulation (HPCS) - Bologna, Italy (2014.7.21-2014.7.25)] 2014 International Conference on High Performance Computing

duced by our NO S facility. The system, in production since July 2013, integrates open source tools - either internally

developed or publicly available - into a framework that can

autonomously handle the primary transformation process and support downstream analysis. The automation middleware is

built around a distributed event queue, drives a workflow manager and executes custom housekeeping tasks. The system,

which is undergoing continuous development, processes the

output of the CRS4 NO S lab, with peak weekly data produc­tion periods of over 200 samples and 2 TB of compressed

sequence data. As reusable components become available,

we plan to release them to the community as open source

software.

AUTHOR CONTRIBUTIONS

Automator: LP, LL, AP; BioBlend.objects: O C, LL,

SL, LP; Compute Engines: LP, SL; Hadoocca: MM, LP;

OMERO.biobank: O C, LL, SL, OZ; Orione: O C, O F, AP, PU; Sample Submission system: O C; Workflow Manager: O C, LP,

LL, SL. AA manages the NO S Lab and oversaw the integration

of the new submission system. O F and OZ oversaw the work and acquired funding. All authors contributed to this writing.

REFERENCES

[I] V. Marx, "Biology: The big challenges of big data," Nature, vol. 498, June 2013.

[2] K. M. Tolle, D. S. W. Tansley, and A. J. G. Hey, "The Fourth Paradigm: Data-Intensive Scientific Discovery," Proceedings of the IEEE, vol. 99, no. 8, pp. 1334-1337, 2011.

[3] C. S. Pareek, R. Smoczynski, and A. Tretyn, "Sequencing technologies and genome sequencing." Journal of applied genetics, vol. 52, no. 4, pp. 413-35, Dec. 2011.

[4] B. Langmead, M. C. Schatz, J. Lin, M. Pop, and S. L. Salzberg, "Searching for SNPs with cloud computing," Genome Biology, vol. 10, no. II, p. R134, 2009.

[5] pMap: parallel sequence mapping tool. [Online]. Available: http: /lbmi.osu.edu/hpC/software/pmap/pmap.html

[6] L. Pireddu, S. Leo, and G. Zanetti, "Seal: a distributed short read mapping and duplicate removal tool," Bioinformatics, vol. 27, no. 15, pp. 2159-2160, 2011.

[7] G. Cuccuru et aI., "Orione, a web-based framework for NGS analysis in microbiology," Bioinformatics, 2014, in press.

[8] V. Orru et aI., "Genetic variants regulating immune cell levels in health and disease." Cell, vol. 155, no. 1, pp. 242-56, Sep. 2013.

[9] P. Francalacci et aI., "Low-pass DNA sequencing of 1200 Sardinians reconstructs European Y-chromosome phylogeny." Science (New York,

N.Y.), vol. 341, no. 6145, pp. 565-9, Aug. 2013. [10] A. Pangrazio et aI., "Exome sequencing identifies CTSK mutations

in patients originally diagnosed as intermediate osteopetrosis." Bone, vol. 59, pp. 122-6, Feb. 2014.

[II] T. Pippucci et aI., "A novel null homozygous mutation confirms CACNA2D2 as a gene mutated in epileptic encephalopathy." PloS one, vol. 8, no. 12, p. e82154, Jan. 2013.

[12] A. Rajasekar et aI., "iRODS primer: Integrated rule-oriented data system," Synthesis Lectures on Information Concepts, Retrieval, and Services, vol. 2, no. I, pp. 1-143, 2010.

[13] RabbitMQ. [Online]. Available: https://www.rabbitmq.com/ [14] C. Allan et al., "OMERO: flexible, model-driven data management for

experimental biology," Nature Methods, vol. 9, no. 3, pp. 245-253, Mar 2012.

[15] Ne04j. [Online]. Available: http://www.ne04j.org/ [16] J. Goecks, A. Nekrutenko, J. Taylor, and the Galaxy Team, "Galaxy:

a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences," Genome Biol­

ogy, vol. II, no. 8, p. R86, 2010.

[17] C. Sloggett, N. Goonasekera, and E. Afgan, "Bioblend: automating pipeline analyses within galaxy and cloudman," Bioinformatics, vol. 29, no. 13, pp. 1685-1686, 2013.

[18] The Galaxy sample tracking website. [Online]. Available: https: //wiki.galax yproject.org/ AdminlSample%20Tracking/Next%20Gen

[19] The Hadoop website. [Online]. Available: http://hadoop.apache.org/ [20] List of institutions that use Hadoop in education or production.

[Online]. Available: http://wiki.apache.orglhadoop/PoweredBy [21] S. Schonherr, L. Forer, H. WeiSZensteiner, F. Kronenberg, G. Specht,

and A. K1oss-Brandstatter, "Cloudgene: A graphical execution platform for mapreduce programs on private and public clouds," BMC Bioif!for­

matics, vol. 13, no. 1, p. 200, 2012. [22] A. Schumacher et aI., "SeqPig: simple and scalable scripting for large

sequencing data sets in Hadoop." Bioinformatics (Oxford, England),

vol. 30, no. I, pp. 119-20, Jan. 2014. [23] S. Leo and G. Zanetti, "Pydoop: a Python MapReduce and HOFS API

for Hadoop," in Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing. ACM, 2010, pp. 819-825.

[24] A. Biffi et aI., "Lentiviral hematopoietic stem cell gene therapy benefits metachromatic leukodystrophy;' Science, vol. 341, no. 6148, p. 1233158, august 2013.

[25] S. Naitza et aI., "A genome-wide association scan on the levels of markers of inflammation in Sardinians reveals associations that underpin its complex regulation," PLoS Genetics, vol. 8, no. 1, p. e1002480, 2012.

[26] S. Sanna et aI., "Variants within the immunoregulatory CBLB gene are associated with multiple sclerosis." Nature genetics, vol. 42, no. 6, pp. 495-7, Jun. 2010.

[27] J. T. Leek et aI., "Tackling the widespread and critical impact of batch effects in high-throughput data," Nature Reviews Genetics, vol. II, pp. 733-739, 2010.

[28] S. Leo, L. Pireddu, and G. Zanetti, "SNP genotype calling with MapRe­duce," in Proceedings of the third international workshop on MapReduce

and its applications, 2012, pp. 49-56. [29] Affymetrix Power Tools. [Online]. Available: http://www.affymetrix.

com/partners_programs/programs/developer/tools/powertools.affx [30] T. White, Hadoop: The Definitive Guide, first edition ed. O'Reilly,

June 2009. [31] E. Sammer, Hadoop Operations, 1st ed. O'Reilly Media, Inc., 2012. [32] R. M. Leggett, R. H. Ramirez-Gonzalez, B. J. Clavijo, D. Waite, and

R. P. Davey, "Sequencing quality assessment tools to enable data-driven informatics for high throughput genomics." Frontiers in genetics, vol. 4, no. December, p. 288, Jan. 2013.

[33] G.-T. Chiang, P. Clapham, G. Qi, K. Sale, and G. Coates, "Implementing a genomic data management system using iRODS in the Well come Trust Sanger Institute." BMC bioinformatics, vol. 12, no. I, p. 361, Jan. 2011.

[34] C. Jordan, D. Stanzione, D. Ware, J. Lu, and C. Noutsos, "Com­prehensive data infrastructure for plant bioinformatics," 2010 IEEE

International Cof!ference On Cluster Computing Workshops and Posters

(Cluster Workshops), pp. 1-5, Sep. 2010. [35] M. Dahlo, S. Lampa, P. I. Olason, J. Hagberg, and O. Spjuth, "Lessons

learned from implementing a national infrastructure in Sweden for storage and analysis of next-generation sequencing data." GigaScience, vol. 2, no. I, p. 9, Jan. 2013.

607