ismb2014読み会 イントロ + deep learning of the tissue-regulated splicing code

25
ISMB2014読み会 2014911於:産総研CBRC

Upload: kengo-sato

Post on 22-Jul-2015

473 views

Category:

Science


4 download

TRANSCRIPT

Page 1: ISMB2014読み会 イントロ + Deep learning of the tissue-regulated splicing code

ISMB2014読み会

2014年9月11日  於:産総研CBRC

Page 2: ISMB2014読み会 イントロ + Deep learning of the tissue-regulated splicing code

What  is  ISMB? •  FAQから引用  –  Intelligent  Systems  for  Molecular  Biology  (ISMB)  is  the  annual  meeDng  of  the  InternaDonal  Society  for  ComputaDonal  Biology  (ISCB).  Over  the  past  eighteen  years  the  ISMB  conference  has  grown  to  become  the  largest  bioinformaDcs  conference  in  the  world.  The  ISMB  conferences  provide  a  mulDdisciplinary  forum  for  disseminaDng  the  latest  developments  in  bioinformaDcs.  ISMB  brings  together  scienDsts  from  computer  science,  molecular  biology,  mathemaDcs,  and  staDsDcs.  Its  principal  focus  is  on  the  development  and  applicaDon  of  advanced  computaDonal  methods  for  biological  problems.

Page 3: ISMB2014読み会 イントロ + Deep learning of the tissue-regulated splicing code

ISMB  2014

•  開催地:米国ボストン  •  日程:7月11日-­‐15日  •  プロシーディング:BioinformaDcs誌の特別号

•  採択率:  37/191  ≒  19.4%  – accept  at  1st  round:  29  papers  –  invite  to  2nd  round:  16  papers  – accept  at  2nd  round:  9  papers  – withdraw  aVer  acceptance:  1  paper  

Page 4: ISMB2014読み会 イントロ + Deep learning of the tissue-regulated splicing code

来年は?

•  ECCBと共催  ⇒  ISMB/ECCB  2015  •  開催地:ダブリン@アイルランド  •  日程:  7月10日-­‐14日  •  投稿締切:  1月9日(正月休めない!)  

•  再来年以降は?  –  2016  オーランド@米国  –  2017  プラハ@チェコ  –  2018  シカゴ@米国

Page 5: ISMB2014読み会 イントロ + Deep learning of the tissue-regulated splicing code

慶應義塾大学理工学部  佐藤健吾  

[email protected]  

Vol. 30 ISMB 2014, pages i121–i129BIOINFORMATICS doi:10.1093/bioinformatics/btu277

Deep learning of the tissue-regulated splicing codeMichael K. K. Leung1,2, Hui Yuan Xiong1,2, Leo J. Lee1,2 and Brendan J. Frey1,2,3,*1Department of Electrical and Computer Engineering, University of Toronto, Toronto, Ontario M5S 3G4, 2Banting andBest Department of Medical Research, University of Toronto, Toronto, Ontario M5S 3E1, Canada and 3CanadianInstitute for Advanced Research, Toronto, Ontario M5G 1Z8, Canada

ABSTRACT

Motivation: Alternative splicing (AS) is a regulated process that directs

the generation of different transcripts from single genes. A computa-

tional model that can accurately predict splicing patterns based on

genomic features and cellular context is highly desirable, both in

understanding this widespread phenomenon, and in exploring the

effects of genetic variations on AS.

Methods: Using a deep neural network, we developed a model

inferred from mouse RNA-Seq data that can predict splicing patterns

in individual tissues and differences in splicing patterns across tissues.

Our architecture uses hidden variables that jointly represent features in

genomic sequences and tissue types when making predictions.

A graphics processing unit was used to greatly reduce the training

time of our models with millions of parameters.

Results: We show that the deep architecture surpasses the perform-

ance of the previous Bayesian method for predicting AS patterns. With

the proper optimization procedure and selection of hyperparameters,

we demonstrate that deep architectures can be beneficial, even with a

moderately sparse dataset. An analysis of what the model has learned

in terms of the genomic features is presented.

Contact: [email protected]

Supplementary information: Supplementary data are available at

Bioinformatics online.

1 INTRODUCTION

Alternative splicing (AS) is a process whereby the exons of aprimary transcript may be connected in different ways duringpre-mRNA splicing. This enables the same gene to give rise tosplicing isoforms containing different combinations of exons,and as a result different protein products, contributing to thecellular diversity of an organism (Wang and Burge, 2008).Furthermore, AS is regulated during development and is oftentissue dependent, so a single gene can have multiple tissue-spe-cific functions. The importance of AS lies in the evidence that atleast 95% of human multi-exon genes are alternatively splicedand that the frequency of AS increases with species complexity(Barbosa-Morais et al., 2012; Pan et al., 2008).One mechanism of splicing regulation occurs at the level of the

sequences of the transcript. The presence or absence of certainregulatory elements can influence which exons are kept, whileothers are removed, before a primary transcript is translatedinto proteins. Computational models that take into accountthe combinatorial effects of these regulatory elements havebeen successful in predicting the outcome of splicing (Barashet al., 2010).

Previously, a ‘splicing code’ that uses a Bayesian neural net-work (BNN) was developed to infer a model that can predict theoutcome of AS from sequence information in different cellularcontexts (Xiong et al., 2011). One advantage of Bayesian meth-ods is that they protect against overfitting by integrating overmodels. When the training data are sparse, as is the case formany datasets in the life sciences, the Bayesian approach canbe beneficial. It was shown that the BNN outperforms severalcommon machine learning algorithms, such as multinomial lo-gistic regression (MLR) and support vector machines, for ASprediction in mouse trained using microarray data.There are several practical considerations when using BNNs.

They often rely on methods like Markov Chain Monte Carlo(MCMC) to sample models from a posterior distribution,which can be difficult to speed up and scale up to a largenumber of hidden variables and a large volume of trainingdata. Furthermore, computation-wise, it is relatively expensiveto get predictions from a BNN, which requires computing theaverage predictions of many models.Recently, deep learning methods have surpassed the state-of-

the-art performance for many tasks (Bengio et al., 2013). Deeplearning generally refers to methods that map data through mul-tiple levels of abstraction, where higher levels represent moreabstract entities. The goal is for an algorithm to automaticallylearn complex functions that map inputs to outputs, withoutusing hand-crafted features or rules (Bengio, 2009). One imple-mentation of deep learning comes in the form of feedforwardneural networks, where levels of abstraction are modeled by mul-tiple non-linear hidden layers.With the increasingly rapid growth in the volume of ‘omic’

data (e.g. genomics, transcriptomics, proteomics), deep learninghas the potential to produce meaningful and hierarchical repre-sentations that can efficiently be used to describe complex bio-logical phenomena. For example, deep networks may be usefulfor modeling multiple stages of a regulatory network at thesequence level and at higher levels of abstraction.Ensemble methods are a class of algorithms that are popular

owing to their generally good performance (Caruana andNiculescu-Mizil, 2006), and are often used in the life sciences(Touw et al., 2013). The strength of ensemble methods comesfrom combining the predictions of many models. Random for-ests is an example, as is the Bayesian model averaging methodpreviously used to model the regulation of splicing. Recently,neural network learning has been improved using a techniquecalled dropout, which makes neural networks behave like anensemble method (Hinton and Srivastava, 2012). Dropoutworks by randomly removing hidden neurons during the presen-tation of each training example. The outcome is that instead oftraining a single model with N hidden variables, it approximates*To whom correspondence should be addressed.

! The Author 2014. Published by Oxford University Press.This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercialre-use, please contact [email protected]

at KEIO

UN

IVERSITY

SCIENCE A

ND

TECHN

OL LIBR on July 22, 2014

http://bioinformatics.oxfordjournals.org/

Dow

nloaded from

ISMB2014読み会@産総研CBRC

Page 6: ISMB2014読み会 イントロ + Deep learning of the tissue-regulated splicing code

AlternaDve  splicing

•  ヒトにおいては、少なくとも95%の遺伝子に選択的スプライシングが起こっている。

(wikipedia)

Page 7: ISMB2014読み会 イントロ + Deep learning of the tissue-regulated splicing code

Deep  Neural  Networks  (DNN)

•  深いニューラルネットワークによる表現力  •  学習が極めて困難

9

Deep Neural Networks • Simple to construct

– Sigmoid nonlinearity for hidden layers – Softmax for the output layer

• But, backpropagation does not work well (if randomly initialized) – Deep networks trained with

backpropagation (without unsupervised pretraining) perform worse than shallow networks

(Bengio et al., NIPS 2007)

Page 8: ISMB2014読み会 イントロ + Deep learning of the tissue-regulated splicing code

Deep  Neural  Networks  (DNN)

•  いくつかのブレークスルー  – Autoencoderによるpre-­‐training  [Hinton  et  al.,  2006]  – Dropoutによる学習の安定化 [Srivastava  et  al.,  2014]  

•  様々な分野のコンテストで圧倒的な成績  – 画像認識、音声認識、化合物の活性予測、…  

•  バイオインフォマティクス分野での応用はまだそれほど多くない  – タンパク質コンタクトマップ予測 [Eickholt  et  al.,  2012]

Page 9: ISMB2014読み会 イントロ + Deep learning of the tissue-regulated splicing code

Stacked  Autoencoder

� 入力サンプルをよく再現するように± BPで or ボルツマンマシンとして学習

± 中間層がスパースに活性化するように正則化を行う

スパースオートエンコーダ

DNNのプレトレーニング

� 層ごとにオートエンコーダを学習 → 過学習を克服

± “greedy layerwise pretraining” [Hinton06]

[岡谷,  2013]

•  層ごとに教師なし学習  

•  各層は入力をよく再現するように学習

Page 10: ISMB2014読み会 イントロ + Deep learning of the tissue-regulated splicing code

Dropout

•  ランダムに隠れユニットを取り除いて学習  •  アンサンブル学習と同じ効果

Srivastava, Hinton, Krizhevsky, Sutskever and Salakhutdinov

(a) Standard Neural Net (b) After applying dropout.

Figure 1: Dropout Neural Net Model. Left: A standard neural net with 2 hidden layers. Right:An example of a thinned net produced by applying dropout to the network on the left.Crossed units have been dropped.

its posterior probability given the training data. This can sometimes be approximated quitewell for simple or small models (Xiong et al., 2011; Salakhutdinov and Mnih, 2008), but wewould like to approach the performance of the Bayesian gold standard using considerablyless computation. We propose to do this by approximating an equally weighted geometricmean of the predictions of an exponential number of learned models that share parameters.

Model combination nearly always improves the performance of machine learning meth-ods. With large neural networks, however, the obvious idea of averaging the outputs ofmany separately trained nets is prohibitively expensive. Combining several models is mosthelpful when the individual models are di↵erent from each other and in order to makeneural net models di↵erent, they should either have di↵erent architectures or be trainedon di↵erent data. Training many di↵erent architectures is hard because finding optimalhyperparameters for each architecture is a daunting task and training each large networkrequires a lot of computation. Moreover, large networks normally require large amounts oftraining data and there may not be enough data available to train di↵erent networks ondi↵erent subsets of the data. Even if one was able to train many di↵erent large networks,using them all at test time is infeasible in applications where it is important to respondquickly.

Dropout is a technique that addresses both these issues. It prevents overfitting andprovides a way of approximately combining exponentially many di↵erent neural networkarchitectures e�ciently. The term “dropout” refers to dropping out units (hidden andvisible) in a neural network. By dropping a unit out, we mean temporarily removing it fromthe network, along with all its incoming and outgoing connections, as shown in Figure 1.The choice of which units to drop is random. In the simplest case, each unit is retained witha fixed probability p independent of other units, where p can be chosen using a validationset or can simply be set at 0.5, which seems to be close to optimal for a wide range ofnetworks and tasks. For the input units, however, the optimal probability of retention isusually closer to 1 than to 0.5.

1930

[Srivastava  et  al.,  2014]

Page 11: ISMB2014読み会 イントロ + Deep learning of the tissue-regulated splicing code

Deep  Neural  Networks  (DNN)

•  深いニューラルネットワークによる表現力  

9

Deep Neural Networks • Simple to construct

– Sigmoid nonlinearity for hidden layers – Softmax for the output layer

• But, backpropagation does not work well (if randomly initialized) – Deep networks trained with

backpropagation (without unsupervised pretraining) perform worse than shallow networks

(Bengio et al., NIPS 2007)

5

Different Levels of Abstraction

• Hierarchical Learning – Natural progression from low

level to high level structure as seen in natural complexity

– Easier to monitor what is being learnt and to guide the machine to better subspaces

– A good lower level representation can be used for many distinct tasks

[Lee,  2010]

Page 12: ISMB2014読み会 イントロ + Deep learning of the tissue-regulated splicing code

問題設定

•  エクソンがスプライシングを受けるかどうかを予測する

(ESEs and ISEs) and silencers (ESSs and ISSs), which are 6–8 nucleotideslong and were identified without regard to possible tissue-dependentroles24–26, and 314 5–7-nucleotide-long motifs that are conserved inintronic sequences neighbouring alternative exons27. There are also460 region-specific counts of 1–3-nucleotide ‘short motifs’, becausesuch features were previously associated with alternative splicing28.We included 57 ‘transcript structure’ features implicated in determin-ing spliced transcript levels, such as exon/intron lengths, regionalprobabilities of secondary structures29, and whether exon inclusion/exclusion introduces a premature termination codon (PTC).

In addition to the feature compendium, we constructed a set of,1,800 ‘unbiased motifs’ by performing a de novo search10 for eachtissue type and direction of splicing change (SupplementaryInformation 3). Later, we report results obtained with and withoutusing these features.

Assembling a high-information code

Our method seeks a code that is able to predict the splicing patterns ofall exons as accurately as possible, based solely on the tissue type andproximal RNA features. The putative features for a particular exonare appended to make a feature vector r, and the correspondingprediction in tissue type c is denoted p(c,r). Like q, p(c,r) consistsof probabilities of increased inclusion or exclusion, or no change. Thecode is combinatorial and accounts for how features cooperate orcompete in a given tissue type, by specifying a subset of important

features, thresholds on feature values and softmax parameters30 relat-ing active feature combinations to the prediction p(c,r) (Supplemen-tary Information 4).

We use a measure of ‘code quality’ that is based on informationtheory31 (see Methods). It can be viewed as the amount of informa-tion about genome-wide tissue-dependent splicing accounted for bythe code. A code quality of zero indicates that the predictions are nobetter than guessing, whereas a higher code quality indicatesimproved prediction capability.

To assemble a code, our method recursively selects features fromthe compendium, while optimizing their thresholds and softmaxparameters to maximize code quality (Supplementary Informa-tion 5). The code quality increased monotonically during assembly,but diminished gains were observed after 200 features were included(Fig. 1b, c, based on fivefold cross-validation). The final assembledcode contained ,200 features. When a code was assembled using thecompendium plus the unbiased motifs, the increase in code qualitydid not exceed 1 s.d. in error (data not shown), but, interestingly,some of the unbiased motifs that did not correspond to any com-pendium features were selected and subsequently experimentallyverified (see later).

To quantify the contributions of its different components, wecompared our final assembled code to partial codes whose onlyinputs were the tissue type, previously described motifs, conservationlevels, or the compendium with transcript structure features or con-servation levels removed (Fig. 1d).

Predicting alternative splicing

On the task of distinguishing alternatively spliced exons from con-stitutively spliced exons, our method achieves a true positive rate ofmore than 60% at a false positive rate of 1% (SupplementaryInformation 6). To address the more difficult challenge of predictingtissue-dependent regulation, we applied the code to various sets ofunique test exons (exons not similar to those used during codeassembly) and verified the predictions using microarray data, PCRwith reverse transcription (RT–PCR) and focused studies (see laterand Supplementary Information 5).

We first asked whether the theoretical ranking of the differentcodes shown in Fig. 1d corresponds well to their relative abilities topredict microarray-assessed tissue-dependent regulation (seeMethods). Indeed, the final assembled code achieved significantlyhigher accuracy than the partial codes (Fig. 2a). For exons in geneswith median expression in the top 20th percentile, at a false positiverate of 1%, a true positive rate of 21% was achieved, and this rose to51% for a false positive rate of 10%.

We next asked how well the splicing code predicts significantdifferences in the percentage exon inclusion between pairs of tissues,for cases where the predicted difference is large (Fig. 2b andSupplementary Fig. 12). For microarray data, the splicing codecorrectly predicted the direction of change (positive or negative) in82.4% of cases (P , 1 3 10230, Binomial test; see Methods). For RT–PCR evaluation, 14 exons that the splicing code predicted wouldexhibit significant tissue-dependent splicing were profiled in 14diverse tissues. The splicing code correctly predicted the directionof change in 93.3% of cases (P , 1 3 10210, Binomial test). A scatter-plot comparing predictions and measurements (Fig. 2c) illustratesthat the code is able to predict an exon’s direction of regulation betterthan its percentage inclusion level. Figure 2d shows RT–PCR dataand predictions for four representative exons.

To assess whether the code recapitulates results from experimentalstudies of individual exons and tissue-specific splicing factors, wesurveyed 97 CNS- and/or muscle-regulated exons targeted by Nova,Fox, PTB, nPTB and/or unknown factors18,19,32–39. For each test exon,we extracted its features, applied the code and examined whether ornot it correctly predicts splicing patterns in CNS or muscle tissues(Supplementary Table 3). The code’s predictions were correct for 74%of the combined set of 97 exons (P , 1 3 10241, Bernoulli test), 65%

Short motifs

Known motifs

Transcript structureNew motifs

Splicing code

a

Tissue type

300 nt 300 nt 300 nt 300 nt

Alternatively spliced exon

Feature setPredicted change in

exon inclusion

Cod

e qu

ality

(bits

) 400

Feat

ure

dete

ctio

n ra

te

Number of RNA features

Knownmotifsonly

Finalassembled

code

Tissuetypeonly

Consonly

W/otranscriptstructure

W/ocons

Short motifs

Known motifs

Transcript structure New motifs

d

b

0 100 200

0

0.4

0.0

0.8

200 400

0

200

Codes derived using different feature sets

300

100

c

RNA featureextraction

Cod

e qu

ality

(bits

)

Code assembly

Figure 1 | Assembling the splicing code. a, The code extracts hundreds ofRNA features (known/new/short motifs and transcript structure features)from any exon of interest (red), its neighbouring exons (yellow) andintervening introns (blue). It then predicts whether or not the exon isalternatively spliced, and if so, whether the exon’s inclusion level will increaseor decrease in a given tissue, relative to others. b, c, Code assembly proceedsby recursively adding features to maximize an information measure of codequality (b), and different feature types are preferred at different stages ofassembly (c). d, The final assembled code achieves higher code quality thansimpler codes derived using previously reported features and feature subsets.Cons, conservation; w/o, without. Error bars represent 1 s.d.

ARTICLES NATURE | Vol 465 | 6 May 2010

54Macmillan Publishers Limited. All rights reserved©2010

[Barash  et  al.,  2010]

Page 13: ISMB2014読み会 イントロ + Deep learning of the tissue-regulated splicing code

モデル

Inputs into the first hidden layer consist of F=1393 genomic featuresxf=1. . .F describing an exon, neighboring introns and adjacent exons. Toimprove learning, the features were normalized by the maximum of theabsolute value across all exons. The purpose of this hidden layer is toreduce the dimensionality of the input and learn a better representation ofthe feature space.

The identity of two tissues, which consists of two 1-of-T binary vari-ables ti=1. . .T and tj=1. . .T, are then appended to the vector of outputs ofthe first hidden layer, together forming the input into the second hiddenlayer. For this work, T=5 for the five tissues available in the RNA-Seqdata. We added a third hidden layer as we found it improved the model’sperformance. The weighted outputs from the last hidden layer is used asinput into a softmax function for classification in the prediction hk(x,t,!),which represents the probability of each splicing pattern k:

hk=exp

Pm !

lastk;m alastm

! "P

k0expP

m !lastk0;m alastm

! " ð3Þ

To learn a set of model parameters !, we used the cross-entropy costfunction E on predictions h(x,t,!) given targets y(x,t), which is mini-mized during training:

E=#X

n

XC

k=1yn;klogðhn;kÞ ð4Þ

where n denotes the training examples, and k indexes C classes.We are interested in two types of predictions. The first task is to predict

the PSI value given a particular tissue type and a set of genomic features.To generate the targets for training, we created C=3 classes, which welabel as low, medium and high categories. Each class contains a real-valuevariable obtained by summing the probability mass of the PSI distribu-tion over equally split intervals of 0–0.33, 0.33–0.66 and 0.66–1. Theyrepresent the probability that a given exon and tissue type has a PSI valueranging from these corresponding intervals, hence are soft class labels.We will refer this as the ‘low, medium, high’ (LMH) code, with targetsyLMHk ðx; tiÞ.The second task describes the "PSI between two tissues for a particu-

lar exon. We again generate three classes, and call them decreased inclu-sion, no change and increased inclusion, which are similarly generated, butfrom the "PSI distributions. We chose an interval that more finely dif-ferentiates tissue-specific AS for this task, where a difference of 40.15would be labeled as a change in PSI levels. We summed the probabilitymass over the intervals of #1 to #0.15 for decreased inclusion, #0.15 to0.15 for no change and 0.15 to 1 for increased inclusion. The purpose of

this target is to learn a model that is independent of the chosen PSI classintervals in the LMH code. For example, the expected PSI of two tissuesti and tj for an exon could be 0.40 and 0.60. The LMH code would betrained to predict medium for both tissues, whereas this tissue differencecode would predict that tj has increased inclusion relative to ti. We willrefer to this as the ‘decrease, no change, increase’ (DNI) code, with targetsyDNIk ðx; ti; tjÞ.Both the LMH and DNI codes are trained jointly, reusing the same

hidden representations learned by the model. For the LMH code, twosoftmax classification outputs predict the PSI for each of the two tissuesthat are given as input into the DNN. A third softmax classificationfunction predicts the "PSI for the two tissues. We note that two PSIpredictions are included in the model’s output so we have a completeset of predictions that use the full input features. The total cost of themodel used during optimization is the sum of the cross-entropy functions(4) for both prediction tasks.

The BNN architecture used for comparison is the same as previouslydescribed (Xiong et al., 2011), but trained on RNA-Seq data with theexpanded feature set and LMH as targets. Although hidden variableswere shared across tissues in both the BNN and DNN, a different setof weights was used following the single hidden layer to predict thesplicing pattern for each tissue separately in the BNN (SupplementaryFig. S3). In the current DNN, the tissue identities are inputs and arejointly represented by hidden variables together with genomic features.For the BNN to make tissue difference predictions in the same manner asthe DNI code, we fitted a MLR on the predicted LMH outputs for eachtissue pair (Supplementary Fig. S4).

2.3 Training the modelThe first hidden layer was trained as an autoencoder to reduce the dimen-sionality of the feature space in an unsupervised manner. An autoencoderis trained by supplying the input through a non-linear hidden layer, andreconstructing the input, with tied weights going into and out of thehidden layer. This method of pretraining the network has been used indeep architectures to initialize learning near a good local minimum(Erhan et al., 2010; Hinton and Salakhutdinov, 2006). We used an auto-encoder instead of other dimensionality reduction techniques like prin-ciple component analysis because it naturally fits into the DNNarchitecture, and that a non-linear technique may discover a better andmore compact representation of the features.

In the second stage of training, the weights from the input layer to thefirst hidden layer (learned from the autoencoder) are fixed, and 10 add-itional inputs corresponding to tissues are appended. A one-hot encoding

Fig. 1. Architecture of the DNN used to predict AS patterns. It contains three hidden layers, with hidden variables that jointly represent genomicfeatures and cellular context (tissue types)

i123

Deep learning of the splicing code

at KEIO

UN

IVERSITY

SCIENCE A

ND

TECHN

OL LIBR on July 22, 2014

http://bioinformatics.oxfordjournals.org/

Dow

nloaded from

Page 14: ISMB2014読み会 イントロ + Deep learning of the tissue-regulated splicing code

特徴量

•  前後のエクソン・イントロンに関する1392個の特徴量[Barash  et  al.,  2010]

 

–  k-­‐mer  – 翻訳可能性  – 長さ  – 保存度  – モチーフ配列(転写因子結合部位)  – …  

(ESEs and ISEs) and silencers (ESSs and ISSs), which are 6–8 nucleotideslong and were identified without regard to possible tissue-dependentroles24–26, and 314 5–7-nucleotide-long motifs that are conserved inintronic sequences neighbouring alternative exons27. There are also460 region-specific counts of 1–3-nucleotide ‘short motifs’, becausesuch features were previously associated with alternative splicing28.We included 57 ‘transcript structure’ features implicated in determin-ing spliced transcript levels, such as exon/intron lengths, regionalprobabilities of secondary structures29, and whether exon inclusion/exclusion introduces a premature termination codon (PTC).

In addition to the feature compendium, we constructed a set of,1,800 ‘unbiased motifs’ by performing a de novo search10 for eachtissue type and direction of splicing change (SupplementaryInformation 3). Later, we report results obtained with and withoutusing these features.

Assembling a high-information code

Our method seeks a code that is able to predict the splicing patterns ofall exons as accurately as possible, based solely on the tissue type andproximal RNA features. The putative features for a particular exonare appended to make a feature vector r, and the correspondingprediction in tissue type c is denoted p(c,r). Like q, p(c,r) consistsof probabilities of increased inclusion or exclusion, or no change. Thecode is combinatorial and accounts for how features cooperate orcompete in a given tissue type, by specifying a subset of important

features, thresholds on feature values and softmax parameters30 relat-ing active feature combinations to the prediction p(c,r) (Supplemen-tary Information 4).

We use a measure of ‘code quality’ that is based on informationtheory31 (see Methods). It can be viewed as the amount of informa-tion about genome-wide tissue-dependent splicing accounted for bythe code. A code quality of zero indicates that the predictions are nobetter than guessing, whereas a higher code quality indicatesimproved prediction capability.

To assemble a code, our method recursively selects features fromthe compendium, while optimizing their thresholds and softmaxparameters to maximize code quality (Supplementary Informa-tion 5). The code quality increased monotonically during assembly,but diminished gains were observed after 200 features were included(Fig. 1b, c, based on fivefold cross-validation). The final assembledcode contained ,200 features. When a code was assembled using thecompendium plus the unbiased motifs, the increase in code qualitydid not exceed 1 s.d. in error (data not shown), but, interestingly,some of the unbiased motifs that did not correspond to any com-pendium features were selected and subsequently experimentallyverified (see later).

To quantify the contributions of its different components, wecompared our final assembled code to partial codes whose onlyinputs were the tissue type, previously described motifs, conservationlevels, or the compendium with transcript structure features or con-servation levels removed (Fig. 1d).

Predicting alternative splicing

On the task of distinguishing alternatively spliced exons from con-stitutively spliced exons, our method achieves a true positive rate ofmore than 60% at a false positive rate of 1% (SupplementaryInformation 6). To address the more difficult challenge of predictingtissue-dependent regulation, we applied the code to various sets ofunique test exons (exons not similar to those used during codeassembly) and verified the predictions using microarray data, PCRwith reverse transcription (RT–PCR) and focused studies (see laterand Supplementary Information 5).

We first asked whether the theoretical ranking of the differentcodes shown in Fig. 1d corresponds well to their relative abilities topredict microarray-assessed tissue-dependent regulation (seeMethods). Indeed, the final assembled code achieved significantlyhigher accuracy than the partial codes (Fig. 2a). For exons in geneswith median expression in the top 20th percentile, at a false positiverate of 1%, a true positive rate of 21% was achieved, and this rose to51% for a false positive rate of 10%.

We next asked how well the splicing code predicts significantdifferences in the percentage exon inclusion between pairs of tissues,for cases where the predicted difference is large (Fig. 2b andSupplementary Fig. 12). For microarray data, the splicing codecorrectly predicted the direction of change (positive or negative) in82.4% of cases (P , 1 3 10230, Binomial test; see Methods). For RT–PCR evaluation, 14 exons that the splicing code predicted wouldexhibit significant tissue-dependent splicing were profiled in 14diverse tissues. The splicing code correctly predicted the directionof change in 93.3% of cases (P , 1 3 10210, Binomial test). A scatter-plot comparing predictions and measurements (Fig. 2c) illustratesthat the code is able to predict an exon’s direction of regulation betterthan its percentage inclusion level. Figure 2d shows RT–PCR dataand predictions for four representative exons.

To assess whether the code recapitulates results from experimentalstudies of individual exons and tissue-specific splicing factors, wesurveyed 97 CNS- and/or muscle-regulated exons targeted by Nova,Fox, PTB, nPTB and/or unknown factors18,19,32–39. For each test exon,we extracted its features, applied the code and examined whether ornot it correctly predicts splicing patterns in CNS or muscle tissues(Supplementary Table 3). The code’s predictions were correct for 74%of the combined set of 97 exons (P , 1 3 10241, Bernoulli test), 65%

Short motifs

Known motifs

Transcript structureNew motifs

Splicing code

a

Tissue type

300 nt 300 nt 300 nt 300 nt

Alternatively spliced exon

Feature setPredicted change in

exon inclusion

Cod

e qu

ality

(bits

) 400

Feat

ure

dete

ctio

n ra

te

Number of RNA features

Knownmotifsonly

Finalassembled

code

Tissuetypeonly

Consonly

W/otranscriptstructure

W/ocons

Short motifs

Known motifs

Transcript structure New motifs

d

b

0 100 200

0

0.4

0.0

0.8

200 400

0

200

Codes derived using different feature sets

300

100

c

RNA featureextraction

Cod

e qu

ality

(bits

)

Code assembly

Figure 1 | Assembling the splicing code. a, The code extracts hundreds ofRNA features (known/new/short motifs and transcript structure features)from any exon of interest (red), its neighbouring exons (yellow) andintervening introns (blue). It then predicts whether or not the exon isalternatively spliced, and if so, whether the exon’s inclusion level will increaseor decrease in a given tissue, relative to others. b, c, Code assembly proceedsby recursively adding features to maximize an information measure of codequality (b), and different feature types are preferred at different stages ofassembly (c). d, The final assembled code achieves higher code quality thansimpler codes derived using previously reported features and feature subsets.Cons, conservation; w/o, without. Error bars represent 1 s.d.

ARTICLES NATURE | Vol 465 | 6 May 2010

54Macmillan Publishers Limited. All rights reserved©2010

Page 15: ISMB2014読み会 イントロ + Deep learning of the tissue-regulated splicing code

出力

•  PSI  (Percentage  of  Splicing  In)  [Katz  et  al.,  2010]を離散化  –  LMH  code  

•  Low:  0-­‐0.33,  Medium:  0.33-­‐0.66,  High:  0.66-­‐1  – DNI  code  

•  Decrease:  部位i  >  部位j  •  No  change:  部位i  ≒  部位j    

(PSIの差の絶対値<0.15)  •  Increase:  部位i  <  部位j  

•  複数の出力を同時に学習  – 学習が安定化  

Inputs into the first hidden layer consist of F=1393 genomic featuresxf=1. . .F describing an exon, neighboring introns and adjacent exons. Toimprove learning, the features were normalized by the maximum of theabsolute value across all exons. The purpose of this hidden layer is toreduce the dimensionality of the input and learn a better representation ofthe feature space.

The identity of two tissues, which consists of two 1-of-T binary vari-ables ti=1. . .T and tj=1. . .T, are then appended to the vector of outputs ofthe first hidden layer, together forming the input into the second hiddenlayer. For this work, T=5 for the five tissues available in the RNA-Seqdata. We added a third hidden layer as we found it improved the model’sperformance. The weighted outputs from the last hidden layer is used asinput into a softmax function for classification in the prediction hk(x,t,!),which represents the probability of each splicing pattern k:

hk=exp

Pm !

lastk;m alastm

! "P

k0expP

m !lastk0;m alastm

! " ð3Þ

To learn a set of model parameters !, we used the cross-entropy costfunction E on predictions h(x,t,!) given targets y(x,t), which is mini-mized during training:

E=#X

n

XC

k=1yn;klogðhn;kÞ ð4Þ

where n denotes the training examples, and k indexes C classes.We are interested in two types of predictions. The first task is to predict

the PSI value given a particular tissue type and a set of genomic features.To generate the targets for training, we created C=3 classes, which welabel as low, medium and high categories. Each class contains a real-valuevariable obtained by summing the probability mass of the PSI distribu-tion over equally split intervals of 0–0.33, 0.33–0.66 and 0.66–1. Theyrepresent the probability that a given exon and tissue type has a PSI valueranging from these corresponding intervals, hence are soft class labels.We will refer this as the ‘low, medium, high’ (LMH) code, with targetsyLMHk ðx; tiÞ.The second task describes the "PSI between two tissues for a particu-

lar exon. We again generate three classes, and call them decreased inclu-sion, no change and increased inclusion, which are similarly generated, butfrom the "PSI distributions. We chose an interval that more finely dif-ferentiates tissue-specific AS for this task, where a difference of 40.15would be labeled as a change in PSI levels. We summed the probabilitymass over the intervals of #1 to #0.15 for decreased inclusion, #0.15 to0.15 for no change and 0.15 to 1 for increased inclusion. The purpose of

this target is to learn a model that is independent of the chosen PSI classintervals in the LMH code. For example, the expected PSI of two tissuesti and tj for an exon could be 0.40 and 0.60. The LMH code would betrained to predict medium for both tissues, whereas this tissue differencecode would predict that tj has increased inclusion relative to ti. We willrefer to this as the ‘decrease, no change, increase’ (DNI) code, with targetsyDNIk ðx; ti; tjÞ.Both the LMH and DNI codes are trained jointly, reusing the same

hidden representations learned by the model. For the LMH code, twosoftmax classification outputs predict the PSI for each of the two tissuesthat are given as input into the DNN. A third softmax classificationfunction predicts the "PSI for the two tissues. We note that two PSIpredictions are included in the model’s output so we have a completeset of predictions that use the full input features. The total cost of themodel used during optimization is the sum of the cross-entropy functions(4) for both prediction tasks.

The BNN architecture used for comparison is the same as previouslydescribed (Xiong et al., 2011), but trained on RNA-Seq data with theexpanded feature set and LMH as targets. Although hidden variableswere shared across tissues in both the BNN and DNN, a different setof weights was used following the single hidden layer to predict thesplicing pattern for each tissue separately in the BNN (SupplementaryFig. S3). In the current DNN, the tissue identities are inputs and arejointly represented by hidden variables together with genomic features.For the BNN to make tissue difference predictions in the same manner asthe DNI code, we fitted a MLR on the predicted LMH outputs for eachtissue pair (Supplementary Fig. S4).

2.3 Training the modelThe first hidden layer was trained as an autoencoder to reduce the dimen-sionality of the feature space in an unsupervised manner. An autoencoderis trained by supplying the input through a non-linear hidden layer, andreconstructing the input, with tied weights going into and out of thehidden layer. This method of pretraining the network has been used indeep architectures to initialize learning near a good local minimum(Erhan et al., 2010; Hinton and Salakhutdinov, 2006). We used an auto-encoder instead of other dimensionality reduction techniques like prin-ciple component analysis because it naturally fits into the DNNarchitecture, and that a non-linear technique may discover a better andmore compact representation of the features.

In the second stage of training, the weights from the input layer to thefirst hidden layer (learned from the autoencoder) are fixed, and 10 add-itional inputs corresponding to tissues are appended. A one-hot encoding

Fig. 1. Architecture of the DNN used to predict AS patterns. It contains three hidden layers, with hidden variables that jointly represent genomicfeatures and cellular context (tissue types)

i123

Deep learning of the splicing code

at KEIO

UN

IVERSITY

SCIENCE A

ND

TECHN

OL LIBR on July 22, 2014

http://bioinformatics.oxfordjournals.org/

Dow

nloaded from

Page 16: ISMB2014読み会 イントロ + Deep learning of the tissue-regulated splicing code

DNNの学習

•  重みは正規分布でランダムに初期化  •  Stacked  Autoencoder  +  Dropout  •  細かい工夫  – 通常の確率的勾配法ではなく、部位間で差が大

きいエクソンから学習していく  

Page 17: ISMB2014読み会 イントロ + Deep learning of the tissue-regulated splicing code

ハイパーパラメータの最適化

•  5-­‐fold  cross  validaDon:  AUCに基づき最適化  –  training:  3  folds  (DNNの学習)  – validaDon:  1  fold  (ハイパーパラメータの最適化)  –  test:  1  fold  (評価)  

•  ガウス過程に基づく  spearmint  [Snoek  et  al.,  2012]という手法を適用

3

S2 Hyperparameters Optimization We used a Bayesian optimization package called spearmint (Snoek et al., 2012) to search for a joint setting of hyperparameters that optimizes the model's performance on validation data. Data was split into 5 approximately equal folds at random for cross validation. Each fold contained a unique set of exons that are not found in any of the other folds. Three of the folds were used for training, one used for validation, whose performance was used for hyperparameter selection, and one for testing. For each fold of the data partition, a separate hyperparameter optimization procedure was performed to ensure a set of test data is always held out from the optimization. The model performance of each selection of hyperparameter was scored by the sum of the AUCs from both the LMH and DNI codes on validation data, and therefore required the setting to perform well on both tasks. The optimal set of hyperparameters were then used to re-train the model using both training and validation data. Five models were trained this way from the different folds of data. Predictions made for the corresponding test data from all models were then evaluated and reported. The hyperparameters that were optimized and their search ranges are: (1) the learning rate for each of the two tasks (0.1 to 2.0), (2) the number of hidden units in each layer (30 to 9000), (3) the L1 penalty (0.0 to 0.25), (4) the standard deviation of the normal distribution used to initialize the weights (0.001 to 0.200), (5) the momentum schedule defined as the number of epochs to linearly increase the momentum from 0.50 to 0.99 (50 to 1500), and (6) the minibatch size (500 to 8500). The number of training epoch was fixed to 1500. In our experience, a good set of hyperparameters were generally found in approximately 2 days, where experiments were ran on a single GPU (Nvidia GTX Titan). The selected set of hyperparameters are shown in Table S2. There is a large range of acceptable values for the number hidden units in the second layer. Table S2. The hyperparameters selected to train the deep neural network. Some are listed in ranges to reflect the variations from the different folds as well as hyperparameters from the top performing runs within a given fold.

Range Selected Hidden Units (layer 1) 450 - 650 Hidden Units (layer 2) 4500 - 6000 Hidden Units (layer 3) 400 - 600 L1 Regularization 0 - 0.05 Learning Rate (LMH code) 1.40 - 1.50 Learning Rate (DNI code) 1.80 - 2.00 Momentum Rate 1250 Minibatch Size 1500 Weight Initialization 0.05 - 0.09

Page 18: ISMB2014読み会 イントロ + Deep learning of the tissue-regulated splicing code

実験

•  実験環境  – Python  with  Gnumpy  [Tieleman,  2010]  で実装  – Nvidia  GTX  Titan上で実験  

•  データ  – マウスの5部位RNA-­‐seqデータ [Brawand  et  al.,  2011]  か

ら得た  11,019個のエクソンのスプライシングパターン  

1

Supplementary Information for Deep learning of the tissue-regulated splicing code

Michael K.K. Leung, Hui Yuan Xiong, Leo J. Lee, and Brendan J. Frey

Engineering and Medicine University of Toronto

10 King's College Road Toronto, ON, Canada M5S 3G4

S1 Dataset Description The dataset consists of 11,019 mouse alternative exons in five tissue types profiled from RNA-Seq data prepared by (Brawand et al., 2011). As explained in the main text, a distribution of percent-spliced-in (PSI) was estimated for each exon and tissue. From this distribution, three real-values were calculated by summing the probability mass over equally split intervals of 0 to 0.33 (low), 0.33 to 0.66 (medium), and 0.66 to 1 (high). They represent the probability that the given exon within a tissue type has PSI value ranging from these intervals, hence are soft assignments into each category. The models were trained using these soft labels. Table S1 shows the distribution of exons in each category, counted by selecting the label with the largest value. Table S1. The number of exons classified as low, medium, and high for each mouse tissue. Exons with large tissue variability (TV) are displayed in a separate column. The proportion of medium category exons that have large tissue variability is higher than the other two categories.

Brain Heart Kidney Liver Testis All TV All TV All TV All TV All TV Low 1782 579 1191 460 1287 528 1001 413 1216 452 Medium 669 456 384 330 345 294 254 220 346 270 High 5229 1068 4060 919 4357 941 3606 757 4161 887 Total 7680 2103 5635 1709 5989 1763 4861 1390 5723 1609

Page 19: ISMB2014読み会 イントロ + Deep learning of the tissue-regulated splicing code

結果:先行研究との比較

•  LMH  code  (all) •  LMH  code  (high  Dssue  variability)

BNN consist only of the PSI prediction for each tissue separately at theoutput (Xiong et al., 2011), for the BNN to make tissue difference pre-dictions in the same manner as the DNI code, we used a MLR on thepredicted outputs for each tissue pair. For a fair comparison, we similarlytrained a MLR on the LMH outputs of the DNN to make DNI predic-tions, and report that result separately. In both cases, the inputs to theMLR are the LMH predictions for two tissues as well as their logarithm.Schematic of the BNN and MLR architecture can be found inSupplementary Figures S3 and S4.

3 RESULTS

We present three sets of results that compare the test perform-ance of the BNN, DNN and MLR for splicing pattern predic-tion. The first is the PSI prediction from the LMH code tested onall exons. The second is the PSI prediction evaluated only ontargets where there are large variations across tissues for a givenexon. These are events where "PSI!"0.15 for at least one pairof tissues, to evaluate the tissue specificity of the model. Thethird result shows how well the code can classify "PSI betweenthe five tissue types. Hyperparameter tuning was used in allmethods. The averaged predictions from all partitions andfolds are used to evaluate the model’s performance on their cor-responding test dataset. Similar to training, we tested on exonsand tissues that have at least 10 junction reads.For the LMH code, as the same prediction target can be gen-

erated by different input configurations, and there are two LMHoutputs, we compute the predictions for all input combinationscontaining the particular tissue and average them into a singleprediction for testing. To assess the stability of the LMH predic-tions, we calculated the percentage of instances in which there isa prediction from one tissue input configuration that does notagree with another tissue input configuration in terms of classmembership, for all exons and tissues. Of all predictions, 91.0%agreed with each other, 4.2% have predictions that are in adja-cent classes (i.e. low and medium, or medium and high), and 4.8%otherwise. Of those predictions that agreed with each other,85.9% correspond to the correct class label on test data,51.2% for the predictions with adjacent classes and 53.8% forthe remaining predictions. This information can be used to assessthe confidence of the predicted class labels. Note that predictionsspanning adjacent classes may be indicative that the PSI value issomewhere between the two classes, and the above analysis usinghard class labels can underestimate the confidence of the model.

3.1 Performance comparison

Table 1a reports AUCLMH_All for PSI predictions from the LMHcode on all tissues and exons. The performance of the DNN inthe low and high categories are comparable with the BNN, butexcels at the medium category, with especially large gains inbrain, heart and kidney. Because a large portion of the exonsexhibit low tissue variability (Section 1 of SupplementaryMaterial), evaluating the performance of the model on allexons may mask the performance gain of the DNN. This as-sumes that exons with high tissue variability are more difficultto predict, where a computational model must learn how ASinterprets genomic features differently in different cellular envir-onments. To more carefully see the tissue specificity of the dif-ferent methods, Table 1b reports AUCLMH_TV evaluated on the

subset of events that exhibit large tissue variability. Here, theDNN significantly outperforms the BNN in all categories andtissues. The improvement in tissue specificity is evident from thelarge gains in the medium category, where exons are more likelyto have large tissue variability. In both comparisons, the MLRperformed poorly compared with both the BNN and DNN.Next, we look at how well the different methods can predict

"PSI between two tissues, where it must determine the directionof change. This is shown in Table 2. As described above, "PSIpredictions for the BNN were made by training a MLR classifieron the LMH outputs (BNN-MLR). To make the comparisonfair, we included the performance of the DNN in making "PSIpredictions by also using a MLR classifier (DNN-MLR) on theLMH outputs. Finally, we evaluated the "PSI predictions dir-ectly from the DNI code, as well as the MLR baseline method,where the inputs include the tissue types.

Table 1. Comparison of the LMH code’s AUC performance on differentmethods

(a) AUCLMH_All

Tissue Method Low Medium High

Brain MLR 81.3" 0.1 72.4" 0.3 81.5" 0.1BNN 89.2" 0.4 75.2" 0.3 88.0" 0.4DNN 89.3" 0.5 79.4" 0.9 88.3" 0.6

Heart MLR 84.6" 0.1 73.1" 0.3 83.6" 0.1BNN 91.1" 0.3 74.7" 0.3 89.5" 0.2DNN 90.7" 0.6 79.7" 1.2 89.4" 1.1

Kidney MLR 86.7" 0.1 75.6" 0.2 86.3" 0.1BNN 92.5" 0.4 78.3" 0.4 91.6" 0.4DNN 91.9" 0.6 82.6" 1.1 91.2" 0.9

Liver MLR 86.5" 0.2 75.6" 0.2 86.5" 0.1BNN 92.7" 0.3 77.9" 0.6 92.3" 0.5DNN 92.2" 0.5 80.5" 1.0 91.1" 0.8

Testis MLR 85.6" 0.1 72.3" 0.4 85.2" 0.1BNN 91.1" 0.3 75.5" 0.6 90.4" 0.3DNN 90.7" 0.6 76.6" 0.7 89.7" 0.7

(b) AUCLMH_TV

Tissue Method Low Medium High

Brain MLR 71.1" 0.2 58.8" 0.2 70.8" 0.1BNN 77.9" 0.5 61.1" 0.5 76.5" 0.7DNN 82.8" 1.0 69.5" 1.1 81.1" 0.4

Heart MLR 73.9" 0.3 58.6" 0.4 72.7" 0.1BNN 78.1" 0.3 58.9" 0.3 75.7" 0.3DNN 82.0" 1.1 67.4" 1.3 79.7" 1.2

Kidney MLR 79.7" 0.3 64.3" 0.2 79.4" 0.2BNN 83.9" 0.5 66.4" 0.5 83.3" 0.6DNN 86.2" 0.6 73.2" 1.3 85.3" 1.2

Liver MLR 80.1" 0.5 63.7" 0.3 79.4" 0.3BNN 84.9" 0.7 65.4" 0.7 84.4" 0.7DNN 87.7" 0.6 69.4" 1.2 84.8" 0.8

Testis MLR 77.3" 0.2 60.8" 0.3 77.0" 0.1BNN 81.1" 0.5 63.9" 0.9 81.0" 0.5DNN 84.6" 1.1 67.8" 0.9 83.5" 0.9

Notes: " indicates 1 standard deviation; top performances are shown in bold.

i125

Deep learning of the splicing code

at KEIO

UN

IVERSITY

SCIENCE A

ND

TECHN

OL LIBR on July 22, 2014

http://bioinformatics.oxfordjournals.org/

Dow

nloaded from

BNN consist only of the PSI prediction for each tissue separately at theoutput (Xiong et al., 2011), for the BNN to make tissue difference pre-dictions in the same manner as the DNI code, we used a MLR on thepredicted outputs for each tissue pair. For a fair comparison, we similarlytrained a MLR on the LMH outputs of the DNN to make DNI predic-tions, and report that result separately. In both cases, the inputs to theMLR are the LMH predictions for two tissues as well as their logarithm.Schematic of the BNN and MLR architecture can be found inSupplementary Figures S3 and S4.

3 RESULTS

We present three sets of results that compare the test perform-ance of the BNN, DNN and MLR for splicing pattern predic-tion. The first is the PSI prediction from the LMH code tested onall exons. The second is the PSI prediction evaluated only ontargets where there are large variations across tissues for a givenexon. These are events where "PSI!"0.15 for at least one pairof tissues, to evaluate the tissue specificity of the model. Thethird result shows how well the code can classify "PSI betweenthe five tissue types. Hyperparameter tuning was used in allmethods. The averaged predictions from all partitions andfolds are used to evaluate the model’s performance on their cor-responding test dataset. Similar to training, we tested on exonsand tissues that have at least 10 junction reads.For the LMH code, as the same prediction target can be gen-

erated by different input configurations, and there are two LMHoutputs, we compute the predictions for all input combinationscontaining the particular tissue and average them into a singleprediction for testing. To assess the stability of the LMH predic-tions, we calculated the percentage of instances in which there isa prediction from one tissue input configuration that does notagree with another tissue input configuration in terms of classmembership, for all exons and tissues. Of all predictions, 91.0%agreed with each other, 4.2% have predictions that are in adja-cent classes (i.e. low and medium, or medium and high), and 4.8%otherwise. Of those predictions that agreed with each other,85.9% correspond to the correct class label on test data,51.2% for the predictions with adjacent classes and 53.8% forthe remaining predictions. This information can be used to assessthe confidence of the predicted class labels. Note that predictionsspanning adjacent classes may be indicative that the PSI value issomewhere between the two classes, and the above analysis usinghard class labels can underestimate the confidence of the model.

3.1 Performance comparison

Table 1a reports AUCLMH_All for PSI predictions from the LMHcode on all tissues and exons. The performance of the DNN inthe low and high categories are comparable with the BNN, butexcels at the medium category, with especially large gains inbrain, heart and kidney. Because a large portion of the exonsexhibit low tissue variability (Section 1 of SupplementaryMaterial), evaluating the performance of the model on allexons may mask the performance gain of the DNN. This as-sumes that exons with high tissue variability are more difficultto predict, where a computational model must learn how ASinterprets genomic features differently in different cellular envir-onments. To more carefully see the tissue specificity of the dif-ferent methods, Table 1b reports AUCLMH_TV evaluated on the

subset of events that exhibit large tissue variability. Here, theDNN significantly outperforms the BNN in all categories andtissues. The improvement in tissue specificity is evident from thelarge gains in the medium category, where exons are more likelyto have large tissue variability. In both comparisons, the MLRperformed poorly compared with both the BNN and DNN.Next, we look at how well the different methods can predict

"PSI between two tissues, where it must determine the directionof change. This is shown in Table 2. As described above, "PSIpredictions for the BNN were made by training a MLR classifieron the LMH outputs (BNN-MLR). To make the comparisonfair, we included the performance of the DNN in making "PSIpredictions by also using a MLR classifier (DNN-MLR) on theLMH outputs. Finally, we evaluated the "PSI predictions dir-ectly from the DNI code, as well as the MLR baseline method,where the inputs include the tissue types.

Table 1. Comparison of the LMH code’s AUC performance on differentmethods

(a) AUCLMH_All

Tissue Method Low Medium High

Brain MLR 81.3" 0.1 72.4" 0.3 81.5" 0.1BNN 89.2" 0.4 75.2" 0.3 88.0" 0.4DNN 89.3" 0.5 79.4" 0.9 88.3" 0.6

Heart MLR 84.6" 0.1 73.1" 0.3 83.6" 0.1BNN 91.1" 0.3 74.7" 0.3 89.5" 0.2DNN 90.7" 0.6 79.7" 1.2 89.4" 1.1

Kidney MLR 86.7" 0.1 75.6" 0.2 86.3" 0.1BNN 92.5" 0.4 78.3" 0.4 91.6" 0.4DNN 91.9" 0.6 82.6" 1.1 91.2" 0.9

Liver MLR 86.5" 0.2 75.6" 0.2 86.5" 0.1BNN 92.7" 0.3 77.9" 0.6 92.3" 0.5DNN 92.2" 0.5 80.5" 1.0 91.1" 0.8

Testis MLR 85.6" 0.1 72.3" 0.4 85.2" 0.1BNN 91.1" 0.3 75.5" 0.6 90.4" 0.3DNN 90.7" 0.6 76.6" 0.7 89.7" 0.7

(b) AUCLMH_TV

Tissue Method Low Medium High

Brain MLR 71.1" 0.2 58.8" 0.2 70.8" 0.1BNN 77.9" 0.5 61.1" 0.5 76.5" 0.7DNN 82.8" 1.0 69.5" 1.1 81.1" 0.4

Heart MLR 73.9" 0.3 58.6" 0.4 72.7" 0.1BNN 78.1" 0.3 58.9" 0.3 75.7" 0.3DNN 82.0" 1.1 67.4" 1.3 79.7" 1.2

Kidney MLR 79.7" 0.3 64.3" 0.2 79.4" 0.2BNN 83.9" 0.5 66.4" 0.5 83.3" 0.6DNN 86.2" 0.6 73.2" 1.3 85.3" 1.2

Liver MLR 80.1" 0.5 63.7" 0.3 79.4" 0.3BNN 84.9" 0.7 65.4" 0.7 84.4" 0.7DNN 87.7" 0.6 69.4" 1.2 84.8" 0.8

Testis MLR 77.3" 0.2 60.8" 0.3 77.0" 0.1BNN 81.1" 0.5 63.9" 0.9 81.0" 0.5DNN 84.6" 1.1 67.8" 0.9 83.5" 0.9

Notes: " indicates 1 standard deviation; top performances are shown in bold.

i125

Deep learning of the splicing code

at KEIO

UN

IVERSITY

SCIENCE A

ND

TECHN

OL LIBR on July 22, 2014

http://bioinformatics.oxfordjournals.org/

Dow

nloaded from

BNN:Bayeisian  NN  [Xiong  et  al.,  2011],          MLR:  MulDnomial  LogisDc  Regression  

Page 20: ISMB2014読み会 イントロ + Deep learning of the tissue-regulated splicing code

先行研究のモデル

4

S3 Model Architectures

Fig. S3. Architecture of the Bayesian neural network (Xiong et al., 2011) used for comparison, where low-medium-high predictions are made separately for each tissue.

Fig. S4. Input and output configuration for training a multinomial logistic regression classifier to utilize the outputs of the low-medium-high code to make tissue difference predictions.

Genomic Features

… …

L tissue 1

H tissue 1

M tissue 1

L tissue 2

H tissue 2

M tissue 2

L tissue 5

H tissue 5

M tissue 5

Low-Medium-High Code

Decrease

IncreaseNo change

Tissue Difference Code

L tissue i

H tissue i

M tissue i

L tissue j

H tissue j

M tissue j

Multinomial Logistic

Regression

log(L tissue i)

log(H tissue i)log(M tissue i)

log(L tissue j)

log(H tissue j)log(M tissue j)

Page 21: ISMB2014読み会 イントロ + Deep learning of the tissue-regulated splicing code

結果:先行研究との比較

•  DNI  code  

–  {B,D}NN-­‐MLR:    •  {B,D}NNでLMH  codeを出力  •  LMH  codeを入力とするMLRでDNI  codeを予測  

Table 2a shows the AUCDvI for classifying decrease versusincrease inclusion for all pairs of tissue. Both the DNN-MLRand DNN outperform the BNN-MLR by a good margin.Comparing the DNN with DNN-MLR, the DNN shows somegain in differentiating brain and heart AS patterns from othertissues. The performance of differentiating the remaining tissues(kidney, liver and testis) with each other is similar between theDNN and DNN-MLR. We note that the similarity between theDNN and DNN-MLR in terms of performance can be due tothe use of soft labels for training. Using MLR directly on thegenomic features and tissue types performs rather poorly, wherepredictions are no better than random.The models are further evaluated on predicting whether there

is a difference in splicing patterns for all tissues, without specify-ing the direction. AUCChange is computed on all exons and tissuepairs. This is shown in Table 2b. The results indicate that this is aless demanding task, as the models can potentially use just thegenomic features to determine whether an exon will have tissuevariability. The difference in performance between all methods isless compared with AUCDvI. However, as the evaluation is overall pairs of tissues, the DNN, which has access to the tissue typesin the input, does significantly better. Although this is also truefor the MLR, it still performed worst overall. This suggests thatin the proposed architecture where tissue types are given as aninput, the MLR lacks the capacity to learn a representation thatcan jointly use tissue types and genomic features to make pre-dictions that are tissue-specific. Both results from Table 2 showthat there is an advantage to learning a DNI code rather thanjust learning the LMH code.To test whether the predictions generalize to RNA-Seq data

from a different experiment, we selected data for two mouse tis-sues, namely the brain and the heart, from (Barbosa-Morais et al.,2012), and analyzed how our model, which is trained with datafrom (Brawand et al., 2011), performs. Table 3 shows the set ofevaluations on the DNN identical to that of Tables 1 and 2, testedon this RNA-Seq data. For the brain, there is an!1–4% decreasein AUCLMH_All and !4–5% decrease for AUCLMH_TV. For theheart, the model’s performance on both dataset is equivalent towithin 1 standard deviation for both AUCLMH_All andAUCLMH_TV. A decrease in performance of !7% is observed inAUCDvI for brain versus heart. There is an increase in AUCChange

but that is owing to only two tissues being evaluated as opposed to

five, where the AUC would be pulled down by the other tissueswith lower performances if they were present.Overall, the decrease in performance is not unexpected, owing

to differences in PSI estimates from variations in the experimen-tal setup. To see how PSI differed, we computed the expected PSIvalues for brain and heart in all exons from both sets of experi-ments, and evaluated their Pearson correlation. For the brain,the correlation is 0.945, and for the heart, it is 0.974. This canexplain why there is a larger decrease in performance for brain,which is a particularly heterogeneous tissue, and hence can varymore between experiments depending on how the samples wereprepared. We note that the performance of the DNN on thisdataset is still better than the BNN’s predictions on the originaldataset. Viewed as a whole, the results indicate that our modelcan indeed be useful for splicing pattern predictions for PSIestimates computed from other datasets. It also shows that ourRNA-Seq processing pipeline is consistent.

Table 2. Comparison of the DNI code’s performance in terms of the AUC for decrease versus increase (AUCDvI) and change versus no change(AUCChange)

(a) AUCDvI (b) AUCChange

Method BrainversusHeart

BrainversusKidney

BrainversusLiver

BrainversusTestis

HeartversusKidney

HeartversusLiver

HeartversusTestis

KidneyversusLiver

KidneyversusTestis

LiverversusTestis

ChangeversusNo change

MLR 50.3" 0.2 48.8" 0.8 48.3" 1.1 51.2" 0.5 50.0" 1.5 47.8" 1.7 51.1" 0.5 49.4" 0.8 51.9" 0.5 51.3" 0.6 74.7" 0.1BNN-MLR 65.3" 0.3 73.7" 0.2 69.1" 0.4 72.9" 0.5 72.6" 0.3 66.7" 0.4 68.3" 0.7 54.7" 0.6 65.0" 0.8 65.0" 0.9 76.6" 0.8DNN-MLR 77.9" 0.1 83.0" 0.1 81.6" 0.1 82.3" 0.2 82.4" 0.1 81.3" 0.1 82.4" 0.1 76.8" 0.5 79.9" 0.2 79.1" 0.1 79.9" 0.8DNN 79.4" 0.7 83.3" 0.8 82.5" 0.6 82.9" 0.7 86.1" 1.0 85.1" 1.1 84.8" 0.8 76.2" 1.0 82.5" 1.0 81.8" 1.3 86.5" 1.0

Note: " indicates 1 standard deviation; top performances are shown in bold.

Table 3. Performance of the DNN evaluated on a different RNA-Seqexperiment

(a) AUCLMH_All

Tissue Low Medium High

Brain 88.1" 0.5 76.1" 1.0 87.0" 0.6Heart 90.7" 0.5 78.4" 1.3 89.0" 1.0

(b) AUCLMH_TV

Tissue Low Medium High

Brain 79.1" 0.9 66.1" 1.0 77.6" 0.8Heart 82.6" 1.0 65.3" 1.2 78.8" 1.1

(c) AUCLMH_TV (d) AUCChange

Method Brain versusHeart

Method Change versusNo Change

DNN-MLR 72.9" 0.1 DNN-MLR 81.7" 1.0DNN 74.2" 1.5 DNN 91.9" 0.7

i126

M.K.K.Leung et al.

at KEIO

UN

IVERSITY

SCIENCE A

ND

TECHN

OL LIBR on July 22, 2014

http://bioinformatics.oxfordjournals.org/

Dow

nloaded from

Page 22: ISMB2014読み会 イントロ + Deep learning of the tissue-regulated splicing code

結果:重要な特徴量

backpropagated signal magnitude (which indicate that these fea-tures need to change the least to affect the prediction themost, andare hence important; note also that all of our features are normal-ized). The table also indicates general trends in the direction ofchange for each feature over the dataset. If more than 5% of theexamples do not follow the general direction of change, it is indi-cated by both an up and down arrow. Some of the splicing rulesinferred by the model can be seen. For example, presence of spli-cing enhancers promotes the splicing of the alternative exon lead-ing to higher inclusion, a shorter alternative exon is more likely tobe spliced out, and the strength and position of acceptor anddonor sites can lead to different splicing patterns.Next, wewanted to see how features are used in a tissue-specific

manner. Using the set of exons with high tissue variability, wecomputed the backpropagation signal to the inputs with theoutput targets changed in the same manner as above, for eachtissue separately. Figure 3 shows the sum of the magnitudes ofthe gradient, normalized by the number of examples in each tissuefor the top 50 features. We can observe that the sensitivity of eachfeature to the model’s predictions differs between tissues. Theprofile for kidney and liver tend to be more similar with eachother than others, which associates well with the model’s weakerperformance in differentiating these two tissues. This figure alsoprovides a view of how genomic features are differentially used bythe DNN, modulated by the input tissue types. In both Table 4and Figure 3, the backpropagation signals were computed on ex-amples from the test set, for all five partitions and folds.

4 CONCLUSIONS

In this work, we introduced a computational model that extendsthe previous splicing code with new prediction targets and im-proved tissue-specificity, using a learning algorithm that scaleswell with the volume of data and the number of hidden variables.The approach is based on DNNs, which can be trained rapidlywith the aid of GPUs, thereby allowing the models to have a

large set of parameters and deal with complex relationships pre-sent in the data. We demonstrate that deep architectures can bebeneficial even with a sparse biological dataset. We furtherdescribed how the input features can be analyzed in terms of

Fig. 2. Plot of the change in AUCLMH_All by substituting the values in each feature groups by their median. Feature groups that are more important tothe predictive performance of the model have lower values. The groups are sorted by the mean over multiple partitions and folds, with the standarddeviations shown. The number of features for each feature group are indicated in brackets

Table 4. The top 25 features (unordered) of the splicing code that de-scribes low and high percent inclusion

Feature description Low High

Strength of the I1 acceptor site # "Strength of the I2 donor site # "Strength of the I1 donor site " #Mean conservation score of first 100 bases in 30 end of I1 "# "#Mean conservation score of first 100 bases in 50 end of I2 "# "#Counts of Burge’s exonic splicing enhancer in A # "Counts of Chasin’s exonic splicing enhancer in A # "Log base 10 length of exon A # "Log base 10 length ratio between A and I2 # "Whether exon A introduces frame shift "# "#Predicted nucleosome positioning in 30 end of A "# "#Frequency of AGG in exon A " #Frequency of CAA in exon A # "Frequency of CGA in exon A # "Frequency of TAG in exon A " #Frequency of TCG in exon A # "Frequency of TTA in exon A " #Translatability of C1-A # "Translatability of C1-A–C2 # "Translatability of C1–C2 " #Counts of Yeo’s ‘GTAAC’ motif cluster in 50 end of I2 # "Counts of Yeo’s ‘TGAGT’ motif cluster in 50 end of I2 # "Counts of Yeo’s ‘GTAGG’ motif cluster in 50 end of I2 # "Counts of Yeo’s ‘GTGAG’ motif cluster in 50 end of I2 # "Counts of Yeo’s ‘GTAAG’ motif cluster in 50 end of I2 # "

Note: The direction of the arrows indicate that a feature’s value should in general beincreased (") or decreased (#) to change the PSI predictions to low or high. Featuredetails can be found in Section 4 of the Supplementary Material.

i128

M.K.K.Leung et al.

at KEIO

UN

IVERSITY

SCIENCE A

ND

TECHN

OL LIBR on July 22, 2014

http://bioinformatics.oxfordjournals.org/

Dow

nloaded from

Page 23: ISMB2014読み会 イントロ + Deep learning of the tissue-regulated splicing code

まとめ

•  DNNを用いてスプライシングパターンを高精度に予測する手法を開発した。  

•  適切な学習方法を用いることで、スパースなデータにおいてもDNNで学習できることを示した。

Page 24: ISMB2014読み会 イントロ + Deep learning of the tissue-regulated splicing code

感想

•  この論文がなぜISMBに採択されたか?  – 今流行のDeep  Learningを使っている。  – 問題設定自体は昔からあるものだけれど、それ

を最新の手法を使ってうまく解いた。  – 複数の出力を同時に学習する転移学習的なモデ

ルにしているところは斬新かも。  

Page 25: ISMB2014読み会 イントロ + Deep learning of the tissue-regulated splicing code

感想

•  DNNはバイオインフォマティクス分野で流行るか?  –  研究され尽くされている分野では、期待するほどの改善

は見られない。(e.g.  自然言語処理の一部の分野)  –  パラメータがどうしても多くなるから、データ数はそれなり

に必要になる。⇒  オミクス計測技術  –  同時に、計算量が膨大になる。⇒  GPGPU  –  生物よりの研究者が気軽に使える実装があまりない。  

⇒  Python  with  Theano  –  ハイパーパラメータの選択が大変    

⇒  暇人しか手を出せない。  

⇒  SVMほどは流行らないのでは?