sbanerjee_07.pdf

Upload: sommukh

Post on 04-Apr-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/29/2019 sbanerjee_07.pdf

    1/13

    Bayesian Modelling of Multivariate Quantitative Traits UsingSeemingly Unrelated Regressions

    Claudio J. Verzilli1n, Nigel Stallard,2 and John C. Whittaker1

    1Department of Epidemiology and Public Health, Imperial College London, London, United Kingdom2Medical and Pharmaceutical Statistics Research Unit, The University of Reading, Reading, United Kingdom

    We investigate a Bayesian approach to modelling the statistical association between markers at multiple loci andmultivariate quantitative traits. In particular, we describe the use of Bayesian Seemingly Unrelated Regressions (SUR)whereby genotypes at the different loci are allowed to have non-simultaneous effects on the phenotypes considered withresiduals from each regression assumed correlated. We present results from simulations showing that, under rather generalconditions that are likely to hold in real situations, the Bayesian SUR approach has increased probability of selecting the truemodel compared to univariate analyses. Finally, we apply our methods to data from subjects genotyped for 12 SNPs in the

    apolipoprotein E (APOE) gene. Phenotypes relate to response to treatment with atorvastatin and include changes in totalcholesterol, low-density lipoprotein cholesterol, and triglycerides. Missing genotype data are naturally accommodated in ourBayesian framework by imputing them using a nested haplotype phasing algorithm. Genet. Epidemiol. 28:313325, 2005.& 2005 Wiley-Liss, Inc.

    Key words: pharmacogenetics; multiple traits; Markov chain Monte Carlo; Bayesian methods

    Contract grant sponsor: Wellcome Trust; Contract grant number: GR 068213.nCorrespondence to: Dr. Claudio J. Verzilli, Department of Epidemiology and Public Health, Imperial College London, St MarysCampus, Norfolk Place, London W2 1PG, UK. E-mail: [email protected] 27 September 2004; Accepted 3 January 2005Published online 23 March 2005 in Wiley InterScience (www.interscience.wiley.com)DOI: 10.1002/gepi.20072

    INTRODUCTION

    The focus of much recent research in humangenetics has been on how to exploit the wealth ofinformation brought about by the numerousgenome sequence variation projects, importantcorollaries of the international Human GenomeProject. The availability of ever-improving markermaps offers great promises of successfully em-ploying association approaches to find suscept-ibility genes for complex traits. As the most

    abundant source of DNA variation, single nucleo-tide polymorphisms (SNPs) are arguably the mostcommonly used genetic marker. Extensive li-braries containing hundreds of thousands of SNPsacross the human genome are being compiled[International SNP Map Working Group, 2001;Thorisson and Stein, 2003; International HapMapConsortium, 2003] and made available on numer-ous web-based databases [Brookes, 2001; Hirika-wa, 2002; Smigielski et al., 2000; Klein andAltman, 2004]. At the same time, the statisticalchallenges that the analysis of this large amount of

    data poses can be formidable. One of the maindifficulties relates to the fact that we can expectonly small-to-moderate effects of individual genesor interactions thereof on one or more complextraits of interest. In pharmacogenetics studies, forinstance, any association between drug responseand individual genetic variants might be influ-enced, among other factors, by variation in geneexpression levels, post-translational modificationof proteins, and drug dose [McCarthy andHilfiker, 2000]. Thus, in general, the number of

    subjects needed to detect as statistically significantany association between gene SNPs and a com-plex trait at commonly used levels of power canbe very large. These difficulties are compoundedby the low heterozygosity of SNPs as opposed to,for instance, microsatellite markers, the low minorallele frequencies that lead to data sparseness andthe large number of hypotheses tested that need tobe adjusted for if the risk of finding false-positiveassociations is not to be increased.

    Another important issue is that any associationcould be the result of the variants being in tight

    Genetic Epidemiology 28: 313325 (2005)

    & 2005 Wiley-Liss, Inc.

  • 7/29/2019 sbanerjee_07.pdf

    2/13

    linkage disequilibrium (LD) with a causative,unassayed, polymorphic site rather than beingdirectly involved in the etiological pathway. Insuch cases, haplotypic or multilocus effects maybe non-negligible and an analysis based on

    haplotypes or models with fully-saturated geno-typic effects may have more power to detectassociations than a single-point, SNP-based ap-proach [Drysdale et al., 2000; Subrahmanyan et al.,2001].

    Statistical methods for the analysis of multipleSNPs in their association to complex traits havereceived the attention of many authors [Hoh andOtt, 2003]. Nelson et al. [2001] propose theCombinatorial Partitioning Method to find pat-terns or partitions of multi-locus genotypes thatminimise the within-partition variability in the

    (quantitative) trait of interest while maximisingthe variability across partitions. They point out,however, that their method is valid only forexploratory purposes and further research isneeded to ascertain its power and coverage ofnominal levels of significance [Moore et al., 2002].Focusing on variants at different sites within asmall genetic region, Cordell and Clayton [2002]propose a stepwise regression approach to identi-fy the relative importance of genotype effects atpolymorphic sites on a binary trait. Their modelsallow testing of the statistical significance ofadditive and dominance effects of SNPs at eachsite as well as all possible two-way, inter-loci,interactions. Furthermore, if the phase of thegenotype data is known, haplotype effects canalso be tested, accounting for the possibilitymentioned earlier that any association might bethe results of the typed SNPs being in tight LDwith causative, untyped polymorphisms.

    The methods cited, as do the majority of thosereported in the literature, focus on the associationbetween genetic variants at multiple loci and asingle trait. In many cases, however, data on morethan one phenotype are collected. This is, how-

    ever, seldom fully exploited in any analysis. In thisarticle, we propose a Bayesian version of theSeemingly Unrelated Regression (SUR) method[Zellner, 1962; Denison et al., 2002] which is usedto model the association between a set of multi-variate quantitative traits and unphased SNPgenotype data. In the SUR model, the SNP allelesat the different loci are allowed to have possiblydifferent effects on each of the phenotypesconsidered, with residuals from each regressionassumed to be correlated. Therefore, differentsets of SNPs can be selected for each trait. In

    simulation studies, we show that the Bayesianmultivariate SUR approach leads to higher poster-ior probability being attached to the true generat-ing model as opposed to Bayesian univariateanalyses of the same data, especially for small

    sample sizes. Thus, the SUR model is moresuccessful at detecting weak main and/or inter-action effects associated with polymorphismscompared to univariate methods. Furthermore,missing genotype data are naturally accommo-dated in our Bayesian framework by treating themas augmented data and sampling from theirconditional posterior distribution.

    We apply our approach to data from subjectsgenotyped for 12 SNPs in the apolipoprotein(APOE) gene, including the two SNPs determin-ing the common E2, E3 and E4 alleles [Rall et al.,

    1982]. The aim was to relate the 12 polymorphismsto response to treatment with atorvastatin, ahypolipidemic pharmacological agent, over a 52-week period. Among the outcomes measuredwere changes in total cholesterol (TC), low-densitylipoprotein cholesterol (LDL-C), and triglycerides(TG) since the start of the trial. In a previous study,these have been shown to be associated to thethree common alleles in men but not women, withE2 lowering cholesterol levels and E4 raising them[Pedro-Botet et al., 2001].

    The article is structured as follows. The nextsection describes the proposed method whileintroducing the notation used throughout. Resultsfrom simulation studies are presented followed bythe application to genotype data in the APOEregion. We end with a discussion of theadvantages and disadvantages of the proposedapproach.

    METHODS

    SEEMINGLY UNRELATED REGRESSIONSMODEL

    Before presenting the Bayesian formulation ofthe Seemingly Unrelated Regressions (SUR) mod-el, we illustrate the parameterisation used for thegenotype model and present the generic expres-sion for a system of SUR. Let y1, y2,y, yM denotemeasurements on M continuous phenotypes takenon n subjects, with ym a n 1 vector, m1,y, M.The same subjects are genotyped at a set ofL SNPloci. Let ql and Ql indicate the alleles present atlocus l, l1,y, L, so that the genotype at this locusmay be qlql, qlQl or QlQl. A linear model can beused to describe the statistical association between

    Verzilli et al.314

  • 7/29/2019 sbanerjee_07.pdf

    3/13

    genotypes and phenotypes. A saturated modelwould include distinct terms for the genotypiceffects at each locus and all multilocus interactioneffects. In the simple case of just two loci, asaturated model contains nine parameters corre-

    sponding to the nine possible two-locus geno-types.

    An alternative parameterisation of the genoty-pic effects, standard in quantitative genetics andadopted here, models the additive and dominanceeffects of the genotypes at the different loci. Forgeneric subject i and locus l, the additive anddominance effects are specified by introducingadditional variables xila and xild , say, coded 1, 0, 1and 0.5, 0.5, 0.5, respectively, for the threepossible genotypes qilqil, qilQil, and QilQil. Thecoding scheme for additive term xla implies that

    the ef fect of being homozygous for allele Ql istwice that of being homozygous for ql whereas thecoefficient for dominance term xld measures anydeviation from the additive assumption. Thus, forgeneric trait m and subject i, a two-locus modelwith additive and dominance effects at the firstlocus and only additive effects at the second plustwo-way interactions between these main effectscan be written as

    ymi bm0 b

    m1a

    xi1a bm1d

    xi1d bm2a

    xi2a

    bm12aa xi1a xi2a bm12da

    xi1d xi2a Emi 1

    where Emi is a zero-mean normally distributedrandom variable, Emi $ N0; s2m. The advantage ofthis parameterisation in designed experiments isthat it leads to orthogonal contrasts so thatparameter estimates do not depend on the currentcomplexity of the model allowing for testing ofnon-nested models. However, as noted by Cordelland Clayton [2002], this property is unlikely toapply to population-based studies. This meansthat, in the latter case, any model comparisonshould be carried out between nested models, sothat terms describing dominance effects at each

    locus should be included only if the correspond-ing additive terms appear in the model andinteraction terms can only involve loci with mainadditive or dominance effects already present inthe model (as in expression (1)). We will return tothis point later in the article when describing ourBayesian model search strategy. To simplify thenotation, we indicate with Xm the n (Km+1)design matrix coding for all additive, dominanceand interaction effects, with Km representing thenumber of terms in the model for phenotype mexcluding the intercept term, m1,y,M, so that

    ymXmbm+em with b bm0 ; . . . ; b

    mKm

    0 a (Km+1) 1vector of coefficients and em(Em1, Em2,y, Emn)

    0 an 1 vector of zero-mean random error terms.Then, a system of SUR can be written as [Zellner,1962]

    y1

    y2

    ..

    .

    yM

    26666664

    37777775

    X1 0 0

    0 X2 0

    ..

    . ... ..

    .

    0 0 XM

    26666664

    37777775

    b1

    b2

    ..

    .

    bM

    26666664

    37777775

    e1

    e2

    ..

    .

    eM

    26666664

    37777775: 2

    We can write (2) more compactly as

    y Xb e 3

    with y y0

    1y0

    2 y0

    M0; b b

    0

    1b0

    2 b0

    M0; e

    E0

    1E0

    2 E0

    M0 , and X a N Kblock-diagonal matrix,

    NnM, KP

    m Km. The vector of random errorterms e in (2) and (3) is assumed to be zero-meannormally distributed with variance-covariancematrix given by

    R Re I

    where I is a n n unit matrix and RE

    a M Mmatrix with generic entry smm0 Eemiem0i form1,y, M, i1,y, n. In this framework, expres-sion (2) allows for a differential effect of SNPgenotypes on phenotypes as well as the possibilitythat some loci might be associated with some ofthe quantitative traits modelled but not all ofthem. Notice that the SUR model is equivalent tothe univariate, single-equation approach when Ris diagonal while the standard multivariateregression model is a special case of (2) corre-sponding to having X1 X2 XM.

    In the frequentist setting, there are several ways

    of estimating parameters in (2); for a recent reviewsee Foschi et al. [2003]. The statistical significanceof genotypic effects can then be assessed using,for instance, a stepwise procedure based onlikelihood ratios. Our approach is fully Bayesianand uses a Reversible Jump (RJ) algorithm toobtain samples from the conditional posteriordistribution of models given the data. The poster-ior probabilities thus obtained are then used tocompare the fit of different models via Bayesfactors. There are two main advantages with thisapproach. First, variable selection using the

    SUR Model for Multivariate Traits 315

  • 7/29/2019 sbanerjee_07.pdf

    4/13

    Bayesian paradigm allows us to make probabil-istic statements about the relative plausibility ofdifferent models. There is also some evidenceshowing that the Bayesian approach achievesbetter coverage of the nominal levels of

    significance than a frequentist stepwise selectionof important predictors [Viallefont et al., 2001].Second, the Bayesian framework naturallyhandles incomplete genotype data by treatingmissing data as extra parameters and samplingfrom their conditional posterior distribution.In particular, in the application to the APOEdata, we describe a nested haplotype phasingalgorithm that is used for this purpose, assumingthat genotype data are missing at random in thesense of Rubin [1976]. In this way, the extrauncertainty due to data incompleteness is taken

    into account, while making a fuller use of theavailable data.

    BAYESIAN SUR

    Before proceeding, we need to introduce furthernotation. We indicate with M fM1;M2;. . . ;Mfullg the set of all possible models for agiven maximum degree of interaction betweenpolymorphic loci. For example, in the case of twoloci, a model for the full genotypic effects on traitm can be written as

    ymi bm0 X

    l1;2X

    ja;d

    bmlj xilj

    X

    ja;d

    Xj0a;d

    bm12jj0 xi1j xi2j Emi:4

    Next, consider a random vector h that completelyspecifies which genotype effects characterize eachmodel. That is, h is a random vector of varyingdimension and contains the indexes of thecolumns of the matrix Xfull that specify the variousnested models, with hMfull 1 a vector with allentries one. Then, given the data D, comparisonbetween any two modelsM0 andM1 can be based

    on the Bayes factor

    BF10 pM1jD

    pM0jD=

    pM1

    pM0

    phM1 jD

    phM0 jD=

    phM1

    phM0 5

    where, for instance, hM0 is the value of h thatcorresponds to a null model M0. In the nextsubsection, we outline a hybrid sampling strategyused to obtain samples from phMjD which arethen used to approximate (5) and discuss thechoice of prior distributions for parameters b, R,and h. A more detailed description of thealgorithm used is given in the Appendix.

    PRIOR DISTRIBUTIONS AND POSTERIORSAMPLING

    The joint distribution of the phenotypes, andmodel parameters can be written as

    pY; b;R; h; X

    pYjb;R; h; Xpb;R; h

    where, from (3), the likelihood on the right-handside is

    pYjb;R; h; X

    jRjM

    2 2pN2 expf12Y Xb

    0R1Y Xbg:

    Following Denison et al. [2002], we assumeindependent prior distributions for the vector ofregression coefficients b and the precision matrix

    R

    1 with b $ N0;X1, R1 $ Wia; S and parti-tion the joint prior as

    pb;R1; h pbjhpR1ph

    N0;X1Wia; Sph 6

    where Wi(aS) is a Wishart distribution withparameters a, a scalar, and S a M M positive-definite matrix. For the reasons mentioned in theprevious section, we adopt a model space priorp(h) that imposes certain constraints on the termsentering each model [Chipman, 1996]. Namely,

    dominance terms are only allowed if the corre-sponding additive terms are already in the model.Similarly, interactions between loci may be in-cluded only if the corresponding main effects arepresent in the model. Thus, in the case of two lociin expression (4), the model space prior factorsas ph py0py1a py1d jy1a py2a py2d jy2a

    Qja;dQ

    j0a;d py12jj0 jy1j ; y2j0 where p(y0)1 and

    py12jj0 1jy1j ; y2j0 0:5 if y1j ; y2j0 1; 1

    0 otherwise

    (

    A similar dependence applies to pyld jyla l 1; 2

    for the dominance terms.The independent priors assumption (6) makes it

    easy to sample from the relevant conditionalposterior distributions. In particular, a hybridsampling strategy is used that alternates an RJstep for updating h with draws from full condi-tional distributions for b and R1. Specifically, toupdate h, a move is proposed with equal prob-ability to modify the current model by adding anexplanatory variable, deleting an explanatoryvariable or replacing a term that is currentlyincluded in the model with a new one. All these

    Verzilli et al.316

  • 7/29/2019 sbanerjee_07.pdf

    5/13

    moves have to satisfy the restriction that theresulting model nests or is nested in the currentone for the reasons mentioned above. Any movefrom h to h0 is then accepted with probability

    min 1;pyjR; h0

    pyjR; h

    ph0

    ph

    qhjh0

    qh0jh

    where pyjR; h is the partial marginal likelihood(integrating over the vector of coefficients b) andq(|) is a proposal distribution that is non-symmetric due to the constraints imposed on themodel space.

    RESULTS

    SIMULATION STUDIES

    In this section, we present results from simula-tion studies comparing the Bayesian SUR ap-proach with univariate Bayesian analyses in termsof posterior probabilities of selecting the genotypeeffects that are known to be non-zero. Since thegenerating model is known, the performance ofthe two approaches can thus be comparedobjectively.

    Various scenarios are considered that differ inthe size of the generated sample and degree ofcorrelation among residuals. In all cases, the totalnumber of loci considered is fixed at 12 and weassume that we are interested in their potential

    association with three quantitative traits. SNPgenotype data were generated by sampling withreplacement from the APOE data set mentionedabove and described more thoroughly in the nextsection. Under the first scenario, data are gener-ated from the SUR model

    y1 10 1:1x1a 0:9x3a E1 7

    y2 10 0:9x1a 0:95x5a 0:75x5d E2 8

    y3 10 0:9x4a 0:8x8a 0:8x12a E3 9

    so that, for instance, we have an additive effect oflocus 1 on Y1 and Y2 and additive and dominanceeffects of locus 5 on Y2. Residuals are normallydistributed N(0,R

    e) and strongly positively corre-

    lated with

    Re 2 0:92 0:8

    2 0:882

    0@

    1A: 10

    Table I shows quartiles of posterior probabilitiesassociated with the true model for the twomethods considered. In order to compare theunivariate and SUR approaches, for each of thethree outcomes, these probabilities are estimatedas the number of times the relevant subvector ~hmcorresponding to (7)(9), appears in Tdraws fromphjD, that is

    p~hmjD %1

    T

    XTt1

    Ihtm

    ~hm 11

    where I(.) is the indicator function, which is one ifits argument is true, m1,2,3. Results correspond-ing to dif ferent sizes of the generated data sets arereported where, for each combination of methodand size of data set, quartiles are constructed from200 independent replications of the Markov chain.Each chain was run for 200,000 iterations and asample of T10,000 draws was retained every 10steps after a burn-in run of 100,000 iterations. The

    latter was deemed sufficient from inspection ofthe Markov chains for the elements ofR. For theSUR model we set x0.01I, S0.01I and a0.01,whereas in the univariate case, the only changeinvolves the prior on the single precision (inversevariance) term for the residuals, which is Gam-ma(0.01,0.01).

    As can be seen from Table I, for each of the threephenotypes considered, the Bayesian SUR modelleads to higher median probabilities being at-tached to the loci that are truly causative. This

    TABLE I. Quartiles of posterior probabilities associated with the true model for different sizes of data setsa

    100 300 500

    Size of data set Q1 Q2 Q3 Q1 Q2 Q3 Q1 Q2 Q3

    Univariate analysesy1 0.03 0.09 0.32 0.59 0.74 0.80 0.67 0.76 0.82y2 0.03 0.05 0.17 0.69 0.77 0.84 0.71 0.80 0.87y3 0.00 0.00 0.00 0.12 0.39 0.60 0.49 0.66 0.77

    SUR modely1 0.48 0.68 0.79 0.84 0.89 0.92 0.83 0.90 0.94y2 0.57 0.81 0.89 0.92 0.93 0.95 0.93 0.95 0.97y3 0.09 0.36 0.54 0.67 0.81 0.88 0.82 0.86 0.89

    aData are simulated from the model given by expressions (7)(9) and (10). Values refer to 200 replications.

    SUR Model for Multivariate Traits 317

  • 7/29/2019 sbanerjee_07.pdf

    6/13

    advantage becomes more apparent for smallersizes of the generated data sets. Interestingly,because of the low heterozygosity at loci 8 and 12in the real data from which the simulated data aresampled, the univariate analysis for y3 fails

    completely to select the true model for samplesize N100 and does very poorly even for N300whereas the SUR model leads to a higherprobability of selecting the true model.

    The univariate and SUR methods were alsocontrasted in terms of Bayes factors comparing thetrue model versus the most frequently visitedmodel among the rest, using (11) to approximate(5). This gives a measure of the confidence withwhich the true model is chosen in each case.Results are shown in Table II. It can be seen that,for small values of N, it is very difficult to

    successfully select the true model using a uni-variate analysis of the data. This is in contrast withthe SUR approach, which, at least for the first twophenotypes, convincingly favours the correctmodel over any other model.

    Finally, we also considered quartiles of thenumber of models visited, which showed that

    the SUR approach tends to favour the modelgenerating the data more decisively than theunivariate approach, thus leading to a smallerprobability of accepting a model other than thetrue one (data not shown).

    The advantage of the multivariate SUR ap-proach in selecting the true causative loci de-creases as the inter-residual correlations getsmaller, as shown under our second simulationscenario. Here, the true effects are still given byexpressions (7)(9) while the matrix Re of correla-tion between residuals is now

    Re

    2 0:5 0:5

    2 0:45

    2

    0B@

    1CA: 12

    Results are shown in Tables III and IV. The twomethods perform rather poorly for N100 whilethe multivariate SUR model has higher probabilityof selecting the causative loci for y3 compared tothe univariate analysis for N300. As in theprevious simulation setting, we found that thenumber of models visited is smaller under the

    TABLE II. Quartiles of Bayes factors comparing the true model versus the most probable among the rest of the modelsvisited for different sizes of data setsa

    100 300 500

    Size of data set Q1 Q2 Q3 Q1 Q2 Q3 Q1 Q2 Q3

    Univariate analysesy1 0.6 1.0 6.9 2.7 6.4 20.7 2.1 18.0 20.3y2 1.5 4.0 8.5 33.3 81.5 118.2 16.0 99.5 151.5y3 0.0 0.1 0.6 1.0 2.9 13.6 3.6 10.7 22.3

    SUR modely1 1.9 7.4 19.0 6.0 10.1 22.1 12.8 56.1 146.2y2 13.2 67.0 194.5 30.4 145.9 298.7 85.7 250.4 461.0y3 1.0 2.8 8.6 3.4 18.2 64.9 10.2 19.9 35.4

    aData are simulated from the model given by expressions (7)(9) and (10). Values refer to 200 replications.

    TABLE III. Quartiles of posterior probabilities associated with the true model for different sizes of data setsa

    100 300 500

    Size of data set Q1 Q2 Q3 Q1 Q2 Q3 Q1 Q2 Q3

    Univariate analysesy1 0.17 0.29 0.48 0.59 0.71 0.77 0.62 0.73 0.75y2 0.02 0.10 0.33 0.65 0.73 0.81 0.49 0.81 0.83y3 0.00 0.00 0.01 0.05 0.33 0.57 0.07 0.52 0.57

    SUR modely1 0.25 0.42 0.65 0.65 0.74 0.81 0.67 0.75 0.78y2 0.06 0.18 0.47 0.72 0.81 0.85 0.78 0.85 0.89y3 0.00 0.00 0.01 0.38 0.57 0.68 0.31 0.61 0.74

    aData are simulated from model given by expressions (7)(9) and (12). Values refer to 200 replications.

    Verzilli et al.318

  • 7/29/2019 sbanerjee_07.pdf

    7/13

    SUR approach than under the univariate analysesalthough the differences are not as large as those

    under scenario 1. Finally, the two approachesyield the same results apart from simulation noisewhen decreasing the true inter-residuals correla-tions further (results not shown).

    APOE GENOTYPE DATA

    In this section, the Bayesian SUR approach isapplied to data from 327 subjects genotyped for 12SNPs in the APOE locus. There is a wealth ofliterature reporting associations between variantsin the APOE region and metabolic regulation ofcholesterol as well as Alzheimers disease thatmainly focus on the three common variants (forreviews see Eichner et al. [2002] and Tanzi andBertram [2001] whereas recent findings on themolecular mechanisms underlying the cholester-ol-AD connection are reviewed in Puglielli et al.[2003]). However, some authors have recentlyargued that there could be important substruc-tures within the E2, E3, and E4 alleles and furtherinvestigation is needed in order to characterise thefull extent of allelic heterogeneity in the APOEgene [Fullerton et al., 2000; Nickerson et al., 2000].The objective of the study reported here was to

    investigate any association between the poly-morphic sites and response to treatment withatorvastatin at 10 mg/day for 52 weeks in terms ofchanges in total cholesterol (TC), low densitylipoprotein cholesterol (LDL-C), and triglycerides(TG). In the analysis, we use logarithm-trans-formed triglycerides values as the empiricaldistribution of the untransformed values washighly skewed. Data on non-genetic variableswere also available, namely each patients age,sex, and body mass index, and were included inour analysis allowing for interactions with addi-

    tive genetic effects only. For some patients,genotype data at some loci are missing. In this

    setting, the mechanism driving the missing dataprocess is likely to be at random [Rubin, 1976];that is, for any subject, the probability of notobserving his or her genotype data at any locusdoes not depend on the true, unobserved geno-type at that locus. Therefore, the missing datamechanism could be ignored without affecting thecorrectness of the results at more cost in efficiency.However, we make a fuller use of the availabledata by devising an imputation scheme thatexploits the genetic nature of the data at hand.In particular, we use a nested haplotype phasingalgorithm to impute the missing genotype data.Various authors have recently proposed Bayesianapproaches to haplotype reconstruction frompopulation data [Stephens et al., 2001; Niu et al.,2002; Lin et al., 2002]. A review of these methods isgiven in Stephens and Donnelly [2003]. Ourapproach uses the nave Gibbs sampler de-scribed in appendix A of Stephens et al. [2001]with the adaptation for missing genotype datagiven in Lin et al. [2002]. This entails adding anextra step to the algorithm outlined in theprevious section. Informally, given current valuesofb, R, and h, for a generic subject with missing

    genotype data, if Hi are his/her current recon-structed haplotypes compatible with his/hermultilocus genotype, newly sampled haplotypesH

    0

    i are accepted with probability given by

    min 1;pyjXH0i; b;R; h

    pyjXHi; b;R; h

    : 13

    In the previous expression, we use X(Hi) tohighlight the fact that the reconstructed haplo-types uniquely determine the subjects missinggenotypes or the entries of the covariate matrix X.Thus, although our model is a model for geno-

    TABLE IV. Quartiles of Bayes factors comparing the true model versus the most probable among the rest of the modelsvisited for different sizes of data setsa

    100 300 500

    Size of data set Q1 Q2 Q3 Q1 Q2 Q3 Q1 Q2 Q3

    Univariate analysesy1 1.0 1.75 8.5 2.2 7.1 19.4 2.1 8.0 21.4y2 3.2 4.0 23.0 19.0 47.5 109.6 27.5 98.4 104.3y3 0.0 0.2 1.5 1.0 3.7 13.0 2.0 4.4 6.0

    SUR modely1 2.1 5.1 12.4 4.0 8.9 23.5 3.1 10.4 22.2y2 4.0 6.2 25.3 9.2 64.4 159.7 130.9 156.6 232.3y3 0.0 0.5 2.5 2.6 9.8 19 5.1 5.6 8.8

    aData are simulated from (7)(9) and (12). Values refer to 200 replications.

    SUR Model for Multivariate Traits 319

  • 7/29/2019 sbanerjee_07.pdf

    8/13

    typic effects, we use the nested haplotype phasingas a convenient way of imputing the missing data.A detailed description of this additional step isgiven in the Appendix.

    The results presented refer to a sample of 20,000

    models obtained after a burn-in run of 20,000iterations with a total chain length of 500,000iterations. Figure 1 shows, for each of the threeoutcomes considered, the probability that thegenotype effects at each locus are different fromzero conditional on the observed data,Prbj 6 0jD, estimated as the proportion ofsampled models containing the correspondingterms. For each of the 12 loci considered, theadditive and dominance effects are plotted side-by-side. Also shown are analogous probabilitiesfor the non-genetic predictors age, BMI and sex.

    Similar plots for two-locus interaction terms didnot indicate any important effects and are notpresented. None of the polymorphisms consid-ered has any important effect on all three out-comes simultaneously, apart from locus L650,where the probability of effect is small, andL4870, the latter on TC and LDL-C only. Noticethat L4870 is one of the polymorphisms involvedin the APOE major alleles. Among the non-geneticeffects considered, sex is an important predictor ofchanges in LDL-C.

    Figure 2 shows posterior densities of regressioncoefficients for those predictors that appear to beimportant in Figure 1. These were obtained from

    the subset of sampled models that contained thecorresponding terms (shown as proportions of thetotal by the heavily shaded parts of the horizontalbars above each graph). Males and subjectshomozygote for the mutant allele at locus L4870

    show a smaller decrease in TC and LDL-C overthe 52-week treatment period compared to fe-males and heterozygote or homozygote wild type.Also, we report in Table V those models withposterior frequencies greater than or equal to 5%:for changes in triglycerides and TC, the nullmodel containing just the intercept term is the firstand second most frequently visited one, respec-tively.

    Table VI shows, for each of the three phenotypesconsidered, the Bayes factors for the compositehypothesis of any genetic effect being different

    from zero versus the null hypothesis of no geneticeffects. Overall, conditional on the observeddata, there is no evidence of any geneticeffects on the outcomes considered. We alsoreport Bayes factors testing the hypothesis of anon-zero effect at locus L4870 (which is one of theloci involved in the APOE major alleles), whichagain show no real evidence for important effectsin this data set. With regard to these findings, itshould be noted that, despite the consistentassociation of the APOE locus with LDL-Cconcentrations in the general population, resultsfrom lipid pharmacogenetic studies are lessclear-cut [Ordovas, 2004].

    Total Cholesterol

    Triglycerides

    LDL-Cholesterol

    0.0

    0.4

    0.8

    0.0

    0.4

    0.8

    0.0

    0.4

    0.8

    L650

    L718

    L1405

    L1494

    L1557

    L1765

    L2096

    L2931

    L4870

    L5008

    L7327

    L7379

    BMI

    Age

    Gender

    Pr(j0

    )

    Fig. 1. Posterior probabilities of non-zero effects of predictors on each of the three phenotypes considered. For genetic effects, additiveand dominance terms at each locus are shown side-by-side (additive on the left).

    Verzilli et al.320

  • 7/29/2019 sbanerjee_07.pdf

    9/13

    Finally, the posterior densities of theinter-residual correlations are shown inFigure 3. Whereas the correlations betweenregression residuals for TC and triglycerides

    and for triglycerides and LDL-C appear tobe modest, there is strong positive correlationbetween the residuals of the TC and LDL-Cregressions.

    DISCUSSION

    In this article, we have described the use ofBayesian Seemingly Unrelated Regressions tomodel multivariate quantitative traits in geneticassociation studies. In simulation studies, the

    1.5 2.0 2.5 3.0

    10

    8

    8

    6

    6

    4

    4

    2

    2

    0 0

    12

    D

    ensity

    TC: additive (L4870)

    0 1

    3.0 3.2 3.4 3.6 3.8 4.0

    D

    ensity

    LDL: additive (L4870)

    0 1

    -1.9 -1.8 -1.7 -1.6 -1.5

    0

    10

    15

    Density

    LDL: Gender

    0 1

    5

    Fig. 2. Posterior densities of additive effects at locus L4870 on changes in TC and LDL-C and gender effect (0male, 1female) onchanges in LDL-C. These were estimated from the subset of sampled models that included the corresponding terms (shown asproportion of the total sampled in the horizontal bars).

    TABLE V. SUR models with posterior probability greater than or equal to 5%a

    Trait Model Posterior model probability (%)

    Total cholesterol L4870a 23.3Intercept only 20.3

    L650a L650d 6.4Triglycerides Intercept only 36.2

    L650a L650d 8.6L650a 5.0

    LDL cholesterol L4870a Gender 17.8Gender 14.0

    L650a L4870a Gender Gender:L650a 6.3L650a Gender Gender:L650a 5.5L650a Gender 5.0

    Intercept only 0.9

    aFor each of the three phenotypes, marginal probabilities are shown. Values were estimated from a posterior sample of 10,000 models.

    TABLE VI. Bayes factors using the SUR model

    TraitAny genetic effect vs.

    no genetic effectsAdditive effect at

    locus L4870

    Total cholesterol o0.01 0.86Triglycerides o0.01 0.02LDL o0.01 1.10

    SUR Model for Multivariate Traits 321

  • 7/29/2019 sbanerjee_07.pdf

    10/13

    proposed approach leads to increased probabilityof selecting the true models compared to Bayesianunivariate methods, whenever the residuals fromthe univariate models are highly correlated. Thelatter circumstance is expected to hold with manyreal datasets, as it is quite common in associationstudies to collect data on two or more relatedquantitative phenotypes. There could, therefore,be advantages in adopting a multivariate ap-proach in such cases whereas, as expected, theunivariate and multivariate SUR models givesimilar results when inter-residual correlationsare low.

    The SUR model offers a flexible way ofparameterising genotypic effects. In particular,the method is suited to cases where one is notprepared to assume that polymorphic loci willhave an effect on all the traits consideredsimultaneously, this being the underlying as-sumption in the more common multivariate

    regression model. At the same time, the SURmodel encompasses the multivariate regressionmodel as a particular case as discussed inMethods. The Bayesian formulation then hasseveral advantages. Namely, it is straightforwardin this framework to accommodate missinggenotype data by iteratively imputing them using,for instance, the nested haplotype phasing stepdescribed in the Appendix. It is easy to envisagewider applications of this approach as, forexample, in haplotype-based association studieswith unphased genotype data. In addition, com-

    plex hypotheses can be tested using Bayes factorsestimated from the sampled models. Finally, ifprior information is available this can be easilyincorporated. We may, for instance, modify theprobability of inclusion for certain model terms orexclude certain interaction terms if this is corro-borated by prior knowledge. The Bayesian SURmodel presented here has been implemented as anR package [R Development Core Team, 2004],which is available on request from the first author;the model definition requires no extra effort onthe part of the user than the univariate counter-parts.

    A difficulty with our approach, which iscommon to all model-based multilocus analysismethods, is how to deal with very many loci. Insuch cases, a model search strategy based onuniform priors might not be effective in traversingthe space of possible models as this can be of highdimension, especially if between-locus interaction

    terms are to be included. A possible solutionwould be to modify the ratio of move proposals ateach iteration of the Markov chain, by giving moreweight to interaction terms between loci whosemain effects appear to be important in the currentsample of models. A drawback of this strategy isthe risk of missing important interaction effectsbetween loci that, individually, have no importantmain effects.

    The choice of the model space prior distributionrequires careful consideration. We have used auniform prior that incorporates relations between

    0.1 0.2 0.3 0.4 0.5

    Density

    12 13 23

    0.84 0.86 0.88 0.90 0.92 -0.1 0.0 0.1 0.2 0.3

    00 0

    10

    20

    30

    1

    22

    3

    44

    5

    6

    6

    8

    7

    Fig. 3. Densities of the posterior samples of correlations among SUR residuals.

    Verzilli et al.322

  • 7/29/2019 sbanerjee_07.pdf

    11/13

    predictors by imposing certain conditional depen-dences. An alternative would be to consider auniform prior on the complexity or the number ofterms entering each model with models of equalcomplexity having the same a priori probability. A

    related problem is how to deal with potentiallyhigh correlation between predictors as a result ofSNPs being in tight LD. This may have undesir-able consequences on model search as the Markovchain will tend to mix slowly over clusters ofsimilar models biasing posterior model probabil-ities away from any good models, if uniformmodel priors are used. This will not, however,affect model comparisons based on Bayes factors[Chipman et al., 2001].

    The application of the Bayesian SUR model tothe SNP genotype data in the APOE region

    showed no evidence of genetic effects on any ofthe three traits considered (Table VI). However,there appears to be a non-zero additive effect onchanges in TC and LDL-C at a locus involved inone of the APOE major alleles (L4870) althoughthe corresponding Bayes factor was not large(Table VI). None of the two-way interaction termsbetween loci and between additive effects at eachlocus and non-genetic variables had a largeposterior probability of being greater than zero.There was, however, a significant gender effect onchanges in LDL-C over the treatment period withfemales showing a sharper decrease.

    ACKNOWLEDGMENTS

    The authors thank Daniel Chasman for helpfulcomments on the manuscript. C.J.V. was sup-ported by a postdoctoral research grant from theWellcome Trust (GR 068213).

    REFERENCES

    Brookes AJ. 2001. HGBASE: a unified human SNP database.

    Trends Genet 17:229.Chipman H. 1996. Bayesian variable selection with related

    predictors. Can J Stat 24:1736.Chipman H, George EI, McCulloch RE. 2001. The practical

    implementation of bayesian model selection. In: Lahiri P,eiditor. Model selection. Lecture notes monographs. Inst MatStat 38:65116.

    Cordell HJ, Clayton DG. 2002. A unified stepwise regressionprocedure for evaluating the relative effects of polymorphismswithin a gene using case/control or family data: application toHLA in type 1 diabetes. Am J Hum Genet 70:124141.

    Denison DGT, Holmes CC, Mallick BK, Smith AFM. 2002.Bayesian methods for non-linear classification and regression.Chichester: John Wiley.

    Drysdale CM, McGraw DW, Stack CB, Stephens JC, Judson RS,

    Nandabalan K, Arnold K, Ruano G, Liggett SB. 2000. Complex

    promoter and coding region beta 2-adrenergic receptor

    haplotypes alter receptor expression and predict in vivo

    responsiveness. Proc Natl Acad Sci USA 97:1048310488.Eichner JE, Dunn ST, Perveen G, Thompson DM, Stewart KE,

    Stroehla BC. 2002. Apolipoprotein E polymorphism andcardiovascular disease: a HuGE review. Am J Epidemiol

    155:487495.Foschi P, Belsley DA, Kontoghiorghes EJ. 2003. A comparative

    study of algorithms for solving seemingly unrelated regressions

    models. Comput Stat Data Anal 44:335.Fullerton SM, Clark AG, Weiss KM, Nickerson DA, Taylor SL,

    Stengard JH, Salomaa V, Vartiainen E, Perola M, Boerwinkle E,

    Sing CF. 2000. Apolipoprotein E variation at the sequence

    haplotype level: implications for the origin and maintenance of

    a major human polymorphism. Am J Hum Genet 67:881900.Hirikawa M. 2002. HOWDY: an integrated database system for

    human genome research. Nucleic Acids Res 30:152157.Hoh J, Ott J. 2003. Mathematical multi-locus approaches to

    localizing complex human trait genes. Nat Rev Genet 4:701709.International HapMap Consortium. 2003. The international

    HapMap project. Nature 426:789796.International SNP Map Working group. 2001. A map of human

    genome sequence variation containing 1.42 million single

    nucleotide polymorphisms. Nature 409:928933.Klein TE, Altman RB. 2004. PharmGKB: the pharmacogenetics and

    pharmacogenomics knowledge base. Pharmacogenom J 4:1.Lin S, Cutler DJ, Zwick ME, Chakravarti A. 2002. Haplotype

    inference in random population samples. Am J Hum Genet

    71:11291137.McCarthy JJ, Hilfiker R. 2000. The use of single-nucleotide

    polymorphism maps in pharmacogenomics. Nat Biotechnol

    18:505508.Moore JH, Lamb JM, Brown NJ, Vaughan DE. 2002. A comparison

    of combinatorial partitioning and linear regression forthe detection of epistatic effects of the ACE I/D and PAI-1

    4G/5G polymorphisms on plasma PAI-1 levels. Clin Genet

    62:7479.Nelson MR, Kardia SL, Ferrell RE, Sing CF. 2001. A combinatorial

    partitioning method to identify multilocus genotypic partitions

    that predict quantitative trait variation. Genome Res 11:458470.Nickerson DA, Taylor SL, Fullerton SM, Weiss KM, Clark AG,

    Stengard JH, Salomaa V, Boerwinkle E, Sing CF. 2000. Sequence

    diversity and large-scale typing of SNPs in the human

    apolipoprotein E gene. Genome Res 10:15321545.Niu T, Qin ZS, Xu X, Liu JS. 2002. Bayesian haplotype inference for

    multiple linked single-nucleotide polymorphisms. Am J Hum

    Genet 70:157169.

    Ordovas JM. 2004. Pharmacogenetics of lipid diseases. HumGenomics 1:111125.

    Pedro-Botet J, Schaefer EJ, Bakker-Arkema RG, Black DM, Stein

    EM, Corella D, Ordovas JM. 2001. Apolipoprotein E genotype

    affects plasma lipid response to atorvastatin in a gender specific

    manner. Atherosclerosis 158:183193.Puglielli L, Tanzi RE, Kovacs DM. 2003. Alzheimers disease: the

    cholesterol connection. Nat Neurosci 6:345351.R Development Core Team. 2004. R: A language and environment

    for statistical computing. Vienna, Austria: R Foundation for

    Statistical Computing.Rall SC Jr, Weisgraber KH, Mahley RW. 1982. Human

    apolipoprotein E. The complete amino acid sequence. J Biol

    Chem 257:41714178.

    SUR Model for Multivariate Traits 323

  • 7/29/2019 sbanerjee_07.pdf

    12/13

    Rubin DB. 1976. Inference and missing data. Biometrika 63:581592.

    Smigielski EM, Sirotkin K, Ward M, Sherry ST. 2000. dbSNP: adatabase of single nucleotide polymorphisms. Nucleic AcidsRes 28:352355.

    Stephens M, Smith NJ, Donnelly P. 2001. A new statistical method

    for haplotype reconstruction from population data. Am J HumGenet 68:978989.

    Stephens M, Donnelly P. 2003. A comparison of bayesian methodsfor haplotype reconstruction from population genotype data.Am J Hum Genet 73:11621169.

    Subrahmanyan L, Eberle MA, Clark AG, Kruglyak L, NickersonDA. 2001. Sequence variation and linkage disequilibrium in thehuman T-cell receptor beta (TCRB) locus. Am J Hum Genet69:381395.

    Tanzi RE, Bertram L. 2001. New frontiers in Alzheimers diseasegenetics. Neuron 32:181184.

    Thorisson GA, Stein LD. 2003. The SNP Consortium website: past,present and future. Nucleic Acids Res 31:124127.

    Viallefont V, Raftery AE, Richardson S. 2001. Variable selectionand Bayesian model averaging in epidemiological case-control

    studies. Stat in Med 20:32153230.Zellner A. 1962. An efficient method of estimating seemingly

    unrelated regressions and tests for aggregation bias. J Am StatSoc 57:348368.

    APPENDIX

    Let y1,y2,y,yM denote measurements on Mcontinuous phenotypes taken on n subjects, withym a n 1 vector, m1,y,M. Indicate with G G1; . . . ; Gn the set of multilocus, possibly incom-plete genotypes with Gil 2 fqlql; qlQl; QlQlg,i1,y,n, l1,y,L. H H1; . . . ;Hn is the corre-sponding set of (unknown) haplotype pairs whereHi hi; h0i. Notice that these are treated as latentvariables and we sample from their posteriordistribution for the purposes of imputing un-known genotype data. We assume independentpriors for b, R1, h and H and partition the jointprior as

    pb;R1; h; H pbjhpRphpH

    N0;X1Wia; SphDig:

    In the expression above, we have assumed auniform distribution for h, Wi(a,S) is a Wishart

    distribution with parameters a, a scalar, and S aM M positive-definite matrix and Dig is aDirichlet prior on the 2dD possible haplotypeassignments with d the number of segregatingsites in the dataset. The choice of model spaceprior p(h) is discussed in Methods. We set g1 g2 gD 1; which corresponds to a uniformprior density on the unit simplex of dimensionD1 [Denison et al., 2002]; that is, we assume thatall haplotype assignments are, a priori, equallyprobable. This choice of prior for haplotypeassignments leads to the simplest version of

    algorithm 2 in the appendix of Stephens et al.[2001], which has been shown to give phasingaccuracy that is similar to that of the EMalgorithm. A hybrid algorithm is then used toobtain samples from the target posterior density

    pb;R1; h; Hjy: In particular, starting from arandom imputation of missing genotype datausing site-specific frequencies, a random guessof H0 and some initial values fb0;R0; h0g, oneiterates between the following steps:

    * Draw new values of b from Nb;K with K X X0R1X1 and b KX0R1y;

    * Draw a new value ofS1e

    from Wia 12N; S R0R where R is the n M matrix of residualsgiven by

    R

    e11 e12 e1M

    e21 e22 e2M... ..

    . . .. ..

    .

    en1 en2 enM

    0BBB@1CCCA

    and eim is the error defined as Eim yim x0

    imbm;* Draw a new value of the parameter vector h,

    using a reversible Metropolis step. Specifically,with equal probability, a move is proposed tomodify the current model by adding anexplanatory variable, deleting an explanatoryvariable or replacing a term that is currentlyincluded in the model with a new one. All these

    moves have to satisfy the restriction that theresulting model nests or is nested in the currentone for the reasons mentioned in Methods. Anymove from h to h0 is then accepted withprobability given by

    x min 1;pyjR; h0

    pyjR; h

    ph0

    ph

    qhjh0

    qh0jh

    min 1;pyjR; h0

    pyjR; hZ

    min 1;X0

    1=2jK0j1=2exp12a

    0

    X1=2jKj1=2exp12aZ( )

    where in the expressions above, q(|) i s aproposal distribution that is asymmetric be-cause of the restrictions on the model space and

    a trSR1c y0R1y b

    0

    K1b:

    * Update the haplotype reconstruction of eachindividual in a random order, using a differentorder at each iteration of the Markov chain. Inparticular, for individual i, remove his or hercurrent haplotype pair from Hi and, from Hi,

    Verzilli et al.324

  • 7/29/2019 sbanerjee_07.pdf

    13/13

    form a list h(h1,y,hR) of distinct haplotypeswith corresponding counts c(c1,y,cR) of thenumber of times they appear in H

    i. Calculatethe vector p(p1,y,pR) where, for r1,y, R,pr0 if no haplotypes can be found in h that are

    compatible with the observed genotype data forsubject i Gobsi , pr

    PSs1 cr 1cs 1 1 if

    hr is compatible with Gobsi and there are S

    possible complement haplotypes in h and prcrotherwise. Then, if k is the number of sites atwhich the individual is heterozygous or hasmissing genotype data, with probability2k=

    Pr cr 2

    k reconstruct his or her haplo-types completely at random (by randomlychoosing the phase at each observed hetero-zygous site and randomly imputing missinggenotypes using locus specific frequencies).

    Otherwise, update the haplotype reconstructionhi1 to hr with probability cr=P

    r0 cr0. This auto-matically updates the complement haplotype

    hi2 at observed heterozygous sites. For loci withmissing data, if there are U possible comple-ments of the newly proposed haplotype hi1 inthe list h, then choose hi2hu with probabilitycu=PUu0 cu. If no complement can be found in h,then unknown sites for both hi1 and hi2 arereconstructed randomly. Finally accept theproposed haplotype pair H

    0

    i with probability

    Q min 1;pyjXH

    0

    i; b;R; h

    pyjXHi; b;R; h

    ( ):

    where we highlight the fact that the newlyproposed haplotypes reconstruction maychange the entries of the design matrix X forsubject i at loci with missing data. Also, the

    acceptance ratio above only applies to subjectswith missing genotype data, as in all other caseswe have Q1.

    SUR Model for Multivariate Traits 325