hank statisticswithmatlab test&anova 040317

Upload: marko-krstic

Post on 25-Feb-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/25/2019 Hank Statisticswithmatlab Test&Anova 040317

    1/12

    2004/3/17

    [2]

    One-Way Analysis of

    Variance (ANOVA)

    two-sample

    t-test

    z-test

    t-test

    Wilcoxon

    rank-sum test,

    (Mann-Whitney

    U test).

    Sign test,

    Wilcoxon

    signed-rank test

    Non-Parametric

    Hypothesis

    Testing

    Assumptions and Test for Normality

    Histogram, QQplot

    Jarque-Bera test, Lilliefors test, Kolmogorov-Smirnov test

    Parametric

    Hypothesis

    Testing

    Complex data

    More than two Groups

    Unpaired dataPaired dataMicroarray Data

    [3]

    A hypothesis testis a procedure for determining if an assertionabout a characteristic of a population is reasonable.

    For example, suppose that someone says that the average price of

    a gallon of regular unleaded gas in Massachusetts is $1.15. Howwould you decide whether this statement is true? You could try to find out what every gas station in the state was

    charging and how many gallons they were selling at that price.That approach might be definitive, but it could end up costingmore than the information is worth.

    A simpler approach is to find out the price of gas at a smallnumber of randomly chosen stations around the state andcompare the average price to $1.15.

    Of course, the average price you get will probably not be exactly$1.15 due to variability in price from one station to the next.

    Suppose your average price was $1.18. Is this three cent differencea result of chance variability, or is the original assertion incorrect?

    A hypothesis test can provide an answer.

    [4]

  • 7/25/2019 Hank Statisticswithmatlab Test&Anova 040317

    2/12

  • 7/25/2019 Hank Statisticswithmatlab Test&Anova 040317

    3/12

  • 7/25/2019 Hank Statisticswithmatlab Test&Anova 040317

    4/12

    [13]

    #$ !

    ExampleH0: no differential expressed.

    The test is significant

    = Reject H0

    False Positive

    = ( Reject H0 | H0 true)= concluding that a gene is

    differentially expressed when in fact

    it is not.

    [14]

    %)

    Question: What if I do a t-test on a pair of samples and fail to reject the null hypothesis--does this mean that there is no significant difference?

    Answer: Maybe yes, maybe no.

    For two-sample t-test, power is the probability of rejecting the hypothesis thatthe means are equal when they are in fact not equal. Power is one minus theprobability of Type-II error.

    The power of the test depends upon the sample size, the magnitudes of thevariances, the alpha level, and the actual difference between the twopopulation means.

    Usually you would only consider the power of a test when you failed to rejectthe null hypothesis.

    High power is desirable (0.7 to 1.0). High power means that there is a highprobability of rejecting the null hypothesis when the null hypothesis is false.

    This is a critical measure of precision in hypothesis testing and needs to beconsidered with care.

    [15]

    *

    Two measurements are independent if knowing thevalue of one measurement does not give information

    about the value of the other. For any gene, the measurements of expression in two

    different patient are independent.

    Replicate measurements from the same patient are not

    independent. (replicate features on an array)

    [16]

    +, #$ , ,

    10

    9

    8

    7

    6

    5

    4

    3

    2

    1

    RanksumWilcoxon rank sum test that two populations are identical (unpaired)(Mann-Whitney test)

    signrankWilcoxon signed rank test of equality of medians (paired)

    signtestSign test for paired samples (paired)

    ttest2Hypothesis testing for the difference in means of two samples (unpaired)

    ttestHypothesis testing for a single sample mean (paired)

    ztestHypothesis testing for the mean of one sample with known variance

    kstest2Kolmogorov-Smirnov test to compare the distribution of two samples

    kstestKolmogorov-Smirnov test of the distribution of one sample

    lillietestLilliefors test for goodness of fit to a normal distribution

    jbtestJarque-Bera test for goodness-of-fit to a normal distribution

    anova1One-Way Analysis of Variance (ANOVA)11

  • 7/25/2019 Hank Statisticswithmatlab Test&Anova 040317

    5/12

    [17]

    -. ! " "") ) Paired or one-sample t-test (Related samples)

    Unpaired or two-sample t-test (Independent samples)

    [18]

    % - !/ The gene acetyl-Coenzyme A acetyltransferase 2 (ACAT2) is on the

    microarray used for the breast cancer data.

    We can use a paired t-test to determine whether or not the gene isdifferentially expressed following doxoruicin chemotherapy. The samples from before and after chemotherapy have been hybridized

    on separate arrays, with a reference sample in the other channel. Normalize the data. Because this is a reference sample experiment, we calculate the log ratio of

    the experimental sample relative to the reference sample for before andafter treatment in each patient.

    Calculate a single log ratio for each patient that represents the difference ingene expression due to treatment by subtracting the log ratio for the genebefore treatment from the log ratio of the gene after treatment.

    Perform the t-test. t=3.22 compare to t(19).

    The p-value for a two-tailed one sample t-test is 0.0045,which is significant at a 1% confidence level.

    Conclude: this gene has been significantly down-regulated following chemotherapy at the 1% level.

    [19]

    ' - /

    The gene metallothionein IB is on the Affymetrix array used forthe leukemia data. To identify whether or not this gene is differentially expressed

    between the AML and ALL patients. To identify genes which are up- or down-regulation in AML relative

    to ALL.

    Steps the data is log transformed. t=-3.4177, p=0.0016

    Conclude that the expression of metallothionein IB is significantlyhigher in AML than in ALL at the 1% level.

    [20]

    " -

    The distribution of the data being tested is normal. For paired t-test, it is the distribution of the subtracted data that must be

    normal.

    For unpaired t-test, the distribution of both data sets must be normal. Homogeneous: the variances of the two population are equal. Plots: Histogram, Density Plot, QQplot, Test for Normality: Jarque-Bera test, Lilliefors test, Kolmogorov-

    Smirnov test. Test for equality of the two variances: Variance ratio F-test.

    Note:

    If the two populations are symmetric, and if the variances are

    equal, then the ttest may be used.If the two populations are symmetric, and the variances are notequal, then use the two-sample unequal variance t-test or Welch's ttest.

  • 7/25/2019 Hank Statisticswithmatlab Test&Anova 040317

    6/12

  • 7/25/2019 Hank Statisticswithmatlab Test&Anova 040317

    7/12

  • 7/25/2019 Hank Statisticswithmatlab Test&Anova 040317

    8/12

    [29]

    !

    Given npairs of data, the sign

    test tests the hypothesis that themedian of the differences in thepairs is zero. The test statistic is the numberof positive differences. If the null hypothesis is true,then the numbers of positive andnegative differences should beapproximately the same. In fact, the number of positivedifferences will have a Binomialdistribution with parameters nandp.

    [30]

    & !- Null hypothesis: the population median from which both samples weredrawn is the same.

    The sum of the ranks for the"positive" (up-regulated) values iscalculated and compared against aprecomputed table to a p-value.

    Sorting the absolute valuesof the differences fromsmallest to largest. Assigning ranks to theabsolute values. Find the sum of the ranks ofthe positive differences.

    If the null hypothesis is true, thesum of the ranks of the positive

    differences should be about thesame as the sum of the ranks ofthe negative differences.

    [31]

    & - -&$ ' 3

    The data from the two groups are combined and given ranks. (1 for the smallest, 2for the second smallest,... )The ranks for the larger group are summed and that number is compared against aprecomputed table to a p-value.

    [32]

    ,.#$ !

  • 7/25/2019 Hank Statisticswithmatlab Test&Anova 040317

    9/12

  • 7/25/2019 Hank Statisticswithmatlab Test&Anova 040317

    10/12

  • 7/25/2019 Hank Statisticswithmatlab Test&Anova 040317

    11/12

    [41]

    $ %

    The permutation test is a test where the null-hypothesis allows to reduce theinference to a randomization problem.

    The process of randomizations makes it possible to ascribe a probabilitydistribution to the difference in the outcome possible under H0.

    The outcome data are analyzed many times (once for each acceptableassignment that could have been possible under H0) and then compared withthe observed result, without dependence on additional distributional or model-based assumptions.

    Perform a permutation test (general):1. Analyze the problem, choice of null-hypothesis2. Choice of test statistic T3. Calculate the value of the test statistic for the observed data: tobs

    4. Apply the randomization principle and look at all possible permutations, this gives thedistribution of the test statistic T under H0.5. Calculation of p-value:

    Ref: Mansmann, U. (2002), Practical microarray analysis: resampling and theBootstraap.Heidelberg. [42]

    $ % 0

    The permutation test allows determining thestatistical significance of the score for every gene.

    [43]

    -& " 727

    It often happens in research practice that you need to compare morethan two groups (e.g., drug 1, drug 2, and placebo), or compare groupscreated by more than one independent variable while controlling for theseparate influence of each of them (e.g., Gender, type of Drug, and sizeof Dose). In these cases, you need to analyze the data using Analysisof Variance, which can be considered to be a generalization of the t-test.

    In fact, for two group comparisons, ANOVA will give results identical toa t-test.

    When the design is more complex, ANOVA offers numerousadvantages that t-tests cannot provide (even if you run a series of t-tests comparing various cells of the design).

    Analysis of Variance (ANOVA) allows us to extend this to more than twopopulations or measurements (treatments/). That is, we can test thefollowing: Are all the means from more than two populations equal?

    Are all the means from more than two treatments on one population equal?(This is equivalent to asking whether the treatments have any overall effect.)

    [44]

    27 ,

  • 7/25/2019 Hank Statisticswithmatlab Test&Anova 040317

    12/12

    [45]

    27 ! /

    To identify the genes that are differentially expressed in one or

    more of these four groups. ARP1 (actin-related protein 1).

    [46]

    "

    Enfron, B. and Tibshirani, R. (1993). An introduction to the bootstrap. Chapman and Hall. Jarque, C. M. and Bera, A. K. (1980). Efficient tests for normality, homoscedasticity, and serial

    independence of regression residuals. Economics Letters 6, 255-9. Kerr, M. K., Martin, M., and Churchill, G. A. (2000). Analysis of variance for gene expression microarray

    data, Journal of Computational Biology, 7: 819-837. Lilliefors, H. W. (1967). On the Kolmogorov-Smirnov test for normality with mean and variance unknown,

    The American Statistical Association Journal. Martinez, W. L. (2002 ). Computational statistics handbook with MATLAB, Boca Raton : Chapman &

    Hall/CRC. Runyon, R. P. (1977). Nonparametric statistics : a contemporary approach, Reading, Mass.: Addison-

    Wesley Pub. Co. Statistics Toolbox User's Guide, The MathWorks Inc.

    http://www.mathworks.com/access/helpdesk/help/toolbox/stats/stats.shtml Stekel, D. (2003). Microarray bioinformatics, New York : Cambridge University Press. Tsai, C. A., Chen, Y. J. and Chen, J. (2003). Testing for differentially expressed genes with microarray

    data, Nucleic Acids Research 31, No 9, e52. Turner, J. R. and Thayer, J. F. (2001). Introduction to analysis of variance : design, analysis, &

    interpretation, Thousand Oaks, Calif. : Sage Publications.

    E-mail: [email protected]: http://www.sinica.edu.tw/~hmwu/