simmons comprehensive cancer center

COMPARING GROUPS – PART 1 CONTINUOUS DATA

Min Chen, Ph.D.

Assistant Professor

Quantitative Biomedical Research CenterDepartment of Clinical SciencesBioinformatics Shared Resource

Simmons Comprehensive Cancer Center

Lecture 4

July 9, 2013

Min Chen (QBRC/CCBSR) Comparing groups Continuous Data 1 Lec 4 1 / 38

OUTLINE

1 REVIEW

2 INTRODUCTION

3 COMPARISON OF TWO GROUPSParametric tests


REVIEW: (1−α )% CONFIDENCE INTERVAL OF THEMEAN

Lower Limit :

L = X̄− zα/2 ×s√n

Upper Limit :

U = X̄+ zα/2 ×s√n

Standard Normal Distribution:

µ = 0,σ = 1


REVIEW OF CONFIDENCE INTERVAL FROM SMALLSAMPLE

As a rule of thumb, if sample size, N < 30, use the formula below.

(1−α)% Confidence Interval:

X̄± tα/2,n−1 ×s√n

where tα/2 is the (α/2)th quantile of the t-distribution with

(n -1) degrees of freedom.


REVIEW: INTERPRETATION OF CI

The CI:

Pr(L(X)≤ θ ≤ U(X)) = 1−α.

It is temping to state “the probability that the θ lies between two

numbers, L and U, is (1−α)”.

� Wrong because θ is a fixed number;� L(X) and U(X) are random variables, not numbers.� On average 95% times the calculated intervals will contain the true

population parameter θ .


RELATIONSHIP BETWEEN TYPE I ERROR (α ) AND POWER


PARAMETRIC VS NON-PARAMETRIC

Parametric tests

Assume data follow some known distribution

E.g., Normal, t-distribution, chi-square, Binomial distribution etc. –

Compare means, variances

Non-parametric tests

Don’t assume a form of distribution

Compare other measures of central tendency (e.g., median, or location

shift)

Useful for skewed data, small samples, ordinal data


NOTATION

Population parameter Sample value

Mean µ X̄

Standard deviation σ s

Variance σ2s

2

Sample Size n

Sample Value xi


ONE SAMPLE t TEST

Recall one - sample t-test:

t =X̄−µ0

s/√

n

Test statistic for comparing the mean of one group against a fixed

value.

General form of a t-statistic is

t =difference of means

standard error.

T-statistic follows a t-distribution!


STUDENT’S t-DISTRIBUTION

Here is how to generate a Student’s t random variable:

Tν =Z�V/ν

,

where

Z is a standard normal distribution;

V has a chi-squared distribution with ν degrees of freedom (df), i.e.,

V =ν

∑i=1

Z2i

where Zi are iid standard normal r.v.’s. (Recall E[Z2i] = 1. So

E[V] = ν .)

Z and V are independent.


t–A FAMILY OF DISTRIBUTIONS IDENTIFIED BY df

Recall t = X̄−µ0s/√

n=

(X̄−µ0)/σ√

n√s2/σ2

.

Approaches Normal distribution as df increases.


SMALL SAMPLE VS. LARGE SAMPLE

Recall in CI, as a rule of thumb, if sample size n < 30, use the tstatistic for the (1−α)% confidence Interval:

X̄± tα/2,n−1 ×s√n

while for large samples we have

X̄± zα/2 ×s√n.

The reason is when sample size is large,

tα/2,n−1 ≈ zα/2.


OUTLINE

1 REVIEW

2 INTRODUCTION

3 COMPARISON OF TWO GROUPSParametric tests


COMPARING MEANS OF PAIRED SAMPLES

In paired samples each data point in one sample is matched to another

data point in the second sample.

Same subject

� Measured at 2 time points� Before and after intervention� Two eyes (Left, Right)� Two organs (Heart, Liver)

Matched subjects

� Experimental animal, Pair-fed Match� Male, Female


COMPARING MEANS OF TWO INDEPENDENT SAMPLES

Two independent samples

Subjects are unrelated in two separate groups;

Sample sizes may be different in each group, (n1,n2)

Variances in each group may be

� Equal, σ21 = σ2

2� Unequal, σ2

1 �= σ22


EXAMPLE 1In a hypertension research study, subjects are given dietary counseling to

restrict their sodium intake. Data on urinary sodium from 8 subjects at

Baseline (Week 0), and Week 1, are shown.

Subject Week 0 Week 1 Change

1 7.85 9.59 1.74

2 12.03 34.5 22.47

3 21.84 4.55 -17.29

4 13.94 20.78 6.84

5 16.68 11.69 -4.99

6 41.78 32.51 -9.27

7 14.97 5.46 -9.51

8 12.07 12.95 0.88

X̄ 17.65 16.5 1.14

s 10.56 11.63 12.22


EXAMPLE 1 (CONTD.)

Subject Week 0 Week 1 Change

1 7.85 9.59 1.74

2 12.03 34.5 22.47

3 21.84 4.55 -17.29

4 13.94 20.78 6.84

5 16.68 11.69 -4.99

6 41.78 32.51 -9.27

7 14.97 5.46 -9.51

8 12.07 12.95 0.88

X̄ 17.65 16.5 1.14

s 10.56 11.63 12.22

Q1:Paired samples or two independent samples?

Q2: Is there a change in mean levels of urinary sodium after 1 week?


PAIRED t-TEST

Example 1 has paired sample data (since same subject was measured at

two time points).

Compute the mean and standard deviations of differences.

H0 : µ1 −µ2 = c vs. Ha : µ1 −µ2 �= c

t =X̄d − c

sd/√

n,

which follows a t-distribution with (n−1) degrees of freedom.

If |t|> t∗n−1(1−α/2), reject H0. Here t

∗n−1(1−α/2) is the (1−α/2)

quantile of Tn−1.

P− value = Pr(Tn−1 > |t|).


REJECTION REGIONS


PAIRED T-TEST USING EXCEL – EXAMPLE 1

Values shown in bold red have been modified from original data.


EXAMPLE 2

A study was performed to compare the mean ERG (electroretinogram)

amplitude of patients with different genetic types of retinitis pigmentosa

(RP), a genetic eye disease that often results in blindness. Data was

collected in patients of age 18-29 years with different genetic types.

Genetic type Mean ± SD N

Dominant 0.85 ± 0.18 62

Recessive 0.38 ± 0.21 35

Table shows values for natural log of ERG.


EXAMPLE 2 (CONTD.)

Q1:Paired samples or two independent samples?

Q2: Is there a difference in mean log(ERG) amplitude between patients

with dominant RP versus those with the recessive form?


TWO-SAMPLE t-TEST WITH EQUAL VARIANCES

Example 2 has two independent samples.

H0 : µ1 = µ2 vs. Ha : µ1 �= µ2

t =X̄1 − X̄2

sp

�1n1+ 1

n2

,

which follows a t-distribution with (n1 +n2 −2) degrees of freedom, where

s2p=

(n1 −1)s21 +(n2 −1)s2

2n1 +n2 −2

is the pooled variance.

If |t|> t∗n1+n2−2(1−α/2), reject H0.

P− value = Pr(Tn−1 > |t|).


TWO-SAMPLE t-TEST FOR EQUAL VARIANCES USINGEXCEL–EXAMPLE 2


COMPARING VARIANCES

In Example 2, the two-sample t-test for independent samples assumed that

variances were equal

Variance of Group 1 = Variance of Group 2

Note that σ21 = 0.182 = 0.032 and σ2

2 = 0.212 = 0.044.

Is equal variance assumption true?


COMPARING VARIANCES

To compare variances, we conduct a hypothesis test to exam if the ratio of

variances is equal to 1.

H0 :σ2

1σ2

2= 1 vs. Ha :

σ21

σ22�= 1

Test statistic: f =s

21

s22,

which follows an F-distribution.


F-DISTRIBUTION

Here is how to generate a F random variable:

Fν1,ν2 =V1/ν1

V2/ν2,

where

V1 and V2 have chi-squared distributions with ν1 and ν2 degrees of

freedom (df), respectively.

V1 and V2 are independent.

Recall E[V] = ν .


F-DISTRIBUTION

F-distribution is a family of distributions that are identified by numerator

and denominator degrees of freedom (df).

F-distribution are

always

right-skewed;

Have numerator

and denominator

df.Recall

f =s

21

s22=

s21/σ2

1s

22/σ2

2.


REJECTION REGIONS FOR THE F-TEST


F-TEST FOR COMPARING VARIANCES

H0 :σ2

1σ2

2= 1 vs. Ha :

σ21

σ22�= 1

Test statistic: f =s

21

s22,

which follows an F-distribution with (n1 −1,n2 −1) degrees of freedom.

If f > Fn1−1,n2−1(1−α/2) or f < Fn1−1,n2−1(α/2), Reject H0.

If f ≥ 1, then P value = 2×Pr(Fn1−1,n2−1 > f );

If f < 1, then P value = 2×Pr(Fn1−1,n2−1 < f ).


F-TEST FOR EQUALITY OF VARIANCES USINGEXCEL–EXAMPLE 2


TWO-SAMPLE t-TEST WITH UNEQUAL VARIANCES

H0 : µ1 = µ2 vs. Ha : µ1 �= µ2

t =X̄1 − X̄2�

s21

n1+

s22

n2

,

which follows a t-distribution with d�degrees of freedom, where

d� =

�s

21/n1 + s

22/n2

�2

s21/n1

n1−1 +s

22/n2

n2−1

.

Round d�down to nearest integer and call it d”.

If |t|> t∗d”(1−α/2), reject H0.

P− value = Pr(Td” > |t|).


EXAMPLE 3

A research study aimed to assess the familial aggregation of cholesterol

levels by collecting data on children of age 2- to 14-years. Cholesterol levels

(mg/dL) were collected in one group of children (say, “cases”) whose father

died from heart disease. Data were also collected in historical control group

of children of same age.

Group Mean ± SD N

Cases 207.3 ± 35.6 100

Historical Control 193.4 ± 17.3 74


EXAMPLE 3 (CONTD.)

Paired sample or two independent samples?

Is there a difference in mean cholesterol levels between Cases and

Historical Control group?

Which statistical test should we use?


F-TEST FOR EQUALITY OF VARIANCES USINGEXCEL–EXAMPLE 3


TWO-SAMPLE t-TEST FOR UNEQUAL VARIANCES USINGEXCEL–EXAMPLE 3


ADVANTAGES OF PAIRED SAMPLES

Suppose we want to test H0 : µ1 = µ2 vs. Ha : µ1 �= µ2

Test statistic is related to X̄1 − X̄2.The variance is:

Var(X̄1 − X̄2) = Var(X̄1)+Var(X̄2)−2ρ12

�Var(X̄1) ·Var(X̄2)

The positive correlation ρ12 in paired-samples reduces the variance of

the difference, yielding more powerful test than the independent

sample design.


REFERENCES I

Rafia Bhore. Lecture notes.

Berman, Nancy (2007). Comparison of Means. In Methods in

Molecular Biology, Vol 404: Topics in Biostatistics, edited by W. T.

Ambrosius. Humana Press Inc., Totowa, NJ, USA.

Rosner, Bernard (2000). Fundamentals of Biostatistics, 5th edition.

Duxbury Press, California, USA.


simmons comprehensive cancer center

Documents