basic statistics ii. significance/hypothesis tests

Basic Statistics II

Significance/hypothesis tests

RCT comparing drug A and drug B for the treatment of hypertension

• 50 patients allocated to A

• 50 patients allocated to B

• Outcome = systolic BP at 3 months

Results

Group A

Mean = 145, sd = 9.9

Group B

Mean = 135, sd = 10.0

Null hypothesis : “μ (A) = μ (B)”

[ie. difference equals 0]

Alternative hypothesis : “μ (A) ≠ μ (B)”

[ie. difference doesn’t equal zero]

[where μ = population mean]

Statistical problem

When can we conclude that

the observed difference

mean(A) - mean(B)

is large enough to suspect that

μ (A) - μ (B) is not zero?

P-value :

“probability of obtaining observed data if the null hypothesis were true”

[eg. if no difference in systolic BP between two groups]

How do we evaluate the probability?

Test Statistic

• Numerical value which can be compared with a known statistical distribution

• Expressed in terms of the observed data and the data expected if the null hypothesis were true

Test statistic

[mean (A) – mean (B)] / sd [mean(A)-mean(B)]

Under null hypothesis this ratio will follow a Normal distribution with mean = 0 and sd = 1

Hypertension example

Test statistic = [mean (A) – mean (B)] / sd [mean(A)-mean(B)]

= [ 145 – 135 ] / 1.99 = 5

→ p <0.001

Interpretation

Drug B results in lower systolic blood pressure in patients with hypertension than does Drug A

Two-sample t-test

Compares two independent groups of Normally distributed data

Significance test example I

Null hypothesis : “μ (A) = μ (B)”

[ie. difference equals 0]

Alternative hypothesis : “μ (A) ≠ μ (B)”

[ie. difference doesn’t equal zero]

Two-sided test

Null hypothesis :

“μ (A) = μ (B) or μ (A) < μ (B) ”

Alternative hypothesis :

“μ (A) > μ (B)”

One-sided test

A one-sided test is only appropriate if a difference in the opposite

direction would have the same meaning or

result in the same action as no difference

Paired-sample t-test

Compares two dependent groups of Normally distributed data

Paired-sample t-test

Mean daily dietary intake of 11 women measured over 10 pre-menstrual and 10 post-menstrual days

Dietary intake example

Pre-menstrual (n=11):

Mean=6753kJ, sd=1142

Post-menstrual (n=11):

Mean=5433kJ, sd=1217

Difference

Mean=1320, sd=367


Test statistic = 1320/[367/sqrt(11)]

= 11.9

p<0.001


Dietary intake during the pre-menstrual period was significantly greater than that during the post-menstrual period

The equivalent non-parametric tests

• Mann-Whitney U-test

•Wilcoxon matched pairs signed rank sum test

Non-parametric tests

• Based on the ranks of the data

• Use complicated formula

• Hence computer package is

recommended

Significance test example II

Type I error

Significant result when null hypothesis is true

(0.05)

Type II error

Non-significant result when null hypothesis is false

[Power = 1 – Type II]

The chi-square test

Used to investigate the relationship between two qualitative variables

The analysis of cross-tabulations

The chi-square test

Compares proportions in two independent samples

Chi-square test example

In an RCT comparing infra-red stimulation (IRS) with placebo on pain caused by osteoarthritis,

9/12 in IRS group ‘improved’ compared with 4/13 in placebo group


Improve?

Yes No

Placebo 4 9 13

IRS 9 3 12

13 12 25

Placebo : 4/13 = 31% improve

IRS: 9/12 = 75% improve

Cross-tabulations

The chi-square test tests the null hypothesis of no relationship between ‘group’ and ‘improvement’ by comparing the observed frequencies with those expected if the null hypothesis were true

Cross-tabulations

Expected frequency

= row total x col total

grand total

Chi-square test example Improve?

Yes No

Placebo 4 9 13

IRS 9 3 12

13 12 25

Expected value for ‘4’ = 13 x 13 / 25

= 6.8

Expected values

Improve?

Yes No

Placebo 6.8 6.2 13

IRS 6.2 5.8 12

13 12 25

Test Statistic

= (observed freq – expected freq)2

expected freq

Test Statistic

= (O – E)2

E

= (4 - 6.8)2/6.8 + (9 – 6.2)2/6.2

+ (4 - 6.8)2/6.8 + (9 – 6.2)2/6.2

= 4.9 → p=0.027


Statistically significant difference in improvement between the IRS and placebo groups

Small samples

The chi-square test is valid if:

at least 80% of the expected frequencies exceed 5 and all the expected frequencies exceed 1

Small samples

If criterion not satisfied then combine or delete rows and columns to give bigger expected values

Small samples

Alternatively:

Use Fisher’s Exact Test

[calculates probability of observed table of frequencies - or more extreme tables-under null hypothesis]

Yates’ Correction

Improves the estimation of the discrete distribution of the test statistic by the continuous chi-square distribution

Chi-square test with Yates’ correction

Subtract ½ from the O-E difference

(|O – E|-½)2

E

Significance test example III

McNemar’s test

Compares proportions in two matched samples

McNemar’s test example

Severe cold age 14

Yes No

Severe Yes 212 144 356

cold No 256 707 963

age 468 851 1319

12


Null hypothesis =

proportions saying ‘yes’ on the 1st and 2nd occasions are the same

the frequencies for ‘yes,no’ and

‘no,yes’ are equal

McNemar’s test

•Test statistic based on observed and expected ‘discordant’ frequencies

•Similar to that for simple chi-square test


Test statistic = 31.4

=> p <0.001

Significant difference between the two ages

Significance test example IV

Comparison of means

2 groups 2-sample t-test

3 or more groups ANOVA

One-way analysis of variance

Example:

Assessing the effect of treatment on the stress levels of a cohort of 60 subjects.

3 age-groups: 15-25, 26-45, 46-65

Stress measured on scale 0-100

Stress levels

Group Mean (SD)

15-25 (n=20) 52.8 (11.2)

26-45 (n=20) 33.4 (15.0)

46-65 (n=20) 35.6 (11.7)

Graph of stress levels

Age Group

43210

Str

ess

Le

vel

80

70

60

50

40

30

20

10

0

ANOVA

Sum of squares

Df Mean square

F Sig

Between groups

4513.6 2 2256.8 13.8 <0.001

Within groups

9294.8 57 163.1

Total 13808.4 59

Interpretation

Significant difference between the three age-groups with respect to stress levels

But what about the specific (pairwise) differences?

Stress levels

Group Mean (SD)

15-25 (n=20) 52.8 (11.2)

26-45 (n=20) 33.4 (15.0)

46-65 (n=20) 35.6 (11.7)

Multiple comparisons

• Comparing each pair of means in turn gives a high probability of finding a significant result by chance

• A multiple comparison method (eg. Scheffé, Duncan, Newman-Keuls) makes appropriate adjustment

Scheffés test

Comparison

15-25 vs. 26-45 p<0.001

15-25 vs. 46-65 p<0.001

26-45 vs. 46-65 p=0.86

Stress levels

Group Mean (SD)

15-25 (n=20) 52.8 (11.2)

26-45 (n=20) 33.4 (15.0)

46-65 (n=20) 35.6 (11.7)

Comparison of medians

2 groups Mann-Whitney

3 or more groups Kruskal-Wallis

Kruskal-Wallis

Example:

Stress levels

Overall comparison of 3 groups:

p<0.001

Multiple comparisons

• There are no non-parametric equivalents to the multiple comparison tests such as Scheffés

• Need to apply Bonferroni’s correction to multiple Mann-Whitney U-tests

Bonferroni’s correction

For k comparisons between means:

multiply each p value by k

Mann-Whitney U-test

Comparison

15-25 vs. 26-45 p<0.001

15-25 vs. 46-65 p<0.001

26-45 vs. 46-65 p=0.68

Need to multiple each p-value by 3

Significance test example V

basic statistics ii. significance/hypothesis tests

Documents