basic statistics ii. significance/hypothesis tests
TRANSCRIPT
Basic Statistics II
Significance/hypothesis tests
RCT comparing drug A and drug B for the treatment of hypertension
• 50 patients allocated to A
• 50 patients allocated to B
• Outcome = systolic BP at 3 months
Results
Group A
Mean = 145, sd = 9.9
Group B
Mean = 135, sd = 10.0
Null hypothesis : “μ (A) = μ (B)”
[ie. difference equals 0]
Alternative hypothesis : “μ (A) ≠ μ (B)”
[ie. difference doesn’t equal zero]
[where μ = population mean]
Statistical problem
When can we conclude that
the observed difference
mean(A) - mean(B)
is large enough to suspect that
μ (A) - μ (B) is not zero?
P-value :
“probability of obtaining observed data if the null hypothesis were true”
[eg. if no difference in systolic BP between two groups]
How do we evaluate the probability?
Test Statistic
• Numerical value which can be compared with a known statistical distribution
• Expressed in terms of the observed data and the data expected if the null hypothesis were true
Test statistic
[mean (A) – mean (B)] / sd [mean(A)-mean(B)]
Under null hypothesis this ratio will follow a Normal distribution with mean = 0 and sd = 1
Hypertension example
Test statistic = [mean (A) – mean (B)] / sd [mean(A)-mean(B)]
= [ 145 – 135 ] / 1.99 = 5
→ p <0.001
Interpretation
Drug B results in lower systolic blood pressure in patients with hypertension than does Drug A
Two-sample t-test
Compares two independent groups of Normally distributed data
Significance test example I
Null hypothesis : “μ (A) = μ (B)”
[ie. difference equals 0]
Alternative hypothesis : “μ (A) ≠ μ (B)”
[ie. difference doesn’t equal zero]
Two-sided test
Null hypothesis :
“μ (A) = μ (B) or μ (A) < μ (B) ”
Alternative hypothesis :
“μ (A) > μ (B)”
One-sided test
A one-sided test is only appropriate if a difference in the opposite
direction would have the same meaning or
result in the same action as no difference
Paired-sample t-test
Compares two dependent groups of Normally distributed data
Paired-sample t-test
Mean daily dietary intake of 11 women measured over 10 pre-menstrual and 10 post-menstrual days
Dietary intake example
Pre-menstrual (n=11):
Mean=6753kJ, sd=1142
Post-menstrual (n=11):
Mean=5433kJ, sd=1217
Difference
Mean=1320, sd=367
Dietary intake example
Test statistic = 1320/[367/sqrt(11)]
= 11.9
p<0.001
Dietary intake example
Dietary intake during the pre-menstrual period was significantly greater than that during the post-menstrual period
The equivalent non-parametric tests
• Mann-Whitney U-test
•Wilcoxon matched pairs signed rank sum test
Non-parametric tests
• Based on the ranks of the data
• Use complicated formula
• Hence computer package is
recommended
Significance test example II
Type I error
Significant result when null hypothesis is true
(0.05)
Type II error
Non-significant result when null hypothesis is false
[Power = 1 – Type II]
The chi-square test
Used to investigate the relationship between two qualitative variables
The analysis of cross-tabulations
The chi-square test
Compares proportions in two independent samples
Chi-square test example
In an RCT comparing infra-red stimulation (IRS) with placebo on pain caused by osteoarthritis,
9/12 in IRS group ‘improved’ compared with 4/13 in placebo group
Chi-square test example
Improve?
Yes No
Placebo 4 9 13
IRS 9 3 12
13 12 25
Placebo : 4/13 = 31% improve
IRS: 9/12 = 75% improve
Cross-tabulations
The chi-square test tests the null hypothesis of no relationship between ‘group’ and ‘improvement’ by comparing the observed frequencies with those expected if the null hypothesis were true
Cross-tabulations
Expected frequency
= row total x col total
grand total
Chi-square test example Improve?
Yes No
Placebo 4 9 13
IRS 9 3 12
13 12 25
Expected value for ‘4’ = 13 x 13 / 25
= 6.8
Expected values
Improve?
Yes No
Placebo 6.8 6.2 13
IRS 6.2 5.8 12
13 12 25
Test Statistic
= (observed freq – expected freq)2
expected freq
Test Statistic
= (O – E)2
E
= (4 - 6.8)2/6.8 + (9 – 6.2)2/6.2
+ (4 - 6.8)2/6.8 + (9 – 6.2)2/6.2
= 4.9 → p=0.027
Chi-square test example
Statistically significant difference in improvement between the IRS and placebo groups
Small samples
The chi-square test is valid if:
at least 80% of the expected frequencies exceed 5 and all the expected frequencies exceed 1
Small samples
If criterion not satisfied then combine or delete rows and columns to give bigger expected values
Small samples
Alternatively:
Use Fisher’s Exact Test
[calculates probability of observed table of frequencies - or more extreme tables-under null hypothesis]
Yates’ Correction
Improves the estimation of the discrete distribution of the test statistic by the continuous chi-square distribution
Chi-square test with Yates’ correction
Subtract ½ from the O-E difference
(|O – E|-½)2
E
Significance test example III
McNemar’s test
Compares proportions in two matched samples
McNemar’s test example
Severe cold age 14
Yes No
Severe Yes 212 144 356
cold No 256 707 963
age 468 851 1319
12
McNemar’s test example
Null hypothesis =
proportions saying ‘yes’ on the 1st and 2nd occasions are the same
the frequencies for ‘yes,no’ and
‘no,yes’ are equal
McNemar’s test
•Test statistic based on observed and expected ‘discordant’ frequencies
•Similar to that for simple chi-square test
McNemar’s test example
Test statistic = 31.4
=> p <0.001
Significant difference between the two ages
Significance test example IV
Comparison of means
2 groups 2-sample t-test
3 or more groups ANOVA
One-way analysis of variance
Example:
Assessing the effect of treatment on the stress levels of a cohort of 60 subjects.
3 age-groups: 15-25, 26-45, 46-65
Stress measured on scale 0-100
Stress levels
Group Mean (SD)
15-25 (n=20) 52.8 (11.2)
26-45 (n=20) 33.4 (15.0)
46-65 (n=20) 35.6 (11.7)
Graph of stress levels
Age Group
43210
Str
ess
Le
vel
80
70
60
50
40
30
20
10
0
ANOVA
Sum of squares
Df Mean square
F Sig
Between groups
4513.6 2 2256.8 13.8 <0.001
Within groups
9294.8 57 163.1
Total 13808.4 59
Interpretation
Significant difference between the three age-groups with respect to stress levels
But what about the specific (pairwise) differences?
Stress levels
Group Mean (SD)
15-25 (n=20) 52.8 (11.2)
26-45 (n=20) 33.4 (15.0)
46-65 (n=20) 35.6 (11.7)
Multiple comparisons
• Comparing each pair of means in turn gives a high probability of finding a significant result by chance
• A multiple comparison method (eg. Scheffé, Duncan, Newman-Keuls) makes appropriate adjustment
Scheffés test
Comparison
15-25 vs. 26-45 p<0.001
15-25 vs. 46-65 p<0.001
26-45 vs. 46-65 p=0.86
Stress levels
Group Mean (SD)
15-25 (n=20) 52.8 (11.2)
26-45 (n=20) 33.4 (15.0)
46-65 (n=20) 35.6 (11.7)
Comparison of medians
2 groups Mann-Whitney
3 or more groups Kruskal-Wallis
Kruskal-Wallis
Example:
Stress levels
Overall comparison of 3 groups:
p<0.001
Multiple comparisons
• There are no non-parametric equivalents to the multiple comparison tests such as Scheffés
• Need to apply Bonferroni’s correction to multiple Mann-Whitney U-tests
Bonferroni’s correction
For k comparisons between means:
multiply each p value by k
Mann-Whitney U-test
Comparison
15-25 vs. 26-45 p<0.001
15-25 vs. 46-65 p<0.001
26-45 vs. 46-65 p=0.68
Need to multiple each p-value by 3
Significance test example V