cross-tabulation. 金枝玉孽 無間道 神鵰俠侶 白娘子傳奇 objectives to study the use of...
TRANSCRIPT
Cross-tabulation
金枝玉孽
無間道
神鵰俠侶
白娘子傳奇
Objectives
To study the use of Crosstab for data analysis.
To study certain measures of association.
Content
Introduction Crosstab Measures of Association
Introduction
X Y Method
Nominal Nominal Chi-squared test in Crosstab
Nominal Ordinal or above ANOVA
Ordinal or above Nominal Discriminant Analysis
Ordinal or above Ordinal or above Regression Analysis
Introduction
A cross tabulation involve the simultaneous counting of the number of observations that fall into each of the data categories of 2 or more variables.
Age group Highly loyal Moderately loyal
Brand switchers
Total
<30 30 42 18 90
30-40 14 20 31 65
>40 34 25 16 75
Total 78 87 65 230
1. Introduction
Example: Testing the effectiveness of coupon in increasing
consumer awareness of Brand A:
Before coupon After coupon
Aware Not aware
Total Aware Not aware
Total
Test area
250 350 600 330 170 550
Control area
160 240 400 160 220 380
410 590 1000 490 390 880
1. Introduction
In percentage of column totals:
Before coupon After coupon
Aware Not aware
Total Aware Not aware
Total
Test area 61 59 60 37 44 57
Control area
39 41 40 33 56 43
100 100 100 100 100 100
Introduction
In percentage of row totals :
Before coupon After coupon
Aware Not aware
Total Aware Not aware
Total
Test area 42 58 100 66 34 100
Control area
40 60 100 42 58 100
41 59 100 57 43 100
Simpson’s Paradox
Lurking variables can change or reverse a relation between two categorical variables!
1997 2007
Age Fraction Income Fraction Income
<=45 0.5 $60,000 0.7 $70,000
>45 0.5 $120,000 0.3 $130,000
Mean $90,000 $88,000
Simpson’s Paradox
Buyer Nonbuyer Total Buyer %
Male 700 300 1000 70%
Female 400 600 1000 40%
Total 1100 900 2000 55%
Simpson’s ParadoxLuxury Buyer Nonbuyer Total Buyer %
Male 40 160 200 20%
Female 200 600 800 25%
Total 240 760 1000 24%
Plain Buyer Nonbuyer Total Buyer %
Male 660 140 800 82.5%
Female 200 0 200 100%
Total 860 140 1000 86%
Chi-squared Test The chi-square test assumes a multinominal
experiment. The multinominal experiment is analogous to tossing n balls at k boxes, where:
1. Each ball must fall in one of the boxes.
2. The probability, pi , that a ball will fall in box i remains the same in repeated tosses.
3. The trials are independent.
4. At the conclusion of the experiment, we have n1
balls in box one, n2 balls in box two, and so on.
Multinomial Distribution
n independents trials permitting K mutually exclusive outcomes whose respective probabilities are P1 , P2 , ..., Pk (Σ1
kPi = 1).
Pk remains constant throughout the n trials.
We are interested in the probability of getting X1 outcomes of the 1st kind , X2 outcomes of the 2nd kind, ......, Xk outcomes of the kth kind (Σ1
kXi = n):
Chi-squared Test Chi-square test can be used to test if the observed
association between the variables in the cross-tabulation is statistically significant. This is called the test of independence.
Example:
Age group Highly loyal Moderately loyal
Brand switchers
Total
<30 30 (30.5) 42 (34.1) 18 (25.4) 90
30-40 14 (22.1) 20 (24.5) 31 (18.4) 65
>40 34 (25.4) 25 (28.4) 16 (21.2) 75
Total 78 87 65 230
Chi-squared Test Chi-squared statistic = Σ{(O - E)2/E}, with degrees
of freedom = (r - 1)(c - 1)
.Hreject ,05.00003.0)01.21(
4)13)(13(
01.212.21
2.2116...
1.34
1.3442
5.30
5.3030
tIndependen areLoyalty and Age:Ho
02
4
222*2
dfP
df
Age group
Highly loyal
Moderately loyal
Brand switchers
Total
<30 30 (30.5) 42 (34.1) 18 (25.4) 90
30-40 14 (22.1) 20 (24.5) 31 (18.4) 65
>40 34 (25.4) 25 (28.4) 16 (21.2) 75
Total 78 87 65 230
O
E
Chi-Square Distribution Table
CrossTable
• The CrossTable( ) function in the gmodels package produces crosstabulations and tests results.
• install.packages("gmodels") if necessary.• library(gmodels)• sales <- read.table('sales.csv', header = T, sep=',')• CrossTable(gender, purchase, expected=T, prop.chisq
= F, chisq = T, fisher = T, sresid = T, format= "SPSS")
CrossTableTotal Observations in Table: 582
| purchase gender | no | yes | Row Total | -------------|-----------|-----------|-----------| female | 102 | 271 | 373 | | 125.615 | 247.385 | | | 27.346% | 72.654% | 64.089% | | 52.041% | 70.207% | | | 17.526% | 46.564% | | | -2.107 | 1.501 | | -------------|-----------|-----------|-----------| male | 94 | 115 | 209 | | 70.385 | 138.615 | | | 44.976% | 55.024% | 35.911% | | 47.959% | 29.793% | | | 16.151% | 19.759% | | | 2.815 | -2.006 | | -------------|-----------|-----------|-----------|Column Total | 196 | 386 | 582 | | 33.677% | 66.323% | | -------------|-----------|-----------|-----------|
CrossTable
Statistics for All Table Factors
Pearson's Chi-squared test -------------------------------------------------------Chi^2 = 18.64021 d.f. = 1 p = 1.578558e-05
Pearson's Chi-squared test with Yates' continuity correction -------------------------------------------------------Chi^2 = 17.85923 d.f. = 1 p = 2.378625e-05
Count Expected
Count Expected -Count Residuals edStandardiz
CrossTableFisher's Exact Test for Count Data-----------------------------------------------------------Sample estimate odds ratio: 0.4611142
Alternative hypothesis: true odds ratio is not equal to 1p = 2.408778e-05 95% confidence interval: 0.3179706 0.6676994
Alternative hypothesis: true odds ratio is less than 1p = 1.352295e-05 95% confidence interval: 0 0.6308124
Alternative hypothesis: true odds ratio is greater than 1p = 0.999994 95% confidence interval: 0.3367543 Inf Minimum expected frequency: 70.38488
Yates’s continuity correction
Chi-squared Test Someone gives you a 52 card deck. You draw a
card, record its suit, replace the card, shuffle the deck and repeat that process 200 times, obtaining the following table:
Diamonds Clubs Hearts Spades46 54 49 51
Does the distribution of suits appear to be standard?
Chi-squared Test
To use the test, each cell count should be greater than one, and not more than 20% of the expected frequencies (E) should be less than 5.
Pooling of categories is a method to solve the problem of having more than 20% of the cells with expected frequency < 5.
Measures of Association
• Need vsc package to get measures of association• install.packages("vcd") if needed• library(vcd)• tab <- xtabs(~gender + purchase, data = sales) #
produce a table between gender and purchase• summary(assocstats(tab))
Measures of Association
Call: xtabs(formula = ~gender + purchase)Number of cases in table: 582 Number of factors: 2 Test for independence of all factors: Chisq = 18.64, df = 1, p-value = 1.579e-05 X^2 df P(> X^2)Likelihood Ratio 18.368 1 1.8212e-05Pearson 18.640 1 1.5786e-05
Phi-Coefficient : 0.179 Contingency Coeff.: 0.176 Cramer's V : 0.179
Magnitude of Effect
For phi, Contingency Coefficient, Cramer’s V:�
small ≈ 0.1� �moderate ≈ 0.3�large ≈ 0.5
Measures of Association
n)( t Coefficieny Contingenc
1)-n(k V sCramer'
n Phi
2
2
2
2
The likelihood ratio
Measures of Association
Introduction
There is a plethora of indexes for measuring the strength of association. Factors affecting the value of a particular measure:
Skewed marginal distributions - only a few indices are impervious to marginal distributions.
Nonsquare tables - some measures cannot attain their maximum. If the table is nonsquare.
Nominal Measures
These measures only provide some indication of the strength of the association, but they cannot show direction and nature of the relationship.
Odds Ratio
The Odds Ratio Consider the following table :
Look at the odds of being white-collar. In the case of Loyal, these odds are 30 to 14 about 2:1. In the case of Disloyal, these odds are 42 to 20 or about 2:1.
The following ratio is called the odds:
odds ratio =n11n21/(n12n22)
Loyal Disloyal
White collar 30 42
Blue collar 14 20
Odds Ratio with a Stratifying Variable
Data comes from a cohort study or case-control study that is stratified, for example, the data may be separated (stratified) by the sex of the people studied. Consider the following tables :
Assuming a constant odds ratio across age-strata, test to see if the odds ratio is 1 and report a Pvalue
High age High alcohol consumption
Low alcohol consumption
Case 25 21
Control 29 128
Odds Ratio with a Stratifying Variable> mymatrix1 <- matrix(c(8,5,52,164),nrow=2,byrow=TRUE)> colnames(mymatrix1) <- c("High","Low")> rownames(mymatrix1) <- c("Case",“Control")> print(mymatrix1) High LowCase 8 5Control 52 164> mymatrix2 <- matrix(c(25,21,29,128),nrow=2,byrow=TRUE)> colnames(mymatrix2) <- c("High","Low")> rownames(mymatrix2) <- c("Case",“Control")> print(mymatrix2) High LowCase 25 21Control 29 128
Odds Ratio with a Stratifying Variable The Mantel-Haenszel odds ratio estimates the odds ratio for
association between the Case and Control, controlling for the possible confounding effects of the stratifying variable (age here).
> install.packages("lawstat") > library("lawstat") > myarray <- array(c(mymatrix1,mymatrix2),dim=c(2,2,2)) > cmh.test(myarray)
Cochran-Mantel-Haenszel Chi-square Test
data: myarray CMH statistic = 32.181, df = 1.000, p-value = 0.000, MH Estimate = 5.197, Pooled Odd Ratio = 4.575, Odd Ratio of
level 1 = 5.046, Odd Ratio of level 2 = 5.255
Measures of Association
The log odds is insensitive to marginal distribution. The following tables have the same log odds ratio:
The odd ratio is also invariant under interchange of rows and columns, hence, the odd ratio is a symmetric measure.
75 15 750 15
10 100 100 100
85 105 850 115
Measures of Association
Yule's Y (or the Coefficient of Colligation)
Y^ = sqrt(Q^)
variance = (1/n11 + 1/n21 + 1/n12 + 1/n22)(1 -Y^2)2/16
Pearson's Product Moment Correction
The following formula is used:
r = (n11n22 - n12n21)/sqrt(n1.n2.n.2n.2)
r lies between -1.0 and 1.0 and equals 0 if the variables are independent.
Measures of Association
Cramer introduced the following variant (it can achieve the maximum value of 1):
V = sqrt(χ2/n(k - 1))
k is the smaller of the number of rows and columns. If one of the table's dimensions is 2, V and phi are the same.
Another version is Tschuprow's T = sqrt(χ2/sqrt[(I - 1)(J - 1)])
T varies between 1 and 0 and attains maximum when I=J.
Measures of Association
Proportional Reduction in Error Refer to the following table:
Like the product
Dislike the product
Total
High income 26 8 34
Low income 17 33 50
Total 13 41 84
Measures of Association
Suppose that we want to predict what category a randomly selected person would fall in when asked the question "Do you like the product?". If we had no knowledge of the row variable, the best bid is always "like the product" we shall be wrong in 41 of the 84 cases. If we know that the subject comes from the high income group, we shall predict `like the product' and be wrong in only 8 cases. If we know that he/she is in the low income group, we shall predict dislike the product and be wrong in 17 cases. Hence, we have reduced our number of prediction errors to 8 + 17 = 25 cases. Proportional reduction in error is:
lambdaC/R = (41 - 25)/41 = 0.39
Measures of Association
That is, 39% of the errors in predicting the column variables are eliminated by knowing the row variable.
Lambda is asymmetric and varies between 0, indicating no ability at all to eliminate errors in the column variable on the basis of the row variable, and 1, indicating an ability to eliminate all errors in the column variable predictions, given knowledge of the row variable. Similarly, the lambda-asymmetric for the column variable is:
lambdaC/R = (ΣfKR* -FC*)/(n - FC*)
fKR* is the maximum frequency found within each subclass of the row variable.
FC* is the maximum frequency among the marginal totals of the row variable.
Measures of Association
For the rows:lambdaR/c = (ΣfLC* -FR*)/(n - FR*)
An symmetric measure:lambda = (ΣfKR* -FC* + ΣfLC* -FR*)/(2n - FC* - FR*)
Ordinal Measures of Association Kendall’s r or Kendall rank correlation coefficient.
A pair of cases is concordant (P), if the values of both variables for one case are higher (or both are lower) than the corresponding values for the other case. The pair is discordant (Q) if the value of one variable for a case is larger than the corresponding value for the other case, and the direction is reversed for the second variable. When the 2 cases have identical values on one or on both variables, they are tied.
Ordinal Measures of Association
If the preponderance of pairs is concordant, the association is said to be positive, otherwise it is negative. If concordant and discordant pairs are equally likely, no association is said to exist. The following are some measures:
Kendall tau-a =(P - Q) / total number of pairs.
tau-b = (P - Q)/sqrt[(P + Q + Tx)(P + Q + Ty)] where Tx = pairs tied on X and Ty = pairs tied on Y.
tau-c =2m(P - Q)/n^2(m - 1) where m is the smaller of the number of rows and columns.
Goodman and Kruskal’s gamma = (P - Q)/(P + Q), and if G=0, it means independence.
Somer’s d = (P - Q)/(P + Q + Ty)
Ordinal Measures of Association
Measures of Association
Interval Data Pearson's correlation coefficient and a lot of other measures can
be used.
Kappa (Agreement - Expected Agreement) / (1 – Expected Agreement)
Kappa For measuring agreement. The 2 variables must have the same
range of values. (Agreement - Expected Agreement) / (1 – Expected Agreement) Example:
Rater 1 rated 44.4% of the customers as loyal.Rater 2 rated 40.3% of the customers as loyal.
Loyal Moderately loyal Brand switcher Total
Loyal 17 4 8 29 40.3%
Moderately loyal
5 12 17 23.6%
Brand switcher 10 3 13 26 36.1%
Total 32 19 21 72
44.4% 26.4% 29.2%
Kappa
If the ratings are independent, 17.9% (44.4% X 40.3%) of the customers would be rated as loyal by both; 6.2% would be rated as moderately loyal by both, and 10.5% would be rated as brand switcher by both.
Therefore, (17.9 + 6.2 +10.5)%=34.6% would be classified the same merely by chance.
Now, observed percentage of customers classified the same = 42/72 = 58.3%
And the largest possible non-chance agreement = 1- 34.6%
Then Kappa = (0.583 - 0.346)/(1 - 0.346) = 0.362.
Kappa
Kappa is always less than or equal to 1. Kappa = 0: Agreement is at chance A value of 1 implies perfect agreement and values less
than 1 imply less than perfect agreement. Kappa < 0: your model is worse than chance. Kappa = negative infinity: Agreement is perfectly inverse Poor agreement = Less than 0.20 Fair agreement = 0.20 to 0.40 Moderate agreement = 0.40 to 0.60 Good agreement = 0.60 to 0.80 Very good agreement = 0.80 to 1.00
Measures of Association
Entering data: weight cases
Entering data: weight cases
SPSS Crosstabs From the pull down menu Analyze, select Descriptive
Statistics and choose Crosstabs to open up the following dialogue box.
Select Gender from the variable list and move it to the Row variable box.
For the column variable, select Shopping Duty.
SPSS Crosstabs Click Cells in the previous dialog box to bring up the
following. Select Expected under Counts to compute the
expected frequency of each cell. Select Standardized under Residuals. click Continue to go back to the previous dialog box.
SPSS Crosstabs Click Continue to go back
to the previous dialog box, then click Statistics to display the following dialog box.
Select Chi-square, Contingency Coefficient, and Phi and Cramer's V.
Click Continue and OK to get the output.
SPSS Crosstabs
Test of Independence
Ho: Gender and shopping duty are independent (or in other words, there is no gender difference in shopping duty).
Ha: Gender and shopping duty are dependent (or in other words, there is gender difference in shopping duty).
Since the Sig. (or p-value) associated with the Likelihood Ratio is 0.490 > 0.05, we would not reject Ho and conclude that there is no difference in shopping duty between male and female respondents.
Recode To combine categories, choose Transform
from the main menu, and then select Recode and Into Different Variables.
In the dialog box that pops up, select Duty and put it into the Numeric Variable -> Output box.
Recode Make up a name for the Output Variable.
Let’s call the new variable rec_duty which stands for recoded duty. Type recoded duty in the textbox under Label.
Recode
Click the Old and New Values button to go into another dialog box.
Recode Click Range under Old Values and enter the
numbers 1 & 2 represent the duty = yes group and the duty = shared responsibility group. Then enter the number 1 into the New Value box.
Recode
Now enter 3 into Old Value textbox and 2 into the New Value text box as shown below
Recode
Specify the value labels of rec_duty as shown below:
Recode
Fisher’s Exact Test When drinking tea, a woman claimed to be able to
distinguish whether milk or tea was added to the cup first.
To test this claim, she was given eight cups of tea. In four of the cups, tea was added first, and in four of the cups, milk was added first.
The order in which the cups were presented to her was randomized.
She was told that there were four cups of each type, so that she should make four predictions of each order.
Ho: The order in which milk or tea is poured into a cup and the taster’s guess of the order are independent.
Ha: The taster can correctly guess the order in which milk or tea is poured into a cup.
Fisher’s Exact TestGuess * Actual Crosstabulation
3 1 42.0 2.0 4.0
1 3 42.0 2.0 4.0
4 4 84.0 4.0 8.0
CountExpected CountCountExpected CountCountExpected Count
Milk First
Tea First
Guess
Total
Milk First Tea FirstActual
Total
Chi-Square Tests
2.000b 1 .157.500 1 .480
2.093 1 .148.486 .243
1.750 1 .1868
Pearson Chi-SquareContinuity Correctiona
Likelihood RatioFisher's Exact TestLinear-by-Linear AssociationN of Valid Cases
Value dfAsymp. Sig.
(2-sided)Exact Sig.(2-sided)
Exact Sig.(1-sided)
Computed only for a 2x2 tablea. 4 cells (100.0%) have expected count less than 5. The minimum expected count is 2.00.b.
70
http://www.swogstat.org/stat/public/fisher.htm
Y
X
The output consists of three p-values: Left: Use this when the alternative to independence is that there is negative association between the variables. That is, the observations tend to lie in lower left and upper right. Right: Use this when the alternative to independence is that there is positive association between the variables. That is, the observations tend to lie in upper left and lower right. 2-Tail: Use this when there is no prior alternative.
TABLE = [ 3 , 7 , 5 , 10 ]Left : p-value = 0.6069Right : p-value = 0.726392-Tail : p-value = 1
yes no total
yes 3 7 10
no 5 10 15
total 8 17
Fisher’s Exact Test
Fisher's exact test returns exact one-tailed and two-tailed p-values for a given frequency table.
The probability of observing a given set of frequencies A, B, C, and D in a 2 x 2 contingency table, given fixed row and column marginal totals and sample size N, is:
Fisher’s Exact Test
Fisher's exact test computes the probability, given the observed marginal frequencies, of obtaining exactly the frequencies observed and any configuration more extreme.
Example:
2 (A) 3 (B) 5 (A+B)
6 (C) 4 (D) 10 (C+D)
8 (A+C) 7 (B+D) 15 (N)
Fisher’s Exact Test
All configurations with the same marginal frequencies include:
Fisher’s Exact Test
Thus, the one-tailed probability for this table would be: .326 + .093 + .007 = .426 ...whereas the two-tailed probability would be: .326 + .093 + .007 + .163 + .019 = .608 The probability for the fourth configuration is not included
because it is less extreme (more probable) than the observed frequency configuration.
Since p = 0.608 > 0.025 (the test is a two-tail test), the null hypothesis is not rejected.
Standardized Residuals
Count Expected
Count Expected -Count Residuals edStandardiz
Symmetric Measures
Magnitude of Effect
For phi, Contingency Coefficient, Cramer’s V:�
trivial if value < ±0.1 � small if ±0.1 < value < ±0.3 medium effect ±0.3 < value < ±0.5 large effect if value > ±0.5
One Sample Chi-Square Test Example
One-Sample Chi-Square Example
Null Ho: 0 = E
Statistical test One-sample chi-square
Significance level .05
Calculated value 9.89
Critical test value 7.815
References:
Reference book Chapter 16.
END
Chi-squared Test Chi-squared statistic = Σ{(O - E)2/E}, with degrees
of freedom = (r - 1)(c - 1)
2
dwewr
xF
dfwr
x
r2
12
0 2/
222*2
22/
1)(
4)13)(13(
01.212.21
2.2116...
1.34
1.3442
5.30
5.3030
tIndependen areLoyalty and Age:Ho
Age group
Highly loyal
Moderately loyal
Brand switchers
Total
<30 30 (30.5) 42 (34.1) 18 (25.4) 90
30-40 14 (22.1) 20 (24.5) 31 (18.4) 65
>40 34 (25.4) 25 (28.4) 16 (21.2) 75
Total 78 87 65 230
O
E