cross-tabulation. 金枝玉孽無間道神鵰俠侶白娘子傳奇 objectives to study the use of...

Cross-tabulation

金枝玉孽

無間道

神鵰俠侶

白娘子傳奇

Objectives

To study the use of Crosstab for data analysis.

To study certain measures of association.

Content

Introduction Crosstab Measures of Association

Introduction

X Y Method

Nominal Nominal Chi-squared test in Crosstab

Nominal Ordinal or above ANOVA

Ordinal or above Nominal Discriminant Analysis

Ordinal or above Ordinal or above Regression Analysis

Introduction

A cross tabulation involve the simultaneous counting of the number of observations that fall into each of the data categories of 2 or more variables.

Age group Highly loyal Moderately loyal

Brand switchers

Total

<30 30 42 18 90

30-40 14 20 31 65

>40 34 25 16 75

Total 78 87 65 230

1. Introduction

Example: Testing the effectiveness of coupon in increasing

consumer awareness of Brand A:

Before coupon After coupon

Aware Not aware

Total Aware Not aware

Total

Test area

250 350 600 330 170 550

Control area

160 240 400 160 220 380

410 590 1000 490 390 880

1. Introduction

In percentage of column totals:


Aware Not aware


Total

Test area 61 59 60 37 44 57

Control area

39 41 40 33 56 43

100 100 100 100 100 100

Introduction

In percentage of row totals :


Aware Not aware


Total

Test area 42 58 100 66 34 100

Control area

40 60 100 42 58 100

41 59 100 57 43 100

Simpson’s Paradox

Lurking variables can change or reverse a relation between two categorical variables!

1997 2007

Age Fraction Income Fraction Income

<=45 0.5 $60,000 0.7 $70,000

>45 0.5 $120,000 0.3 $130,000

Mean $90,000 $88,000

Simpson’s Paradox

Buyer Nonbuyer Total Buyer %

Male 700 300 1000 70%

Female 400 600 1000 40%

Total 1100 900 2000 55%

Simpson’s ParadoxLuxury Buyer Nonbuyer Total Buyer %

Male 40 160 200 20%

Female 200 600 800 25%

Total 240 760 1000 24%

Plain Buyer Nonbuyer Total Buyer %

Male 660 140 800 82.5%

Female 200 0 200 100%

Total 860 140 1000 86%

Chi-squared Test The chi-square test assumes a multinominal

experiment. The multinominal experiment is analogous to tossing n balls at k boxes, where:

1. Each ball must fall in one of the boxes.

2. The probability, pi , that a ball will fall in box i remains the same in repeated tosses.

3. The trials are independent.

4. At the conclusion of the experiment, we have n1

balls in box one, n2 balls in box two, and so on.

Multinomial Distribution

n independents trials permitting K mutually exclusive outcomes whose respective probabilities are P1 , P2 , ..., Pk (Σ1

kPi = 1).

Pk remains constant throughout the n trials.

We are interested in the probability of getting X1 outcomes of the 1st kind , X2 outcomes of the 2nd kind, ......, Xk outcomes of the kth kind (Σ1

kXi = n):

Chi-squared Test Chi-square test can be used to test if the observed

association between the variables in the cross-tabulation is statistically significant. This is called the test of independence.

Example:

Age group Highly loyal Moderately loyal

Brand switchers

Total

<30 30 (30.5) 42 (34.1) 18 (25.4) 90

30-40 14 (22.1) 20 (24.5) 31 (18.4) 65

>40 34 (25.4) 25 (28.4) 16 (21.2) 75

Total 78 87 65 230

Chi-squared Test Chi-squared statistic = Σ{(O - E)2/E}, with degrees

of freedom = (r - 1)(c - 1)

.Hreject ,05.00003.0)01.21(

4)13)(13(

01.212.21

2.2116...

1.34

1.3442

5.30

5.3030

tIndependen areLoyalty and Age:Ho

02

4

222*2

dfP

df

Age group

Highly loyal

Moderately loyal

Brand switchers

Total

<30 30 (30.5) 42 (34.1) 18 (25.4) 90

30-40 14 (22.1) 20 (24.5) 31 (18.4) 65

>40 34 (25.4) 25 (28.4) 16 (21.2) 75

Total 78 87 65 230

O

E

Chi-Square Distribution Table

CrossTable

• The CrossTable( ) function in the gmodels package produces crosstabulations and tests results.

• install.packages("gmodels") if necessary.• library(gmodels)• sales <- read.table('sales.csv', header = T, sep=',')• CrossTable(gender, purchase, expected=T, prop.chisq

= F, chisq = T, fisher = T, sresid = T, format= "SPSS")

CrossTableTotal Observations in Table: 582

| purchase gender | no | yes | Row Total | -------------|-----------|-----------|-----------| female | 102 | 271 | 373 | | 125.615 | 247.385 | | | 27.346% | 72.654% | 64.089% | | 52.041% | 70.207% | | | 17.526% | 46.564% | | | -2.107 | 1.501 | | -------------|-----------|-----------|-----------| male | 94 | 115 | 209 | | 70.385 | 138.615 | | | 44.976% | 55.024% | 35.911% | | 47.959% | 29.793% | | | 16.151% | 19.759% | | | 2.815 | -2.006 | | -------------|-----------|-----------|-----------|Column Total | 196 | 386 | 582 | | 33.677% | 66.323% | | -------------|-----------|-----------|-----------|

CrossTable

Statistics for All Table Factors

Pearson's Chi-squared test -------------------------------------------------------Chi^2 = 18.64021 d.f. = 1 p = 1.578558e-05

Pearson's Chi-squared test with Yates' continuity correction -------------------------------------------------------Chi^2 = 17.85923 d.f. = 1 p = 2.378625e-05

Count Expected

Count Expected -Count Residuals edStandardiz

CrossTableFisher's Exact Test for Count Data-----------------------------------------------------------Sample estimate odds ratio: 0.4611142

Alternative hypothesis: true odds ratio is not equal to 1p = 2.408778e-05 95% confidence interval: 0.3179706 0.6676994

Alternative hypothesis: true odds ratio is less than 1p = 1.352295e-05 95% confidence interval: 0 0.6308124

Alternative hypothesis: true odds ratio is greater than 1p = 0.999994 95% confidence interval: 0.3367543 Inf Minimum expected frequency: 70.38488

Yates’s continuity correction

Chi-squared Test Someone gives you a 52 card deck. You draw a

card, record its suit, replace the card, shuffle the deck and repeat that process 200 times, obtaining the following table:

Diamonds Clubs Hearts Spades46 54 49 51

Does the distribution of suits appear to be standard?

Chi-squared Test

To use the test, each cell count should be greater than one, and not more than 20% of the expected frequencies (E) should be less than 5.

Pooling of categories is a method to solve the problem of having more than 20% of the cells with expected frequency < 5.

Measures of Association

• Need vsc package to get measures of association• install.packages("vcd") if needed• library(vcd)• tab <- xtabs(~gender + purchase, data = sales) #

produce a table between gender and purchase• summary(assocstats(tab))


Call: xtabs(formula = ~gender + purchase)Number of cases in table: 582 Number of factors: 2 Test for independence of all factors: Chisq = 18.64, df = 1, p-value = 1.579e-05 X^2 df P(> X^2)Likelihood Ratio 18.368 1 1.8212e-05Pearson 18.640 1 1.5786e-05

Phi-Coefficient : 0.179 Contingency Coeff.: 0.176 Cramer's V : 0.179

Magnitude of Effect

For phi, Contingency Coefficient, Cramer’s V:�

small ≈ 0.1� �moderate ≈ 0.3�large ≈ 0.5


n)( t Coefficieny Contingenc

1)-n(k V sCramer'

n Phi

2

2

2

2

The likelihood ratio


Introduction

There is a plethora of indexes for measuring the strength of association. Factors affecting the value of a particular measure:

Skewed marginal distributions - only a few indices are impervious to marginal distributions.

Nonsquare tables - some measures cannot attain their maximum. If the table is nonsquare.

Nominal Measures

These measures only provide some indication of the strength of the association, but they cannot show direction and nature of the relationship.

Odds Ratio

The Odds Ratio Consider the following table :

Look at the odds of being white-collar. In the case of Loyal, these odds are 30 to 14 about 2:1. In the case of Disloyal, these odds are 42 to 20 or about 2:1.

The following ratio is called the odds:

odds ratio =n11n21/(n12n22)

Loyal Disloyal

White collar 30 42

Blue collar 14 20

Odds Ratio with a Stratifying Variable

Data comes from a cohort study or case-control study that is stratified, for example, the data may be separated (stratified) by the sex of the people studied. Consider the following tables :

Assuming a constant odds ratio across age-strata, test to see if the odds ratio is 1 and report a Pvalue

High age High alcohol consumption

Low alcohol consumption

Case 25 21

Control 29 128

Odds Ratio with a Stratifying Variable> mymatrix1 <- matrix(c(8,5,52,164),nrow=2,byrow=TRUE)> colnames(mymatrix1) <- c("High","Low")> rownames(mymatrix1) <- c("Case",“Control")> print(mymatrix1) High LowCase 8 5Control 52 164> mymatrix2 <- matrix(c(25,21,29,128),nrow=2,byrow=TRUE)> colnames(mymatrix2) <- c("High","Low")> rownames(mymatrix2) <- c("Case",“Control")> print(mymatrix2) High LowCase 25 21Control 29 128

Odds Ratio with a Stratifying Variable The Mantel-Haenszel odds ratio estimates the odds ratio for

association between the Case and Control, controlling for the possible confounding effects of the stratifying variable (age here).

> install.packages("lawstat") > library("lawstat") > myarray <- array(c(mymatrix1,mymatrix2),dim=c(2,2,2)) > cmh.test(myarray)

Cochran-Mantel-Haenszel Chi-square Test

data: myarray CMH statistic = 32.181, df = 1.000, p-value = 0.000, MH Estimate = 5.197, Pooled Odd Ratio = 4.575, Odd Ratio of

level 1 = 5.046, Odd Ratio of level 2 = 5.255


The log odds is insensitive to marginal distribution. The following tables have the same log odds ratio:

The odd ratio is also invariant under interchange of rows and columns, hence, the odd ratio is a symmetric measure.

75 15 750 15

10 100 100 100

85 105 850 115


Yule's Y (or the Coefficient of Colligation)

Y^ = sqrt(Q^)

variance = (1/n11 + 1/n21 + 1/n12 + 1/n22)(1 -Y^2)2/16

Pearson's Product Moment Correction

The following formula is used:

r = (n11n22 - n12n21)/sqrt(n1.n2.n.2n.2)

r lies between -1.0 and 1.0 and equals 0 if the variables are independent.


Cramer introduced the following variant (it can achieve the maximum value of 1):

V = sqrt(χ2/n(k - 1))

k is the smaller of the number of rows and columns. If one of the table's dimensions is 2, V and phi are the same.

Another version is Tschuprow's T = sqrt(χ2/sqrt[(I - 1)(J - 1)])

T varies between 1 and 0 and attains maximum when I=J.


Proportional Reduction in Error Refer to the following table:

Like the product

Dislike the product

Total

High income 26 8 34

Low income 17 33 50

Total 13 41 84


Suppose that we want to predict what category a randomly selected person would fall in when asked the question "Do you like the product?". If we had no knowledge of the row variable, the best bid is always "like the product" we shall be wrong in 41 of the 84 cases. If we know that the subject comes from the high income group, we shall predict `like the product' and be wrong in only 8 cases. If we know that he/she is in the low income group, we shall predict dislike the product and be wrong in 17 cases. Hence, we have reduced our number of prediction errors to 8 + 17 = 25 cases. Proportional reduction in error is:

lambdaC/R = (41 - 25)/41 = 0.39


That is, 39% of the errors in predicting the column variables are eliminated by knowing the row variable.

Lambda is asymmetric and varies between 0, indicating no ability at all to eliminate errors in the column variable on the basis of the row variable, and 1, indicating an ability to eliminate all errors in the column variable predictions, given knowledge of the row variable. Similarly, the lambda-asymmetric for the column variable is:

lambdaC/R = (ΣfKR* -FC*)/(n - FC*)

fKR* is the maximum frequency found within each subclass of the row variable.

FC* is the maximum frequency among the marginal totals of the row variable.


For the rows:lambdaR/c = (ΣfLC* -FR*)/(n - FR*)

An symmetric measure:lambda = (ΣfKR* -FC* + ΣfLC* -FR*)/(2n - FC* - FR*)

Ordinal Measures of Association Kendall’s r or Kendall rank correlation coefficient.

A pair of cases is concordant (P), if the values of both variables for one case are higher (or both are lower) than the corresponding values for the other case. The pair is discordant (Q) if the value of one variable for a case is larger than the corresponding value for the other case, and the direction is reversed for the second variable. When the 2 cases have identical values on one or on both variables, they are tied.

Ordinal Measures of Association

If the preponderance of pairs is concordant, the association is said to be positive, otherwise it is negative. If concordant and discordant pairs are equally likely, no association is said to exist. The following are some measures:

Kendall tau-a =(P - Q) / total number of pairs.

tau-b = (P - Q)/sqrt[(P + Q + Tx)(P + Q + Ty)] where Tx = pairs tied on X and Ty = pairs tied on Y.

tau-c =2m(P - Q)/n^2(m - 1) where m is the smaller of the number of rows and columns.

Goodman and Kruskal’s gamma = (P - Q)/(P + Q), and if G=0, it means independence.

Somer’s d = (P - Q)/(P + Q + Ty)

Ordinal Measures of Association


Interval Data Pearson's correlation coefficient and a lot of other measures can

be used.

Kappa (Agreement - Expected Agreement) / (1 – Expected Agreement)

Kappa For measuring agreement. The 2 variables must have the same

range of values. (Agreement - Expected Agreement) / (1 – Expected Agreement) Example:

Rater 1 rated 44.4% of the customers as loyal.Rater 2 rated 40.3% of the customers as loyal.

Loyal Moderately loyal Brand switcher Total

Loyal 17 4 8 29 40.3%

Moderately loyal

5 12 17 23.6%

Brand switcher 10 3 13 26 36.1%

Total 32 19 21 72

44.4% 26.4% 29.2%

Kappa

If the ratings are independent, 17.9% (44.4% X 40.3%) of the customers would be rated as loyal by both; 6.2% would be rated as moderately loyal by both, and 10.5% would be rated as brand switcher by both.

Therefore, (17.9 + 6.2 +10.5)%=34.6% would be classified the same merely by chance.

Now, observed percentage of customers classified the same = 42/72 = 58.3%

And the largest possible non-chance agreement = 1- 34.6%

Then Kappa = (0.583 - 0.346)/(1 - 0.346) = 0.362.

Kappa

Kappa is always less than or equal to 1. Kappa = 0: Agreement is at chance A value of 1 implies perfect agreement and values less

than 1 imply less than perfect agreement. Kappa < 0: your model is worse than chance. Kappa = negative infinity: Agreement is perfectly inverse Poor agreement = Less than 0.20 Fair agreement = 0.20 to 0.40 Moderate agreement = 0.40 to 0.60 Good agreement = 0.60 to 0.80 Very good agreement = 0.80 to 1.00

Entering data: weight cases

SPSS Crosstabs From the pull down menu Analyze, select Descriptive

Statistics and choose Crosstabs to open up the following dialogue box.

Select Gender from the variable list and move it to the Row variable box.

For the column variable, select Shopping Duty.

SPSS Crosstabs Click Cells in the previous dialog box to bring up the

following. Select Expected under Counts to compute the

expected frequency of each cell. Select Standardized under Residuals. click Continue to go back to the previous dialog box.

SPSS Crosstabs Click Continue to go back

to the previous dialog box, then click Statistics to display the following dialog box.

Select Chi-square, Contingency Coefficient, and Phi and Cramer's V.

Click Continue and OK to get the output.

SPSS Crosstabs

Test of Independence

Ho: Gender and shopping duty are independent (or in other words, there is no gender difference in shopping duty).

Ha: Gender and shopping duty are dependent (or in other words, there is gender difference in shopping duty).

Since the Sig. (or p-value) associated with the Likelihood Ratio is 0.490 > 0.05, we would not reject Ho and conclude that there is no difference in shopping duty between male and female respondents.

Recode To combine categories, choose Transform

from the main menu, and then select Recode and Into Different Variables.

In the dialog box that pops up, select Duty and put it into the Numeric Variable -> Output box.

Recode Make up a name for the Output Variable.

Let’s call the new variable rec_duty which stands for recoded duty. Type recoded duty in the textbox under Label.

Recode

Click the Old and New Values button to go into another dialog box.

Recode Click Range under Old Values and enter the

numbers 1 & 2 represent the duty = yes group and the duty = shared responsibility group. Then enter the number 1 into the New Value box.

Recode

Now enter 3 into Old Value textbox and 2 into the New Value text box as shown below

Recode

Specify the value labels of rec_duty as shown below:

Recode

Fisher’s Exact Test When drinking tea, a woman claimed to be able to

distinguish whether milk or tea was added to the cup first.

To test this claim, she was given eight cups of tea. In four of the cups, tea was added first, and in four of the cups, milk was added first.

The order in which the cups were presented to her was randomized.

She was told that there were four cups of each type, so that she should make four predictions of each order.

Ho: The order in which milk or tea is poured into a cup and the taster’s guess of the order are independent.

Ha: The taster can correctly guess the order in which milk or tea is poured into a cup.

Fisher’s Exact TestGuess * Actual Crosstabulation

3 1 42.0 2.0 4.0

1 3 42.0 2.0 4.0

4 4 84.0 4.0 8.0

CountExpected CountCountExpected CountCountExpected Count

Milk First

Tea First

Guess

Total

Milk First Tea FirstActual

Total

Chi-Square Tests

2.000b 1 .157.500 1 .480

2.093 1 .148.486 .243

1.750 1 .1868

Pearson Chi-SquareContinuity Correctiona

Likelihood RatioFisher's Exact TestLinear-by-Linear AssociationN of Valid Cases

Value dfAsymp. Sig.

(2-sided)Exact Sig.(2-sided)

Exact Sig.(1-sided)

Computed only for a 2x2 tablea. 4 cells (100.0%) have expected count less than 5. The minimum expected count is 2.00.b.

70

http://www.swogstat.org/stat/public/fisher.htm

Y

X

The output consists of three p-values: Left: Use this when the alternative to independence is that there is negative association between the variables. That is, the observations tend to lie in lower left and upper right. Right: Use this when the alternative to independence is that there is positive association between the variables. That is, the observations tend to lie in upper left and lower right. 2-Tail: Use this when there is no prior alternative.

TABLE = [ 3 , 7 , 5 , 10 ]Left : p-value = 0.6069Right : p-value = 0.726392-Tail : p-value = 1

yes no total

yes 3 7 10

no 5 10 15

total 8 17

http://www.swogstat.org/stat/public/fisher.htm

Fisher’s Exact Test

Fisher's exact test returns exact one-tailed and two-tailed p-values for a given frequency table.

The probability of observing a given set of frequencies A, B, C, and D in a 2 x 2 contingency table, given fixed row and column marginal totals and sample size N, is:


Fisher's exact test computes the probability, given the observed marginal frequencies, of obtaining exactly the frequencies observed and any configuration more extreme.

Example:

2 (A) 3 (B) 5 (A+B)

6 (C) 4 (D) 10 (C+D)

8 (A+C) 7 (B+D) 15 (N)


All configurations with the same marginal frequencies include:


Thus, the one-tailed probability for this table would be: .326 + .093 + .007 = .426 ...whereas the two-tailed probability would be: .326 + .093 + .007 + .163 + .019 = .608 The probability for the fourth configuration is not included

because it is less extreme (more probable) than the observed frequency configuration.

Since p = 0.608 > 0.025 (the test is a two-tail test), the null hypothesis is not rejected.

Standardized Residuals

Count Expected

Count Expected -Count Residuals edStandardiz

Symmetric Measures

Magnitude of Effect

For phi, Contingency Coefficient, Cramer’s V:�

trivial if value < ±0.1 � small if ±0.1 < value < ±0.3 medium effect ±0.3 < value < ±0.5 large effect if value > ±0.5

One Sample Chi-Square Test Example

One-Sample Chi-Square Example

Null Ho: 0 = E

Statistical test One-sample chi-square

Significance level .05

Calculated value 9.89

Critical test value 7.815

References:

Reference book Chapter 16.

Chi-squared Test Chi-squared statistic = Σ{(O - E)2/E}, with degrees

of freedom = (r - 1)(c - 1)

2

dwewr

xF

dfwr

x

r2

12

0 2/

222*2

22/

1)(

4)13)(13(

01.212.21

2.2116...

1.34

1.3442

5.30

5.3030

tIndependen areLoyalty and Age:Ho

Age group

Highly loyal

Moderately loyal

Brand switchers

Total

<30 30 (30.5) 42 (34.1) 18 (25.4) 90

30-40 14 (22.1) 20 (24.5) 31 (18.4) 65

>40 34 (25.4) 25 (28.4) 16 (21.2) 75

Total 78 87 65 230

O

E

cross-tabulation. 金枝玉孽 無間道 神鵰俠侶 白娘子傳奇 objectives to study the use of...

Documents

cross-tabulation. 金枝玉孽無間道神鵰俠侶白娘子傳奇 objectives to study the use of...