p nb probite

21
1 1. POISSON REGRESSION We are keeping count for some quite rare events. Variable Y is called dependent variable, variables X, Z, W are independent variables or regressors or predictors. We are modelling the behavior of the mean value of Y: =e , =+ 1 + 2 + 3 . Estimates for the coefficients , 1 , 2 , 3 are calculated from data. Data Y has Poisson distribution. We expect the mean of Y to be similar by its magnitude to the variance of Y. Possible values for Y are 0, 1,2,3, ... Regressors can be interval or categorical random variables. Main steps 1) Check if Y has Poisson distribution. 2) Check if normed deviance is close to 1. 3) Check if maximum likelihood is statistically significant. If p-value ≥ 0,05, model is unacceptable. 4) Check if all regressors are significant (Wald test p < 0,05). If not drop them from the model. We do not pay attention to p-value for model constant (intercept).

Upload: anarmasimov

Post on 31-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: P NB ProbitE

1

1. POISSON REGRESSION

We are keeping count for some quite rare events.

Variable Y is called dependent variable, variables X, Z, W are independent variables or

regressors or predictors.

We are modelling the behavior of the mean value of Y:

𝜇 = e𝑧 ,

𝑧 = 𝐶 + 𝑏1 𝑋 + 𝑏2𝑍 + 𝑏3𝑊.

Estimates for the coefficients �̂�, �̂�1, �̂�2, �̂�3 are calculated from data.

Data

Y has Poisson distribution. We expect the mean of Y to be similar by its magnitude to the variance

of Y. Possible values for Y are 0, 1,2,3, ...

Regressors can be interval or categorical random variables.

Main steps

1) Check if Y has Poisson distribution.

2) Check if normed deviance is close to 1.

3) Check if maximum likelihood is statistically significant. If p-value ≥ 0,05, model is

unacceptable.

4) Check if all regressors are significant (Wald test p < 0,05). If not – drop them from the

model. We do not pay attention to p-value for model constant (intercept).

Page 2: P NB ProbitE

2

Poisson regression with SPSS

1. Data

File ESS4FR. Variables:

agea – respondents age,

hhmmb – number of household members,

imptrad – important to keep traditions

eduyrs – years of formal schooling,

cldcrsv – help for childcare (0 – very bad, ... , 10 – very good).

We will model number of other than respondent household members by agea and cldrsv. We will

investigate respondents for whom imptrad ≤ 2 and eduyrs ≤ 10.

Use Select Cases -> If condition is satisfied -> If and write imptrad <= 2 & eduyrs <= 10.

Then Continue -> OK.

Dependent variable (we call it numbhh) can be created with the help of Transform →

Compute Variable.

2. Preliminary analysis

First we check if numbhh is similar to Poisson variable. Analyze → Descriptive Statistics →

Frequences.

Good Poisson regression model:

Normed deviance is close to 1;

Maximum likelihood has p < 0,05.

For all regressors Wald test p < 0,05.

Page 3: P NB ProbitE

3

Further Statistics. Check Mean and Variance.

We see that the mean of numbhh (1,0036) is close to its variance (1,482). Thus, numbhh

satisfies one of the most important properties of the Poisson variable.

It is possible also to check if random variable has Poisson distribution with the help of

Kolmogorov- Smirnov test:

Analyze → Nonparametric Tests → Legacy Dialogs → 1-Sample K-S.

Statistics

numbhh

N Valid 281

Missing 0

Mean 1.0036

Variance 1.482

Page 4: P NB ProbitE

4

We see that we can assume that numbhh has Poisson distribution (p = 0,169), but is not

normal (p = 0,000).

3. SPSS Poisson regression options

Choose Analyze → Generalized Linear Models → Generalized Linear Models.

Choose Type of Model and check Poisson loglinear.

One-Sample Kolmogorov-Smirnov Test 2

numbhh

N 281

Poisson

Parametera,b

Mean 1.0036

Most Extreme

Differences

Absolute .066

Positive .066

Negative -.026

Kolmogorov-Smirnov Z 1.111

Asymp. Sig. (2-tailed) .169

a. Test distribution is Poisson.

One-Sample Kolmogorov-Smirnov Test

numbhh

N 281

Normal

Parametersa,b

Mean 1.0036

Std. Deviation 1.21743

Most Extreme

Differences

Absolute .302

Positive .302

Negative -.205

Kolmogorov-Smirnov Z 5.060

Asymp. Sig. (2-tailed) .000

a. Test distribution is Normal.

Page 5: P NB ProbitE

5

Click on Response and move numbhh into Dependent Vriable.

Click on Predictors and move both regressors agea and cldcsrv into Covariates.

(We do not have categorical variables, which are moved into Factors).

After choosing Model both variables should be moved into Model.

In Statistics check in addition Include exponential parameter estimates. Then -> OK.

Page 6: P NB ProbitE

6

4. Results

In Goodness of Fit table we can find normed deviance. We see that the normed deviance is

close to 1 (0,919). Thus, the Poisson regression model fits our data. It remains to decide which

regressors are statistically significant.

Goodness of Fitb

Value df Value/df

Deviance 230.635 251 .919

Scaled Deviance 230.635 251

Pearson Chi-Square 188.314 251 .750

Scaled Pearson Chi-Square 188.314 251

Log Likelihooda -301.040

Akaike's Information

Criterion (AIC)

608.080

Finite Sample Corrected

AIC (AICC)

608.176

Bayesian Information

Criterion (BIC)

618.692

Consistent AIC (CAIC) 621.692

In the table Omnibus Test we find p-value for maximum likelihood statistic. Since p <

0,05, we conclude that not all regressors are statistically insignificant.

Omnibus Testa

Likelihood Ratio

Chi-Square df Sig.

112.919 2 .000

Page 7: P NB ProbitE

7

In the table Tests of Model Effects we see Wald test p-values for all regressors. We do not

check p-value for intercept. Both p < 0,05. Therefore, we conclude that both regressors (agea and

cldcrsv) are statistically significant and should remain in the model.

Tests of Model Effects

Source

Type III

Wald Chi-Square df Sig.

(Intercept) 41.188 1 .000

agea 105.703 1 .000

cldcrsv 14.395 1 .000

In the table Parameter Estimates information about Wald p-values is repeated. Moreover, the

tabale contains estimates of the model‘s coefficients (Column B).

Parameter Estimates

Parameter B

Std.

Error

95 % Wald

Confidence Interval Hypothesis Test

Exp(B)

95 % Wald Confidence

Interval for Exp(B)

Lower Upper

Wald Chi-

Square df Sig. Lower Upper

(Intercept) 1.535 .2392 1.066 2.004 41.188 1 .000 4.642 2.905 7.419

agea -.035 .0034 -.042 -.028 105.703 1 .000 .966 .959 .972

cldcrsv .099 .0261 .048 .150 14.395 1 .000 1.104 1.049 1.162

(Scale) 1a

We can see that coefficient for agea is negative: -0,035 < 0. This means that when

respondents age increases, the number of household members decreases. Mathematical model‘s

expression is

μ̂ = exp {1,535 − 0,035 𝑎𝑔𝑒𝑎 + 0,099𝑐𝑙𝑑𝑐𝑟𝑠𝑣}.

Here μ̂ – is the mean value of other household members.

Forecasting means that we insert given values of agea and cldcrsv into above formula.

5. Categorical regressor

Categorical regressors are included into Generalized Linear Models - Predictors -> Factors

Page 8: P NB ProbitE

8

Do not forget to add ctzntr into Model window. Then, in the table Parameter Estimates

Parameter B

Std.

Error

95 % Wald Confidence

Interval

Lower Upper

(Intercept) 1.239 .3115 .629 1.850

agea -.036 .0034 -.043 -.029

[ctzcntr=1] .352 .2319 -.103 .806

[ctzcntr=2] 0a . . .

cldcrsv .104 .0263 .053 .156

We get additional information about both ctzcntr. Model then can be written as

ln μ̂ = 1,239 − 0,036𝑎𝑔𝑒𝑎 + 0,104𝑐𝑙𝑑𝑐𝑟𝑠𝑣 + {0,352, if 𝑐𝑡𝑧𝑐𝑛𝑡𝑟 = 1,

0, if 𝑐𝑡𝑧𝑐𝑛𝑡𝑟 = 2.

Page 9: P NB ProbitE

9

2. NEGATIVE BINOMIAL REGRESSION

We are keeping count for some events.

Variable Y is called dependent variable, variables X, Z, W are independent variables or

regressors or predictors.

We are modelling the behavior of the mean value of Y:

𝜇 = e𝑧 ,

𝑧 = 𝐶 + 𝑏1 𝑋 + 𝑏2𝑍 + 𝑏3𝑊.

Estimates for the coefficients �̂�, �̂�1, �̂�2, �̂�3 are calculated from data. NB regression is an alternative

to the Poisson regression. The main difference is that the variance of Y is larger than the mean of Y.

Data

Y has negative binomial distribution. We expect the mean of Y to be smaller than the variance of Y.

Possible values for Y are 0, 1,2,3, ...

Regressors can be interval or categorical random variables.

Main steps

5) Check if the variance of Y is greater than the mean of Y. Otherwise, the NB regression is

not applicable.

6) Check if normed deviance is close to 1.

7) Check if maximum likelihood is statistically significant. If p-value ≥ 0,05, model is

unacceptable.

8) Check if all regressors are significant (Wald test p < 0,05). If not – drop them from the

model. We do not pay attention to p-value for model constant (intercept).

Page 10: P NB ProbitE

10

Negative binomial regression with SPSS

1. Data

File ESS4SE. Variables:

emplno – respondent’s number of employers,

emplnof – father’s number of employees, (1 – if no empoye, 2 – has 1–24 employees, 3 –

more than 25 employees),

brmwmny – borrow money for living (1 – is very difficult, ..., 5 – very easy),

eduyrs – years of formal schooling.

We will model the dependence of emplno from emplnof, brwmny, eduyrs. Variable emplnof

has only one observation greater than 26. Therefore, with recode we create a new dichotomous

variable emplnof2, (0 – if no employees, 1 – at least one employe).

2. SPSS options for the negative binomial regression

Analyze → Generalized Linear Models → Generalized Linear Models.

Click on Type of Model. Do not choose Negative binomial with log link.

Good Negative Binomial regression model:

Normed deviance is close to 1;

Maximum likelihood has p < 0,05.

For all regressors Wald test p < 0,05.

Page 11: P NB ProbitE

11

Check Custom --> Distribution -> Negative binomial, Link function – Log , Parameter

– Estimate value.

Click on Response and put emplno into Dependent Variable.

In Predictors put both variables eduyrs and brwmny into Covariates. Categorical variable

emplnof2 put into Factors.

Page 12: P NB ProbitE

12

Choose Model and put all variables into Model window.

3. Results

At the beginning of output we see descriptive statistics. Observe that standard deviation of emplno

(moreover, its variance) is greater than mean.

Categorical Variable Information

N Percent

Factor emplnof2 .00 33 50.0%

1.00 33 50.0%

Total 66 100.0%

In Goodness of Fit table we see that normed deviance is 0,901, that is – we see quite good

overall model fit to data.

Goodness of Fitb

Value df Value/df

Deviance 54.989 61 .901

Scaled Deviance 54.989 61

-------------------------------------

--------------------

Continuous Variable Information

N Minimum Maximum Mean Std. Deviation

Dependent

Variable

emplno Number of employees

respondent has/had

66 0 763 14.73 93.831

Covariate eduyrs Years of full-time

education completed

66 5 23 11.71 3.732

brwmny Borrow money to make

ends meet, difficult or easy

66 1 5 3.68 1.069

Page 13: P NB ProbitE

13

Omnibus Test table contains maximum likelihood statistics and its p-value. Since p < 0,05,

we conclude that at least one regressor is statistically significant.

Tests of Model Effects contains Wald tests for each regressor. All regressors are statistically

significant (we do not check p-value for intercept).

Tests of Model Effects

Source

Type III

Wald Chi-

Square df Sig.

(Intercept) .151 1 .698

emplnof2 6.298 1 .012

Eduyrs 4.959 1 .026

Brwmny 7.399 1 .007

Parameter Estimates table contains parameter estimates (surprise, surprise)

Parameter Estimates

Parameter B

Std.

Error

95 % Wald

Confidence

Interval Hypothesis Test

Exp(B)

95 % Wald

Confidence

Interval for Exp(B)

Lower Upper

Wald

Chi-

Square df Sig. Lower Upper

(Intercept) 1.590 2.1316 -2.588 5.768 .556 1 .456 4.904 .075 319.831

[emplnof2=.00] -1.629 .6493 -2.902 -.357 6.298 1 .012 .196 .055 .700

[emplnof2=1.00] 0a . . . . . . 1 . .

Eduyrs .286 .1286 .034 .539 4.959 1 .026 1.332 1.035 1.714

Brwmny -.753 .2768 -1.295 -.210 7.399 1 .007 .471 .274 .810

(Scale) 1b

(Negative

binomial)

5.327 1.2084 3.415 8.310

Estimated model:

ln μ̂ = 1,590 + 0,286 𝑒𝑑𝑢𝑦𝑟𝑠 − 0,753 𝑏𝑟𝑤𝑚𝑛𝑦 + {0, if 𝑒𝑚𝑝𝑙𝑛𝑜𝑓2 = 1,

−1,629, if 𝑒𝑚𝑝𝑙𝑛𝑜𝑓2 = 0.

Here μ̂ – is mean value of employees.

Omnibus Testa

Likelihood

Ratio Chi-

Square df Sig.

23.777 3 .000

Page 14: P NB ProbitE

14

3. PROBIT REGRESSION

Model

We are modelling two-valued variable. Probit regression can be used whenever logistic regression

applies and vice versa. Model scheme

Variable Y is dependent variable, X, Z, W are independent variables (regressors). Typically Y

values are coded 0 or 1. Model is constructed for P(Y = 0):

P(𝑌 = 0) = Φ(𝐶 + 𝑏1 𝑋 + 𝑏2𝑍 + 𝑏3𝑊).

Here Φ(⋅) is the distribution function of the standard normal random variable. Equivalent

expression

Φ−1(P(𝑌 = 0)) = 𝐶 + 𝑏1 𝑋 + 𝑏2𝑍 + 𝑏3𝑊.

Here Φ−1(⋅) is inverse function, also known as probit function.

If 𝑏1 > 0, then as X grows, also grows P(Y= 0).

If 𝑏1 < 0, then as X , also grows P(Y= 1).

Data

a) Variable Y is dichotomous. Data for Y contains at least 20% of zeros and at least 20%

of 1.

b) If model contains many categorical variables, for each combination of categories data

should contain at least 5 observations.

c) No strong correlation between regressors.

Page 15: P NB ProbitE

15

Model fit

Model fits data if:

Maximum likelihood p-value p < 0,05.

Wald test p-value p < 0,05 for all regressors.

Correct classification for many cases of Y = 1 and Y = 0.

For all variables Cook’s measure ≤ 1.

(Pseudo) Coefficient of determination ≥ 0,20.

Probit regression with SPSS

1. Data

File LAMS. Variables:

K2 – university,

K33_2 – studies for achievement of my present position were (1 – absolutely unimportant,

..., 5 – very important),

K35_1 – my present occupation corresponds to bachelor studies (1 – agree, 2 – more agree,

than disagree, 3 – more disagree, than agree, 4 – disagree),

K36_1 – I use professional skills obtained during studies (1 – never, ..., 5 – very

frequently),

K37_1 – satisfaction with my profession (1 – not at all , ...., 5 – very much).

With recode we create a new two-valued variable Y, (0 – if respondent rarely applies

professional skills obtained during studies, 1 – if frequently). Model scheme:

Y = f( K35_1, K33_2, K37_1).

Or graphically

Page 16: P NB ProbitE

16

2. SPSS options

Analyze -> Generalized Linear Models → Generalized Linear Models.

Choose Type of Model and check Binary probit.

Page 17: P NB ProbitE

17

Open Response and put Y into Dependent Variable .

Open Predictors and move K37_1 ir K33_2 into Covariates (with some reservation we treat

these variables as interval ones). Regressor K35_1 obtains only 4 values, therefore is treated as a

categorical variable. Move K35_1 into Factors.

Open Model window and move all regressors to the right:

Page 18: P NB ProbitE

18

Open Save and check Predicted category and Cook‘s distance.

3. Results

Model is constructed for P(Y = 0). In Categorical Variable Information we check that there is

sufficient number of respondents for each value of categorical variables (Y included).

Categorical Variable Information

N Percent

Dependent Variable Y .00 100 31.1%

1.00 222 68.9%

Total 322 100.0%

Factor K35_1 |Esamo darbo

atitikimas bakalauro

(vientisųjų) studijų krypčiai

1 Tikrai taip 153 47.5 %

2 Greičiau taip 95 29.5 %

3 Greičiau ne 36 11.2%

4 Tikrai ne 38 11.8%

Total 322 100.0%

In Omnibus Test table we check that p-value for the maximum likelihood test is sufficiently

small p = 0,00...< 0,05.

Omnibus Testa

Likelihood Ratio

Chi-Square df Sig.

163.847 5 .000

Dependent Variable: Y

Model: (Intercept), K35_1, K37_1, K33_2

Parameter Estimates table contains parameter estimates and Wald tests for the significance

of each regressor. We do not check the significance of Intercept. Categorical variable K35_1 was

Page 19: P NB ProbitE

19

replaced by 4 dummy variables, one of which is not statistically significant. However, for one such

insignificant result, it is not rational to drop K35_1 from the model.

Parameter Estimates

Parameter B Std. Error

95 % Wald Confidence Interval Hypothesis Test

Lower Upper

Wald Chi-

Square df Sig.

(Intercept) 4.853 .7092 3.463 6.243 46.832 1 .000

[K35_1=1] -1.577 .3272 -2.218 -.936 23.229 1 .000

[K35_1=2] -1.018 .3226 -1.650 -.385 9.953 1 .002

[K35_1=3] -.261 .3722 -.991 .468 .493 1 .482

[K35_1=4] 0a . . . . . .

K37_1 -.273 .1141 -.496 -.049 5.720 1 .017

K33_2 -.780 .1151 -1.005 -.554 45.859 1 .000

(Scale) 1b

Dependent Variable: Y

Model: (Intercept), K35_1, K37_1, K33_2

a. Set to zero because this parameter is redundant.

b. Fixed at the displayed value.

We obtained four models, which differ by the constant only, They can be written in the following

way:

P̂(𝑌 = 0) = P(rarely applies knowledge in his work) = Φ(𝑧),

𝑧 = 4,85 − 0,273 𝐾371 − 0,78 𝐾332 + {

−1,57, if 𝐾35_1 = 1,−1,02, if 𝐾35_1 = 2,−0,26, if 𝐾35_1 = 3,0, if 𝐾35_1 = 4.

Signs of coefficients agrre with general logic of the models. Coefficient for K37_1 is

negative. The larger value of K37_1 (more happy with his work), the less probable that knowledge

is rarely used. Other signs of coefficients can be explained similarly.

We treated probit regression as a partial case of the generalized linear model. Therefore, one can

check the magnitude of deviance in the table Goodness of Fit. We see that deviance is close to

unity (1,156), which demonstrates good fit of the model. Note that for the probit regression more

important is small p-value of the maximum likelihood test (it can be find in Omnibus Test.). If all

model’s characteristics except deviance show good model fit, we assume that the model is

acceptable.

Page 20: P NB ProbitE

20

Goodness of Fitb

Value Df Value/df

Deviance 49.722 43 1.156

Scaled Deviance 49.722 43

Pearson Chi-Square 48.218 43 1.121

Scaled Pearson Chi-Square 48.218 43

Log Likelihooda -47.932

Akaike's Information

Criterion (AIC)

107.865

Finite Sample Corrected

AIC (AICC)

108.131

Bayesian Information

Criterion (BIC)

130.512

Consistent AIC (CAIC) 136.512

Checking for outliers we choose Analyze → Descriptive Statistics → Descriptives. Move

variable CooksDistance intoVariable(s). Choose OK.

Descriptive Statistics

N Minimum Maximum Mean Std. Deviation

CooksDistance Cook's

Distance

322 .000 .039 .00324 .006749

Valid N (listwise) 322

Maximal Cook’s distance value is 0,039<1. Therefore, there is no outliers in our data.

To obtain classification table we choose Analyze → Descriptive Statistics → Crosstabs. Į

Move Y into Row(s) and PredictedValue. into Column(s) . Choose Cells and check Row. Then

Continue and OK.

Y * PredictedValue Predicted Category Value Crosstabulation

PredictedValue Predicted

Category Value

Total .00 1.00

Y .00 Count 66 34 100

% within Y 66.0% 34.0% 100.0%

1.00 Count 17 205 222

% within Y 7.7% 92.3% 100.0%

Total Count 83 239 322

% within Y 25.8% 74.2% 100.0%

Page 21: P NB ProbitE

21

From 100 respondents, who rarely use professional skills obtained during studies, 66 are

correctly classified ( 66 %). From 222 respondents, who frequently use professional skills obtained

during studies, 205 are correctly classified ( 92,3 %). Recalling the table Categorical Variable

Information and its percents (respectively 31,1 % and 68,9 % ), we see that probit model ensures

much better forecasting than random gues. Final conlusion: probit regression model fits data

sufficiently well.

4. Forecasting

One value can be forecasted in the following way. Let us assume that previous model is applied to

respondent, for whom K33_2 = 4, K35_1 = 1, K37_1 = 4. We add additional row in data writing 4

in the column K33_2 , 1 in the column K35_1 , 4 in the column K37_1 and 1 in the column

filter_$. Remaining columns are empty.

We repeat probit analysis but check Predicted value of mean of response in window Save

In data appears new column MeanPredicted containing probabilities for P( Y = 0). We got

0,175 probability for our respondent. Therefore is unlikely that this respondent will apply skills

from studies in his professional work.