7 bivariate eda

Upload: rahuliiim

Post on 03-Apr-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/28/2019 7 Bivariate Eda

    1/61

    bivariate EDA and regression

    analysis

  • 7/28/2019 7 Bivariate Eda

    2/61

    length

    width

  • 7/28/2019 7 Bivariate Eda

    3/61

    distance from quarry

    weight

    of core

  • 7/28/2019 7 Bivariate Eda

    4/61

    -4 -3 -2 -1 0 1 2 3 4 5

    AG_C1_1

    -5

    -4

    -3

    -2

    -1

    0

    1

    2

    3

    AG_

    C1

    _2

  • 7/28/2019 7 Bivariate Eda

    5/61

    -4 -3 -2 -1 0 1 2 3 4 5

    AG_C1_1

    -5

    -4

    -3

    -2

    -1

    0

    1

    2

    3

    AG

    _C

    1_2

  • 7/28/2019 7 Bivariate Eda

    6/61

    -4 -3 -2 -1 0 1 2 3 4 5

    AG_C1_1

    -5

    -4

    -3

    -2

    -1

    0

    1

    2

    3

    AG_

    C1

    _2

  • 7/28/2019 7 Bivariate Eda

    7/61

    AG_C1_2

    AG

    _C1

    _1

    AG_C2_2 AG_C3_2 AG_C4_2

    AG_C1_1

    AG

    _C2

    _1

    AG_C

    2_1

    AG

    _C3

    _1

    AG_C3_1

    AG_C1_2

    AG

    _C4

    _1

    AG_C2_2 AG_C3_2 AG_C4_2

    A

    G_C4_1

    scatterplot matrix

  • 7/28/2019 7 Bivariate Eda

    8/61

    AG_C1_1

    AG

    _C1

    _1

    AG_C2_1 AG_C3_1 AG_C4_1 AG_C1_2 AG_C2_2 AG_C3_2 AG_C4_2

    AG_C1_1

    AG

    _C2

    _1

    AG_C2_1

    AG

    _C3

    _1

    AG_C3_1

    A

    G_

    C4

    _1

    AG_C4_1

    AG

    _C1

    _2

    AG_C1_2

    AG

    _C2

    _2

    AG_C2_2

    AG

    _C3

    _2

    AG_C3_2

    AG_C1_1

    AG

    _C4

    _2

    AG_C2_1 AG_C3_1 AG_C4_1 AG_C1_2 AG_C2_2 AG_C3_2 AG_C4_2

    AG_C4_2

  • 7/28/2019 7 Bivariate Eda

    9/61

    -4 -3 -2 -1 0 1 2 3 4 5

    AG_C1_1

    -10

    -5

    0

    5

    10

    AG

    _

    C2

    _1

  • 7/28/2019 7 Bivariate Eda

    10/61

    scatterplots

    scatterplots provide the most detailed summary of abivariate relationship, but they are not concise, andthere are limits to what else you can do with them

    simpler kinds of summaries may be useful

    more compact; often capture less detail

    may support more extended mathematical analyses

    may reveal fundamental relationships

    -4 -3 -2 -1 0 1 2 3 4 5

    AG_C1_1

    -5

    -4

    -3

    -2

    -1

    0

    1

    2

    3

    AG

    _C

    1_

    2

  • 7/28/2019 7 Bivariate Eda

    11/61

  • 7/28/2019 7 Bivariate Eda

    12/61

    y = a + bx

  • 7/28/2019 7 Bivariate Eda

    13/61

    y = a + bx

    1 2 3 4 5 6

    1

    2

    3

    4

    5

    6

    a = y intercept

    y

    x

    (x2,y2)

    (x1,y1)

    b = slope

    b = y/x

    b = (y2-y1)/(x2-x1)

  • 7/28/2019 7 Bivariate Eda

    14/61

    y = a + bx

    we can predict values ofy from values ofx

    predicted values ofyare called y-hat

    the predicted values (y) are often regardedas dependent on the (independent) x

    values try to assign independent values to x-axis,

    dependent values to the y-axis

    bxay

  • 7/28/2019 7 Bivariate Eda

    15/61

    y = a + bx

    becomes a concise summary of a point

    distribution, and a model of a relationship

    may have important explanatory andpredictive value

  • 7/28/2019 7 Bivariate Eda

    16/61

  • 7/28/2019 7 Bivariate Eda

    17/61

    how do we come up with these lines?

    various options:

    by eye

    calculating a Tukey Line (resistant to

    outliers)

    locally weighted regression LOWESSleast squares regression

  • 7/28/2019 7 Bivariate Eda

    18/61

    linear regression

    linear regression and correlation analysis

    are generally concerned with fitting lines to

    real data

    least squares regression is one of the main

    tools

    attempts to minimize deviation of observed

    points from the regression line

    maximizes its potential for prediction

  • 7/28/2019 7 Bivariate Eda

    19/61

    standard approach minimizes the squared

    variation in y

    Note:

    these are the vertical deviations this is a sum-squared-error approach

    n

    i

    ii yy1

    2)(

  • 7/28/2019 7 Bivariate Eda

    20/61

    regressing x on y would involve defining

    the line

    by minimizing

    ii dycx

    2

    ii xx

  • 7/28/2019 7 Bivariate Eda

    21/61

    calculating a line that minimizes this value

    is called regressingy onx

    appropriate when we are trying to predictyfromx

    this is also called Model I Regression

  • 7/28/2019 7 Bivariate Eda

    22/61

    start by calculating the slope (b):

    n

    i

    i

    n

    i

    ii

    xx

    yyxx

    b

    1

    2

    1

    )(

    ))(( covariance

  • 7/28/2019 7 Bivariate Eda

    23/61

    once you have the slope, you can calculate

    the y-intercept (a):

    n

    xbyxbya

    ii

  • 7/28/2019 7 Bivariate Eda

    24/61

    regression pathologies

    things to avoid in regression analysis

  • 7/28/2019 7 Bivariate Eda

    25/61

  • 7/28/2019 7 Bivariate Eda

    26/61

  • 7/28/2019 7 Bivariate Eda

    27/61

  • 7/28/2019 7 Bivariate Eda

    28/61

  • 7/28/2019 7 Bivariate Eda

    29/61

    Tukey Line

    resistant to outliers

    divide cases into thirds, based onx-axis

    identify the medianx andy values in upper

    and lower thirds

    slope (b)= (My3-My1)/(Mx3-Mx1)

    intercept (a) = median of all values yi-b*xi

  • 7/28/2019 7 Bivariate Eda

    30/61

  • 7/28/2019 7 Bivariate Eda

    31/61

    Correlation

    regression concerns fitting a linear model to

    observed data

    correlation concerns the degree of fitbetween observed data and the model...

    if most points lie near the line:

    the fit of the model is good

    the two variables are strongly correlated

    values of y can be well predicted from x

  • 7/28/2019 7 Bivariate Eda

    32/61

    Pearsonsr

    this is assessed using the product-moment

    correlation coefficient:

    = covariance (the numerator), standardizedby a measure of variation in both x and y

    22 )()(

    ))((yyxx

    yyxxr

    ii

    ii

  • 7/28/2019 7 Bivariate Eda

    33/61

    y

    x

    22 )()(

    ))((

    yyxx

    yyxxr

    ii

    ii

    +

    +

    -

    -

    (xi,yi)

  • 7/28/2019 7 Bivariate Eda

    34/61

    unlike the covariance, r is unit-less

    ranges between1 and 1 0 = no correlation

    -1 and 1 = perfect negative and positive

    correlation (respectively) r is symmetrical

    correlation betweenx andy is the same as

    betweeny andx no question of independence or dependence

    recall, this symmetry is not true of regression

  • 7/28/2019 7 Bivariate Eda

    35/61

    regression/correlation

    one can assess the strength of a relationship by

    seeing how knowledge of one variable

    improves the ability to predict the other

  • 7/28/2019 7 Bivariate Eda

    36/61

    if you ignorex, the best predictor ofy will

    be the mean of ally values (y-bar)

    if they measurements are widely scattered,

    prediction errors will be greater than if they

    are close together

    we can assess the dispersion ofy values

    around their mean by:

    2)( yyi

    y

  • 7/28/2019 7 Bivariate Eda

    37/61

    y

    iy

    2

    )( yyi

    2)( ii yy

  • 7/28/2019 7 Bivariate Eda

    38/61

    2)( ii yy

    2)( yyir2=

    coefficient of determination (r2)

    describes the proportion of variation that is

    explained or accounted for by the regression line

    r2=.5

    half of the variation is explained by the regression

    half of the variation iny is explained by variation inx

  • 7/28/2019 7 Bivariate Eda

    39/61

    y

    iy

  • 7/28/2019 7 Bivariate Eda

    40/61

    correlation and percentages

    much of what we want to learn aboutassociation between variables can belearned from counts

    ex: are high counts of bone needles associatedwith high counts of end scrapers?

    sometimes, similar questions are posed ofpercent-standardized data

    ex: are highproportions of decorated potteryassociated with highproportions of copper

    bells?

  • 7/28/2019 7 Bivariate Eda

    41/61

    caution

    these are different questions and have

    different implications for formal regression

    percents will show at least some level ofcorrelation even if the underlying counts do

    not

    spurious correlation (negative) closed-sum effect

  • 7/28/2019 7 Bivariate Eda

    42/61

    case C_v1 C_v2 C_v3 C_v4 C_v5 C_v6 C_v7 C_v8 C_v9 C_v10

    1 15 14 94 59 76 13 8 97 10 952 35 1 89 95 23 77 14 9 27 43

    3 20 96 73 31 90 65 74 60 85 27

    4 23 59 7 52 33 83 71 35 57 90

    5 36 90 86 15 97 54 52 41 34 3

    6 79 2 26 5 11 68 74 44 13 87

    7 40 99 28 66 77 23 69 22 63 36

    8 95 36 22 75 21 48 95 58 74 68

    9 27 0 58 99 32 30 5 5 100 75

    10 67 93 98 61 62 94 3 16 43 48

    10 vars.

    5 vars.

    3 vars.

    2 vars.

  • 7/28/2019 7 Bivariate Eda

    43/61

    -1.0 -0.5 0.0 0.5 1.0

    r

    original counts

    -1.0 -0.5 0.0 0.5 1.0

    r

    percents (10 vars.)

    -1.0 -0.5 0.0 0.5 1.0

    r

    percents (5 vars.)

    -1.0 -0.5 0.0 0.5 1.0r

    percents (3 vars.)

    -1.0 -0.5 0.0 0.5 1.0

    r

    percents (2 vars.)

  • 7/28/2019 7 Bivariate Eda

    44/61

    0 20 40 60 80 100

    C_V1

    0

    20

    40

    60

    80

    100

    C_ V

    2

    0 5 10 15 20

    P10_V1

    0

    5

    10

    15

    20

    P 1 0_ V 2

    0 10 20 30 40 50 60 70

    T5_V1

    0

    10

    20

    30

    40

    T 5

    _ V 2

    10 20 30 40 50 60 70 80

    T3_V1

    0

    10

    20

    30

    40

    50

    60

    70

    T 3

    _ V 2

    10 20 30 40 50 60 70 80 90 100

    T2_V1

    0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    T2

    _V2

  • 7/28/2019 7 Bivariate Eda

    45/61

    regression assumptions

    both variables are measured at the interval

    scale or above

    variation is the same at all points along theregression line (variation is homoscedastic)

  • 7/28/2019 7 Bivariate Eda

    46/61

    residuals

    vertical deviations of points around the regression

    for case i, residual = yi-y-hati [yi-(a+bxi)]

    residuals iny should not show patterned variationeither withx ory-hat

    normally distributed around the regression line

    residual error should not beautocorrelated

    (errors/residuals in y are independent)

  • 7/28/2019 7 Bivariate Eda

    47/61

    standard error of the regression

    recall: standard error of an estimate (SEE) is like

    a standard deviation

    can calculate an SEE for residuals associated witha regression formula

    n

    yyS

    ii

    iyyi

    2

  • 7/28/2019 7 Bivariate Eda

    48/61

    to the degree that the regression assumptions

    hold, there is a 68% probability that true

    values of y lie within 1 SEE of y-hat 95% within 2 SEE

    can plot lines showing the SEE

    y-hat = a+bx +/- SEE

  • 7/28/2019 7 Bivariate Eda

    49/61

  • 7/28/2019 7 Bivariate Eda

    50/61

    data transformations and

    regression

    read Shennan, Chapter 9 (esp. pp. 151-173)

  • 7/28/2019 7 Bivariate Eda

    51/61

    0 50 100 150 200

    VAR1

    0

    50

    100

    150

    200

    V A R 2

    0 50 100 150 200

    VAR1

    0

    50

    100

    150

    200

    V A R 2

  • 7/28/2019 7 Bivariate Eda

    52/61

    40 80 120 160

    VAR1

    0

    50

    100

    150

    200

    V A R 2

  • 7/28/2019 7 Bivariate Eda

    53/61

    0 5 10 15

    VAR1T

    0

    50

    100

    150

    200

    VAR2

    let VAR1T = sqr(VAR1)

  • 7/28/2019 7 Bivariate Eda

    54/61

    distribution and fall-off models

    ex: density of obsidian vs. distance from thequarry:

    0 10 20 30 40 50 60 70 80

    DIST

    0

    1

    2

    3

    4

    5

    6

    D

    E N S I T Y

  • 7/28/2019 7 Bivariate Eda

    55/61

  • 7/28/2019 7 Bivariate Eda

    56/61

    0 10 20 30 40 50 60 70 80DIST

    0

    1

    2

    3

    4

    5

    6

    DENSITY

    Plot of Residuals against Predicted Values

    -1 0 1 2 3 4ESTIMATE

    -1

    0

    1

    2

    RESIDUAL

  • 7/28/2019 7 Bivariate Eda

    57/61

    0 10 20 30 40 50 60 70 80

    DIST

    1

    2

    3456

    DENSITY

    0 10 20 30 40 50 60 70 80

    DIST

    -3

    -2

    -1

    0

    1

    2

    LG

    _DENS

    LG_DENS log(DENSITY)

  • 7/28/2019 7 Bivariate Eda

    58/61

    0 10 20 30 40 50 60 70 80

    DIST

    -3

    -2

    -1

    0

    1

    2

    L G

    _ D E N

    S

    y = 1.70-.05x

    [remembery

    is logged

    density]

  • 7/28/2019 7 Bivariate Eda

    59/61

    0 10 20 30 40 50 60 70 80

    DISTANCE

    0

    1

    2

    3

    4

    5

    6

    D E N S I T Y

    0 800

    6

    0 10 20 30 40 50 60 70 80

    DISTANCE

    0

    1

    2

    3

    4

    5

    6

    DENSITY

    logy = 1.70-.05x

    fploty = exp(1.70-.05*x)

  • 7/28/2019 7 Bivariate Eda

    60/61

    begin

    PLOT DENSITY*DISTANCE / FILL=1,0,0

    fplot y = exp(1.70-.05*x) ; XLABEL='' YLABEL=''

    XTICK=0 XPIP=0 YTICK=0 YPIP=0 XMIN=0

    XMAX=80 YMIN=0 YMAX=6

    end

    transformation summary

  • 7/28/2019 7 Bivariate Eda

    61/61

    transformation summary

    correcting left skew:x4 stronger

    x3 strong

    x2 mild

    correcting right skew:

    x weak

    log(x) mild

    -1/x strong

    -1/x2 stronger