measures of central tendency &...

MEASURES OF CENTRALTENDENCY & VARIABILITY+ NORMAL DISTRIBUTION

Day 3Summer 2015

7/28/2015

1

Fang Chen EC

NU

陈芳

华东

师大

英语系

DISTRIBUTION

Symmetry Modality

单峰，双峰

Skewness 正偏或负偏

Kurtosis

7/28/2015Fang C

hen ECN

U 陈

芳华

东师

大英语系

2

CHAPTER 4Measures of Central Tendency 集中趋势

7/28/2015

3

Fang Chen EC

NU

陈芳

华东

师大

英语系

One major purpose of statistical procedures is to summarize raw data in a meaningful way to make some conclusions.

e.g. You wonder how the students in your colleague’s class are doing in the final exam this year. There is a number you REALLY want to know: ___________

Statistics that describe central tendency are numerical values that describe the center of a distribution of scores for a variable.

7/28/2015

4

Fang Chen EC

NU

陈芳

华东

师大

英语系

CENTRAL TENDENCY

Three common measures of central tendency: Mode 众数

Median 中数

Mean 平均数

7/28/2015

5

Fang Chen EC

NU

陈芳

华东

师大

英语系

FINDING THE MODE 众数

Value N %

1 2 11.8

2 1 5.9

3 1 5.9

4 2 11.8

5 3 17.6

6 4 23.5

7 2 11.8

8 1 5.9

9 1 5.9

Total 17 100

7/28/2015

6

Create a frequency distribution for a set of values and find the value that occurs most frequently.

{ }1,7,6, 4,6,3,8,4,6,5,2,5,6,9,1,5,7X =

Fang Chen EC

NU

陈芳

华东

师大

英语系

{ }9,8,7,7,6,6,6,5,5,4,4,3,2,1,1=X

FINDING THE MEDIAN 中数

7/28/2015

7

Fang Chen EC

NU

陈芳

华东

师大

英语系

{ }9,8,7,7,6,6,6,6,5,5,4,4,3,2,1,1=X

THE MEAN/ AVERAGE 平均数

when we are dealing with populations

when we are dealing with samplesX

µ

7/28/2015

8

the most common measure of central tendency

Defined as the average of all the observed score.

Usually has to be calculated.The statistical notation for the mean is:

We calculate the mean with: NX

X ∑=

Fang Chen EC

NU

陈芳

华东

师大

英语系

COMPARE AND CONTRAST

The more symmetric a distribution is, the closer these three measures of central tendency will be

If a distribution is truly normal (symmetric and unimodal), then the mean, median, and mode will be exactly the same Unfortunately, this rarely happens. We must choose a measure that best suits our purposes

and data.

7/28/2015

9

Fang Chen EC

NU

陈芳

华东

师大

英语系

SOME ADVANTAGES AND DISADVANTAGES- MODE

Advantages: Any randomly selected observation, Xi, is more likely to

be the mode than any other score. It is the only measure of central tendency that can be

used with nominal data. Is not affected by extreme scores

Disadvantages: Depends on the sample of data and may not be

representative of the population Can depend on the way the data is grouped Cannot be defined in simple mathematical equation

7/28/2015

10

Fang Chen EC

NU

陈芳

华东

师大

英语系

ADVANTAGES AND DISADVANTAGES- MEDIAN

Advantages: It is unaffected by extreme scores (outliers) …

Disadvantages: Depends on the sample of data and is not easily

generalized to the greater population Does not enter statistical equations readily and

therefore more difficult to work with than the mean.

may not be an actual value observed in the data.

7/28/2015

11

Fang Chen EC

NU

陈芳

华东

师大

英语系

ADVANTAGES AND DISADVANTAGES- MEAN Advantages:

The mean can be defined mathematically with a simple equation and can easily be manipulated algebraically.

Is the most stable estimate of the central tendency of population than would the sample medians or modes

Disadvantages: Influenced by the extreme values. (Very sensitive to

outliers.) The sample mean may not be an actual value observed

in the data.

7/28/2015

12

Fang Chen EC

NU

陈芳

华东

师大

英语系

CHAPTER 5Measures of Variability 分散趋势/变异性

7/28/2015

13

Fang Chen EC

NU

陈芳

华东

师大

英语系

VARIABILITY / DISPERSION 变异性

Variability is defined as how the data is distributed around a measure of central tendency (e.g.mean)

Measures of variability describe the way and degree to which the data is spread

Measures of variability quantify how similar the scores in a sample are to one another.

7/28/2015

14

Fang Chen EC

NU

陈芳

华东

师大

英语系

CONSIDER THE FOLLOWING:

Two classes were assigned to the same teacher. In the first class, all the kids come from a family where at least one parent is a teacher/professor. In the second class, there are various kinds of family background for the kids.

How similar do you expect the pretest scores within the two groups to be?

7/28/2015

15

Fang Chen EC

NU

陈芳

华东

师大

英语系

THE RESULTING DATA…

7/28/2015

16

Fang Chen EC

NU

陈芳

华东

师大

英语系

THE DATA FROM A GRAPHICALPERSPECTIVE

Sample 1- more variability Sample 2- less variability

7/28/2015

17

Fang Chen EC

NU

陈芳

华东

师大

英语系

MEASURES OF VARIABILITY

7/28/2015

18

The range 全距

The interquartile range 四分位距

Deviation 离差

Average deviation Mean Absolute deviation Variance 方差

Standard deviation 标准差

Fang Chen EC

NU

陈芳

华东

师大

英语系

RANGE 全距The distance

between the lowest and highest value.

Data from the previous example:

The range can be heavily influenced by extreme scores.

7/28/2015Fang C

hen ECN

U 陈

芳华

东师

大英语系

19

THE INTERQUARTILE RANGE 四分位距

The interquartile range is the range of the middle 50% of the observations.

A trimmed statistic: how much from the lower end and the upper end respectively?

Calculated by taking the difference between the 75th percentile and 25th percentile

Percentile: the percentage of observations that are below a particular score value.

7/28/2015

20

Fang Chen EC

NU

陈芳

华东

师大

英语系

FINDING THE INTERQUARTILE RANGE

Using the data from our example: Sample 1:

P25=37 & P75=77 for a interquartile range of 40 score points

Sample 2: P25=68 & P75=93 for a interquartile range of 25 score points

The interquartile range has the opposite problem as the range—it gets rid of too much of the data

7/28/2015

21

Fang Chen EC

NU

陈芳

华东

师大

英语系

DEVIATION 离差

The difference between every data point and the mean The average deviation The mean absolute deviation, m.a.d. Variance Standard deviation

7/28/2015

22

Fang Chen EC

NU

陈芳

华东

师大

英语系

AVERAGE DEVIATION

We could find for each observed value.

Then use to look at on average how

far the observations are from the mean.

While, the logic is sound, the average deviances for any sample will always be equal to zero --- Why?

7/28/2015

23

( )i id X X= −

1 ( )

N

ii

i

dmean d

N= =∑

Fang Chen EC

NU

陈芳

华东

师大

英语系

There are two ways to eliminate problems connected with the positive and negative deviances Take the absolute value of the deviances (ignore

the sign) or MAD Square each deviance, since the square of a

negative number is positive

7/28/2015

24

Fang Chen EC

NU

陈芳

华东

师大

英语系

MAD Mean absolute deviation

Not convenient for statistical manipulation

7/28/2015

25

NXX

MAD i∑ −=

Fang Chen EC

NU

陈芳

华东

师大

英语系

VARIANCE

We start by finding how each observed value differs from the mean:

To get rid of the negative deviances, we square each of these values:

Then, we sum the squared deviances (often called the “sum of squares”)

Calculate the average.

7/28/2015

26

( )iX X−

( )2iX X−

( )2

1

N

ii

X X=

−∑

Fang Chen EC

NU

陈芳

华东

师大

英语系

VARIANCE: FINAL EQUATIONS

( )

( )

2

2 1

2

2 1

1

N

ii

x

n

ii

x

X X

N

X Xs

n

σ =

=

−=

−=

−

∑

∑

7/28/2015

27

Fang Chen EC

NU

陈芳

华东

师大

英语系

STANDARD DEVIATION- SD 标准方差

Because we squared the deviations while calculating the variance, we have altered the original scale. This makes the variance difficult to interpret.

To convert this back to the original scale, we take the square root—called the standard deviation. σ is the population standard deviation s is the sample standard deviation

Think of SD as a measure of how far our data values deviate from the mean, on average

7/28/2015

28

Fang Chen EC

NU

陈芳

华东

师大

英语系

STANDARD DEVIATION: FINALEQUATIONS

7/28/2015

29

( )

( )

2

1

2

1

1

N

ii

x

n

ii

x

X X

N

X Xs

n

σ =

=

−=

−=

−

∑

∑

Fang Chen EC

NU

陈芳

华东

师大

英语系

OUR EXAMPLE…

7/28/2015

30

Fang Chen EC

NU

陈芳

华东

师大

英语系

BACK TO OUR EXAMPLE… A loose interpretation:

Class 1 deviated, either positively or negatively, on average, 24 points from the mean

Class 2 deviated, either positively or negatively, on average, 12 points from the mean

In general, we can conclude that the values in class 2 tend to be more similar to one another (homogeneous) than that of class 1.

Interpretation in terms of our example: Teachers’ kids all performed very similarly, whereas those from other families were much more variable in the performance.

7/28/2015

31

Fang Chen EC

NU

陈芳

华东

师大

英语系

CHARACTERISTICS OF SD Basically a measure of the average of the

deviations of each score from the mean.

Can be used to build confidence intervals to see how many scores fall below or above the mean ---more on this in Chapter6.

7/28/2015

32

Fang Chen EC

NU

陈芳

华东

师大

英语系

COMPUTATIONAL FORMULAE

The formulae presented for both variance and standard deviations up to this point are referred to as the definitional formulae.

For hand calculation, another equation is easier to use.

Not much difference if you are using computer programs.

7/28/2015

33

Fang Chen EC

NU

陈芳

华东

师大

英语系

DON’T BE SCARED….

Definitional Computational

7/28/2015

34

2

1

2

1

2

2 1

2

2 1

1

N

ii

n

ii

XN

i Ni

x

Xn

i ni

x

X

N

Xs

n

σ

=

=

=

=

∑−

=

∑−

=−

∑

∑

( )

( )

2

2 1

2

2 1

1

N

ii

x

n

ii

x

X X

N

X Xs

n

σ =

=

−=

−=

−

∑

∑

Fang Chen EC

NU

陈芳

华东

师大

英语系

THE PERPETUAL QUESTION:“WHY DIVIDE BY n-1 FOR SAMPLE STATISTICS”?

Adjustment to produce an unbiased estimate.

1. Concrete examples in the book. Howell p Gravetter & Wallnau

P100-101

2. Algebraic proof.

7/28/2015

35

Fang Chen EC

NU

陈芳

华东

师大

英语系

http://halfdone.files.wordpress.com/2009/02/thinker.jpg�

REPRESENTING DISTRIBUTIONSWITH GRAPHICS --- BOXPLOT

A boxplot ( or box and whisker plot) includes a measure of central tendency (the median) and a measure of dispersion (the interquartile range) Hinges= 1st and 3rd quartiles= 25th and 75th quantile H-spread: the range between the two quartiles Whisker: 1.5*H-spread from the top and bottom of

the box

7/28/2015

36

Fang Chen EC

NU

陈芳

华东

师大

英语系

BOXPLOT

7/28/2015

37Score

35

40

45

50

55

Median

Quartile location

Hinge

Interquartile range

Whisker

* Outlier

Fang Chen EC

NU

陈芳

华东

师大

英语系

SPSS At least two routes Graphs Boxplot Analyze Descriptive statistics Explore

7/28/2015

38

Fang Chen EC

NU

陈芳

华东

师大

英语系

KEY TERMS

Describing distribution:4_______________, _______________,_______________, _______________.

Measures of central tendency:3 ______________, _______________, _____________

Measures of variability:2 ______________, _______________

Displaying distribution:1 _______________

7/28/2015

39

Fang Chen EC

NU

陈芳

华东

师大

英语系

BREAKActivity 1

7/29/2015

40

Fang Chen EC

NU

陈芳

华东

师大

英语系

THE NORMALDISTRIBUTION& Z-SCORESSummer 2015

7/28/2015Fang C

hen ECN

U 陈

芳华

东师

大英语系41

OVERVIEW

Probability for discrete vs. continuous data

The normal distributionStandard Normal Distribution z-transformations and z-scoresUsing z-scores to find probabilities

7/28/2015

42

Fang Chen EC

NU

陈芳

华东

师大

英语系

Think of discrete variables with the notion of a probability of a specific outcome We have a known number (10) of

purple, red & white marbles—what is the probability of choosing a red marble?

7/28/2015

43

Fang Chen EC

NU

陈芳

华东

师大

英语系

FREQUENCY, AREA, AND PROBABILITY FORDISCRETE VARIABLES

The pie chart to the left represents the frequency distribution of red, purple and white marbles in a bag .

7/28/2015

44

10%

40%

50%

Fang Chen EC

NU

陈芳

华东

师大

英语系

We think of continuous variables with the idea of a probability of obtaining a value that falls within a range With our distribution of scores, what is probability that

somebody will have IQ score of 92?

7/28/2015

45

Fang Chen EC

NU

陈芳

华东

师大

英语系

7/28/2015

46

IQ Score RangesFrequency Proportion Cumulative

71-75 1 0.02 0.0276-80 2 0.04 0.0681-85 4 0.08 0.1486-90 5 0.1 0.2491-95 7 0.14 0.3896-100 11 0.22 0.6101-105 8 0.16 0.76106-110 5 0.1 0.86111-115 3 0.06 0.92116-120 3 0.06 0.98121-125 1 0.02 1

Total 50 1

Fang Chen EC

NU

陈芳

华东

师大

英语系

Like with the pie chart early, we can relate area to probability. The area is the interval corresponding to each bar.

How many potential ranges could we create?

What would this do?

7/28/2015

47

Fang Chen EC

NU

陈芳

华东

师大

英语系

AN INTERVAL OF 20 POINTS/ 3 GROUPS

7/28/2015

48

91-110:31/50=0.62

Fang Chen EC

NU

陈芳

华东

师大

英语系

AN INTERVAL OF 10 POINTS/ 6 GROUPS

7/28/2015

49

91-100: 18/50=0.36

Fang Chen EC

NU

陈芳

华东

师大

英语系

WITH AN INTERVAL OF 5 POINTS

7/28/2015

50

96-100: 11/50=0.22

Fang Chen EC

NU

陈芳

华东

师大

英语系

WITH AN INTERVAL OF 2 POINT

7/28/2015

51

94-96: 7/50=0.14

Fang Chen EC

NU

陈芳

华东

师大

英语系

A CHANGE OF CONCEPT The probability of exactly any single value is 0,

because we can break down the intervals into finer and finer ones…until infinity, meaning the bar size will become smaller and smaller until 0.

But we want to talk about a specific value in our observation. We want to use the same probability to interpret the score…we will use probability density function (PDF).

An x value will corresponds to only one PDF value that is kind of the frequency, and is the height of the normal curve.

How does this work?

7/28/2015

52

Fang Chen EC

NU

陈芳

华东

师大

英语系

PDF

7/28/2015

53

Fang Chen EC

NU

陈芳

华东

师大

英语系

PROBABILITY DENSITY FUNCTION/ PDF概率密度函数

7/28/2015

54

For every x value, we can plug the value into the function and get a f(X) number, which corresponds to the height of the point on the normal curve corresponding to the X value, we call it density. This is the y value in your z-table. The largest y value is at the center of the normal distribution where z=0.

E.g.

2.718e 3.14 where

)(2

1)(22 2/)(

==

= −−

ππσ

σµXeXf

0279.0)718.2(14.3*233.11

1)90(22 33.11*2/)74.9790( == −−f

Fang Chen EC

NU

陈芳

华东

师大

英语系

GRAPHING THE PDF AND RELATE TOAREA

7/28/2015

55

0.00000.00500.01000.01500.02000.02500.03000.03500.0400

70 80 90 100 110 120 130

Den

sity

IQ Scores

Graphing probability density function

0.00000.00500.01000.01500.02000.02500.03000.03500.0400

70 75 80 85 90 95 100105110115120125130

Den

sity

IQ Scores


0.00000.00500.01000.01500.02000.02500.03000.03500.0400

707274767880828486889092949698100102104106108110112114116118120122124126128130

Den

sity

IQ Scores


Fang Chen EC

NU

陈芳

华东

师大

英语系

PERCENTILES

Percentile: the point below which a specified percentage of scores in the distribution fall

Percentile rank: the percentage of scores equal to or less than the given score. To get the percentile rank involves integration in calculus.

You don’t have to calculate for that, someone has already prepared the table for us ( z table). We just need to know how to use it.

A percentile is a score, a percentile rank is a percentage.

Can be found for discrete or continuous data

7/28/2015

56

Fang Chen EC

NU

陈芳

华东

师大

英语系

NORMAL DISTRIBUTION正态分布

Normal distribution is important because: Many dependent variables are assumed to be

normally distributed in the population The sampling distribution of the mean is

normally distributed ( more coming.) Many statistics models are based on an

assumption of a normally distributed variable.

7/28/2015

57

Fang Chen EC

NU

陈芳

华东

师大

英语系

NORMAL DISTRIBUTION

7/28/2015

58

0.0000

0.0050

0.0100

0.0150

0.0200

0.0250

0.0300

0.0350

0.0400

70 80 90 100 110 120 130

Den

sity

IQ Scores


Bell-shaped curve Unimodal Symmetric—mean,

median and mode are all in the center

Not skewed Extends from -∞ to

+∞ The total area

under the curve is 1

Fang Chen EC

NU

陈芳

华东

师大

英语系

NORMAL DISTRIBUTION

7/28/2015

59

About 68%of the distribution lies within 1 SD of the mean, 95% lies within 2 SD of the mean and 99.7% of the distribution lies within 3 SD of the mean.

We can immediately make some inferences.

∞

Fang Chen EC

NU

陈芳

华东

师大

英语系

STANDARD NORMAL DISTRIBUTION标准正态分布

The standard normal distribution is just a special case of normal distribution with a mean=0 and SD=1. Any normal distribution can be transformed to be a standardized normal distribution.

Why bother transforming, or standardizing a distribution?

7/28/2015

60

Fang Chen EC

NU

陈芳

华东

师大

英语系

HOW MANY TABLES DO WE NEED? For our IQ data, our mean is 97.74, SD=11.33,

one SD below the mean is 97.74-11.33=86.41, one SD above the mean is 97.74+11.33=109.07. The percentile rank of 84.13% corresponds to a raw score of 109.07.

For SAT score, mean=500, SD=100, one SD below the mean is 400, one SD above the mean is 600. The percentile rank of 84.13% corresponds to a raw score of 600.

……

7/29/2015

61

Fang Chen EC

NU

陈芳

华东

师大

英语系

STANDARD NORMAL DISTRIBUTION

There are some general rules we can follow to do this Any constant can be added or subtracted to

every value and the result will shift the meanof that variable by that same constant ---activity1-question.

Likewise, if we multiply each value by a constant, the resulting mean will be adjusted by the same constant

7/29/2015

62

Fang Chen EC

NU

陈芳

华东

师大

英语系

STANDARDIZED SCORES 标准分

When we transform our variables to the z-distribution (the standard normal distribution), we are standardizing our scores.

This essentially means we put all of our values on the same scale and end up with a distribution of mean=0 and SD=1.

We call the process the z-transformationThe standardized scores that come out of

this process are called z-scores.

7/28/2015

63

Fang Chen EC

NU

陈芳

华东

师大

英语系

Z-SCORE TRANSFORMATION

The end result will be a set of standardized scores. All scores that are below the mean will be negative and all

scores above the mean will be positive We can interpret the value of the z-score as how many

standard deviation above or below the mean A z-score =1.0 is a score that is exactly 1 SD above the

mean A z-score of -1.5 is score that is exactly 1.5 SD below the

mean

7/28/2015

64

ii

Xz

µσ−

=• X is our original data• µ is the mean of the population• σ is the population standard

deviation

Fang Chen EC

NU

陈芳

华东

师大

英语系

Z-SCORE EXAMPLE

Test score: Mean = 50 Standard deviation = 10 So the z-score if you received a 60 is

and the z-score if you received a 45 is

7/28/2015

65

11010

105060

==−

=−

=σµXz

5.010

510

5045−=

−=

−=

−=

σµXz

Fang Chen EC

NU

陈芳

华东

师大

英语系

SO? Now we can refer to the z-table to see what

percentile a score value of 60 or 45 corresponds to.

A full z-score table can be found in Howell p604-607 Table E-10.

A z-score of 1 corresponds to a percentile of 0.8413. This means 84.13% of scores fall at or below a z-score of 1 or the raw score of 60.

A z-score of -.5 corresponds to a percentile rank of 0.3085. This means 30.85% of scores fall at or below a z-score of -.5 or a raw score of 45.

7/28/2015

66

Fang Chen EC

NU

陈芳

华东

师大

英语系

FINDING THE PERCENTILE RANK OF ARAW SCORE

Step 1: Change the raw scores to z-scores using

Step 2: Look in the z-table to find the percentile rank.

Example A population mean of 400, with a population

SD of 100, What are the percentile rank corresponding to the following raw scores? What do they mean?1) A score of 5002) A score of 3003) A score of 275

7/28/2015

67

σµ−

=Xz Fang C

hen ECN

U 陈

芳华

东师

大英语系

LET ME JUST BE REDUNDANT… Percentile rank refers to the percentage of scores

at or below the score of interest.

There are no negative z values in the table. If the z value you calculated is positive, look

for the number under larger portion column. If the z value is negative, look for the

number under the smaller portion column.

7/28/2015

68

Fang Chen EC

NU

陈芳

华东

师大

英语系

FINDING THE RAW SCORE FROM APERCENTILE RANK

Step 1: Using the z-table, find the corresponding z-scores.

Step 2: transform the z scores back to the raw scores using

Example: We know a distribution has a mean of 400 and a SD

of 100, what raw score corresponds to the1) 95th percentile? 2) 50th percentile ? 3) 33th percentile?

7/28/2015

69

µσ += *ZX

Fang Chen EC

NU

陈芳

华东

师大

英语系

WHAT ELSE?A population mean of 400, with a population SD of

100 We can also answer more complex questions like

1) What percent of scores are between 300 and 540?2) What percent of scores are between 475 an 605?

Step 1: Transform the raw scores into z-scores. 300:z=-1, 540:z=1.4, 475: z=0.75, 605: z=2.05

Step 2: Find the proportion corresponding to the raw scores.

Step 3: Calculate the difference between the raw scores either by addition or subtraction.

7/28/2015

70

Fang Chen EC

NU

陈芳

华东

师大

英语系

7/28/2015

71

2) For a z-score of -1, this is the mean to z area:

Fang Chen EC

NU

陈芳

华东

师大

英语系

7/28/2015

72

For a z-score of 1.4, this is the mean to z area:

Fang Chen EC

NU

陈芳

华东

师大

英语系

7/28/2015

73

We can add the mean to z areas to calculate the percentage of scores falling in the range:p(-1 < z< 1.4) = p(-1 < z < μ) + p(μ < z< 1.4)

Fang Chen EC

NU

陈芳

华东

师大

英语系

7/28/2015

74

3) We can subtract the two areas as necessary.p(0.75< z < 2.05) = p( 0<z < 2.05) - p(0< z< 0.75)

Fang Chen EC

NU

陈芳

华东

师大

英语系

HOW ELSE COULD WE USE THIS? Given our conversation about probability in the

last class: we might want to describe how unusual a particular

score might be in the population. Used for hypothesis testing. Activity.

7/28/2015

75

Fang Chen EC

NU

陈芳

华东

师大

英语系

SUMMARY

PDF is introduced to get to probability for continuous variable.

How to transform any scores within a distribution into a z score ( or to standardize the raw scores)?

How to find the percentile of a z score? --- The portions of scores fall at or below the z score of interest.

How to find the raw scores that corresponds to a certain percentile?

How to find the percentage of scores fall within any two raw scores?

7/28/2015

76

Fang Chen EC

NU

陈芳

华东

师大

英语系

measures of central tendency &...

Documents