measures of central tendency &...
TRANSCRIPT
MEASURES OF CENTRALTENDENCY & VARIABILITY+ NORMAL DISTRIBUTION
Day 3Summer 2015
7/28/2015
1
Fang Chen EC
NU
陈芳
华东
师大
英语系
DISTRIBUTION
Symmetry Modality
单峰,双峰
Skewness 正偏或负偏
Kurtosis
7/28/2015Fang C
hen ECN
U 陈
芳华
东师
大英语系
2
CHAPTER 4Measures of Central Tendency 集中趋势
7/28/2015
3
Fang Chen EC
NU
陈芳
华东
师大
英语系
One major purpose of statistical procedures is to summarize raw data in a meaningful way to make some conclusions.
e.g. You wonder how the students in your colleague’s class are doing in the final exam this year. There is a number you REALLY want to know: ___________
Statistics that describe central tendency are numerical values that describe the center of a distribution of scores for a variable.
7/28/2015
4
Fang Chen EC
NU
陈芳
华东
师大
英语系
CENTRAL TENDENCY
Three common measures of central tendency: Mode 众数
Median 中数
Mean 平均数
7/28/2015
5
Fang Chen EC
NU
陈芳
华东
师大
英语系
FINDING THE MODE 众数
Value N %
1 2 11.8
2 1 5.9
3 1 5.9
4 2 11.8
5 3 17.6
6 4 23.5
7 2 11.8
8 1 5.9
9 1 5.9
Total 17 100
7/28/2015
6
Create a frequency distribution for a set of values and find the value that occurs most frequently.
{ }1,7,6, 4,6,3,8,4,6,5,2,5,6,9,1,5,7X =
Fang Chen EC
NU
陈芳
华东
师大
英语系
{ }9,8,7,7,6,6,6,5,5,4,4,3,2,1,1=X
FINDING THE MEDIAN 中数
7/28/2015
7
Fang Chen EC
NU
陈芳
华东
师大
英语系
{ }9,8,7,7,6,6,6,6,5,5,4,4,3,2,1,1=X
THE MEAN/ AVERAGE 平均数
when we are dealing with populations
when we are dealing with samplesX
µ
7/28/2015
8
the most common measure of central tendency
Defined as the average of all the observed score.
Usually has to be calculated.The statistical notation for the mean is:
We calculate the mean with: NX
X ∑=
Fang Chen EC
NU
陈芳
华东
师大
英语系
COMPARE AND CONTRAST
The more symmetric a distribution is, the closer these three measures of central tendency will be
If a distribution is truly normal (symmetric and unimodal), then the mean, median, and mode will be exactly the same Unfortunately, this rarely happens. We must choose a measure that best suits our purposes
and data.
7/28/2015
9
Fang Chen EC
NU
陈芳
华东
师大
英语系
SOME ADVANTAGES AND DISADVANTAGES- MODE
Advantages: Any randomly selected observation, Xi, is more likely to
be the mode than any other score. It is the only measure of central tendency that can be
used with nominal data. Is not affected by extreme scores
Disadvantages: Depends on the sample of data and may not be
representative of the population Can depend on the way the data is grouped Cannot be defined in simple mathematical equation
7/28/2015
10
Fang Chen EC
NU
陈芳
华东
师大
英语系
ADVANTAGES AND DISADVANTAGES- MEDIAN
Advantages: It is unaffected by extreme scores (outliers) …
Disadvantages: Depends on the sample of data and is not easily
generalized to the greater population Does not enter statistical equations readily and
therefore more difficult to work with than the mean.
may not be an actual value observed in the data.
7/28/2015
11
Fang Chen EC
NU
陈芳
华东
师大
英语系
ADVANTAGES AND DISADVANTAGES- MEAN Advantages:
The mean can be defined mathematically with a simple equation and can easily be manipulated algebraically.
Is the most stable estimate of the central tendency of population than would the sample medians or modes
Disadvantages: Influenced by the extreme values. (Very sensitive to
outliers.) The sample mean may not be an actual value observed
in the data.
7/28/2015
12
Fang Chen EC
NU
陈芳
华东
师大
英语系
CHAPTER 5Measures of Variability 分散趋势/变异性
7/28/2015
13
Fang Chen EC
NU
陈芳
华东
师大
英语系
VARIABILITY / DISPERSION 变异性
Variability is defined as how the data is distributed around a measure of central tendency (e.g.mean)
Measures of variability describe the way and degree to which the data is spread
Measures of variability quantify how similar the scores in a sample are to one another.
7/28/2015
14
Fang Chen EC
NU
陈芳
华东
师大
英语系
CONSIDER THE FOLLOWING:
Two classes were assigned to the same teacher. In the first class, all the kids come from a family where at least one parent is a teacher/professor. In the second class, there are various kinds of family background for the kids.
How similar do you expect the pretest scores within the two groups to be?
7/28/2015
15
Fang Chen EC
NU
陈芳
华东
师大
英语系
THE RESULTING DATA…
7/28/2015
16
Fang Chen EC
NU
陈芳
华东
师大
英语系
THE DATA FROM A GRAPHICALPERSPECTIVE
Sample 1- more variability Sample 2- less variability
7/28/2015
17
Fang Chen EC
NU
陈芳
华东
师大
英语系
MEASURES OF VARIABILITY
7/28/2015
18
The range 全距
The interquartile range 四分位距
Deviation 离差
Average deviation Mean Absolute deviation Variance 方差
Standard deviation 标准差
Fang Chen EC
NU
陈芳
华东
师大
英语系
RANGE 全距The distance
between the lowest and highest value.
Data from the previous example:
The range can be heavily influenced by extreme scores.
7/28/2015Fang C
hen ECN
U 陈
芳华
东师
大英语系
19
THE INTERQUARTILE RANGE 四分位距
The interquartile range is the range of the middle 50% of the observations.
A trimmed statistic: how much from the lower end and the upper end respectively?
Calculated by taking the difference between the 75th percentile and 25th percentile
Percentile: the percentage of observations that are below a particular score value.
7/28/2015
20
Fang Chen EC
NU
陈芳
华东
师大
英语系
FINDING THE INTERQUARTILE RANGE
Using the data from our example: Sample 1:
P25=37 & P75=77 for a interquartile range of 40 score points
Sample 2: P25=68 & P75=93 for a interquartile range of 25 score points
The interquartile range has the opposite problem as the range—it gets rid of too much of the data
7/28/2015
21
Fang Chen EC
NU
陈芳
华东
师大
英语系
DEVIATION 离差
The difference between every data point and the mean The average deviation The mean absolute deviation, m.a.d. Variance Standard deviation
7/28/2015
22
Fang Chen EC
NU
陈芳
华东
师大
英语系
AVERAGE DEVIATION
We could find for each observed value.
Then use to look at on average how
far the observations are from the mean.
While, the logic is sound, the average deviances for any sample will always be equal to zero --- Why?
7/28/2015
23
( )i id X X= −
1 ( )
N
ii
i
dmean d
N= =∑
Fang Chen EC
NU
陈芳
华东
师大
英语系
There are two ways to eliminate problems connected with the positive and negative deviances Take the absolute value of the deviances (ignore
the sign) or MAD Square each deviance, since the square of a
negative number is positive
7/28/2015
24
Fang Chen EC
NU
陈芳
华东
师大
英语系
MAD Mean absolute deviation
Not convenient for statistical manipulation
7/28/2015
25
NXX
MAD i∑ −=
Fang Chen EC
NU
陈芳
华东
师大
英语系
VARIANCE
We start by finding how each observed value differs from the mean:
To get rid of the negative deviances, we square each of these values:
Then, we sum the squared deviances (often called the “sum of squares”)
Calculate the average.
7/28/2015
26
( )iX X−
( )2iX X−
( )2
1
N
ii
X X=
−∑
Fang Chen EC
NU
陈芳
华东
师大
英语系
VARIANCE: FINAL EQUATIONS
( )
( )
2
2 1
2
2 1
1
N
ii
x
n
ii
x
X X
N
X Xs
n
σ =
=
−=
−=
−
∑
∑
7/28/2015
27
Fang Chen EC
NU
陈芳
华东
师大
英语系
STANDARD DEVIATION- SD 标准方差
Because we squared the deviations while calculating the variance, we have altered the original scale. This makes the variance difficult to interpret.
To convert this back to the original scale, we take the square root—called the standard deviation. σ is the population standard deviation s is the sample standard deviation
Think of SD as a measure of how far our data values deviate from the mean, on average
7/28/2015
28
Fang Chen EC
NU
陈芳
华东
师大
英语系
STANDARD DEVIATION: FINALEQUATIONS
7/28/2015
29
( )
( )
2
1
2
1
1
N
ii
x
n
ii
x
X X
N
X Xs
n
σ =
=
−=
−=
−
∑
∑
Fang Chen EC
NU
陈芳
华东
师大
英语系
OUR EXAMPLE…
7/28/2015
30
Fang Chen EC
NU
陈芳
华东
师大
英语系
BACK TO OUR EXAMPLE… A loose interpretation:
Class 1 deviated, either positively or negatively, on average, 24 points from the mean
Class 2 deviated, either positively or negatively, on average, 12 points from the mean
In general, we can conclude that the values in class 2 tend to be more similar to one another (homogeneous) than that of class 1.
Interpretation in terms of our example: Teachers’ kids all performed very similarly, whereas those from other families were much more variable in the performance.
7/28/2015
31
Fang Chen EC
NU
陈芳
华东
师大
英语系
CHARACTERISTICS OF SD Basically a measure of the average of the
deviations of each score from the mean.
Can be used to build confidence intervals to see how many scores fall below or above the mean ---more on this in Chapter6.
7/28/2015
32
Fang Chen EC
NU
陈芳
华东
师大
英语系
COMPUTATIONAL FORMULAE
The formulae presented for both variance and standard deviations up to this point are referred to as the definitional formulae.
For hand calculation, another equation is easier to use.
Not much difference if you are using computer programs.
7/28/2015
33
Fang Chen EC
NU
陈芳
华东
师大
英语系
DON’T BE SCARED….
Definitional Computational
7/28/2015
34
2
1
2
1
2
2 1
2
2 1
1
N
ii
n
ii
XN
i Ni
x
Xn
i ni
x
X
N
Xs
n
σ
=
=
=
=
∑−
=
∑−
=−
∑
∑
( )
( )
2
2 1
2
2 1
1
N
ii
x
n
ii
x
X X
N
X Xs
n
σ =
=
−=
−=
−
∑
∑
Fang Chen EC
NU
陈芳
华东
师大
英语系
THE PERPETUAL QUESTION:“WHY DIVIDE BY n-1 FOR SAMPLE STATISTICS”?
Adjustment to produce an unbiased estimate.
1. Concrete examples in the book. Howell p Gravetter & Wallnau
P100-101
2. Algebraic proof.
7/28/2015
35
Fang Chen EC
NU
陈芳
华东
师大
英语系
REPRESENTING DISTRIBUTIONSWITH GRAPHICS --- BOXPLOT
A boxplot ( or box and whisker plot) includes a measure of central tendency (the median) and a measure of dispersion (the interquartile range) Hinges= 1st and 3rd quartiles= 25th and 75th quantile H-spread: the range between the two quartiles Whisker: 1.5*H-spread from the top and bottom of
the box
7/28/2015
36
Fang Chen EC
NU
陈芳
华东
师大
英语系
BOXPLOT
7/28/2015
37Score
35
40
45
50
55
Median
Quartile location
Hinge
Interquartile range
Whisker
* Outlier
Fang Chen EC
NU
陈芳
华东
师大
英语系
SPSS At least two routes Graphs Boxplot Analyze Descriptive statistics Explore
7/28/2015
38
Fang Chen EC
NU
陈芳
华东
师大
英语系
KEY TERMS
Describing distribution:4_______________, _______________,_______________, _______________.
Measures of central tendency:3 ______________, _______________, _____________
Measures of variability:2 ______________, _______________
Displaying distribution:1 _______________
7/28/2015
39
Fang Chen EC
NU
陈芳
华东
师大
英语系
BREAKActivity 1
7/29/2015
40
Fang Chen EC
NU
陈芳
华东
师大
英语系
THE NORMALDISTRIBUTION& Z-SCORESSummer 2015
7/28/2015Fang C
hen ECN
U 陈
芳华
东师
大英语系41
OVERVIEW
Probability for discrete vs. continuous data
The normal distributionStandard Normal Distribution z-transformations and z-scoresUsing z-scores to find probabilities
7/28/2015
42
Fang Chen EC
NU
陈芳
华东
师大
英语系
Think of discrete variables with the notion of a probability of a specific outcome We have a known number (10) of
purple, red & white marbles—what is the probability of choosing a red marble?
7/28/2015
43
Fang Chen EC
NU
陈芳
华东
师大
英语系
FREQUENCY, AREA, AND PROBABILITY FORDISCRETE VARIABLES
The pie chart to the left represents the frequency distribution of red, purple and white marbles in a bag .
7/28/2015
44
10%
40%
50%
Fang Chen EC
NU
陈芳
华东
师大
英语系
We think of continuous variables with the idea of a probability of obtaining a value that falls within a range With our distribution of scores, what is probability that
somebody will have IQ score of 92?
7/28/2015
45
Fang Chen EC
NU
陈芳
华东
师大
英语系
7/28/2015
46
IQ Score RangesFrequency Proportion Cumulative
71-75 1 0.02 0.0276-80 2 0.04 0.0681-85 4 0.08 0.1486-90 5 0.1 0.2491-95 7 0.14 0.3896-100 11 0.22 0.6101-105 8 0.16 0.76106-110 5 0.1 0.86111-115 3 0.06 0.92116-120 3 0.06 0.98121-125 1 0.02 1
Total 50 1
Fang Chen EC
NU
陈芳
华东
师大
英语系
Like with the pie chart early, we can relate area to probability. The area is the interval corresponding to each bar.
How many potential ranges could we create?
What would this do?
7/28/2015
47
Fang Chen EC
NU
陈芳
华东
师大
英语系
AN INTERVAL OF 20 POINTS/ 3 GROUPS
7/28/2015
48
91-110:31/50=0.62
Fang Chen EC
NU
陈芳
华东
师大
英语系
AN INTERVAL OF 10 POINTS/ 6 GROUPS
7/28/2015
49
91-100: 18/50=0.36
Fang Chen EC
NU
陈芳
华东
师大
英语系
WITH AN INTERVAL OF 5 POINTS
7/28/2015
50
96-100: 11/50=0.22
Fang Chen EC
NU
陈芳
华东
师大
英语系
WITH AN INTERVAL OF 2 POINT
7/28/2015
51
94-96: 7/50=0.14
Fang Chen EC
NU
陈芳
华东
师大
英语系
A CHANGE OF CONCEPT The probability of exactly any single value is 0,
because we can break down the intervals into finer and finer ones…until infinity, meaning the bar size will become smaller and smaller until 0.
But we want to talk about a specific value in our observation. We want to use the same probability to interpret the score…we will use probability density function (PDF).
An x value will corresponds to only one PDF value that is kind of the frequency, and is the height of the normal curve.
How does this work?
7/28/2015
52
Fang Chen EC
NU
陈芳
华东
师大
英语系
7/28/2015
53
Fang Chen EC
NU
陈芳
华东
师大
英语系
PROBABILITY DENSITY FUNCTION/ PDF概率密度函数
7/28/2015
54
For every x value, we can plug the value into the function and get a f(X) number, which corresponds to the height of the point on the normal curve corresponding to the X value, we call it density. This is the y value in your z-table. The largest y value is at the center of the normal distribution where z=0.
E.g.
2.718e 3.14 where
)(2
1)(22 2/)(
==
= −−
ππσ
σµXeXf
0279.0)718.2(14.3*233.11
1)90(22 33.11*2/)74.9790( == −−f
Fang Chen EC
NU
陈芳
华东
师大
英语系
GRAPHING THE PDF AND RELATE TOAREA
7/28/2015
55
0.00000.00500.01000.01500.02000.02500.03000.03500.0400
70 80 90 100 110 120 130
Den
sity
IQ Scores
Graphing probability density function
0.00000.00500.01000.01500.02000.02500.03000.03500.0400
70 75 80 85 90 95 100105110115120125130
Den
sity
IQ Scores
Graphing probability density function
0.00000.00500.01000.01500.02000.02500.03000.03500.0400
707274767880828486889092949698100102104106108110112114116118120122124126128130
Den
sity
IQ Scores
Graphing probability density function
Fang Chen EC
NU
陈芳
华东
师大
英语系
PERCENTILES
Percentile: the point below which a specified percentage of scores in the distribution fall
Percentile rank: the percentage of scores equal to or less than the given score. To get the percentile rank involves integration in calculus.
You don’t have to calculate for that, someone has already prepared the table for us ( z table). We just need to know how to use it.
A percentile is a score, a percentile rank is a percentage.
Can be found for discrete or continuous data
7/28/2015
56
Fang Chen EC
NU
陈芳
华东
师大
英语系
NORMAL DISTRIBUTION正态分布
Normal distribution is important because: Many dependent variables are assumed to be
normally distributed in the population The sampling distribution of the mean is
normally distributed ( more coming.) Many statistics models are based on an
assumption of a normally distributed variable.
7/28/2015
57
Fang Chen EC
NU
陈芳
华东
师大
英语系
NORMAL DISTRIBUTION
7/28/2015
58
0.0000
0.0050
0.0100
0.0150
0.0200
0.0250
0.0300
0.0350
0.0400
70 80 90 100 110 120 130
Den
sity
IQ Scores
Graphing probability density function
Bell-shaped curve Unimodal Symmetric—mean,
median and mode are all in the center
Not skewed Extends from -∞ to
+∞ The total area
under the curve is 1
Fang Chen EC
NU
陈芳
华东
师大
英语系
NORMAL DISTRIBUTION
7/28/2015
59
About 68%of the distribution lies within 1 SD of the mean, 95% lies within 2 SD of the mean and 99.7% of the distribution lies within 3 SD of the mean.
We can immediately make some inferences.
∞
Fang Chen EC
NU
陈芳
华东
师大
英语系
STANDARD NORMAL DISTRIBUTION标准正态分布
The standard normal distribution is just a special case of normal distribution with a mean=0 and SD=1. Any normal distribution can be transformed to be a standardized normal distribution.
Why bother transforming, or standardizing a distribution?
7/28/2015
60
Fang Chen EC
NU
陈芳
华东
师大
英语系
HOW MANY TABLES DO WE NEED? For our IQ data, our mean is 97.74, SD=11.33,
one SD below the mean is 97.74-11.33=86.41, one SD above the mean is 97.74+11.33=109.07. The percentile rank of 84.13% corresponds to a raw score of 109.07.
For SAT score, mean=500, SD=100, one SD below the mean is 400, one SD above the mean is 600. The percentile rank of 84.13% corresponds to a raw score of 600.
……
7/29/2015
61
Fang Chen EC
NU
陈芳
华东
师大
英语系
STANDARD NORMAL DISTRIBUTION
There are some general rules we can follow to do this Any constant can be added or subtracted to
every value and the result will shift the meanof that variable by that same constant ---activity1-question.
Likewise, if we multiply each value by a constant, the resulting mean will be adjusted by the same constant
7/29/2015
62
Fang Chen EC
NU
陈芳
华东
师大
英语系
STANDARDIZED SCORES 标准分
When we transform our variables to the z-distribution (the standard normal distribution), we are standardizing our scores.
This essentially means we put all of our values on the same scale and end up with a distribution of mean=0 and SD=1.
We call the process the z-transformationThe standardized scores that come out of
this process are called z-scores.
7/28/2015
63
Fang Chen EC
NU
陈芳
华东
师大
英语系
Z-SCORE TRANSFORMATION
The end result will be a set of standardized scores. All scores that are below the mean will be negative and all
scores above the mean will be positive We can interpret the value of the z-score as how many
standard deviation above or below the mean A z-score =1.0 is a score that is exactly 1 SD above the
mean A z-score of -1.5 is score that is exactly 1.5 SD below the
mean
7/28/2015
64
ii
Xz
µσ−
=• X is our original data• µ is the mean of the population• σ is the population standard
deviation
Fang Chen EC
NU
陈芳
华东
师大
英语系
Z-SCORE EXAMPLE
Test score: Mean = 50 Standard deviation = 10 So the z-score if you received a 60 is
and the z-score if you received a 45 is
7/28/2015
65
11010
105060
==−
=−
=σµXz
5.010
510
5045−=
−=
−=
−=
σµXz
Fang Chen EC
NU
陈芳
华东
师大
英语系
SO? Now we can refer to the z-table to see what
percentile a score value of 60 or 45 corresponds to.
A full z-score table can be found in Howell p604-607 Table E-10.
A z-score of 1 corresponds to a percentile of 0.8413. This means 84.13% of scores fall at or below a z-score of 1 or the raw score of 60.
A z-score of -.5 corresponds to a percentile rank of 0.3085. This means 30.85% of scores fall at or below a z-score of -.5 or a raw score of 45.
7/28/2015
66
Fang Chen EC
NU
陈芳
华东
师大
英语系
FINDING THE PERCENTILE RANK OF ARAW SCORE
Step 1: Change the raw scores to z-scores using
Step 2: Look in the z-table to find the percentile rank.
Example A population mean of 400, with a population
SD of 100, What are the percentile rank corresponding to the following raw scores? What do they mean?1) A score of 5002) A score of 3003) A score of 275
7/28/2015
67
σµ−
=Xz Fang C
hen ECN
U 陈
芳华
东师
大英语系
LET ME JUST BE REDUNDANT… Percentile rank refers to the percentage of scores
at or below the score of interest.
There are no negative z values in the table. If the z value you calculated is positive, look
for the number under larger portion column. If the z value is negative, look for the
number under the smaller portion column.
7/28/2015
68
Fang Chen EC
NU
陈芳
华东
师大
英语系
FINDING THE RAW SCORE FROM APERCENTILE RANK
Step 1: Using the z-table, find the corresponding z-scores.
Step 2: transform the z scores back to the raw scores using
Example: We know a distribution has a mean of 400 and a SD
of 100, what raw score corresponds to the1) 95th percentile? 2) 50th percentile ? 3) 33th percentile?
7/28/2015
69
µσ += *ZX
Fang Chen EC
NU
陈芳
华东
师大
英语系
WHAT ELSE?A population mean of 400, with a population SD of
100 We can also answer more complex questions like
1) What percent of scores are between 300 and 540?2) What percent of scores are between 475 an 605?
Step 1: Transform the raw scores into z-scores. 300:z=-1, 540:z=1.4, 475: z=0.75, 605: z=2.05
Step 2: Find the proportion corresponding to the raw scores.
Step 3: Calculate the difference between the raw scores either by addition or subtraction.
7/28/2015
70
Fang Chen EC
NU
陈芳
华东
师大
英语系
7/28/2015
71
2) For a z-score of -1, this is the mean to z area:
Fang Chen EC
NU
陈芳
华东
师大
英语系
7/28/2015
72
For a z-score of 1.4, this is the mean to z area:
Fang Chen EC
NU
陈芳
华东
师大
英语系
7/28/2015
73
We can add the mean to z areas to calculate the percentage of scores falling in the range:p(-1 < z< 1.4) = p(-1 < z < μ) + p(μ < z< 1.4)
Fang Chen EC
NU
陈芳
华东
师大
英语系
7/28/2015
74
3) We can subtract the two areas as necessary.p(0.75< z < 2.05) = p( 0<z < 2.05) - p(0< z< 0.75)
Fang Chen EC
NU
陈芳
华东
师大
英语系
HOW ELSE COULD WE USE THIS? Given our conversation about probability in the
last class: we might want to describe how unusual a particular
score might be in the population. Used for hypothesis testing. Activity.
7/28/2015
75
Fang Chen EC
NU
陈芳
华东
师大
英语系
SUMMARY
PDF is introduced to get to probability for continuous variable.
How to transform any scores within a distribution into a z score ( or to standardize the raw scores)?
How to find the percentile of a z score? --- The portions of scores fall at or below the z score of interest.
How to find the raw scores that corresponds to a certain percentile?
How to find the percentage of scores fall within any two raw scores?
7/28/2015
76
Fang Chen EC
NU
陈芳
华东
师大
英语系