new chapter six - university of maine systemmedia.usm.maine.edu/~jbeaudry/research literacy/ch...
TRANSCRIPT
CHAPTER SIX Descriptive Statistics and Data Displays
Chapter Objectives:
• Understand what descriptive statistics do
• Understand counts and percentages, and how they are displayed
• Understand measures of central tendency and how they are displayed
Understand correlations and how they are displayed
• Understand frequency distributions and how they are displayed
• Understand measures of variability and how they are displayed
• Understand range, spread, variance, and standard deviation how they are
displayed
• Understand what the normal curve represents and how it is used
_____________________________________________________________________
When one thinks of quantification, numbers come to mind; but quantification
is about more than numbers. It is about how numerical data are transformed into
descriptive statistics and how these calculations are presented to the reader.
• Descriptive statistics are the calculations that report on the status of
variables at a given moment.
At their most elementary level, descriptive statistics report data as counts,
percentages, measures of central tendency like means or averages, and
correlations. At more complex levels, they include frequency distributions,
variability, and correlations. Test reports and surveys often depend on descriptive
statistics to communicate results.
While all data can be presented as statistics in tables that organize data
into columns and rows, visual displays like charts, graphs and data plots are also
effective in communicating results; they have become an essential tool in
quantitative research. This chapter explains how data are described statistically
and how they are displayed visually.
Counts and Percentages
Counts and percentages are the simplest numerical calculations that describe
quantitative data.
• Counts describe “ how many” and are usually represented by the letter n.
• Percentages are calculations that show the fraction of a whole, where
the whole is represented as 100.
Counts and percentages are usually represented visually as pie charts, bar graphs,
and line graphs.
Pie Charts
Pie charts compare how many or what percentage of subjects falls within a
particular category on a single variable. They organize data into circular charts to
represent the comparison. For example, in the pie chart below, the variable is
“type of high school,’ and the categories are public, independent, and religious
schools. The pie chart shows the percentage of students from each category who
took the SAT in 2012.
Figure 1: Pie chart of Type of High School for 2012 SAT
Bar Graphs
Like pie charts, bar graphs compare how many or what percentage of
subjects fall within a particular category on a single variable. They do so by
organizing data into either horizontal or vertical bars. On one axis of the graph, are
the labels (or categories) that are being compared; on the other axis, are the values
(counts, percentage, raw scores).
Simple bar graphs display comparisons of categories on a single variable.
For example, the vertical bar graph below presents the same data as the circle
graph shown above.
Figure 2: Bar Graph of Type of High School for 2012 SAT
Stacked bar graphs also display comparisons of categories on a single
variable. In this case, the graph shoes all of the categories on one graph that is
divided into segments that represent the proportion of subjects in each category.
Below is a vertical staked bar graph of the same data that were represented in
the grouped bar graph and the circle graph above.
Figure 3: Stacked Bar Graph of Type of High School for 2012 SAT
Grouped bar charts display more complex comparisons by adding sub-
categories, like gender. The graph below compares the percentage of students
that fall within one of seven categories of courses on the single variable ‘ type of
computer courses’ and further compares the percentage of compares the
percentage of males and females within each category.
Figure 4: Course-taking Patterns for High School for 2012 SAT by Gender
Line Graphs
Line graphs display change in a variable over time. Each data point on the
line represents a count or a percentage that is linked in time with the other data
points on the line. Line graphs are used to look at trends for groups and individuals.
The line graph below shows the trend in percentages of African American and
Mexican American SAT test-takers.
Figure 5: Line Graph of Percent of High School Test-takers for 2012 SAT
0
4
8
12
16
20
2006 07 08 09 10 11 12 2013
African-American
Mexican-American
Percent African-American and Mexican-American SATTest-takers
Year of Test
SAT National Sample of Test-takers (approx 1.5 milliontotal per year))
Perc
ent
of T
est-
taki
ngSt
uden
ts
Correlations
• A correlation describes the strength and direction of a relationship between
and among variables. It answers the question: What is the relationship
between and among variables and how do they interact with each other?
• A positive correlation means that as one variable increases, so does the
other.
• A negative correlation means that as the value of one variable increases,
the value of the other variable decreases. A correlation can be expressed
both visually and as a statistic.
Visual Representation
A scatter plot is the visual representation of a correlation that expresses
the direction and strength of a relationship. The scatter plot below is a visual
representation of the relationship between two variables: female median income
in each state and the percent of people in that state who have a bachelor’s
degree. The x-axis provides data about the median income of female workers in
each state, and the y-axis provides the data on the percent of people who have
bachelor’s degrees. Notice that the two variables move in the same direction; as
the percent of women in a state with bachelor’s degrees go up, so does the
median wage for all people in the state and vice versa. The scatter plot tells us
that this is a positive correlation; as one variable goes up, so does the other.
Figure 6: Scatterplot of Median Income of Female Workers by the Percent
of People with Bachelor’s Degrees in Each State
A scatter plot also provides information about the strength of the relationship.
The closer the variables are to a straight line, the stronger the relationship. The
straight line represents a perfect correlation and is called the line of best fit. A
visual examination of the plot shows the values clustering moderately close to
the line,
Statistic Representation
• The coefficient of correlation (r) is the statistic that precisely
describes the direction and strength of a correlation.
The direction of a relationship is expressed by the plus (+) or minus (-) sign that
appears before the r. The strength of a relationship is interpreted by how close
the r-value is to +1.0 or -1.0. A perfect positive correlation is 1.0 and would
appear on a scatter plot as a straight, diagonal line in a positive direction. A
perfect negative correlation is -1.0 and would appear on a scatter plot as a
straight line in a negative direction. We suggest the following as a guide for
evaluation of the strength of a correlation; the values may be positive or negative.
Weak correlations: r = +/- .24 or less
Moderate correlations: r = +/- .25 to .49
Moderately strong correlations: r = +/- .50 to .74
Strong correlations: r = +/- .75 to .99
In our study of female level of education and statewide earnings had a
correlation value of r = .61; this is a moderately strong correlation.
The coefficient of correlation is a very handy statistic that is used for a
variety of purposes in educational research. It can also be misused and
misunderstood. It is important to remember that r describes a relationship; and is
not an indication of cause and effect.
Measures of Central Tendency
Measures of central tendency provide information about midpoints of scores.
There are three measures of central tendency: mean, median, and mode.
• Mean is the measure of central tendency that is the arithmetic
average: the sum of values divided by the number in the sample or
population.
• Median is the measure of central tendency that is the mid-point in a
distribution of scores; that is, it denotes the point below which and
above which 50% of scores occur.
• Mode is the measure of central tendency that represents the most
commonly occurring score(s).
Of the three measures of central tendency, the mean is the most widely reported
and is an important computation for more complex statistical reasoning.
Standardized test reports for the SAT show the mean scaled scores for the total
number of test-takers and also for groups like females and males. Teachers
usually average grades to determine a final grade for a student. Survey
researchers average responses on numerical scales in reporting results.
However, the mean does not always provide a full picture. For example, consider
the math scores in room 115:
98, 94, 94, 90, 88, 88, 88, 88, 88, 86, 86, 86, 86, 86,
84, 84, 82, 82, 82, 80, 80, 80, 78, 76, 66, 58
A simple calculation has the mean and median representing two different mid-
points for this distribution, the mean score = 83.8; the median score = 86.0.
According to the mean, we could conclude that a student scoring 84 is at the
average of the class, but according to the median, that same student is in the
lower half of the class scores.
Line Graphs
Line graphs are used to display changes in means over time. The line
graph below displays changes in means on a single variable (SAT writing scores)
for a span of 6 years.
Figure 7: SAT Mean Writing Scores (2006-2012)
Note that the same data is represented in a line graph above and in the table
below.
Year
2006 2007 2008 2009 2010 2011 2012
SAT Scores
497 493 493 492 491 489 488
Figure 8: Data Table of SAT Writing Scores Over Time (2006-2012)
The line graph below displays mean scores for critical reading and math for a span
of 30 years.
Figure 9: Line Graph Showing Trend of SAT Critical Reading and Mathematics
Scores Over Time (2006-2012)
Bar Graphs
Bar graphs are also used to display means. The bar graph below compares
mean SAT scores of Maine students for the types of high school courses in visual
and performing arts ( x-axis) and mean SAT scores for reading, writing, and math
( y-axis).
Figure 10: Types of High School Arts Course Taken by Students in the State of
Maine and the Average SAT Scores for those Students
400
440
480
520
560
600
Mus
icPe
rform
ance
Dram
a: P
lay/
Prod
uctio
nM
usic:
Stu
dy o
rAp
prec
iation
Stud
io Ar
t/De
sign
Dram
a: S
tudy
Appr
eciat
ionDa
nce
Phot
ogra
phy/
Film
Art H
istor
y
None
Math
Writing
Reading
Types of HS Arts Courses Taken and SAT
Types of Courses Taken
Maine SAT 2013 (Sample = 14,501 students)
Mai
neSA
TSc
ores
2013
Frequency Distributions
• Frequency distributions group data into categories and show the
number of times each category occurs.
Grouping data into categories makes them more accessible and easier to
interpret. Frequency distributions are visually represented in stem and leaf plots,
frequency tables, and frequency polygons or histograms. Below, we describe
each of these representations and show how the scores in room 115 (98, 94, 94,
90, 88, 88, 88, 88, 88, 86, 86, 86, 86, 86, 84, 84, 82, 82, 82, 80, 80, 80, 78, 76, 66,
58) could be presented in each.
Stem and Leaf Plots
In a stem and leaf plot, the numbers on the right side are the leaves. The
stems are the ten’s digits of the scores on the test, and the leaves are the unit
digits. In the plot below, “5 | 8” represents a score of 58, and the series of
numbers “8 | 0 0 0” represents the fact that three students achieved a score of
“80”.
Figure 11: Stem and Leaf Plot for Math Test Scores for Room 115
Note that the stem and leaf plot also provides information about central tendency.
To determine the mean, we can add all of the scores and divide by the number of
scores. To determine the median, we can find the score above which and below
which the other scores fall. In fact, the stem and leaf plot is perhaps the quickest
way to find the median.
Frequency Tables
In a frequency table, grouped data are presented in three columns. The first
column on the left shows the range of scores that define a category; the next
column shows the number of times each category is appears in the data.; the third
column shows the same information as a percentage. In the frequency table
below, the room 115 scores are organized into five categories of scores: 90-100,
80-89, 70-79, 60-69, 50-59.
Score Frequency %
90-100 4 15.4 %
80-89 18 69.2 %
70-79 2 7.7 %
60-69 1 3.85 %
50-59 1 3.85 %
Histograms
In a histogram, the grouped data are represented as in bar graphs.
Figure 12: Histogram of Math Test Scores for Room 115
0
5
10
15
20
50-59 60-69 70-79 80-89 90-100
Math Score
Math Test Score Distribution Room 115
Math Test Score
Numb
erof
Stud
ents
Measures of Variability
• Measures of variability are mathematical calculations that show the
extent to which scores and values diverge from the average or mean.
Variability also refers to the extent to which these data points differ from
each other.
Measures of variability provide a more complete description of the data than do
measures of central tendency or frequency distributions. High variability indicates
that scores or values are widely scattered and that the group represented by the
data is heterogeneous. Low variability indicates that scores and values are tightly
clustered around the mean and that the group represented by the data is
homogeneous. The most common measures of variability are the spread, range,
quartiles, variance, and standard deviation.
• Spread simply reports the highest and lowest score. The spread of scores in
room I115 scores is from a low score 58 to high score of 96.
• Range is the arithmetic difference between the highest and lowest scores.
The range of scores in room 115 is 38 (96 minus 58).
• Quartiles divide ranked data into four equal parts.
• Standard deviation is a calculation that determines the average variance
of all scores from the mean.
• Variance is the standard deviation squared and is very important in many
more complex calculations
The formula for calculating standard deviation is presented below.
Figure 13: Formula to Calculate Measures of Variability (Standard
deviation and variance)
While this may seem daunting, it is not so difficult to understand. The key
elements in the calculation include each score value, the deviation score (which
is the difference in each score from the mean), and the sample size. The first
step is to calculate the mean; the next step is to determine a deviation score for
each value (x-µ); the next step is to square each deviation score. Then the
squared deviation scores are added together, and the sum (Σ) is divided by the
sample size (1/N). The final step is to take the square root (√) of that figure; the
result is the standard deviation (σ). It is not necessary to calculate SD by hand;
this can be done with a computer programs like SPSS or SAS, a scientific
calculator, or a web-based calculator.
http://easycalculation.com/statistics/standard-deviation.php The important thing
to understand what standard deviation means and how it is used.
Box and Whisker Plots
Box-and-whisker plots (also known as box plots) use the median and
quartiles to summarize variability of data. The ends of the box are the lower and
upper quartile; the whiskers or error bars extend to the extreme lowest and
highest scores in the distribution. The central line in the box is the median. Box-
and-whisker plots provide a succinct visual summary of the spread, central
tendency and the shape of the distribution. The box plot below shows the test
scores of those students on subsidized lunch at one point in time. The box
contains the middle 50% (median) of the data. The upper edge (hinge) of the box
indicates the 75th percent of the data set, and the lower hinge indicates the 25th
percent. The range of the middle two quartiles is known as the inter-quartile
range. The line in the box indicates the median value of the data. The ends of the
vertical lines or "whiskers" indicate the minimum and maximum data values. The
points outside the ends of the whiskers are outliers or suspected outliers. Below
are examples of how data counts, percentages, and measures of central
tendency can be represented in the various visual formats. The box-and-whisker
plot in Figure 17 shows a summary of data on an interval scale. The box plot
shows the symmetry of this distribution, the range, shape and median.
Figure 14: Box-and-Whisker Plot Model
The box plot in Figure 15 shows the test scores of those students on
subsidized lunch at one point in time. The box contains the middle 50% (median)
of the data. The upper edge (hinge) of the box indicates the 75th percent of the
data set, and the lower hinge indicates the 25th percent. The range of the middle
two quartiles is known as the inter-quartile range. The line in the box indicates
the median value of the data. The ends of the vertical lines or "whiskers" indicate
the minimum and maximum data values. The points outside the ends of the
whiskers are outliers or suspected outliers.
Figure 15: Box-and-whisker plot of Free and Reduced Lunch (2008)
Tables of Means and Standard Deviations
Tables report means and standard deviations alongside each other in and
provide rich data about the characteristics of groups. For example, look at the
table below.
Math Scores in Rooms115 and Room 120
Class Mean SD
Room 115 83.7 8.2
Room 120 84.1 3.3
Figure 16: Math Scores in Rooms 115 and 120
If we look only at means (M= 83.8: M= 84.1), the two classes appear very similar.
However, the different standard deviations (SD= 8.2; M-SD=3.3) tell a different
story. They indicate that Room 115 has more variability in scores and is more
heterogeneous than the Room 120.
The Normal Distribution Curve
• The normal distribution curve (also known as the Gaussian distribution
and the bell curve) is a visual display that shows the frequency
distribution of naturally occurring phenomenon and values, such as
height and weigh, and is used extensively in quantitative research.
The values that occur most frequently are clustered around the midpoint of
the curve, and the least occurring values (lowest and highest) values are located
at the two ends, or tails. For instance, the curve below shows the frequency
distribution of blood pressure values.
Figure 17: Distribution of Diastolic Blood Pressure Readings (Mean = 84, SD =
12)
The bell curve has the following characteristics;
• The mean, median, and mode are the same
• The curve is smooth and symmetrical
• 68.2% of the values fall within the first SD plus or minus (+ and - )
one SD, and 95.4% of the values fall within the second SD plus or
minus (+ and - )
. Normal distribution for IQ
Note that the mean and median =100, and the SD = 15. About 68% of
scores fall within the first standard deviation and include scores from 85 to 115
(100 +/- 15). Ninety-four percent of scores fall within two standard deviations and
include scores from 70 to 130, the usual cut-off scores for labeling students as
being “developmental disabled” or “gifted.”
The curve below shows SAT scores. The mean score is 500; each
standard deviation represent
Figure 18: SAT Scores as a Normal Distribution (Mean = 500, SD = 100)
Percentiles are also derived from the bell curve and are reported to show
how scores stand in relation to each other. For instance, a score at the 75th
percentile is equal to or higher than 75% of all scores within the sample group
and lower than 25% of scores,
Percentiles are not equal intervals along the curve. The chart below makes this
clear. It shows that a 30-point difference in SAT scores at the middle of the curve
(between scores of 570 and 600) yields a 10 percentile difference, while a 30 point
difference at the tail of the curve (between scores of 760 and 790) yields no
difference in percentile rank.
The bell curve, along with means and standard deviations are the most
important tools available to the quantitative researcher. We will be exploring how
they are used in later chapters.
Score Percentile Score Percentile
570 69 790 99
600 79 760 99
Chapter Summary
Summary
• Descriptive research focuses on the state of variables
• Descriptive statistics include counts, percentages, measures of central
tendency, correlations, frequency distributions, and measures of variability
• Measures of central tendency are the mean, median, and mode
• Correlations show the strength and direction of a relationship and are
calculated by the coefficient of correlation (r).
• Measures of variability include range, spread, quartiles, variance, and
standard deviation
• Descriptive statistics are represented visually in tables, graphs, and charts of
varying complexity.
• The normal distribution curve (also known as the Gaussian distribution and
the bell curve) shows the frequency distribution of naturally occurring
phenomena.
• The normal curve is used in normative tests and measures and is an
important tool in quantitative reasoning.
Concepts and Terms
Descriptive Statistics Counts Percentages
Central Tendency Mean Median
Mode Frequency distribution Variability
Spread Range Standard deviation
Variance Normal distribution/bell/ Gaussian curve
Percentile Quartile
Pie chart Bar graph Stacked bar graph
Grouped bar graph Line graph Stem and leaf plot
Frequency table Histogram Box and whisker plot
Table of means and standard deviations
Review, Consolidation, and Extension of Knowledge
1. Explain the difference between mean and median and how this might
impact decision-making.
2. Go to this url: nces.ed.gov/nceskids/createagraph and follow the
instructions to create a pie graph, circle graph and line graph of any data
that are available to you or that you can access on the internet.
3. Construct a frequency table of any data available to you or that you can
access on the Internet. Then using the url above, construct a histogram.
4. Construct a stem and leaf plot of any data available to you or that you can
access on the internet.
5. Go to the url below and watch a demonstration of how a normal
distribution curve is created
http://www.theexhibitguys.com/Galton_Probability_Machine.html