new chapter six - university of maine systemmedia.usm.maine.edu/~jbeaudry/research literacy/ch...

30
CHAPTER SIX Descriptive Statistics and Data Displays Chapter Objectives: Understand what descriptive statistics do Understand counts and percentages, and how they are displayed Understand measures of central tendency and how they are displayed Understand correlations and how they are displayed Understand frequency distributions and how they are displayed Understand measures of variability and how they are displayed Understand range, spread, variance, and standard deviation how they are displayed Understand what the normal curve represents and how it is used _____________________________________________________________________ When one thinks of quantification, numbers come to mind; but quantification is about more than numbers. It is about how numerical data are transformed into descriptive statistics and how these calculations are presented to the reader. Descriptive statistics are the calculations that report on the status of variables at a given moment.

Upload: dinhquynh

Post on 03-Jul-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

CHAPTER SIX Descriptive Statistics and Data Displays

Chapter Objectives:

• Understand what descriptive statistics do

• Understand counts and percentages, and how they are displayed

• Understand measures of central tendency and how they are displayed

Understand correlations and how they are displayed

• Understand frequency distributions and how they are displayed

• Understand measures of variability and how they are displayed

• Understand range, spread, variance, and standard deviation how they are

displayed

• Understand what the normal curve represents and how it is used

_____________________________________________________________________

When one thinks of quantification, numbers come to mind; but quantification

is about more than numbers. It is about how numerical data are transformed into

descriptive statistics and how these calculations are presented to the reader.

• Descriptive statistics are the calculations that report on the status of

variables at a given moment.

At their most elementary level, descriptive statistics report data as counts,

percentages, measures of central tendency like means or averages, and

correlations. At more complex levels, they include frequency distributions,

variability, and correlations. Test reports and surveys often depend on descriptive

statistics to communicate results.

While all data can be presented as statistics in tables that organize data

into columns and rows, visual displays like charts, graphs and data plots are also

effective in communicating results; they have become an essential tool in

quantitative research. This chapter explains how data are described statistically

and how they are displayed visually.

Counts and Percentages

Counts and percentages are the simplest numerical calculations that describe

quantitative data.

• Counts describe “ how many” and are usually represented by the letter n.

• Percentages are calculations that show the fraction of a whole, where

the whole is represented as 100.

Counts and percentages are usually represented visually as pie charts, bar graphs,

and line graphs.

Pie Charts

Pie charts compare how many or what percentage of subjects falls within a

particular category on a single variable. They organize data into circular charts to

represent the comparison. For example, in the pie chart below, the variable is

“type of high school,’ and the categories are public, independent, and religious

schools. The pie chart shows the percentage of students from each category who

took the SAT in 2012.

Figure 1: Pie chart of Type of High School for 2012 SAT

Bar Graphs

Like pie charts, bar graphs compare how many or what percentage of

subjects fall within a particular category on a single variable. They do so by

organizing data into either horizontal or vertical bars. On one axis of the graph, are

the labels (or categories) that are being compared; on the other axis, are the values

(counts, percentage, raw scores).

Simple bar graphs display comparisons of categories on a single variable.

For example, the vertical bar graph below presents the same data as the circle

graph shown above.

Figure 2: Bar Graph of Type of High School for 2012 SAT

Stacked bar graphs also display comparisons of categories on a single

variable. In this case, the graph shoes all of the categories on one graph that is

divided into segments that represent the proportion of subjects in each category.

Below is a vertical staked bar graph of the same data that were represented in

the grouped bar graph and the circle graph above.

Figure 3: Stacked Bar Graph of Type of High School for 2012 SAT

Grouped bar charts display more complex comparisons by adding sub-

categories, like gender. The graph below compares the percentage of students

that fall within one of seven categories of courses on the single variable ‘ type of

computer courses’ and further compares the percentage of compares the

percentage of males and females within each category.

Figure 4: Course-taking Patterns for High School for 2012 SAT by Gender

Line Graphs

Line graphs display change in a variable over time. Each data point on the

line represents a count or a percentage that is linked in time with the other data

points on the line. Line graphs are used to look at trends for groups and individuals.

The line graph below shows the trend in percentages of African American and

Mexican American SAT test-takers.

Figure 5: Line Graph of Percent of High School Test-takers for 2012 SAT

0

4

8

12

16

20

2006 07 08 09 10 11 12 2013

African-American

Mexican-American

Percent African-American and Mexican-American SATTest-takers

Year of Test

SAT National Sample of Test-takers (approx 1.5 milliontotal per year))

Perc

ent

of T

est-

taki

ngSt

uden

ts

Correlations

• A correlation describes the strength and direction of a relationship between

and among variables. It answers the question: What is the relationship

between and among variables and how do they interact with each other?

• A positive correlation means that as one variable increases, so does the

other.

• A negative correlation means that as the value of one variable increases,

the value of the other variable decreases. A correlation can be expressed

both visually and as a statistic.

Visual Representation

A scatter plot is the visual representation of a correlation that expresses

the direction and strength of a relationship. The scatter plot below is a visual

representation of the relationship between two variables: female median income

in each state and the percent of people in that state who have a bachelor’s

degree. The x-axis provides data about the median income of female workers in

each state, and the y-axis provides the data on the percent of people who have

bachelor’s degrees. Notice that the two variables move in the same direction; as

the percent of women in a state with bachelor’s degrees go up, so does the

median wage for all people in the state and vice versa. The scatter plot tells us

that this is a positive correlation; as one variable goes up, so does the other.

Figure 6: Scatterplot of Median Income of Female Workers by the Percent

of People with Bachelor’s Degrees in Each State

A scatter plot also provides information about the strength of the relationship.

The closer the variables are to a straight line, the stronger the relationship. The

straight line represents a perfect correlation and is called the line of best fit. A

visual examination of the plot shows the values clustering moderately close to

the line,

Statistic Representation

• The coefficient of correlation (r) is the statistic that precisely

describes the direction and strength of a correlation.

The direction of a relationship is expressed by the plus (+) or minus (-) sign that

appears before the r. The strength of a relationship is interpreted by how close

the r-value is to +1.0 or -1.0. A perfect positive correlation is 1.0 and would

appear on a scatter plot as a straight, diagonal line in a positive direction. A

perfect negative correlation is -1.0 and would appear on a scatter plot as a

straight line in a negative direction. We suggest the following as a guide for

evaluation of the strength of a correlation; the values may be positive or negative.

Weak correlations: r = +/- .24 or less

Moderate correlations: r = +/- .25 to .49

Moderately strong correlations: r = +/- .50 to .74

Strong correlations: r = +/- .75 to .99

In our study of female level of education and statewide earnings had a

correlation value of r = .61; this is a moderately strong correlation.

The coefficient of correlation is a very handy statistic that is used for a

variety of purposes in educational research. It can also be misused and

misunderstood. It is important to remember that r describes a relationship; and is

not an indication of cause and effect.

Measures of Central Tendency

Measures of central tendency provide information about midpoints of scores.

There are three measures of central tendency: mean, median, and mode.

• Mean is the measure of central tendency that is the arithmetic

average: the sum of values divided by the number in the sample or

population.

• Median is the measure of central tendency that is the mid-point in a

distribution of scores; that is, it denotes the point below which and

above which 50% of scores occur.

• Mode is the measure of central tendency that represents the most

commonly occurring score(s).

Of the three measures of central tendency, the mean is the most widely reported

and is an important computation for more complex statistical reasoning.

Standardized test reports for the SAT show the mean scaled scores for the total

number of test-takers and also for groups like females and males. Teachers

usually average grades to determine a final grade for a student. Survey

researchers average responses on numerical scales in reporting results.

However, the mean does not always provide a full picture. For example, consider

the math scores in room 115:

98, 94, 94, 90, 88, 88, 88, 88, 88, 86, 86, 86, 86, 86,

84, 84, 82, 82, 82, 80, 80, 80, 78, 76, 66, 58

A simple calculation has the mean and median representing two different mid-

points for this distribution, the mean score = 83.8; the median score = 86.0.

According to the mean, we could conclude that a student scoring 84 is at the

average of the class, but according to the median, that same student is in the

lower half of the class scores.

Line Graphs

Line graphs are used to display changes in means over time. The line

graph below displays changes in means on a single variable (SAT writing scores)

for a span of 6 years.

Figure 7: SAT Mean Writing Scores (2006-2012)

Note that the same data is represented in a line graph above and in the table

below.

Year

2006 2007 2008 2009 2010 2011 2012

SAT Scores

497 493 493 492 491 489 488

Figure 8: Data Table of SAT Writing Scores Over Time (2006-2012)

The line graph below displays mean scores for critical reading and math for a span

of 30 years.

Figure 9: Line Graph Showing Trend of SAT Critical Reading and Mathematics

Scores Over Time (2006-2012)

Bar Graphs

Bar graphs are also used to display means. The bar graph below compares

mean SAT scores of Maine students for the types of high school courses in visual

and performing arts ( x-axis) and mean SAT scores for reading, writing, and math

( y-axis).

Figure 10: Types of High School Arts Course Taken by Students in the State of

Maine and the Average SAT Scores for those Students

400

440

480

520

560

600

Mus

icPe

rform

ance

Dram

a: P

lay/

Prod

uctio

nM

usic:

Stu

dy o

rAp

prec

iation

Stud

io Ar

t/De

sign

Dram

a: S

tudy

Appr

eciat

ionDa

nce

Phot

ogra

phy/

Film

Art H

istor

y

None

Math

Writing

Reading

Types of HS Arts Courses Taken and SAT

Types of Courses Taken

Maine SAT 2013 (Sample = 14,501 students)

Mai

neSA

TSc

ores

2013

Frequency Distributions

• Frequency distributions group data into categories and show the

number of times each category occurs.

Grouping data into categories makes them more accessible and easier to

interpret. Frequency distributions are visually represented in stem and leaf plots,

frequency tables, and frequency polygons or histograms. Below, we describe

each of these representations and show how the scores in room 115 (98, 94, 94,

90, 88, 88, 88, 88, 88, 86, 86, 86, 86, 86, 84, 84, 82, 82, 82, 80, 80, 80, 78, 76, 66,

58) could be presented in each.

Stem and Leaf Plots

In a stem and leaf plot, the numbers on the right side are the leaves. The

stems are the ten’s digits of the scores on the test, and the leaves are the unit

digits. In the plot below, “5 | 8” represents a score of 58, and the series of

numbers “8 | 0 0 0” represents the fact that three students achieved a score of

“80”.

Figure 11: Stem and Leaf Plot for Math Test Scores for Room 115

Note that the stem and leaf plot also provides information about central tendency.

To determine the mean, we can add all of the scores and divide by the number of

scores. To determine the median, we can find the score above which and below

which the other scores fall. In fact, the stem and leaf plot is perhaps the quickest

way to find the median.

Frequency Tables

In a frequency table, grouped data are presented in three columns. The first

column on the left shows the range of scores that define a category; the next

column shows the number of times each category is appears in the data.; the third

column shows the same information as a percentage. In the frequency table

below, the room 115 scores are organized into five categories of scores: 90-100,

80-89, 70-79, 60-69, 50-59.

Score Frequency %

90-100 4 15.4 %

80-89 18 69.2 %

70-79 2 7.7 %

60-69 1 3.85 %

50-59 1 3.85 %

Histograms

In a histogram, the grouped data are represented as in bar graphs.

Figure 12: Histogram of Math Test Scores for Room 115

0

5

10

15

20

50-59 60-69 70-79 80-89 90-100

Math Score

Math Test Score Distribution Room 115

Math Test Score

Numb

erof

Stud

ents

Measures of Variability

• Measures of variability are mathematical calculations that show the

extent to which scores and values diverge from the average or mean.

Variability also refers to the extent to which these data points differ from

each other.

Measures of variability provide a more complete description of the data than do

measures of central tendency or frequency distributions. High variability indicates

that scores or values are widely scattered and that the group represented by the

data is heterogeneous. Low variability indicates that scores and values are tightly

clustered around the mean and that the group represented by the data is

homogeneous. The most common measures of variability are the spread, range,

quartiles, variance, and standard deviation.

• Spread simply reports the highest and lowest score. The spread of scores in

room I115 scores is from a low score 58 to high score of 96.

• Range is the arithmetic difference between the highest and lowest scores.

The range of scores in room 115 is 38 (96 minus 58).

• Quartiles divide ranked data into four equal parts.

• Standard deviation is a calculation that determines the average variance

of all scores from the mean.

• Variance is the standard deviation squared and is very important in many

more complex calculations

The formula for calculating standard deviation is presented below.

Figure 13: Formula to Calculate Measures of Variability (Standard

deviation and variance)

While this may seem daunting, it is not so difficult to understand. The key

elements in the calculation include each score value, the deviation score (which

is the difference in each score from the mean), and the sample size. The first

step is to calculate the mean; the next step is to determine a deviation score for

each value (x-µ); the next step is to square each deviation score. Then the

squared deviation scores are added together, and the sum (Σ) is divided by the

sample size (1/N). The final step is to take the square root (√) of that figure; the

result is the standard deviation (σ). It is not necessary to calculate SD by hand;

this can be done with a computer programs like SPSS or SAS, a scientific

calculator, or a web-based calculator.

http://easycalculation.com/statistics/standard-deviation.php The important thing

to understand what standard deviation means and how it is used.

Box and Whisker Plots

Box-and-whisker plots (also known as box plots) use the median and

quartiles to summarize variability of data. The ends of the box are the lower and

upper quartile; the whiskers or error bars extend to the extreme lowest and

highest scores in the distribution. The central line in the box is the median. Box-

and-whisker plots provide a succinct visual summary of the spread, central

tendency and the shape of the distribution. The box plot below shows the test

scores of those students on subsidized lunch at one point in time. The box

contains the middle 50% (median) of the data. The upper edge (hinge) of the box

indicates the 75th percent of the data set, and the lower hinge indicates the 25th

percent. The range of the middle two quartiles is known as the inter-quartile

range. The line in the box indicates the median value of the data. The ends of the

vertical lines or "whiskers" indicate the minimum and maximum data values. The

points outside the ends of the whiskers are outliers or suspected outliers. Below

are examples of how data counts, percentages, and measures of central

tendency can be represented in the various visual formats. The box-and-whisker

plot in Figure 17 shows a summary of data on an interval scale. The box plot

shows the symmetry of this distribution, the range, shape and median.

Figure 14: Box-and-Whisker Plot Model

The box plot in Figure 15 shows the test scores of those students on

subsidized lunch at one point in time. The box contains the middle 50% (median)

of the data. The upper edge (hinge) of the box indicates the 75th percent of the

data set, and the lower hinge indicates the 25th percent. The range of the middle

two quartiles is known as the inter-quartile range. The line in the box indicates

the median value of the data. The ends of the vertical lines or "whiskers" indicate

the minimum and maximum data values. The points outside the ends of the

whiskers are outliers or suspected outliers.

Figure 15: Box-and-whisker plot of Free and Reduced Lunch (2008)

Tables of Means and Standard Deviations

Tables report means and standard deviations alongside each other in and

provide rich data about the characteristics of groups. For example, look at the

table below.

Math Scores in Rooms115 and Room 120

Class Mean SD

Room 115 83.7 8.2

Room 120 84.1 3.3

Figure 16: Math Scores in Rooms 115 and 120

If we look only at means (M= 83.8: M= 84.1), the two classes appear very similar.

However, the different standard deviations (SD= 8.2; M-SD=3.3) tell a different

story. They indicate that Room 115 has more variability in scores and is more

heterogeneous than the Room 120.

The Normal Distribution Curve

• The normal distribution curve (also known as the Gaussian distribution

and the bell curve) is a visual display that shows the frequency

distribution of naturally occurring phenomenon and values, such as

height and weigh, and is used extensively in quantitative research.

The values that occur most frequently are clustered around the midpoint of

the curve, and the least occurring values (lowest and highest) values are located

at the two ends, or tails. For instance, the curve below shows the frequency

distribution of blood pressure values.

Figure 17: Distribution of Diastolic Blood Pressure Readings (Mean = 84, SD =

12)

The bell curve has the following characteristics;

• The mean, median, and mode are the same

• The curve is smooth and symmetrical

• 68.2% of the values fall within the first SD plus or minus (+ and - )

one SD, and 95.4% of the values fall within the second SD plus or

minus (+ and - )

. Normal distribution for IQ

Note that the mean and median =100, and the SD = 15. About 68% of

scores fall within the first standard deviation and include scores from 85 to 115

(100 +/- 15). Ninety-four percent of scores fall within two standard deviations and

include scores from 70 to 130, the usual cut-off scores for labeling students as

being “developmental disabled” or “gifted.”

The curve below shows SAT scores. The mean score is 500; each

standard deviation represent

Figure 18: SAT Scores as a Normal Distribution (Mean = 500, SD = 100)

Percentiles are also derived from the bell curve and are reported to show

how scores stand in relation to each other. For instance, a score at the 75th

percentile is equal to or higher than 75% of all scores within the sample group

and lower than 25% of scores,

Percentiles are not equal intervals along the curve. The chart below makes this

clear. It shows that a 30-point difference in SAT scores at the middle of the curve

(between scores of 570 and 600) yields a 10 percentile difference, while a 30 point

difference at the tail of the curve (between scores of 760 and 790) yields no

difference in percentile rank.

The bell curve, along with means and standard deviations are the most

important tools available to the quantitative researcher. We will be exploring how

they are used in later chapters.

Score Percentile Score Percentile

570 69 790 99

600 79 760 99

Chapter Summary

Summary

• Descriptive research focuses on the state of variables

• Descriptive statistics include counts, percentages, measures of central

tendency, correlations, frequency distributions, and measures of variability

• Measures of central tendency are the mean, median, and mode

• Correlations show the strength and direction of a relationship and are

calculated by the coefficient of correlation (r).

• Measures of variability include range, spread, quartiles, variance, and

standard deviation

• Descriptive statistics are represented visually in tables, graphs, and charts of

varying complexity.

• The normal distribution curve (also known as the Gaussian distribution and

the bell curve) shows the frequency distribution of naturally occurring

phenomena.

• The normal curve is used in normative tests and measures and is an

important tool in quantitative reasoning.

Concepts and Terms

Descriptive Statistics Counts Percentages

Central Tendency Mean Median

Mode Frequency distribution Variability

Spread Range Standard deviation

Variance Normal distribution/bell/ Gaussian curve

Percentile Quartile

Pie chart Bar graph Stacked bar graph

Grouped bar graph Line graph Stem and leaf plot

Frequency table Histogram Box and whisker plot

Table of means and standard deviations

Review, Consolidation, and Extension of Knowledge

1. Explain the difference between mean and median and how this might

impact decision-making.

2. Go to this url: nces.ed.gov/nceskids/createagraph and follow the

instructions to create a pie graph, circle graph and line graph of any data

that are available to you or that you can access on the internet.

3. Construct a frequency table of any data available to you or that you can

access on the Internet. Then using the url above, construct a histogram.

4. Construct a stem and leaf plot of any data available to you or that you can

access on the internet.

5. Go to the url below and watch a demonstration of how a normal

distribution curve is created

http://www.theexhibitguys.com/Galton_Probability_Machine.html