Graphs

In this section students will:

  1. Construct a frequency distribution
  2. Construct bar graphs
  3. Construct stem-and-leaf display and visualize data distribution
  4. Interpret information displayed in all types of graphs

Stem-and-leaf Graphs

One simple graph, the stem-and-leaf graph or stemplot, comes from the field of exploratory data analysis. It is a good choice when the data sets are small (\(n<30\)).

To create the plot, divide each observation of data into a stem and a leaf. The leaf consists of a final significant digit.

For example, 23 has stem two and leaf three. The number 432 has stem 43 and leaf two. Likewise, the number 5,432 has stem 543 and leaf two. The decimal 9.3 has stem nine and leaf three.

Write the stems in a vertical line from smallest to largest. Draw a vertical line to the right of the stems. Then write the leaves in increasing order next to their corresponding stem.

More Stemplots

The stemplot is a quick way to graph data and gives an exact picture of the data.

You want to look for an overall pattern and any outliers. An outlier is an observation of data that does not fit the rest of the data. It is sometimes called an extreme value.

When you graph an outlier, it will appear not to fit the pattern of the graph.

Some outliers are due to mistakes (for example, writing down 50 instead of 500) while others may indicate that something unusual is happening. It takes some background information to explain outliers, so we will cover them in more detail later.

Example Stemplot

For Susan Dean’s spring pre-calculus class, scores for the first exam were as follows (smallest to largest): 33, 42, 49, 49, 53, 55, 55, 61, 63, 67, 68, 68, 69, 69, 72, 73, 74, 78, 80, 83, 88, 88, 88, 90, 92, 94, 94, 94, 94, 96, 100.

Pre-Calculus Spring First Exam Scores

  The decimal point is 1 digit(s) to the right of the |

   3 | 3
   4 | 299
   5 | 355
   6 | 1378899
   7 | 2348
   8 | 03888
   9 | 0244446
  10 | 0

The stemplot shows that most scores fell in the 60s, 70s, 80s, and 90s. Eight out of the 31 scores or approximately 26% (8/31) were in the 90s or 100, a fairly high number of As.

President Data

Presidential Age Data
Presidential Age Data

Side-by-side Stemplots

A side-by-side stem-and-leaf plot allows a comparison of the two data sets in two columns. In a side-by-side stem-and-leaf plot, two sets of leaves share the same stem. The leaves are to the left and the right of the stems. The previous slide had the datasets of the ages of presidents at their inauguration and at their death. Construct a side-by-side stem-and-leaf plot using this data

_______________________________________________
  1 | 2: represents 12, leaf unit: 1 
              pres.in      pres.out         
_______________________________________________
    2              32| 4* |                    
    9         9987776| 4. |69              2   
  (13)  4444422111110| 5* |3               3   
  (12)   877776665555| 5. |6688            7   
   10         4421110| 6* |003344         13   
    3             985| 6. |567778         19   
                     | 7* |0111234        (7)  
                     | 7. |7889           13   
                     | 8* |013             9   
                     | 8. |58              6   
                     | 9* |0033            4   
_______________________________________________
n:                 44      39               
_______________________________________________

Line Graphs

Another type of graph that is useful for specific data values is a line graph.

The x-axis (horizontal axis) consists of data values and the y-axis (vertical axis) consists of frequency points.

The frequency points are connected using line segments.

In a survey, 50 teenagers between 13-17 indicated how many days per week they play video games

Days/Week Frequency
1 2
2 3
3 8
4 7
5 14
6 7
7 9

Video games

Bar Graphs

Bar graphs consist of bars that are separated from each other.

The bars can be rectangles or they can be rectangular boxes (used in three-dimensional plots), and they can be vertical or horizontal.

Example: By the end of 2011, Facebook had over 146 million users in the United States. Table 2.9 shows three age groups, the number of users in each age group, and the proportion (%) of users in each age group.

Bar Graph Example

Facebook users 2011
Barplot Facebook users

Histograms and Time Series Graphs

In this section students will:

  1. Determine types of graphs appropriate for specific data
  2. Construct histograms, relative frequency histograms
  3. Construct time-series graphs

Histograms

A histogram consists of contiguous (adjoining) boxes. It has both a horizontal axis and a vertical axis.

The horizontal axis is labeled with what the data represents (for instance, distance from your home to school).

The vertical axis is labeled either frequency or relative frequency (or percent frequency or probability).

The graph will have the same shape with either label.

The histogram (like the stemplot) can give you the shape of the data, the center, and the spread of the data.

Making a Histogram

To construct a histogram, first decide how many bars or intervals, also called classes, represent the data. Many histograms consist of five to 15 bars or classes for clarity.

Choose a starting point for the first interval to be less than the smallest data value. A convenient starting point is a lower value carried out to one more decimal place than the value with the most decimal places.

For example, if the value with the most decimal places is 6.1 and this is the smallest value, a convenient starting point is 6.05 (6.1 – 0.05 = 6.05). If all the data happen to be integers and the smallest value is two, then a convenient starting point is 1.5 (2 – 0.5 = 1.5).

Also, when the starting point and other boundaries are carried to one additional decimal place, no data value will fall on a boundary.

Next, calculate the width of each bar or class interval. To calculate this width, subtract the starting point from the ending value and divide by the number of bars (you must choose the number of bars you desire).

Soccer Data

The following data are the heights (in inches to the nearest half inch) of 100 male semiprofessional soccer players. The heights are continuous data, since height is measured. 60, 60.5, 61, 61, 61.5, 63.5, 63.5, 63.5, 64, 64, 64, 64, 64, 64, 64, 64.5, 64.5, 64.5, 64.5, 64.5, 64.5, 64.5, 64.5, 66, 66, 66, 66, 66, 66, 66, 66, 66, 66, 66.5, 66.5, 66.5, 66.5, 66.5, 66.5, 66.5, 66.5, 66.5, 66.5, 66.5, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67.5, 67.5, 67.5, 67.5, 67.5, 67.5, 67.5, 68, 68, 69, 69, 69, 69, 69, 69, 69, 69, 69, 69, 69.5, 69.5, 69.5, 69.5, 69.5, 70, 70, 70, 70, 70, 70, 70.5, 70.5, 70.5, 71, 71, 71, 72, 72, 72, 72.5, 72.5, 73, 73.5, 74

Histogram Example

The smallest data value is 60. Since the data with the most decimal places has one decimal (for instance, 61.5), we want our starting point to have two decimal places. Since the numbers 0.5, 0.05, 0.005, etc. are convenient numbers, use 0.05 and subtract it from 60, the smallest value, for the convenient starting point. 60 – 0.05 = 59.95 which is more precise than, say, 61.5 by one decimal place. The starting point is, then, 59.95. The largest value is 74, so 74 + 0.05 = 74.05 is the ending value. Next, calculate the width of each bar or class interval. To calculate this width, subtract the starting point from the ending value and divide by the number of bars (you must choose the number of bars you desire). Suppose you choose eight bars.

NOTE
We will round up to two and make each bar or class interval two units wide. Rounding up to two is one way to prevent a value from falling on a boundary. Rounding to the next number is often necessary even if it goes against the standard rules of rounding. For this example, using 1.76 as the width would also work. A guideline that is followed by some for the number of bars or class intervals is to take the square root of the number of data values and then round to the nearest whole number, if necessary. For example, if there are 150 values of data, take the square root of 150 and round to 12 bars or intervals.

Create the boundaries by adding the width to the starting point and so on to create the total number of classes needed.

The boundaries are:
59.95
59.95 + 2 = 61.95
61.95 + 2 = 63.95
63.95 + 2 = 65.95
65.95 + 2 = 67.95
67.95 + 2 = 69.95
69.95 + 2 = 71.95
71.95 + 2 = 73.95
73.95 + 2 = 75.95

Count the frequency in each class to draw your histogram.

Histogram Frequencies

Class Frequency Relative Frequency
59.95-61.95 5 0.05
61.95-63.95 3 0.03
63.95-65.95 15 0.15
65.95-67.95 40 0.4
67.95-69.95 17 0.17
69.95-71.95 12 0.12
71.95-73.95 7 0.07
73.95-75.95 1 0.01

Graph histogram

The heights 60 through 61.5 inches are in the interval 59.95–61.95. The heights that are 63.5 are in the interval 61.95–63.95. The heights that are 64 through 64.5 are in the interval 63.95–65.95. The heights 66 through 67.5 are in the interval 65.95–67.95. The heights 68 through 69.5 are in the interval 67.95–69.95. The heights 70 through 71 are in the interval 69.95–71.95. The heights 72 through 73.5 are in the interval 71.95–73.95. The height 74 is in the interval 73.95–75.95.

The following histogram displays the heights on the x-axis and relative frequency on the y-axis.

Histogram of heights
Histogram of heights

The Graph again

Histogram of heights

Time Series Graph

To construct a time series graph, we must look at both pieces of our paired data set. We start with a standard Cartesian coordinate system.

The horizontal axis is used to plot the date or time increments, and the vertical axis is used to plot the values of the variable that we are measuring. By doing this, we make each point on the graph correspond to a date and a measured quantity.

The points on the graph are typically connected by straight lines in the order in which they occur.

It is basically a line plot with time as the “independent” variable

Time series graphs are important tools in various applications of statistics.

When recording values of the same variable over an extended period of time, sometimes it is difficult to discern any trend or pattern.

However, once the same data points are displayed graphically, some features jump out; time series graphs make trends easy to spot.

Annual CPI Time Series Plots

CPI 1913-2023CPI 1913-2023

Volcano

Contour plot (not required for this course)

Volcano

Measures of Location of Data

In this section students will:

  1. Define measure of position and determine most appropriate measure
  2. Compute the median from raw data
  3. Compute percentiles from raw data
  4. Compute quartiles from raw data
  5. Compute the five-number summary from raw data

Measures of Location

The median is a number that measures the center of the data. You can think of the median as the “middle value,” but it does not actually have to be one of the observed values.

It is a number that separates ordered data into halves. Half the values are the same number or smaller than the median, and half the values are the same number or larger. The median, \(M\), is called both the second quartile and the 50\(^{th}\) percentile.

The common measures of location are quartiles and percentiles.

Percentiles divide ordered data into hundredths. To score in the 90\(^{th}\) percentile of an exam does not mean, necessarily, that you received 90% on a test. It means that 90% of test scores are the same or less than your score and 10% of the test scores are the same or greater than your test score.

To calculate percentiles, the data must be ordered from smallest to largest.

Quartiles divide ordered data into quarters. The first quartile, \(Q1\), is the middle value of the lower half of the data, and the third quartile, \(Q3\), is the middle value of the upper half of the data.

The first quartile, \(Q1\), is the same as the 25\(^{th}\) percentile, and the third quartile, \(Q3\), is the same as the 75\(^{th}\) percentile. Quartiles may or may not be part of the data.

To calculate quartiles, the data must be ordered from smallest to largest.

Find \(k^{th}\) Percentile

\(k\): the \(k^{th}\) percentile. It may or may not be part of the data.
\(i\): the index (ranking or position of a data value)
\(n\): the total number of data points

Order the data from smallest to largest. Calculate \[i=\frac{k}{100}(n+1)\]

If \(i\) is an integer, then the \(k^{th}\) percentile is the data value in the \(i^{th}\) position in the ordered set of data.

If \(i\) is not an integer, then round \(i\) up and round \(i\) down to the nearest integers. Average the two data values in these two positions in the ordered data set.

Percentiles

Listed are 29 ages for Academy Award winning best actors in order from smallest to largest. Calculate the 20\(^{th}\) and 55\(^{th}\) percentiles

 [1] 18 21 22 25 26 27 29 30 31 33 36 37 41 42 47 52 55 57 58 62 64 67 69 71 72
[26] 73 74 76 77

20\(^{th}\) percentile: \[i=\frac{k}{100}(n+1)=\frac{20}{100}(29+1)=6\]

55\(^{th}\) percentile: \[i=\frac{k}{100}(n+1)=\frac{55}{100}(29+1)=16.5\]

The 20\(^{th}\) percentile is the 6\(^{th}\) data point and the 55\(^{th}\) percentile is the 16.5\(^{th}\) data point (which will be the average between the 16\(^{th}\) and 17\(^{th}\) data points)

The 6\(^{th}\) data point is 27 and the 16\(^{th}\) and 17\(^{th}\) data points are 52 and 55 with their average being 53.5

Find Percentile (\(p\)) for a Value in the Dataset

Order the data from smallest to largest.

\(x\): the number of data values counting from the bottom of the data list up to but not including the data value for which you want to find the percentile
\(y\): the number of data values equal to the data value for which you want to find the percentile
\(n\): the total number of data points
\(p\): the percentile

Calculate \[p=\frac{(x+0.5y)}{n}*(100)\]
Then round to the nearest integer

Locations: \(M\) (median) and Quartiles

For finding the median and quartiles (\(Q1\) and \(Q3\); 25\(^{th}\) and 75\(^{th}\) percentiles, respectively), the same formulas from before can be applied

\[i=\frac{k}{100}(n+1)\]

Sleep Survey Data

Fifty statistics students were asked how much sleep they get per school night (rounded to the nearest hour). The results are given below.

Find the 80\(^{th}\) percentile
Find the 90\(^{th}\) percentile
Find the median, \(M\)
Find the first quartile. What is another name for the first quartile?

80\(^{th}\) percentile: \[i=\frac{k}{100}(n+1)=\frac{80}{100}(50+1)=40.8\]
90\(^{th}\) percentile: \[i=\frac{k}{100}(n+1)=\frac{90}{100}(50+1)=45.9\]
\(M=50^{th}\) percentile: \[i=\frac{k}{100}(n+1)=\frac{50}{100}(50+1)=25.5\]
\(Q1=25^{th}\) percentile: \[i=\frac{k}{100}(n+1)=\frac{25}{100}(50+1)=12.75\]

The 80\(^{th}\) percentile is the 40.5\(^{th}\) data point (which will be the average between the 40\(^{th}\) and 41\(^{st}\) data points). The 40\(^{th}\) and 41\(^{st}\) data points are 8 and 9 with their average being 8.5

The 90\(^{th}\) percentile is the 45.9\(^{th}\) data point (which will be the average between the 45\(^{th}\) and 46\(^{th}\) data points). The 45\(^{th}\) and 46\(^{th}\) data points are 9 and 9 with their average being 9

The median, \(M\), which is also the 50\(^{th}\) percentile is the 25.5\(^{th}\) data point (which will be the average between the 25\(^{th}\) and 26\(^{th}\) data points). The 25\(^{th}\) and 26\(^{th}\) data points are 7 and 7 with their average being 7

The first quartile, \(Q1\), which is the 25\(^{th}\) percentile is the 12.75\(^{th}\) data point (which will be the average between the 12\(^{th}\) and 13\(^{th}\) data points). The 12\(^{th}\) and 13\(^{th}\) data points are 6 and 6 with their average being 6.

NOTE: this is one of many ways to calculate percentages. This method is not the most accurate but is good enough

Interpretation of Percentiles, Quartiles, and Median

NOTE: When writing the interpretation of a percentile in the context of the given data, the sentence should contain the following information

Example

On a timed math test, the first quartile for time it took to finish the exam was 35 minutes. Interpret the first quartile in the context of this situation.

Interquartile Range (IQR)

The interquartile range is a number that indicates the spread of the middle half or the middle 50% of the data. It is the difference between the third quartile (\(Q3\)) and the first quartile (\(Q1\))

\[IQR = Q3 – Q1\]
The IQR can help to determine potential outliers. A value is suspected to be a potential outlier if it is less than \(1.5\times IQR\) below the first quartile or more than \(1.5\times IQR\) above the third quartile. Potential outliers always require further investigation.

NOTE: A potential outlier is a data point that is significantly different from the other data points. These special data points may be errors or some kind of abnormality or they may be a key to understanding the data.

Salaries IQR example:

First, we have to find \(Q3\) and \(Q1\), but the data needs to be ordered from smallest to largest

 [1]  33000  64500  28000  54000  72000  68500  69000  42000  54000 120000
[11]  40500
 [1]  28000  33000  40500  42000  54000  54000  64500  68500  69000  72000
[11] 120000

First, we have to find \(Q3\) and \(Q1\).

\(Q1=25^{th}\) percentile: \[i=\frac{k}{100}(n+1)=\frac{25}{100}(11+1)=3\]
\(Q3=75^{th}\) percentile: \[i=\frac{k}{100}(n+1)=\frac{75}{100}(11+1)=9\]

The first quartile, \(Q1\), which is the 25\(^{th}\) percentile is the 3\(^{rd}\) data point, which is 4.05^{4}

The third quartile, \(Q3\), which is the 75\(^{th}\) percentile is the 9\(^{th}\) data point, which is 6.9^{4}

The Interquartile Range, \(IQR\), is \(IQR=Q3-Q1=6.9\times 10^{4}-4.05\times 10^{4}=2.85\times 10^{4}\)

For outlier detection: any point either \(<Q1-1.5*IQR\) or \(>Q3+1.5*IQR\) will be an outlier \[1.5(IQR)=1.5*2.85\times 10^{4}=4.275\times 10^{4}\]

Lower boundary: \[lower=Q1-1.5*IQR=-2250\]

Upper boundary: \[upper=Q3+1.5*IQR=1.1175\times 10^{5}\]
There are no values less than -2250 and there is one value that is more than 1.1175^{5}, which is 120,000

The 5# Summary and Boxplots

In this section students will:

  1. Learn the 5# summary
  2. Construct a box-and-whisker plot
  3. Interpret the meaning of a box-and-whisker plot

5# Summary

The 5# summary is comprised of five values: the minimum value, the first quartile, the median, the third quartile, and the maximum value.

5# Summary
\(min\)
\(Q1\)
\(M\)
\(Q3\)
\(max\)

Boxplots

Boxplots (also called box-and-whisker plots or box-whisker plots) give a good graphical image of the concentration of the data. They also show how far the extreme values are from most of the data.

A box plot is constructed from the 5# Summary

To construct a box plot, use a horizontal or vertical number line and a rectangular box.

The smallest and largest data values label the endpoints of the axis. The first quartile marks one end of the box and the third quartile marks the other end of the box. Approximately the middle 50 percent of the data fall inside the box. The “whiskers” extend from the ends of the box.

The median or second quartile can be between the first and third quartiles, or it can be one, or the other, or both. The box plot gives a good, quick picture of the data.

NOTE: You may encounter box-and-whisker plots that have dots marking outlier values. In those cases, the whiskers are not extending to the minimum and maximum values. (Most of the plots generated by software will not end the whiskers at the minimum and maximum but with the lower and upper boundaries \(1.5 \times IQR\))

The two whiskers extend from the first quartile to \(Q1-IQR(1.5)\) and from the third quartile to \(Q3+IQR(1.5)\) The median is shown with a line.

NOTE: It is important to start a box plot with a scaled number line. Otherwise the box plot may not be useful.

Boxplot Example

The following data are the heights of 40 students in a statistics class. 59, 60, 61, 62, 62, 63, 63, 64, 64, 64, 65, 65, 65, 65, 65, 65, 65, 65, 65, 66, 66, 67, 67, 68, 68, 69, 70, 70, 70, 70, 70, 71, 71, 72, 72, 73, 74, 74, 75, 77

 [1] 59 60 61 62 62 63 63 64 64 64 65 65 65 65 65 65 65 65 65 66 66 67 67 68 68
[26] 69 70 70 70 70 70 71 71 72 72 73 74 74 75 77

Construct a box plot: Minimum value = 59, Maximum value = 77, Q1: First quartile = 64.5, Q2: Second quartile or median= 66, Q3: Third quartile = 70, \(IQR=Q3-Q1=70-64.5=5.5\) and \(IQR(1.5)=5.5(1.5)=8.25\)

Range\(=maximum-minimum=77-59=18\)

Outliers: any point that is either:

\[<Q1-IQR(1.5)\] or \[>Q3+IQR(1.5)\]
The “fences” or boundaries are \(Q1-IQR(1.5)=64.5-8.25=56.25\)
\(Q3+IQR(1.5)=77+8.25=85.25\). Thus any data point less than 56.25 or any data point greater than 85.25 are to be considered outliers.

NOTE
Each quarter has approximately 25% of the data
The spreads of the four quarters are 64.5 – 59 = 5.5 (first quarter), \(66-64.5=1.5\) (second quarter), 70 – 66 = 4 (third quarter), and 77 – 70 = 7 (fourth quarter). So, the second quarter has the smallest spread and the fourth quarter has the largest spread
Range = maximum value – the minimum value = 77 – 59 = 18
Interquartile Range: \(IQR = Q3 – Q1 = 70 – 64.5 = 5.5\)
The interval 59–65 has more than 25% of the data so it has more data in it than the interval 66 through 70 which has 25% of the data
The middle 50% (middle half) of the data has a range of 5.5 inches

Create a boxplot

  1. create your scale axis
  2. Mark where \(Q1\), \(M\), and \(Q3\) are on the axis
  3. Create the box
  4. Mark the outer boundaries
  5. Extend the fences (boundaries) from the box
  6. The median is a line through the box
  7. Mark any outliers (points outside the boundaries)

Boxplot of heights

Measures of Center of the Data

In this section students will:

  1. Define measures of center and determine most appropriate measure
  2. Compute the mean, median, and mode from raw data

Measures of Center

The “center” of a data set is also a way of describing location. The two most widely used measures of the “center” of the data are the mean (average) and the median.

To calculate the mean weight of 50 people, add the 50 weights together and divide by 50. To find the median weight of the 50 people, order the data and find the number that splits the data into two equal parts.

NOTE: The words “mean” and “average” are often used interchangeably. The substitution of one word for the other is common practice. The technical term is “arithmetic mean” and “average” is technically a center location. However, in practice among non-statisticians, “average” is commonly accepted for “arithmetic mean.”

The letter used to represent the sample mean is an x with a bar over it (pronounced “x bar”): \(\overline{x}\). \[\overline{x}=\frac{\sum x_i}{n}\]

The Greek letter \(\mu\) (pronounced “mew”) represents the population mean.

One of the requirements for the sample mean to be a good estimate of the population mean is for the sample taken to be truly random.

The median \(M\) is also a center location (the location of the middle most data point). The location of the median and the true value of the median are not the same.

Example Mean and Median

AIDS data indicating the number of months a patient with AIDS lives after taking a new antibody drug are as follows (smallest to largest):

 [1]  3  4  8  8 10 11 12 13 14 15 15 16 16 17 17 18 21 22 22 24 24 25 26 26 27
[26] 27 29 29 31 32 33 33 34 34 35 37 40 44 44 47

\[\overline{x}=\frac{\sum x_i}{n}=\frac{3+4+8+\cdots+44+47}{40}=\frac{943}{40}=23.575\approx 23.6\]
\(M=50^{th}\) percentile: \[i=\frac{k}{100}(n+1)=\frac{50}{100}(40+1)=20.5\]

The median, \(M\), which is the 50\(^{th}\) percentile is the 20.5\(^{th}\) data point (so average the 20\(^{th}\) and the 21\(^{st}\) data points), which is \(24+24=24\)

Measure of Center: Mode

Another measure of the center is the mode. The mode is the most frequent value. There can be more than one mode in a data set as long as those values have the same frequency and that frequency is the highest. A data set with two modes is called bimodal.

Five real estate exam scores are 430, 430, 480, 480, 495. The data set is bimodal because the scores 430 and 480 each occur twice.

NOTE
The mode can be calculated for qualitative data as well as for quantitative data. For example, if the data set is: red, red, red, green, green, yellow, purple, black, blue, the mode is red.

Mode Example

The number of books checked out from the library from 25 students are as follows: 0; 0; 0; 1; 2; 3; 3; 4; 4; 5; 5; 7; 7; 7; 7; 8; 8; 8; 9; 10; 10; 11; 11; 12; 12

Find the mode.

Center Example

Suppose that in a small town of 50 people, one person earns $5,000,000 per year and the other 49 each earn $30,000. Which is the better measure of the “center”: the mean or the median?

Best measure of center
Best measure of center

The median is a better measure of the “center” than the mean because 49 of the values are 30,000 and one is 5,000,000. The 5,000,000 is an outlier. The 30,000 gives us a better sense of the middle of the data.

Skewness and the Mean, Median, and Mode

In this section students will:

  1. Determine the most appropriate measures of center, measure of variation, and measure of position
  2. Interpret the meanings of each measure of center, measure of variation, and measure of position
  3. Recognize basic distribution shapes: uniform, symmetrical, skewed, and bimodal

Symmetry

Histogram of heights

The histogram displays a symmetric distribution of data.

A distribution is symmetric if a vertical line can be drawn at some point in the histogram such that the shape to the left and the right of the vertical line are mirror images of each other.

In a perfectly symmetric distribution, the mean and the median are (approximately) the same.

This example has one mode (unimodal), and the mode is the same as the mean and median.

In a symmetric distribution that has two modes (bimodal), the two modes would be different from the mean and median.

Bimodal

bimodal histogram

Left Skew

Left skewed histogram

A distribution of this type is called skewed to the left because it is pulled out to the left.

Notice that the mean is less than the median, and they are both less than the mode. The mean and the median both reflect the skewing, but the mean reflects it more so.

To summarize, generally if the distribution of data is skewed to the left, the mean is less than the median, which is often less than the mode.

Right Skew

Right skewed histogram

A distribution of this type is called skewed to the right because it is pulled out to the right.

Of the three statistics, the mean is the largest, while the mode is the smallest. Again, the mean reflects the skewing the most.

To summarize, if the distribution of data is skewed to the right, the mode is often less than the median, which is less than the mean.

Measures of the Spread of the Data

In this section students will:

  1. Define measures of variation and determine most appropriate measure
  2. Compute the range, variance, and standard deviation

Measures of Spread

An important characteristic of any set of data is the variation in the data.

In some data sets, the data values are concentrated closely near the mean; in other data sets, the data values are more widely spread out from the mean.

The most common measure of variation, or spread, is the standard deviation. The standard deviation is a number that measures how far data values are from their mean, on average.

The standard deviation provides a measure of the overall variation in a data set

The standard deviation is always positive or zero.

The standard deviation is small when the data are all concentrated close to the mean, exhibiting little variation or spread.

The standard deviation is larger when the data values are more spread out from the mean, exhibiting more variation.

The standard deviation can be used to determine whether a data value is close to or far from the mean.

A data value that is two standard deviations from the average is just on the borderline for what many statisticians would consider to be far from the average.

Considering data to be far from the mean if it is more than two standard deviations away is more of an approximate “rule of thumb” than a rigid rule.

In general, the shape of the distribution of the data affects how much of the data is further away than two standard deviations. (You will learn more about this in later chapters.)

Calculating the Standard Deviation

The procedure to calculate the standard deviation depends on whether the numbers are the entire population or are data from a sample.

The calculations are similar, but not identical. Therefore the symbol used to represent the standard deviation depends on whether it is calculated from a population or a sample.

The lower case letter \(s\) represents the sample standard deviation and the Greek letter \(\sigma\) (sigma, lower case) represents the population standard deviation.

If the sample has the same characteristics as the population, then \(s\) should be a good estimate of \(\sigma\).

Calculating the Variance

To calculate the standard deviation, we need to calculate the variance first.

The variance is the average of the squares of the deviations (the \(x-\overline{x}\) values for a sample, or the \(x-\mu\) values for a population)

The symbol \(\sigma^2\) represents the population variance; the population standard deviation \(\sigma\) is the square root of the population variance.

The symbol \(s^2\) represents the sample variance; the sample standard deviation \(s\) is the square root of the sample variance.

You can think of the standard deviation as a special average of the deviations.

Variance: \[s^2=\frac{\sum (x_i-\overline{x})^2}{n-1}\]
Standard deviation: \[s=\sqrt{\frac{\sum (x_i-\overline{x})^2}{n-1}}=\sqrt{s^2}\]

Example of Variance and Standard Deviation

In a fifth grade class, the teacher was interested in the average age and the sample standard deviation of the ages of her students. The following data are the ages for a sample of \(n=20\) fifth grade students. The ages are rounded to the nearest half year: 9, 9.5, 9.5, 10, 10, 10, 10, 10.5, 10.5, 10.5, 10.5, 11, 11, 11, 11, 11, 11, 11.5, 11.5, 11.5

The average age is 10.53 years, rounded to two places. Calculate the variance and standard deviation
Variance: \[s^2=\frac{\sum (x_i-\overline{x})^2}{n-1}\]
\[s^2=\frac{(9-10.53)^2+(9.5-10.53)^2+\cdots+(11.5-10.53)^2}{20-1}=\frac{9.7375}{19}=0.5125\]
Standard deviation: \[s=\sqrt{\frac{\sum (x_i-\overline{x})^2}{n-1}}=\sqrt{s^2}\]
\[s=\sqrt{s^2}=\sqrt{0.5125}=0.7158911 \approx 0.72\]

Bivariate Data

When two variables are measured on a single experimental unit, the resulting data are called bivariate data. Most often in the case of bivariate data, the relationship between the two variables is usually a desired outcome.

When both variables are quantitative, a graphing tool used is the scatterplot (an \(x-y\) plot from algebra days). The distribution of the points in this plot may take one of many different forms, and, based on what is seen in the scatterplot, one or more possible relationships may be indicated.

No relationship between \(x\) and \(y\) would be indicated by a general scattering of the points with no apparent pattern in the plot. The simplest relationship between \(x\) and \(y\) would be one in which the values of \(y\) increase or decrease linearly with the values of \(x\).

Scatterplots Eruptions and DecagonScatterplots Eruptions and Decagon

Old Faithful plot: \(y\) increases as \(x\) increases
Decagon plot: \(y\) decreases as \(x\) increases

Correlation

To determine the strength of the relationship between two quantitative variables, we use a measure called correlation

Defn: Is a calculation that measures the strength and direction (positive or negative) of the linear relationship between 2 quantitative variables, \(x\) and \(y\)

Correlation \(\neq\) causation

It is extremely important to note that just because two variables have a mathematical correlation IT DOES NOT MEAN \(X\) CAUSES \(Y\)!!!. To establish actual causation, repeatable experimentation must be done.

Correlation logistics

Correlation interpretations:
\(|r|\geq 0.8\): strong
\(0.6\leq |r|<0.8\): moderate
\(0.4\leq |r|<0.6\): fair
\(0\le |r|<0.4\): weak

\[r=\frac{1}{n-1}\sum{\frac{(x_i-\bar{x})(y_i-\bar{y})}{s_x s_y}}=\frac{s_{xy}}{s_x s_y}\]

\[s_{xy}=\frac{\sum x_iy_i-\frac{\left(\sum x_i\right)\left(\sum y_i\right)}{n}}{n-1}\]
\[\sum x_iy_i=x_1y_1+x_2y_2+\cdots+x_ny_n\]
\[\sum x_i=x_1+x_2+\cdots+x_n\]
\[\sum y_i=y_1+y_2+\cdots+y_n\]
\[s_x=\sqrt{\frac{\sum(x_i-\overline{x})^2}{n-1}}\]
\[s_y=\sqrt{\frac{\sum(y_i-\overline{y})^2}{n-1}}\]

Calculate Correlation \(r\)

\[r=\frac{1}{n-1}\sum{\frac{(x_i-\bar{x})(y_i-\bar{y})}{s_x s_y}}=\frac{s_{xy}}{s_x s_y}\]
\[s_{xy}=\frac{\sum x_iy_i-\frac{\left(\sum x_i\right)\left(\sum y_i\right)}{n}}{n-1}\]

Scatterplot of houses

\[s_x=\sqrt{\frac{\sum(x_i-\overline{x})^2}{n-1}}=281.4842\]
\[s_y=\sqrt{\frac{\sum(y_i-\overline{y})^2}{n-1}}=59.7592\]
\[\sum x_iy_i=1360(78.5)+1940(175.7)+\cdots+1480(68.8)=30444383\]

\[\sum x_i=1360+1940+\cdots+1480=20980\]
\[\sum y_i=78.5+175.7+\cdots+68.8=1643.5\]
\[s_{xy}=\frac{3044383-\frac{(20980)(1643.5)}{12}}{11}=15545.2\]
Then \[r=\frac{s_{xy}}{s_xs_y}=\frac{15545.2}{281.4842*59.7592}=0.9241\]

Since this is close to 1, we could say that there is a strong, positive, linear relationship between housing price and square footage.

Empirical Rule (ER)

For data having a distribution that is SYMMETRIC (will not be valid for non-symmetric distributions):

68% of observations are within the interval \(\overline{X}\pm 1s\)
95% of observations are within the interval \(\overline{X}\pm 2s\)
99.7% of observations are within the interval \(\overline{X}\pm 3s\)

Empirical Rule
Empirical Rule

ER example

Example: \(\overline{X}=15\), \(s=2\)

68% of observations are within the interval \[\overline{X}\pm 1s=15 \pm 2=(13,17)\]

95% of observations are within the interval \[\overline{X}\pm 2s=15\pm 2(2)=(11,19)\]

99.7% of observations are within the interval \[\overline{X}\pm 3s=15 \pm 3(2)=(9,21)\]