In this section students will:
One simple graph, the stem-and-leaf graph or stemplot, comes from the field of exploratory data analysis. It is a good choice when the data sets are small (\(n<30\)).
To create the plot, divide each observation of data into a stem and a leaf. The leaf consists of a final significant digit.
For example, 23 has stem two and leaf three. The number 432 has stem 43 and leaf two. Likewise, the number 5,432 has stem 543 and leaf two. The decimal 9.3 has stem nine and leaf three.
Write the stems in a vertical line from smallest to largest. Draw a vertical line to the right of the stems. Then write the leaves in increasing order next to their corresponding stem.
The stemplot is a quick way to graph data and gives an exact picture of the data.
You want to look for an overall pattern and any outliers. An outlier is an observation of data that does not fit the rest of the data. It is sometimes called an extreme value.
When you graph an outlier, it will appear not to fit the pattern of the graph.
Some outliers are due to mistakes (for example, writing down 50 instead of 500) while others may indicate that something unusual is happening. It takes some background information to explain outliers, so we will cover them in more detail later.
For Susan Dean’s spring pre-calculus class, scores for the first exam were as follows (smallest to largest): 33, 42, 49, 49, 53, 55, 55, 61, 63, 67, 68, 68, 69, 69, 72, 73, 74, 78, 80, 83, 88, 88, 88, 90, 92, 94, 94, 94, 94, 96, 100.
Pre-Calculus Spring First Exam Scores
The decimal point is 1 digit(s) to the right of the |
3 | 3
4 | 299
5 | 355
6 | 1378899
7 | 2348
8 | 03888
9 | 0244446
10 | 0
The stemplot shows that most scores fell in the 60s, 70s, 80s, and 90s. Eight out of the 31 scores or approximately 26% (8/31) were in the 90s or 100, a fairly high number of As.
A side-by-side stem-and-leaf plot allows a comparison of the two data sets in two columns. In a side-by-side stem-and-leaf plot, two sets of leaves share the same stem. The leaves are to the left and the right of the stems. The previous slide had the datasets of the ages of presidents at their inauguration and at their death. Construct a side-by-side stem-and-leaf plot using this data
_______________________________________________
1 | 2: represents 12, leaf unit: 1
pres.in pres.out
_______________________________________________
2 32| 4* |
9 9987776| 4. |69 2
(13) 4444422111110| 5* |3 3
(12) 877776665555| 5. |6688 7
10 4421110| 6* |003344 13
3 985| 6. |567778 19
| 7* |0111234 (7)
| 7. |7889 13
| 8* |013 9
| 8. |58 6
| 9* |0033 4
_______________________________________________
n: 44 39
_______________________________________________
Another type of graph that is useful for specific data values is a line graph.
The x-axis (horizontal axis) consists of data values and the y-axis (vertical axis) consists of frequency points.
The frequency points are connected using line segments.
In a survey, 50 teenagers between 13-17 indicated how many days per week they play video games
| Days/Week | Frequency |
|---|---|
| 1 | 2 |
| 2 | 3 |
| 3 | 8 |
| 4 | 7 |
| 5 | 14 |
| 6 | 7 |
| 7 | 9 |
Bar graphs consist of bars that are separated from each other.
The bars can be rectangles or they can be rectangular boxes (used in three-dimensional plots), and they can be vertical or horizontal.
Example: By the end of 2011, Facebook had over 146 million users in the United States. Table 2.9 shows three age groups, the number of users in each age group, and the proportion (%) of users in each age group.
In this section students will:
A histogram consists of contiguous (adjoining) boxes. It has both a horizontal axis and a vertical axis.
The horizontal axis is labeled with what the data represents (for instance, distance from your home to school).
The vertical axis is labeled either frequency or relative frequency (or percent frequency or probability).
The graph will have the same shape with either label.
The histogram (like the stemplot) can give you the shape of the data, the center, and the spread of the data.
To construct a histogram, first decide how many bars or intervals, also called classes, represent the data. Many histograms consist of five to 15 bars or classes for clarity.
Choose a starting point for the first interval to be less than the smallest data value. A convenient starting point is a lower value carried out to one more decimal place than the value with the most decimal places.
For example, if the value with the most decimal places is 6.1 and this is the smallest value, a convenient starting point is 6.05 (6.1 – 0.05 = 6.05). If all the data happen to be integers and the smallest value is two, then a convenient starting point is 1.5 (2 – 0.5 = 1.5).
Also, when the starting point and other boundaries are carried to one additional decimal place, no data value will fall on a boundary.
Next, calculate the width of each bar or class interval. To calculate this width, subtract the starting point from the ending value and divide by the number of bars (you must choose the number of bars you desire).
The following data are the heights (in inches to the nearest half inch) of 100 male semiprofessional soccer players. The heights are continuous data, since height is measured. 60, 60.5, 61, 61, 61.5, 63.5, 63.5, 63.5, 64, 64, 64, 64, 64, 64, 64, 64.5, 64.5, 64.5, 64.5, 64.5, 64.5, 64.5, 64.5, 66, 66, 66, 66, 66, 66, 66, 66, 66, 66, 66.5, 66.5, 66.5, 66.5, 66.5, 66.5, 66.5, 66.5, 66.5, 66.5, 66.5, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67.5, 67.5, 67.5, 67.5, 67.5, 67.5, 67.5, 68, 68, 69, 69, 69, 69, 69, 69, 69, 69, 69, 69, 69.5, 69.5, 69.5, 69.5, 69.5, 70, 70, 70, 70, 70, 70, 70.5, 70.5, 70.5, 71, 71, 71, 72, 72, 72, 72.5, 72.5, 73, 73.5, 74
The smallest data value is 60. Since the data with the most decimal places has one decimal (for instance, 61.5), we want our starting point to have two decimal places. Since the numbers 0.5, 0.05, 0.005, etc. are convenient numbers, use 0.05 and subtract it from 60, the smallest value, for the convenient starting point. 60 – 0.05 = 59.95 which is more precise than, say, 61.5 by one decimal place. The starting point is, then, 59.95. The largest value is 74, so 74 + 0.05 = 74.05 is the ending value. Next, calculate the width of each bar or class interval. To calculate this width, subtract the starting point from the ending value and divide by the number of bars (you must choose the number of bars you desire). Suppose you choose eight bars.
NOTE
We will round up to two and make each bar or class interval two units
wide. Rounding up to two is one way to prevent a value from falling on a
boundary. Rounding to the next number is often necessary even if it goes
against the standard rules of rounding. For this example, using 1.76 as
the width would also work. A guideline that is followed by some for the
number of bars or class intervals is to take the square root of the
number of data values and then round to the nearest whole number, if
necessary. For example, if there are 150 values of data, take the square
root of 150 and round to 12 bars or intervals.
Create the boundaries by adding the width to the starting point and so on to create the total number of classes needed.
The boundaries are:
59.95
59.95 + 2 = 61.95
61.95 + 2 = 63.95
63.95 + 2 = 65.95
65.95 + 2 = 67.95
67.95 + 2 = 69.95
69.95 + 2 = 71.95
71.95 + 2 = 73.95
73.95 + 2 = 75.95
Count the frequency in each class to draw your histogram.
| Class | Frequency | Relative Frequency |
|---|---|---|
| 59.95-61.95 | 5 | 0.05 |
| 61.95-63.95 | 3 | 0.03 |
| 63.95-65.95 | 15 | 0.15 |
| 65.95-67.95 | 40 | 0.4 |
| 67.95-69.95 | 17 | 0.17 |
| 69.95-71.95 | 12 | 0.12 |
| 71.95-73.95 | 7 | 0.07 |
| 73.95-75.95 | 1 | 0.01 |
The heights 60 through 61.5 inches are in the interval 59.95–61.95. The heights that are 63.5 are in the interval 61.95–63.95. The heights that are 64 through 64.5 are in the interval 63.95–65.95. The heights 66 through 67.5 are in the interval 65.95–67.95. The heights 68 through 69.5 are in the interval 67.95–69.95. The heights 70 through 71 are in the interval 69.95–71.95. The heights 72 through 73.5 are in the interval 71.95–73.95. The height 74 is in the interval 73.95–75.95.
The following histogram displays the heights on the x-axis and relative frequency on the y-axis.
To construct a time series graph, we must look at both pieces of our paired data set. We start with a standard Cartesian coordinate system.
The horizontal axis is used to plot the date or time increments, and the vertical axis is used to plot the values of the variable that we are measuring. By doing this, we make each point on the graph correspond to a date and a measured quantity.
The points on the graph are typically connected by straight lines in the order in which they occur.
It is basically a line plot with time as the “independent” variable
Time series graphs are important tools in various applications of statistics.
When recording values of the same variable over an extended period of time, sometimes it is difficult to discern any trend or pattern.
However, once the same data points are displayed graphically, some features jump out; time series graphs make trends easy to spot.
Contour plot (not required for this course)
In this section students will:
The median is a number that measures the center of the data. You can think of the median as the “middle value,” but it does not actually have to be one of the observed values.
It is a number that separates ordered data into halves. Half the values are the same number or smaller than the median, and half the values are the same number or larger. The median, \(M\), is called both the second quartile and the 50\(^{th}\) percentile.
The common measures of location are quartiles and percentiles.
Percentiles divide ordered data into hundredths. To score in the 90\(^{th}\) percentile of an exam does not mean, necessarily, that you received 90% on a test. It means that 90% of test scores are the same or less than your score and 10% of the test scores are the same or greater than your test score.
To calculate percentiles, the data must be ordered from smallest to largest.
Quartiles divide ordered data into quarters. The first quartile, \(Q1\), is the middle value of the lower half of the data, and the third quartile, \(Q3\), is the middle value of the upper half of the data.
The first quartile, \(Q1\), is the same as the 25\(^{th}\) percentile, and the third quartile, \(Q3\), is the same as the 75\(^{th}\) percentile. Quartiles may or may not be part of the data.
To calculate quartiles, the data must be ordered from smallest to largest.
\(k\): the \(k^{th}\) percentile. It may or may not be
part of the data.
\(i\): the index (ranking or position
of a data value)
\(n\): the total number of data
points
Order the data from smallest to largest. Calculate \[i=\frac{k}{100}(n+1)\]
If \(i\) is an integer, then the \(k^{th}\) percentile is the data value in the \(i^{th}\) position in the ordered set of data.
If \(i\) is not an integer, then round \(i\) up and round \(i\) down to the nearest integers. Average the two data values in these two positions in the ordered data set.
Listed are 29 ages for Academy Award winning best actors in order from smallest to largest. Calculate the 20\(^{th}\) and 55\(^{th}\) percentiles
[1] 18 21 22 25 26 27 29 30 31 33 36 37 41 42 47 52 55 57 58 62 64 67 69 71 72
[26] 73 74 76 77
20\(^{th}\) percentile: \[i=\frac{k}{100}(n+1)=\frac{20}{100}(29+1)=6\]
55\(^{th}\) percentile: \[i=\frac{k}{100}(n+1)=\frac{55}{100}(29+1)=16.5\]
The 20\(^{th}\) percentile is the 6\(^{th}\) data point and the 55\(^{th}\) percentile is the 16.5\(^{th}\) data point (which will be the average between the 16\(^{th}\) and 17\(^{th}\) data points)
The 6\(^{th}\) data point is 27 and the 16\(^{th}\) and 17\(^{th}\) data points are 52 and 55 with their average being 53.5
Order the data from smallest to largest.
\(x\): the number of data values
counting from the bottom of the data list up to but not
including the data value for which you want to find the
percentile
\(y\): the number of data values
equal to the data value for which you want to find the
percentile
\(n\): the total number of data
points
\(p\): the percentile
Calculate \[p=\frac{(x+0.5y)}{n}*(100)\]
Then round to the nearest integer
For finding the median and quartiles (\(Q1\) and \(Q3\); 25\(^{th}\) and 75\(^{th}\) percentiles, respectively), the same formulas from before can be applied
\[i=\frac{k}{100}(n+1)\]
Fifty statistics students were asked how much sleep they get per school night (rounded to the nearest hour). The results are given below.
Find the 80\(^{th}\)
percentile
Find the 90\(^{th}\) percentile
Find the median, \(M\)
Find the first quartile. What is another name for the first
quartile?
80\(^{th}\) percentile: \[i=\frac{k}{100}(n+1)=\frac{80}{100}(50+1)=40.8\]
90\(^{th}\) percentile: \[i=\frac{k}{100}(n+1)=\frac{90}{100}(50+1)=45.9\]
\(M=50^{th}\) percentile: \[i=\frac{k}{100}(n+1)=\frac{50}{100}(50+1)=25.5\]
\(Q1=25^{th}\) percentile: \[i=\frac{k}{100}(n+1)=\frac{25}{100}(50+1)=12.75\]
The 80\(^{th}\) percentile is the 40.5\(^{th}\) data point (which will be the average between the 40\(^{th}\) and 41\(^{st}\) data points). The 40\(^{th}\) and 41\(^{st}\) data points are 8 and 9 with their average being 8.5
The 90\(^{th}\) percentile is the 45.9\(^{th}\) data point (which will be the average between the 45\(^{th}\) and 46\(^{th}\) data points). The 45\(^{th}\) and 46\(^{th}\) data points are 9 and 9 with their average being 9
The median, \(M\), which is also the 50\(^{th}\) percentile is the 25.5\(^{th}\) data point (which will be the average between the 25\(^{th}\) and 26\(^{th}\) data points). The 25\(^{th}\) and 26\(^{th}\) data points are 7 and 7 with their average being 7
The first quartile, \(Q1\), which is the 25\(^{th}\) percentile is the 12.75\(^{th}\) data point (which will be the average between the 12\(^{th}\) and 13\(^{th}\) data points). The 12\(^{th}\) and 13\(^{th}\) data points are 6 and 6 with their average being 6.
NOTE: this is one of many ways to calculate percentages. This method is not the most accurate but is good enough
NOTE: When writing the interpretation of a percentile in the context of the given data, the sentence should contain the following information
Example
On a timed math test, the first quartile for time it took to finish the exam was 35 minutes. Interpret the first quartile in the context of this situation.
The interquartile range is a number that indicates the spread of the middle half or the middle 50% of the data. It is the difference between the third quartile (\(Q3\)) and the first quartile (\(Q1\))
\[IQR = Q3 – Q1\]
The IQR can help to determine potential outliers. A value is suspected
to be a potential outlier if it is less than \(1.5\times IQR\) below the first quartile or
more than \(1.5\times IQR\) above the
third quartile. Potential outliers always require further
investigation.
NOTE: A potential outlier is a data point that is significantly different from the other data points. These special data points may be errors or some kind of abnormality or they may be a key to understanding the data.
Salaries IQR example:
First, we have to find \(Q3\) and \(Q1\), but the data needs to be ordered from smallest to largest
[1] 33000 64500 28000 54000 72000 68500 69000 42000 54000 120000
[11] 40500
[1] 28000 33000 40500 42000 54000 54000 64500 68500 69000 72000
[11] 120000
First, we have to find \(Q3\) and \(Q1\).
\(Q1=25^{th}\) percentile: \[i=\frac{k}{100}(n+1)=\frac{25}{100}(11+1)=3\]
\(Q3=75^{th}\) percentile: \[i=\frac{k}{100}(n+1)=\frac{75}{100}(11+1)=9\]
The first quartile, \(Q1\), which is the 25\(^{th}\) percentile is the 3\(^{rd}\) data point, which is 4.05^{4}
The third quartile, \(Q3\), which is the 75\(^{th}\) percentile is the 9\(^{th}\) data point, which is 6.9^{4}
The Interquartile Range, \(IQR\), is \(IQR=Q3-Q1=6.9\times 10^{4}-4.05\times 10^{4}=2.85\times 10^{4}\)
For outlier detection: any point either \(<Q1-1.5*IQR\) or \(>Q3+1.5*IQR\) will be an outlier \[1.5(IQR)=1.5*2.85\times 10^{4}=4.275\times 10^{4}\]
Lower boundary: \[lower=Q1-1.5*IQR=-2250\]
Upper boundary: \[upper=Q3+1.5*IQR=1.1175\times
10^{5}\]
There are no values less than -2250 and there is one value that is more
than 1.1175^{5}, which is 120,000
In this section students will:
The 5# summary is comprised of five values: the minimum value, the first quartile, the median, the third quartile, and the maximum value.
5# Summary
\(min\)
\(Q1\)
\(M\)
\(Q3\)
\(max\)
Boxplots (also called box-and-whisker plots or box-whisker plots) give a good graphical image of the concentration of the data. They also show how far the extreme values are from most of the data.
A box plot is constructed from the 5# Summary
To construct a box plot, use a horizontal or vertical number line and a rectangular box.
The smallest and largest data values label the endpoints of the axis. The first quartile marks one end of the box and the third quartile marks the other end of the box. Approximately the middle 50 percent of the data fall inside the box. The “whiskers” extend from the ends of the box.
The median or second quartile can be between the first and third quartiles, or it can be one, or the other, or both. The box plot gives a good, quick picture of the data.
NOTE: You may encounter box-and-whisker plots that have dots marking outlier values. In those cases, the whiskers are not extending to the minimum and maximum values. (Most of the plots generated by software will not end the whiskers at the minimum and maximum but with the lower and upper boundaries \(1.5 \times IQR\))
The two whiskers extend from the first quartile to \(Q1-IQR(1.5)\) and from the third quartile to \(Q3+IQR(1.5)\) The median is shown with a line.
NOTE: It is important to start a box plot with a scaled number line. Otherwise the box plot may not be useful.
The following data are the heights of 40 students in a statistics class. 59, 60, 61, 62, 62, 63, 63, 64, 64, 64, 65, 65, 65, 65, 65, 65, 65, 65, 65, 66, 66, 67, 67, 68, 68, 69, 70, 70, 70, 70, 70, 71, 71, 72, 72, 73, 74, 74, 75, 77
[1] 59 60 61 62 62 63 63 64 64 64 65 65 65 65 65 65 65 65 65 66 66 67 67 68 68
[26] 69 70 70 70 70 70 71 71 72 72 73 74 74 75 77
Construct a box plot: Minimum value = 59, Maximum value = 77, Q1: First quartile = 64.5, Q2: Second quartile or median= 66, Q3: Third quartile = 70, \(IQR=Q3-Q1=70-64.5=5.5\) and \(IQR(1.5)=5.5(1.5)=8.25\)
Range\(=maximum-minimum=77-59=18\)
Outliers: any point that is either:
\[<Q1-IQR(1.5)\] or \[>Q3+IQR(1.5)\]
The “fences” or boundaries are \(Q1-IQR(1.5)=64.5-8.25=56.25\)
\(Q3+IQR(1.5)=77+8.25=85.25\). Thus any
data point less than 56.25 or any data point greater than 85.25 are to
be considered outliers.
NOTE
Each quarter has approximately 25% of the data
The spreads of the four quarters are 64.5 – 59 = 5.5 (first quarter),
\(66-64.5=1.5\) (second quarter), 70 –
66 = 4 (third quarter), and 77 – 70 = 7 (fourth quarter). So, the second
quarter has the smallest spread and the fourth quarter has the largest
spread
Range = maximum value – the minimum value = 77 – 59 = 18
Interquartile Range: \(IQR = Q3 – Q1 = 70 –
64.5 = 5.5\)
The interval 59–65 has more than 25% of the data so it has more data in
it than the interval 66 through 70 which has 25% of the data
The middle 50% (middle half) of the data has a range of 5.5 inches
Create a boxplot
In this section students will:
The “center” of a data set is also a way of describing location. The two most widely used measures of the “center” of the data are the mean (average) and the median.
To calculate the mean weight of 50 people, add the 50 weights together and divide by 50. To find the median weight of the 50 people, order the data and find the number that splits the data into two equal parts.
NOTE: The words “mean” and “average” are often used interchangeably. The substitution of one word for the other is common practice. The technical term is “arithmetic mean” and “average” is technically a center location. However, in practice among non-statisticians, “average” is commonly accepted for “arithmetic mean.”
The letter used to represent the sample mean is an x with a bar over it (pronounced “x bar”): \(\overline{x}\). \[\overline{x}=\frac{\sum x_i}{n}\]
The Greek letter \(\mu\) (pronounced “mew”) represents the population mean.
One of the requirements for the sample mean to be a good estimate of the population mean is for the sample taken to be truly random.
The median \(M\) is also a center location (the location of the middle most data point). The location of the median and the true value of the median are not the same.
AIDS data indicating the number of months a patient with AIDS lives after taking a new antibody drug are as follows (smallest to largest):
[1] 3 4 8 8 10 11 12 13 14 15 15 16 16 17 17 18 21 22 22 24 24 25 26 26 27
[26] 27 29 29 31 32 33 33 34 34 35 37 40 44 44 47
\[\overline{x}=\frac{\sum
x_i}{n}=\frac{3+4+8+\cdots+44+47}{40}=\frac{943}{40}=23.575\approx
23.6\]
\(M=50^{th}\) percentile: \[i=\frac{k}{100}(n+1)=\frac{50}{100}(40+1)=20.5\]
The median, \(M\), which is the 50\(^{th}\) percentile is the 20.5\(^{th}\) data point (so average the 20\(^{th}\) and the 21\(^{st}\) data points), which is \(24+24=24\)
Another measure of the center is the mode. The mode is the most frequent value. There can be more than one mode in a data set as long as those values have the same frequency and that frequency is the highest. A data set with two modes is called bimodal.
Five real estate exam scores are 430, 430, 480, 480, 495. The data set is bimodal because the scores 430 and 480 each occur twice.
NOTE
The mode can be calculated for qualitative data as well as for
quantitative data. For example, if the data set is: red, red, red,
green, green, yellow, purple, black, blue, the mode is red.
The number of books checked out from the library from 25 students are as follows: 0; 0; 0; 1; 2; 3; 3; 4; 4; 5; 5; 7; 7; 7; 7; 8; 8; 8; 9; 10; 10; 11; 11; 12; 12
Find the mode.
Suppose that in a small town of 50 people, one person earns $5,000,000 per year and the other 49 each earn $30,000. Which is the better measure of the “center”: the mean or the median?
The median is a better measure of the “center” than the mean because 49 of the values are 30,000 and one is 5,000,000. The 5,000,000 is an outlier. The 30,000 gives us a better sense of the middle of the data.
In this section students will:
The histogram displays a symmetric distribution of data.
A distribution is symmetric if a vertical line can be drawn at some point in the histogram such that the shape to the left and the right of the vertical line are mirror images of each other.
In a perfectly symmetric distribution, the mean and the median are (approximately) the same.
This example has one mode (unimodal), and the mode is the same as the mean and median.
In a symmetric distribution that has two modes (bimodal), the two modes would be different from the mean and median.
A distribution of this type is called skewed to the left because it is pulled out to the left.
Notice that the mean is less than the median, and they are both less than the mode. The mean and the median both reflect the skewing, but the mean reflects it more so.
To summarize, generally if the distribution of data is skewed to the left, the mean is less than the median, which is often less than the mode.
A distribution of this type is called skewed to the right because it is pulled out to the right.
Of the three statistics, the mean is the largest, while the mode is the smallest. Again, the mean reflects the skewing the most.
To summarize, if the distribution of data is skewed to the right, the mode is often less than the median, which is less than the mean.
In this section students will:
An important characteristic of any set of data is the variation in the data.
In some data sets, the data values are concentrated closely near the mean; in other data sets, the data values are more widely spread out from the mean.
The most common measure of variation, or spread, is the standard deviation. The standard deviation is a number that measures how far data values are from their mean, on average.
The standard deviation provides a measure of the overall variation in a data set
The standard deviation is always positive or zero.
The standard deviation is small when the data are all concentrated close to the mean, exhibiting little variation or spread.
The standard deviation is larger when the data values are more spread out from the mean, exhibiting more variation.
The standard deviation can be used to determine whether a data value is close to or far from the mean.
A data value that is two standard deviations from the average is just on the borderline for what many statisticians would consider to be far from the average.
Considering data to be far from the mean if it is more than two standard deviations away is more of an approximate “rule of thumb” than a rigid rule.
In general, the shape of the distribution of the data affects how much of the data is further away than two standard deviations. (You will learn more about this in later chapters.)
The procedure to calculate the standard deviation depends on whether the numbers are the entire population or are data from a sample.
The calculations are similar, but not identical. Therefore the symbol used to represent the standard deviation depends on whether it is calculated from a population or a sample.
The lower case letter \(s\) represents the sample standard deviation and the Greek letter \(\sigma\) (sigma, lower case) represents the population standard deviation.
If the sample has the same characteristics as the population, then \(s\) should be a good estimate of \(\sigma\).
To calculate the standard deviation, we need to calculate the variance first.
The variance is the average of the squares of the deviations (the \(x-\overline{x}\) values for a sample, or the \(x-\mu\) values for a population)
The symbol \(\sigma^2\) represents the population variance; the population standard deviation \(\sigma\) is the square root of the population variance.
The symbol \(s^2\) represents the sample variance; the sample standard deviation \(s\) is the square root of the sample variance.
You can think of the standard deviation as a special average of the deviations.
Variance: \[s^2=\frac{\sum
(x_i-\overline{x})^2}{n-1}\]
Standard deviation: \[s=\sqrt{\frac{\sum
(x_i-\overline{x})^2}{n-1}}=\sqrt{s^2}\]
In a fifth grade class, the teacher was interested in the average age and the sample standard deviation of the ages of her students. The following data are the ages for a sample of \(n=20\) fifth grade students. The ages are rounded to the nearest half year: 9, 9.5, 9.5, 10, 10, 10, 10, 10.5, 10.5, 10.5, 10.5, 11, 11, 11, 11, 11, 11, 11.5, 11.5, 11.5
The average age is 10.53 years, rounded to two places. Calculate the
variance and standard deviation
Variance: \[s^2=\frac{\sum
(x_i-\overline{x})^2}{n-1}\]
\[s^2=\frac{(9-10.53)^2+(9.5-10.53)^2+\cdots+(11.5-10.53)^2}{20-1}=\frac{9.7375}{19}=0.5125\]
Standard deviation: \[s=\sqrt{\frac{\sum
(x_i-\overline{x})^2}{n-1}}=\sqrt{s^2}\]
\[s=\sqrt{s^2}=\sqrt{0.5125}=0.7158911
\approx 0.72\]
When two variables are measured on a single experimental unit, the resulting data are called bivariate data. Most often in the case of bivariate data, the relationship between the two variables is usually a desired outcome.
When both variables are quantitative, a graphing tool used is the scatterplot (an \(x-y\) plot from algebra days). The distribution of the points in this plot may take one of many different forms, and, based on what is seen in the scatterplot, one or more possible relationships may be indicated.
No relationship between \(x\) and \(y\) would be indicated by a general scattering of the points with no apparent pattern in the plot. The simplest relationship between \(x\) and \(y\) would be one in which the values of \(y\) increase or decrease linearly with the values of \(x\).
Old Faithful plot: \(y\) increases
as \(x\) increases
Decagon plot: \(y\) decreases as \(x\) increases
To determine the strength of the relationship between two quantitative variables, we use a measure called correlation
Defn: Is a calculation that measures the strength and direction (positive or negative) of the linear relationship between 2 quantitative variables, \(x\) and \(y\)
Correlation \(\neq\) causation
It is extremely important to note that just because two variables have a mathematical correlation IT DOES NOT MEAN \(X\) CAUSES \(Y\)!!!. To establish actual causation, repeatable experimentation must be done.
Correlation interpretations:
\(|r|\geq 0.8\): strong
\(0.6\leq |r|<0.8\): moderate
\(0.4\leq |r|<0.6\): fair
\(0\le |r|<0.4\): weak
\[r=\frac{1}{n-1}\sum{\frac{(x_i-\bar{x})(y_i-\bar{y})}{s_x s_y}}=\frac{s_{xy}}{s_x s_y}\]
\[s_{xy}=\frac{\sum
x_iy_i-\frac{\left(\sum x_i\right)\left(\sum
y_i\right)}{n}}{n-1}\]
\[\sum
x_iy_i=x_1y_1+x_2y_2+\cdots+x_ny_n\]
\[\sum x_i=x_1+x_2+\cdots+x_n\]
\[\sum y_i=y_1+y_2+\cdots+y_n\]
\[s_x=\sqrt{\frac{\sum(x_i-\overline{x})^2}{n-1}}\]
\[s_y=\sqrt{\frac{\sum(y_i-\overline{y})^2}{n-1}}\]
\[s_x=\sqrt{\frac{\sum(x_i-\overline{x})^2}{n-1}}=281.4842\]
\[s_y=\sqrt{\frac{\sum(y_i-\overline{y})^2}{n-1}}=59.7592\]
\[\sum
x_iy_i=1360(78.5)+1940(175.7)+\cdots+1480(68.8)=30444383\]
\[\sum
x_i=1360+1940+\cdots+1480=20980\]
\[\sum
y_i=78.5+175.7+\cdots+68.8=1643.5\]
\[s_{xy}=\frac{3044383-\frac{(20980)(1643.5)}{12}}{11}=15545.2\]
Then \[r=\frac{s_{xy}}{s_xs_y}=\frac{15545.2}{281.4842*59.7592}=0.9241\]
Since this is close to 1, we could say that there is a strong, positive, linear relationship between housing price and square footage.
For data having a distribution that is SYMMETRIC (will not be valid for non-symmetric distributions):
68% of observations are within the interval \(\overline{X}\pm 1s\)
95% of observations are within the interval \(\overline{X}\pm 2s\)
99.7% of observations are within the interval \(\overline{X}\pm 3s\)
Example: \(\overline{X}=15\), \(s=2\)
68% of observations are within the interval \[\overline{X}\pm 1s=15 \pm 2=(13,17)\]
95% of observations are within the interval \[\overline{X}\pm 2s=15\pm 2(2)=(11,19)\]
99.7% of observations are within the interval \[\overline{X}\pm 3s=15 \pm 3(2)=(9,21)\]