Statistics is a science, not a branch of mathematics, but uses mathematical models as essential tools.1
Like other branches of science, statistics uses math, but is not inherently a math class. Just like physics uses math [for everything!], but is not a math class.2
Defined, statistics is the science of data. Sample information is obtained and inferences about the population are made from the sample information. We use descriptive statistics (graphs, numerical summaries, etc.) and inferential statistics (formal methods using probabilities). Design observational and experimental studies using samples of data
In this section students will
The science of statistics deals with the collection, analysis, interpretation, and presentation of data. We see and use data in our everyday lives.
In this course, you will learn how to organize and summarize data. Organizing and summarizing data is called descriptive statistics.
Two ways to summarize data are by graphing and by using numbers (for example, finding an average).
After you have studied probability and probability distributions, you will use formal methods for drawing conclusions from “good“ data. The formal methods are called inferential statistics.
Statistical inference uses probability to determine how confident we can be that our conclusions are correct.
Probability is a mathematical tool used to study randomness. It deals with the chance (the likelihood) of an event occurring.
The expected theoretical probability of heads in any one toss is 1/2 or 0.5. Even though the outcomes of a few repetitions are uncertain, there is a regular pattern of outcomes when there are many repetitions.
For example, if you toss a fair coin four times, the outcomes may not be two heads and two tails. However, if you toss the same coin 4,000 times, the outcomes will be close to half heads and half tails.
In statistics, we generally want to study a population. You can think of a population as a collection of persons, things, or objects under study.
The statistic is an estimate of a population parameter. A parameter is a number that is a property of the population.
Since we considered all math classes to be the population, then the average number of points earned per student over all the math classes is an example of a parameter.
To study the population, we select a sample. The idea of sampling is to select a portion (or subset) of the larger population and study that portion (the sample) to gain information about the population.
From the sample data, we can calculate a statistic. A statistic is a number that represents a property of the sample, and is an estimate of the parameter.
For example, if we consider one math class to be a sample of the population of all math classes, then the average number of points earned by students in that one math class at the end of the term is an example of a statistic.
Data are the actual values of the variable. They may be numbers or they may be words. Datum is a single value. Data are the result of sampling from a population.
Because it takes a lot of time and money to examine an entire population, sampling is a very practical technique.
One of the main concerns in the field of statistics is how accurately a statistic estimates a parameter. The accuracy really depends on how well the sample represents the population.
The sample must contain the characteristics of the population in order to be a representative sample. We are interested in both the sample statistic and the population parameter in inferential statistics.
A variable, notated by capital letters such as \(X\) and \(Y\), is a characteristic of interest for each person or thing in a population. Variables may be numerical or categorical.
Numerical variables take on values with equal units such as weight in pounds and time in hours.
Categorical variables place the person or thing into a category.
Two words that come up often in statistics are mean and proportion.
NOTE
The words “mean” and “average” are often used interchangeably. The
substitution of one word for the other is common practice. The technical
term is “arithmetic mean,” and “average” is technically a center
location. However, in practice among non-statisticians, “average” is
commonly accepted for “arithmetic mean.”
In this section students will:
Qualitative data are the result of categorizing or
describing attributes of a population. Hair color, blood type,
ethnic group, the car a person drives, and the street a person lives on
are examples of qualitative data.
Qualitative data are generally described by words or letters.
Quantitative data are always numbers. Quantitative data are the result of counting or measuring attributes of a population. Amount of money, pulse rate, weight, number of people living in your town, and number of students who take statistics are examples of quantitative data.
Quantitative data may be either discrete or continuous.
All data that are the result of counting are called quantitative discrete data. These data take on only certain numerical values.
All data that are the result of measuring are quantitative continuous data assuming that we can measure accurately.
Note You may collect data as numbers and report it categorically. For example, the quiz scores for each student are recorded throughout the term. At the end of the term, the quiz scores are reported as A, B, C, D, or F.
Example 1.5 Data Sample of Quantitative Discrete Data
The data are the number of books students carry in their backpacks. You
sample five students. Two students carry three books, one student
carries four books, one student carries two books, and one student
carries one book. The numbers of books (three, four, two, and one) are
the quantitative discrete data.
Example 1.6 Data Sample of Quantitative Continuous
Data
The data are the weights of backpacks with books in them. You sample the
same five students. The weights (in pounds) of their backpacks are 6.2,
7, 6.8, 9.1, 4.3. Notice that backpacks carrying three books can have
different weights. Weights are quantitative continuous data.
Example 1.8
The data are the colors of backpacks. Again, you sample the same five
students. One student has a red backpack, two students have black
backpacks, one student has a green backpack, and one student has a gray
backpack. The colors red, black, black, green, and gray are qualitative
data.
Tables are a good way of organizing and displaying data. But graphs can be even more helpful in understanding the data.
There are no strict rules concerning which graphs to use.
Two graphs that are used to display qualitative data are pie charts and bar graphs.
In a pie chart, categories of data are represented by wedges in a circle and are proportional in size to the percent of individuals in each category.
In a bar graph, the length of the bar for each category is proportional to the number or percent of individuals in each category. Bars may be vertical or horizontal.
A Pareto chart consists of bars that are sorted into order by category size (largest to smallest). It is an ordered bar graph
Gathering information about an entire population often costs too much or is virtually impossible. Instead, we use a sample of the population. A sample should have the same characteristics as the population it is representing.
There are several different methods of random sampling. In each form of random sampling, each member of a population initially has an equal chance of being selected for the sample.
The easiest method to describe is called a simple random sample. Any group of \(n\) individuals is equally likely to be chosen by any other group of \(n\) individuals if the simple random sampling technique is used. In other words, each sample of the same size has an equal chance of being selected. (Another way to think about it is that every individual has the same chance to be chosen for the sample)
True random sampling is done with replacement. That is, once a member is picked, that member goes back into the population and thus may be chosen more than once. However for practical reasons, in most populations, simple random sampling is done without replacement.
In a college population of 10,000 people, suppose you want to pick a sample of 1,000 randomly for a survey.
For any particular sample of 1,000, if you are sampling with replacement, - the chance of picking the first person is 1,000 out of 10,000 - the chance of picking a different second person for this sample is 999 out of 10,000 - you replace the first person before picking the next person
For any particular sample of 1,000, if you are sampling without replacement, - the chance of picking the first person for any particular sample is 1,000 out of 10,000 - the chance of picking a different second person is 999 out of 9,999 - you do not replace the first person before picking the next person
Besides simple random sampling, there are other forms of sampling that involve a chance process for getting the sample. Other well-known random sampling methods are the stratified sample, the cluster sample, and the systematic sample.
To choose a stratified sample, divide the population into groups called strata and then take a proportionate number from each stratum.
To choose a 1-stage cluster sample, divide the population into clusters (groups) and then randomly select some of the clusters. All the members from these clusters are in the cluster sample.
To choose a 2-stage cluster sample, divide the population into clusters (groups) and then randomly select some of the clusters. Next randomly select some of the members from these clusters.
To choose a systematic sample, randomly select a starting point and take every \(k^{th}\) piece of data from a listing of the population.
A type of sampling that is non-random is convenience sampling. Convenience sampling involves using results that are readily available.
When you analyze data, it is important to be aware of sampling errors and nonsampling errors. The actual process of sampling causes sampling errors. For example, the sample may not be large enough.
Factors not related to the sampling process cause nonsampling errors. A defective counting device can cause a nonsampling error.
In statistics, a sampling bias is created when a sample is collected from a population and some members of the population are not as likely to be chosen as others (remember, each member of the population should have an equally likely chance of being chosen).
When a sampling bias happens, there can be incorrect conclusions drawn about the population that is being studied.
Variation is present in any set of data.
For example, 16-ounce cans of beverage may contain more or less than 16 ounces of liquid.
It was mentioned previously that two or more samples from the same population, taken randomly, and having close to the same characteristics of the population will likely be different from each other.
Determine the type of sampling used (simple random, stratified, systematic, cluster, or convenience).
In this section students will:
The way a set of data is measured is called its level of measurement.
Data can be classified into four levels of measurement. They are (from lowest to highest level):
Data that is measured using a nominal scale is qualitative. Categories, colors, names, labels and favorite foods along with yes or no responses are examples of nominal level data.
Nominal scale data are not ordered. For example, trying to classify people according to their favorite food does not make any sense. Putting pizza first and sushi second is not meaningful.
Smartphone companies are another example of nominal scale data. Some examples are Sony, Motorola, Nokia, Samsung and Apple. This is just a list and there is no agreed upon order. Some people may favor Apple but that is a matter of opinion.
Nominal scale data cannot be used in calculations.
Data that is measured using an ordinal scale is similar to nominal scale data, but the ordinal scale data can be ordered.
An example of ordinal scale data is a list of the top five national parks in the United States. The top five national parks in the United States can be ranked from one to five but we cannot measure differences between the data.
Another example of using the ordinal scale is a cruise survey where the responses to questions about the cruise are “excellent,” “good,” “satisfactory,” and “unsatisfactory.” These responses are ordered from the most desired response to the least desired, but the differences between two pieces of data cannot be measured.
Like the nominal scale data, ordinal scale data cannot be used in calculations.
Data that is measured using the interval scale is similar to ordinal level data because it has a definite ordering but the differences between interval scale data can be measured though the data does not have a starting point.
Temperature scales like Celsius (C) and Fahrenheit (F) are measured by using the interval scale. In both temperature measurements, 40\(^{\circ}\) is equal to 100\(^{\circ}\) minus 60\(^{\circ}\). Differences make sense. But 0 degrees does not because, in both scales, 0 is not the absolute lowest temperature. Temperatures like -10\(^{\circ}\) F and -15\(^{\circ}\) C exist and are colder than 0.
Interval level data can be used in calculations, but one type of comparison cannot be done. 80\(^{\circ}\) C is not four times as hot as 20\(^{\circ}\) C (nor is 80\(^{\circ}\) F four times as hot as 20\(^{\circ}\) F). There is no meaning to the ratio of 80 to 20 (or four to one).
Data that is measured using the ratio scale takes care of the ratio problem and gives you the most information. Ratio scale data is like interval scale data, but it has a 0 point and ratios can be calculated.
For example, four multiple choice statistics final exam scores are 80, 68, 20 and 92 (out of a possible 100 points). The exams are machine-graded. The data can be put in order from lowest to highest: 20, 68, 80, 92.
The differences between the data have meaning. The score 92 is
more than the score 68 by 24 points.
Ratios can be calculated. The smallest score is 0. So 80 is four
times 20. The score of 80 is four times better than the score of
20.
A frequency is the number of times a value of the data occurs. The sum of the values in the frequency column represents the total number of values included in the sample.
A relative frequency is the ratio (fraction or proportion) of the number of times a value of the data occurs in the set of all outcomes to the total number of outcomes. To find the relative frequencies, divide each frequency by the total number of data values. Relative frequencies can be written as fractions, percents, or decimals.
Cumulative relative frequency is the accumulation of the previous relative frequencies. To find the cumulative relative frequencies, add all the previous relative frequencies to the relative frequency for the current row.
NOTE
Because of rounding, the relative frequency column may not always sum to
one, and the last entry in the cumulative relative frequency column may
not be one. However, they each should be close to one.
The table represents the heights, in inches, of a sample of 100 male semiprofessional soccer players.
From the previous table, find the following answers
Remember, you count frequencies. To find the relative frequency, divide the frequency by the total number of data values. To find the cumulative relative frequency, add all of the previous relative frequencies to the relative frequency for the current row.
Answer the following questions.
In this section students will:
The purpose of an experiment is to investigate the relationship between two variables.
When one variable causes change in another, we call the first variable the explanatory variable. The affected variable is called the response variable.
In a randomized experiment, the researcher manipulates values of the explanatory variable and measures the resulting changes in the response variable.
The different values of the explanatory variable are called treatments. An experimental unit is a single object or individual to be measured.
Additional variables that can cloud a study are called lurking variables. In order to prove that the explanatory variable is causing a change in the response variable, it is necessary to isolate the explanatory variable.
The researcher must design the experiment in such a way that there is only one difference between groups being compared: the planned treatments. This is accomplished by the random assignment of experimental units to treatment groups.
When subjects are assigned treatments randomly, all of the potential lurking variables are spread equally among the groups.
When participation in a study prompts a physical response from a participant, it is difficult to isolate the effects of the explanatory variable.
To counter the power of suggestion, researchers set aside one treatment group as a control group. This group is given a placebo treatment; a treatment that cannot influence the response variable. The control group helps researchers balance the effects of being in an experiment with the effects of the active treatments.
Of course, if you are participating in a study and you know that you are receiving a pill which contains no actual medication, then the power of suggestion is no longer a factor. Blinding in a randomized experiment preserves the power of suggestion.
When a person involved in a research study is blinded, they do not know who is receiving the active treatment(s) and who is receiving the placebo treatment. A double-blind experiment is one in which both the subjects and the researchers involved with the subjects are blinded.
Completely Randomized Design (CRD): one which each unit has an equal chance to be chosen for the treatment group; all units are as homogeneous (similar) as possible.
Randomized Complete Block Design (RCBD): when the experimental units are not homogeneous (similar) enough, detecting differences among treatment groups can be difficult. When we think the entire group of units is not similar enough, we create groups, called blocks, for comparing \(t\) treatments in \(b\) blocks. The units within the blocks are similar to each other; all blocks see all treatments in random order.
Matched Pairs : when we take measurements on the same unit, usually once before and once after a treatments (examples include weight loss programs, Coke vs. Pepsi,…)
Want to test new and old fertilizer on tomato plants, and will include a control with no fertilizer, just water.
Want to test new and old fertilizer on tomato plants, and will include a control with no fertilizer, just water. BUT there are inherent differences in the plants from different nurseries so a block is included.
The widespread misuse and misrepresentation of statistical information often gives the field a bad name.
Some say that “numbers don’t lie,” but the people who use numbers to support their claims often do.
Many types of statistical fraud are difficult to spot. Some researchers simply stop collecting data once they have just enough to prove what they had hoped to prove.
They don’t want to take the chance that a more extensive study would complicate their lives by producing data contradicting their hypothesis
Misuse and misinterpretation of statistical information has plagued the field. While data and numbers do not lie, people do. Even a good person when faced with a serious situation in life, can be tempted (and give into it). “The road to Hell is paved with good intentions.”
Many studies that were either falsified or manipulated recently have probably impacted your life, either directly or indirectly. One study was about vaccines and autism, the other was a study done while President Truman was in office about the effects of fat and sugar on the human body. Both studies were found to be falsified, but not before there were negative impacts from them.
There are federal laws and regulations for research studies that is overseen by the US Department of Health and Human Services, with the primary goal of protecting the research participants. Universities and other research institutions are required have their planned studies approved by committees, usually called Institutional Review Boards
Learning some basic statistics can help you to be more informed so as to better analyze statistical studies critically