What is Statistics?

Statistics is a science, not a branch of mathematics, but uses mathematical models as essential tools.1

Like other branches of science, statistics uses math, but is not inherently a math class. Just like physics uses math [for everything!], but is not a math class.2

Defined, statistics is the science of data. Sample information is obtained and inferences about the population are made from the sample information. We use descriptive statistics (graphs, numerical summaries, etc.) and inferential statistics (formal methods using probabilities). Design observational and experimental studies using samples of data

Sampling and Data

In this section students will

  1. Define the basic vocabulary
  2. Distinguish between population and sample
  3. Distinguish between parameters and statistics

Definitions of Statistics

The science of statistics deals with the collection, analysis, interpretation, and presentation of data. We see and use data in our everyday lives.

In this course, you will learn how to organize and summarize data. Organizing and summarizing data is called descriptive statistics.

Two ways to summarize data are by graphing and by using numbers (for example, finding an average).

After you have studied probability and probability distributions, you will use formal methods for drawing conclusions from “good“ data. The formal methods are called inferential statistics.

Statistical inference uses probability to determine how confident we can be that our conclusions are correct.

Probability

Probability is a mathematical tool used to study randomness. It deals with the chance (the likelihood) of an event occurring.

The expected theoretical probability of heads in any one toss is 1/2 or 0.5. Even though the outcomes of a few repetitions are uncertain, there is a regular pattern of outcomes when there are many repetitions.

For example, if you toss a fair coin four times, the outcomes may not be two heads and two tails. However, if you toss the same coin 4,000 times, the outcomes will be close to half heads and half tails.

Population and Parameter

In statistics, we generally want to study a population. You can think of a population as a collection of persons, things, or objects under study.

The statistic is an estimate of a population parameter. A parameter is a number that is a property of the population.

Since we considered all math classes to be the population, then the average number of points earned per student over all the math classes is an example of a parameter.

Sample and Statistic

To study the population, we select a sample. The idea of sampling is to select a portion (or subset) of the larger population and study that portion (the sample) to gain information about the population.

From the sample data, we can calculate a statistic. A statistic is a number that represents a property of the sample, and is an estimate of the parameter.

For example, if we consider one math class to be a sample of the population of all math classes, then the average number of points earned by students in that one math class at the end of the term is an example of a statistic.

Data and Sampling

Data are the actual values of the variable. They may be numbers or they may be words. Datum is a single value. Data are the result of sampling from a population.

Because it takes a lot of time and money to examine an entire population, sampling is a very practical technique.

One of the main concerns in the field of statistics is how accurately a statistic estimates a parameter. The accuracy really depends on how well the sample represents the population.

The sample must contain the characteristics of the population in order to be a representative sample. We are interested in both the sample statistic and the population parameter in inferential statistics.

Variables

A variable, notated by capital letters such as \(X\) and \(Y\), is a characteristic of interest for each person or thing in a population. Variables may be numerical or categorical.

Numerical variables take on values with equal units such as weight in pounds and time in hours.

Categorical variables place the person or thing into a category.

Mean and Average

Two words that come up often in statistics are mean and proportion.

NOTE
The words “mean” and “average” are often used interchangeably. The substitution of one word for the other is common practice. The technical term is “arithmetic mean,” and “average” is technically a center location. However, in practice among non-statisticians, “average” is commonly accepted for “arithmetic mean.”

Key Terms Example

Terminology Definition
Terminology Definition

Data, Samples, Variation in Sampling

In this section students will:

  1. Classify data as qualitative or quantitative
  2. Classify data as continuous, discrete, or neither
  3. Determine types of graphs appropriate for specific data
  4. Identify the basic techniques for choosing samples

Data

Qualitative data are the result of categorizing or describing attributes of a population. Hair color, blood type, ethnic group, the car a person drives, and the street a person lives on are examples of qualitative data.
Qualitative data are generally described by words or letters.

Quantitative data are always numbers. Quantitative data are the result of counting or measuring attributes of a population. Amount of money, pulse rate, weight, number of people living in your town, and number of students who take statistics are examples of quantitative data.

Data Types

Quantitative data may be either discrete or continuous.

All data that are the result of counting are called quantitative discrete data. These data take on only certain numerical values.

All data that are the result of measuring are quantitative continuous data assuming that we can measure accurately.

Note You may collect data as numbers and report it categorically. For example, the quiz scores for each student are recorded throughout the term. At the end of the term, the quiz scores are reported as A, B, C, D, or F.

Data Examples

Example 1.5 Data Sample of Quantitative Discrete Data
The data are the number of books students carry in their backpacks. You sample five students. Two students carry three books, one student carries four books, one student carries two books, and one student carries one book. The numbers of books (three, four, two, and one) are the quantitative discrete data.

Example 1.6 Data Sample of Quantitative Continuous Data
The data are the weights of backpacks with books in them. You sample the same five students. The weights (in pounds) of their backpacks are 6.2, 7, 6.8, 9.1, 4.3. Notice that backpacks carrying three books can have different weights. Weights are quantitative continuous data.

Example 1.8
The data are the colors of backpacks. Again, you sample the same five students. One student has a red backpack, two students have black backpacks, one student has a green backpack, and one student has a gray backpack. The colors red, black, black, green, and gray are qualitative data.

Qualitative Data

Tables are a good way of organizing and displaying data. But graphs can be even more helpful in understanding the data.

There are no strict rules concerning which graphs to use.

Two graphs that are used to display qualitative data are pie charts and bar graphs.

Pie Chart

In a pie chart, categories of data are represented by wedges in a circle and are proportional in size to the percent of individuals in each category.

M&M color pie chart

Student ethnicity pie charts
Student ethnicity pie charts

Bar Graph

In a bar graph, the length of the bar for each category is proportional to the number or percent of individuals in each category. Bars may be vertical or horizontal.

Barplot of M&M colors

Student ethnicity barplot
Student ethnicity barplot

Pareto Chart

A Pareto chart consists of bars that are sorted into order by category size (largest to smallest). It is an ordered bar graph

Student ethnicity pareto chart
Student ethnicity pareto chart

Random Sampling

Gathering information about an entire population often costs too much or is virtually impossible. Instead, we use a sample of the population. A sample should have the same characteristics as the population it is representing.

There are several different methods of random sampling. In each form of random sampling, each member of a population initially has an equal chance of being selected for the sample.

The easiest method to describe is called a simple random sample. Any group of \(n\) individuals is equally likely to be chosen by any other group of \(n\) individuals if the simple random sampling technique is used. In other words, each sample of the same size has an equal chance of being selected. (Another way to think about it is that every individual has the same chance to be chosen for the sample)

Sampling Examples

True random sampling is done with replacement. That is, once a member is picked, that member goes back into the population and thus may be chosen more than once. However for practical reasons, in most populations, simple random sampling is done without replacement.

In a college population of 10,000 people, suppose you want to pick a sample of 1,000 randomly for a survey.

For any particular sample of 1,000, if you are sampling with replacement, - the chance of picking the first person is 1,000 out of 10,000 - the chance of picking a different second person for this sample is 999 out of 10,000 - you replace the first person before picking the next person

For any particular sample of 1,000, if you are sampling without replacement, - the chance of picking the first person for any particular sample is 1,000 out of 10,000 - the chance of picking a different second person is 999 out of 9,999 - you do not replace the first person before picking the next person

Sampling Designs

Besides simple random sampling, there are other forms of sampling that involve a chance process for getting the sample. Other well-known random sampling methods are the stratified sample, the cluster sample, and the systematic sample.

To choose a stratified sample, divide the population into groups called strata and then take a proportionate number from each stratum.

To choose a 1-stage cluster sample, divide the population into clusters (groups) and then randomly select some of the clusters. All the members from these clusters are in the cluster sample.

To choose a 2-stage cluster sample, divide the population into clusters (groups) and then randomly select some of the clusters. Next randomly select some of the members from these clusters.

To choose a systematic sample, randomly select a starting point and take every \(k^{th}\) piece of data from a listing of the population.

A type of sampling that is non-random is convenience sampling. Convenience sampling involves using results that are readily available.

Errors in Sampling

When you analyze data, it is important to be aware of sampling errors and nonsampling errors. The actual process of sampling causes sampling errors. For example, the sample may not be large enough.

Factors not related to the sampling process cause nonsampling errors. A defective counting device can cause a nonsampling error.

In statistics, a sampling bias is created when a sample is collected from a population and some members of the population are not as likely to be chosen as others (remember, each member of the population should have an equally likely chance of being chosen).

When a sampling bias happens, there can be incorrect conclusions drawn about the population that is being studied.

Variation in Data and Samples

Variation is present in any set of data.

For example, 16-ounce cans of beverage may contain more or less than 16 ounces of liquid.

It was mentioned previously that two or more samples from the same population, taken randomly, and having close to the same characteristics of the population will likely be different from each other.

Sampling Examples

Determine the type of sampling used (simple random, stratified, systematic, cluster, or convenience).

  1. A soccer coach selects six players from a group of boys aged eight to ten, seven players from a group of boys aged 11 to 12, and three players from a group of boys aged 13 to 14 to form a recreational soccer team.
  2. A pollster interviews all human resource personnel in five different high tech companies.
  3. A high school educational researcher interviews 50 high school female teachers and 50 high school male teachers.
  4. A medical researcher interviews every third cancer patient from a list of cancer patients at a local hospital.
  5. A high school counselor uses a computer to generate 50 random numbers and then picks students whose names correspond to the numbers.
  6. A student interviews classmates in his algebra class to determine how many pairs of jeans a student owns, on the average.

Frequency, Frequencey Tables, and Levels of Measurement

In this section students will:

  1. Use basic vocabulary
  2. Determine the level of measurements for variables
  3. Construct a frequency table, including relative and cumulative frequencies

Levels of Measurement

The way a set of data is measured is called its level of measurement.

Data can be classified into four levels of measurement. They are (from lowest to highest level):

  1. Nominal scale level
  2. Ordinal scale level
  3. Interval scale level
  4. Ratio scale level

Nominal Scale

Data that is measured using a nominal scale is qualitative. Categories, colors, names, labels and favorite foods along with yes or no responses are examples of nominal level data.

Nominal scale data are not ordered. For example, trying to classify people according to their favorite food does not make any sense. Putting pizza first and sushi second is not meaningful.

Smartphone companies are another example of nominal scale data. Some examples are Sony, Motorola, Nokia, Samsung and Apple. This is just a list and there is no agreed upon order. Some people may favor Apple but that is a matter of opinion.

Nominal scale data cannot be used in calculations.

Ordinal Scale

Data that is measured using an ordinal scale is similar to nominal scale data, but the ordinal scale data can be ordered.

An example of ordinal scale data is a list of the top five national parks in the United States. The top five national parks in the United States can be ranked from one to five but we cannot measure differences between the data.

Another example of using the ordinal scale is a cruise survey where the responses to questions about the cruise are “excellent,” “good,” “satisfactory,” and “unsatisfactory.” These responses are ordered from the most desired response to the least desired, but the differences between two pieces of data cannot be measured.

Like the nominal scale data, ordinal scale data cannot be used in calculations.

Interval Scale

Data that is measured using the interval scale is similar to ordinal level data because it has a definite ordering but the differences between interval scale data can be measured though the data does not have a starting point.

Temperature scales like Celsius (C) and Fahrenheit (F) are measured by using the interval scale. In both temperature measurements, 40\(^{\circ}\) is equal to 100\(^{\circ}\) minus 60\(^{\circ}\). Differences make sense. But 0 degrees does not because, in both scales, 0 is not the absolute lowest temperature. Temperatures like -10\(^{\circ}\) F and -15\(^{\circ}\) C exist and are colder than 0.

Interval level data can be used in calculations, but one type of comparison cannot be done. 80\(^{\circ}\) C is not four times as hot as 20\(^{\circ}\) C (nor is 80\(^{\circ}\) F four times as hot as 20\(^{\circ}\) F). There is no meaning to the ratio of 80 to 20 (or four to one).

Ratio Scale

Data that is measured using the ratio scale takes care of the ratio problem and gives you the most information. Ratio scale data is like interval scale data, but it has a 0 point and ratios can be calculated.

For example, four multiple choice statistics final exam scores are 80, 68, 20 and 92 (out of a possible 100 points). The exams are machine-graded. The data can be put in order from lowest to highest: 20, 68, 80, 92.

The differences between the data have meaning. The score 92 is more than the score 68 by 24 points.
Ratios can be calculated. The smallest score is 0. So 80 is four times 20. The score of 80 is four times better than the score of 20.

Frequency and Frequency Tables

A frequency is the number of times a value of the data occurs. The sum of the values in the frequency column represents the total number of values included in the sample.

A relative frequency is the ratio (fraction or proportion) of the number of times a value of the data occurs in the set of all outcomes to the total number of outcomes. To find the relative frequencies, divide each frequency by the total number of data values. Relative frequencies can be written as fractions, percents, or decimals.

Cumulative relative frequency is the accumulation of the previous relative frequencies. To find the cumulative relative frequencies, add all the previous relative frequencies to the relative frequency for the current row.

NOTE
Because of rounding, the relative frequency column may not always sum to one, and the last entry in the cumulative relative frequency column may not be one. However, they each should be close to one.

Table Example

The table represents the heights, in inches, of a sample of 100 male semiprofessional soccer players.

Table of heights
Table of heights

Table Example Questions

From the previous table, find the following answers

  1. The percentage of heights that are less than 65.95 inches
  2. The percentage of heights that are from 67.95 to 71.95 inches
  3. The percentage of heights that are from 67.95 to 73.95 inches
  4. The percentage of heights that are more than 65.95 inches
  5. The number of players in the sample who are between 61.95 and 71.95 inches tall
  6. What kind of data are the heights?
  7. Describe how you could gather this data (the heights) so that the data are characteristic of all male semiprofessional soccer players.

Remember, you count frequencies. To find the relative frequency, divide the frequency by the total number of data values. To find the cumulative relative frequency, add all of the previous relative frequencies to the relative frequency for the current row.

Death freq

Table of world natural disaster deaths 2006-2012
Table of world natural disaster deaths 2006-2012

Death freq questions

Answer the following questions.

  1. What is the frequency of deaths measured from 2006 through 2009?
  2. What percentage of deaths occurred after 2009?
  3. What is the relative frequency of deaths that occurred in 2003 or earlier?
  4. What is the percentage of deaths that occurred in 2004?
  5. What kind of data are the numbers of deaths?
  6. The Richter scale is used to quantify the energy produced by an earthquake. Examples of Richter scale numbers are 2.3, 4.0, 6.1, and 7.0. What kind of data are these numbers?

Experimental Design and Ethics

In this section students will:

  1. Identify different experimental designs
  2. Identify the practice and ethical concern that arise when conducting a study

Experimental Design Variables

The purpose of an experiment is to investigate the relationship between two variables.

When one variable causes change in another, we call the first variable the explanatory variable. The affected variable is called the response variable.

In a randomized experiment, the researcher manipulates values of the explanatory variable and measures the resulting changes in the response variable.

The different values of the explanatory variable are called treatments. An experimental unit is a single object or individual to be measured.

Experiment Logistics

Additional variables that can cloud a study are called lurking variables. In order to prove that the explanatory variable is causing a change in the response variable, it is necessary to isolate the explanatory variable.

The researcher must design the experiment in such a way that there is only one difference between groups being compared: the planned treatments. This is accomplished by the random assignment of experimental units to treatment groups.

When subjects are assigned treatments randomly, all of the potential lurking variables are spread equally among the groups.

Experiment Logistics (cont)

When participation in a study prompts a physical response from a participant, it is difficult to isolate the effects of the explanatory variable.

To counter the power of suggestion, researchers set aside one treatment group as a control group. This group is given a placebo treatment; a treatment that cannot influence the response variable. The control group helps researchers balance the effects of being in an experiment with the effects of the active treatments.

Of course, if you are participating in a study and you know that you are receiving a pill which contains no actual medication, then the power of suggestion is no longer a factor. Blinding in a randomized experiment preserves the power of suggestion.

When a person involved in a research study is blinded, they do not know who is receiving the active treatment(s) and who is receiving the placebo treatment. A double-blind experiment is one in which both the subjects and the researchers involved with the subjects are blinded.

Experimental Designs: CRD, RCBD, Matched pairs

Completely Randomized Design (CRD): one which each unit has an equal chance to be chosen for the treatment group; all units are as homogeneous (similar) as possible.

Randomized Complete Block Design (RCBD): when the experimental units are not homogeneous (similar) enough, detecting differences among treatment groups can be difficult. When we think the entire group of units is not similar enough, we create groups, called blocks, for comparing \(t\) treatments in \(b\) blocks. The units within the blocks are similar to each other; all blocks see all treatments in random order.

Matched Pairs : when we take measurements on the same unit, usually once before and once after a treatments (examples include weight loss programs, Coke vs. Pepsi,…)

Diagram of CRD

Want to test new and old fertilizer on tomato plants, and will include a control with no fertilizer, just water.

CRD
CRD

Diagram of RCBD

Want to test new and old fertilizer on tomato plants, and will include a control with no fertilizer, just water. BUT there are inherent differences in the plants from different nurseries so a block is included.

RCBD
RCBD

Ethics

The widespread misuse and misrepresentation of statistical information often gives the field a bad name.

Some say that “numbers don’t lie,” but the people who use numbers to support their claims often do.

Many types of statistical fraud are difficult to spot. Some researchers simply stop collecting data once they have just enough to prove what they had hoped to prove.

They don’t want to take the chance that a more extensive study would complicate their lives by producing data contradicting their hypothesis

Misuse and misinterpretation of statistical information has plagued the field. While data and numbers do not lie, people do. Even a good person when faced with a serious situation in life, can be tempted (and give into it). “The road to Hell is paved with good intentions.”

Many studies that were either falsified or manipulated recently have probably impacted your life, either directly or indirectly. One study was about vaccines and autism, the other was a study done while President Truman was in office about the effects of fat and sugar on the human body. Both studies were found to be falsified, but not before there were negative impacts from them.

Research regulations

There are federal laws and regulations for research studies that is overseen by the US Department of Health and Human Services, with the primary goal of protecting the research participants. Universities and other research institutions are required have their planned studies approved by committees, usually called Institutional Review Boards

  1. Risks must be minimized and reasonable within the scope of the research and projected benefits
  2. Participants must give informed consent and all documentation must be kept
  3. Data collected from individuals must be kept such that their information is kept private

Learning some basic statistics can help you to be more informed so as to better analyze statistical studies critically


  1. John Tukey↩︎

  2. M. Londe↩︎