Sampling Distributions

In this section students will:

  1. Learn about sampling distributions
  2. Understand why sampling distributions are studied

Three Types of Distributions

Data Distribution
the distribution of a variable in a sample

Population Distribution
the probability distribution of a single observation of a variable

Sampling Distribution
the probability distribution of a statistic

Definition of a Sampling Distribution

sampling distribution: a probability distribution of a statistic; it is a distribution of all possible samples (random samples) from a population and how often each outcome occurs in repeated sampling (of the same size \(n\))

Given simple random samples of size \(n\) from a given population with a measured characteristic such as mean \(\overline{X}\), proportion (\(\hat p\)), the probability distribution of all the measured characteristics is called a sampling distribution. It is the distribution of all possible samples (outcomes) of that statistic.

Use of a statistic to estimate the parameter is the main function of inferential statistics as it provides the properties of the statistic.

Simulation of LLN

Law of Large Numbers (LLN) states that as the number of repetitions of an experiment is increased, the relative frequency obtained in the experiment tends to become ever closer to the theoretical probability. Even though the outcomes do not happen according to any set pattern or order (overall), the long-term observed relative frequency will approach the theoretical probability

outcomes=function(n,p){
  u=runif(n)
  x=1*(u<=p) # easier way to convert logical (T/F) to numeric (0,1)
  return(x)
}
p=0.26; max.n=500 # p(1) and max sample size
n=5:max.n # n increasing from sample size 5 to 500
phat=numeric(length(n))
for(i in 1:length(n)){
  phat[i]=mean(outcomes(n[i],p))
}
plot(n,phat,type='l',xlab='Count of Trials',ylab='Proportion of successes')
abline(h=p,lty=2)

Distribution of proportion with increasing n

Central Limit Theorem for the Sample Statistics

In this section students will:

  1. Use properties of the Central Limit Theorem (CLT) to estimate the means and standard deviations of sampling distributions of sample means
  2. Use properties of the Central Limit Theorem (CLT) to estimate the proportions and standard deviations of sampling distributions of sample proportions
  3. Use properties of the Central Limit Theorem (CLT) to estimate the totals and standard deviations of sampling distributions of sample totals

CLT for Sample Mean

Definition
The sampling distribution of the sample mean is approximately normal with mean \(\mu\) and standard deviation of the sampling distribution of the sample mean (also called standard error) \(se=\frac{\sigma}{\sqrt{n}}\), provided \(n\) is sufficiently large.

If we take \(n\) observations of a quantitative variable and then compute the mean of those observations in the sample, then \(\bar{x}\) is the sample mean.

Assumptions: Each observation \(x_i\) has the same probability distribution with mean \(\mu\) and standard deviation \(\sigma\), and the observations are independent.

\[\overline{X} \sim N(\mu,se)\]

\[\text{Standard error of the mean: }se=\frac{\sigma}{\sqrt{n}}\]

\[z=\frac{\bar{x}-\mu}{se}\]

Sample sizes should be \(n\geq30\) for the sample mean. If a distribution is already inherently normal, the sample size stipulation can be ignored.

Simulation example

A simulation of the CLT will show how taking multiple random samples of the same size from the same population will produce a normal distribution of the sample means. The examples show a normal distribution, exponential distribution, and a binomial distribution.

# Normal example
x=rnorm(500,mean=100,sd=10)
hist(x,prob=TRUE)

CLT simulation for sample mean

x=rnorm(500,mean=100,sd=10)
hist(x,prob=TRUE,ylim=c(0,0.04))
curve(dnorm(x,mean=100,sd=10),70,130,add=TRUE,lwd=2,col="red")

CLT simulation for sample mean

mean(x)
[1] 99.81855
x=rnorm(500,mean=100,sd=10)
mean(x)
[1] 100.302
mu=100
sigma=10
n=5

xbar=rep(0,500) 
for (i in 1:500) 
{ xbar[i]=mean(rnorm(n,mean=mu,sd=sigma)) }

hist(xbar,prob=TRUE,breaks=12,xlim=c(70,130),ylim=c(0,0.1))

CLT simulation for sample mean

# Exponential distribution example
x=rexp(500)
hist(x,prob=TRUE)

CLT simulation for sample mean

x=rexp(500)
hist(x,prob=TRUE)
curve(dexp(x),add=TRUE,lwd=2,col="red")

CLT simulation for sample mean

mean(x)
[1] 0.9716396
x=rexp(500)
mean(x)
[1] 1.037852
n=30
xbar=rep(0,500)
for (i in 1:500) 
{ xbar[i]=mean(rexp(n)) }

hist(xbar,prob=TRUE,breaks=12)

CLT simulation for sample mean

# Binomial distribution example
# start with n=10, p=.8, 500 iterations

y=rbinom(500,10,.8)
hist(y,prob=T)

CLT simulation for sample mean

y=rbinom(500,10,.8)
hist(y,prob=T)

CLT simulation for sample mean

mean(y)
[1] 8.012
y=rbinom(500,10,.8)
mean(y)
[1] 8.106
n=30
xbar=rep(0,500)
for (i in 1:500) 
{ xbar[i]=mean(rbinom(n,10,.8)) }

hist(xbar,prob=TRUE,breaks=15)

CLT simulation for sample mean

CLT of Sample Proportion

Definition
The sampling distribution of the sample proportion is approximately normal with mean \(p\) and standard deviation of the sampling distribution of the sample proportion (also called standard error) \(se=\sqrt{\frac{pq}{n}}\), provided \(n\) is sufficiently large (note \(q=1-p\))

If we make \(n\) observations, and count the number of observations on which an outcome happens (call this \(X\)), then \(\hat p=\frac Xn\) is the sample proportion.

Assumptions: \(X\) has a binomial distribution where \(n\) is the number of trials and the probability of the outcome on each trial is \(p\).

\[\hat p \sim N(p,se)\]

\[\text{Standard error of the proportion: } se=\sqrt{\frac{pq}{n}}\]

\[z=\frac{\hat p-p}{se}\]

Sample sizes should be \(n\geq60\) for the sample proportion

CLT of Sample Total

Definition
The sampling distribution of the sample total is approximately normal with mean \(\tau\) and standard deviation of the sampling distribution of the sample total (also called standard error) \(se=\sigma\sqrt{n}\), provided \(n\) is sufficiently large.

The sample total is the average value of a distribution multiplied by the total number of trials. If we take \(n\) observations of a quantitative variable and then compute the sample total (\(\hat{\tau}=n\bar{x}\)) of those observations in the sample, then \(\hat{\tau}\) is the sample total.
Assumptions: Each observation \(x_i\) has the same probability distribution with the mean of the total \(\tau=n\mu\) and standard deviation of the sampling distribution of the sample total (standard error) \(se=\sqrt{n}\sigma\) (or maybe easier to see it as \(se=\sigma(\sqrt{n})\)), and the observations are independent.

\[\hat{\tau} \sim N(\tau,se)\]
\[with~~\tau=n\mu~~and~~se=\sqrt{n}\sigma\]
\[z=\frac{\hat{\tau}-\tau}{se}\]

CLT for sample mean (\(\overline{X}\)) and sample sum/total (\(\hat{\tau}\))

for sample mean (\(\overline{X}\)) and total (\(\hat{\tau}\))

The level of a particular pollutant, nitrogen dioxide (\(NO_2\)), in the exhaust of a hypothetical model of car, that when driven in city traffic, has a mean level of 2.1 grams per mile (\(g/m\)) and a standard deviation of 0.3 \(g/m\). Suppose a company has a fleet of 35 of these cars.

(a) What is the mean and standard deviation of the sampling distribution of the sample mean?

mean: \(\mu=2.1\) and \(se=\frac{\sigma}{\sqrt{n}}=\frac{0.3}{\sqrt{35}}=0.0507\)

\(\overline{X} \sim N(\mu,se)=\overline{X} \sim N(2.1,0.0507)\)

CLT for \(\overline{X}\) and \(\hat{\tau}\) solutions

(b) find the probability that the mean \(NO_2\) level is less than 2.03 \(g/m\)

\[P(\overline{X}<2.03)=P\left(Z<\frac{2.03-2.1}{0.0507}\right)=P(Z<-1.38)=0.0837933\]

(c) Mandates by the EPA state that the average of the fleet of these cars cannot exceed 2.2 \(g/m\), find the probability that the fleet \(NO_2\) levels from their fleet exceed the EPA mandate

\[P(\overline{X}>2.2)=1-P\left(Z<\frac{2.2-2.1}{0.0507}\right)\]
\[=1-P(Z<1.97)=1-0.9755808=0.0244192\]

CLT for \(\overline{X}\) and \(\hat{\tau}\) solutions

(d) At most, 25% of these cars exceed what mean \(NO_2\) value?

Find the \(z\) score that represents the top 25%, which is the same as the bottom 75% (is also \(Q3\)) and what is needed to find \(P(Z<z_0)=0.75 \Rightarrow z_0=0.6744898\). Next use \(z=\frac{\bar{x}-\mu}{se}\) and solve for \(\bar{x}\): \(\bar{x}=z(se)+\mu\)

\[\bar{x}=(0.6744898)(0.0507)+2.1=2.1341966\]

CLT for \(\overline{X}\) and \(\hat{\tau}\) solutions

(e) what is the mean and standard deviation of the sample total amount, in \(g/m\), of \(NO_2\) in the exhaust for the fleet?

\[\tau=n\mu=35(2.1)=73.5\]
\[se=\sqrt{n}\sigma=\sqrt{35}(0.3)=1.7748239\]
\[\hat{\tau} \sim N(\tau,se)=\hat{\tau}\sim N(73.5,1.7748)\]

CLT for \(\overline{X}\) and \(\hat{\tau}\) solutions

(f) find the probability that the total amount of \(NO_2\) for the fleet is between 70 and 75 \(g/m\)

\[P(70<\hat{\tau}<75)=P\left(\frac{70-73.5}{1.7448}<Z<\frac{75-73.5}{1.7748}\right)\]
\[=P(-2.01<Z<0.86)=P(Z<0.86)-P(Z<-2.01)\]
\[=0.8051055-0.0222156=0.7828899\]

CLT for proportion (\(\hat p\))

Mars company claims that 10% of the M&M’s it produces are green. Suppose that candies are packaged at random in bags that contain 60 candies.
(a) Describe the sampling distribution of the sample proportion (what should the distribution look like?); calculate the mean proportion and standard deviation of the sampling distribution of the sample proportion of green M&M’s in bags that contain 60 candies (calculate \(p\) and \(se\)).
(b) What is the probability that a bag of 60 candies will have more than 13% green M&M’s?

CLT for \(\hat p\) solutions

(a) Describe the sampling distribution of the sample proportion; calculate the mean proportion and standard deviation of the sampling distribution of the sample proportion of green M&M’s in bags that contain 60 candies.

The distribution of the sample proportion will be approximately normal since \(n \geq 60\). The mean proportion \(\pi=0.1\) and the standard error is \(\sqrt{\frac{pq}{n}}=\sqrt{\frac{(0.1)(1-0.1)}{60}}=0.0387\) (the standard deviation of the sampling distribution of the sample proportion). Thus

\[\hat p \sim N(0.1,0.0387)\]

CLT for \(\hat p\) solutions

(b) What is the probability that a bag of 60 candies will have more than 13% green M&M’s?

\[P(\hat p>0.13)=P\left(Z>\frac{0.13-0.1}{0.0387}\right)\]
\[=P(Z>0.78)=1-P(Z<0.78)=1-0.7823046\]
\[=0.2177\]