In this section students will:
Data Distribution
the distribution of a variable in a sample
Population Distribution
the probability distribution of a single observation
of a variable
Sampling Distribution
the probability distribution of a statistic
sampling distribution: a probability distribution of a statistic; it is a distribution of all possible samples (random samples) from a population and how often each outcome occurs in repeated sampling (of the same size \(n\))
Given simple random samples of size \(n\) from a given population with a measured characteristic such as mean \(\overline{X}\), proportion (\(\hat p\)), the probability distribution of all the measured characteristics is called a sampling distribution. It is the distribution of all possible samples (outcomes) of that statistic.
Use of a statistic to estimate the parameter is the main function of inferential statistics as it provides the properties of the statistic.
Law of Large Numbers (LLN) states that as the number of repetitions of an experiment is increased, the relative frequency obtained in the experiment tends to become ever closer to the theoretical probability. Even though the outcomes do not happen according to any set pattern or order (overall), the long-term observed relative frequency will approach the theoretical probability
outcomes=function(n,p){
u=runif(n)
x=1*(u<=p) # easier way to convert logical (T/F) to numeric (0,1)
return(x)
}
p=0.26; max.n=500 # p(1) and max sample size
n=5:max.n # n increasing from sample size 5 to 500
phat=numeric(length(n))
for(i in 1:length(n)){
phat[i]=mean(outcomes(n[i],p))
}
plot(n,phat,type='l',xlab='Count of Trials',ylab='Proportion of successes')
abline(h=p,lty=2)
In this section students will:
Definition
The sampling distribution of the sample mean is approximately
normal with mean \(\mu\) and standard
deviation of the sampling distribution of the sample mean (also called
standard error) \(se=\frac{\sigma}{\sqrt{n}}\), provided
\(n\) is sufficiently large.
If we take \(n\) observations of a quantitative variable and then compute the mean of those observations in the sample, then \(\bar{x}\) is the sample mean.
Assumptions: Each observation \(x_i\) has the same probability distribution with mean \(\mu\) and standard deviation \(\sigma\), and the observations are independent.
\[\overline{X} \sim N(\mu,se)\]
\[\text{Standard error of the mean: }se=\frac{\sigma}{\sqrt{n}}\]
\[z=\frac{\bar{x}-\mu}{se}\]
Sample sizes should be \(n\geq30\) for the sample mean. If a distribution is already inherently normal, the sample size stipulation can be ignored.
A simulation of the CLT will show how taking multiple random samples of the same size from the same population will produce a normal distribution of the sample means. The examples show a normal distribution, exponential distribution, and a binomial distribution.
# Normal example
x=rnorm(500,mean=100,sd=10)
hist(x,prob=TRUE)
x=rnorm(500,mean=100,sd=10)
hist(x,prob=TRUE,ylim=c(0,0.04))
curve(dnorm(x,mean=100,sd=10),70,130,add=TRUE,lwd=2,col="red")
mean(x)
[1] 99.81855
x=rnorm(500,mean=100,sd=10)
mean(x)
[1] 100.302
mu=100
sigma=10
n=5
xbar=rep(0,500)
for (i in 1:500)
{ xbar[i]=mean(rnorm(n,mean=mu,sd=sigma)) }
hist(xbar,prob=TRUE,breaks=12,xlim=c(70,130),ylim=c(0,0.1))
# Exponential distribution example
x=rexp(500)
hist(x,prob=TRUE)
x=rexp(500)
hist(x,prob=TRUE)
curve(dexp(x),add=TRUE,lwd=2,col="red")
mean(x)
[1] 0.9716396
x=rexp(500)
mean(x)
[1] 1.037852
n=30
xbar=rep(0,500)
for (i in 1:500)
{ xbar[i]=mean(rexp(n)) }
hist(xbar,prob=TRUE,breaks=12)
# Binomial distribution example
# start with n=10, p=.8, 500 iterations
y=rbinom(500,10,.8)
hist(y,prob=T)
y=rbinom(500,10,.8)
hist(y,prob=T)
mean(y)
[1] 8.012
y=rbinom(500,10,.8)
mean(y)
[1] 8.106
n=30
xbar=rep(0,500)
for (i in 1:500)
{ xbar[i]=mean(rbinom(n,10,.8)) }
hist(xbar,prob=TRUE,breaks=15)
Definition
The sampling distribution of the sample proportion is
approximately normal with mean \(p\)
and standard deviation of the sampling distribution of the sample
proportion (also called standard error) \(se=\sqrt{\frac{pq}{n}}\), provided \(n\) is sufficiently large (note \(q=1-p\))
If we make \(n\) observations, and count the number of observations on which an outcome happens (call this \(X\)), then \(\hat p=\frac Xn\) is the sample proportion.
Assumptions: \(X\) has a binomial distribution where \(n\) is the number of trials and the probability of the outcome on each trial is \(p\).
\[\hat p \sim N(p,se)\]
\[\text{Standard error of the proportion: } se=\sqrt{\frac{pq}{n}}\]
\[z=\frac{\hat p-p}{se}\]
Sample sizes should be \(n\geq60\) for the sample proportion
Definition
The sampling distribution of the sample total is approximately
normal with mean \(\tau\) and standard
deviation of the sampling distribution of the sample total (also called
standard error) \(se=\sigma\sqrt{n}\),
provided \(n\) is sufficiently
large.
The sample total is the average value of a distribution multiplied by
the total number of trials. If we take \(n\) observations of a quantitative variable
and then compute the sample total (\(\hat{\tau}=n\bar{x}\)) of those
observations in the sample, then \(\hat{\tau}\) is the sample total.
Assumptions: Each observation \(x_i\)
has the same probability distribution with the mean of the total \(\tau=n\mu\) and standard deviation of the
sampling distribution of the sample total (standard error) \(se=\sqrt{n}\sigma\) (or maybe easier to see
it as \(se=\sigma(\sqrt{n})\)), and the
observations are independent.
\[\hat{\tau} \sim
N(\tau,se)\]
\[with~~\tau=n\mu~~and~~se=\sqrt{n}\sigma\]
\[z=\frac{\hat{\tau}-\tau}{se}\]
for sample mean (\(\overline{X}\)) and total (\(\hat{\tau}\))
The level of a particular pollutant, nitrogen dioxide (\(NO_2\)), in the exhaust of a hypothetical model of car, that when driven in city traffic, has a mean level of 2.1 grams per mile (\(g/m\)) and a standard deviation of 0.3 \(g/m\). Suppose a company has a fleet of 35 of these cars.
(a) What is the mean and standard deviation of the
sampling distribution of the sample mean?
mean: \(\mu=2.1\) and \(se=\frac{\sigma}{\sqrt{n}}=\frac{0.3}{\sqrt{35}}=0.0507\)
\(\overline{X} \sim N(\mu,se)=\overline{X} \sim N(2.1,0.0507)\)
(b) find the probability that the mean \(NO_2\) level is less than 2.03 \(g/m\)
\[P(\overline{X}<2.03)=P\left(Z<\frac{2.03-2.1}{0.0507}\right)=P(Z<-1.38)=0.0837933\]
(c) Mandates by the EPA state that the average of the
fleet of these cars cannot exceed 2.2 \(g/m\), find the probability that the fleet
\(NO_2\) levels from their fleet exceed
the EPA mandate
\[P(\overline{X}>2.2)=1-P\left(Z<\frac{2.2-2.1}{0.0507}\right)\]
\[=1-P(Z<1.97)=1-0.9755808=0.0244192\]
(d) At most, 25% of these cars exceed what mean
\(NO_2\) value?
Find the \(z\) score that represents the top 25%, which is the same as the bottom 75% (is also \(Q3\)) and what is needed to find \(P(Z<z_0)=0.75 \Rightarrow z_0=0.6744898\). Next use \(z=\frac{\bar{x}-\mu}{se}\) and solve for \(\bar{x}\): \(\bar{x}=z(se)+\mu\)
\[\bar{x}=(0.6744898)(0.0507)+2.1=2.1341966\]
(e) what is the mean and standard deviation of the
sample total amount, in \(g/m\), of
\(NO_2\) in the exhaust for the
fleet?
\[\tau=n\mu=35(2.1)=73.5\]
\[se=\sqrt{n}\sigma=\sqrt{35}(0.3)=1.7748239\]
\[\hat{\tau} \sim N(\tau,se)=\hat{\tau}\sim
N(73.5,1.7748)\]
(f) find the probability that the total amount of \(NO_2\) for the fleet is between 70 and 75
\(g/m\)
\[P(70<\hat{\tau}<75)=P\left(\frac{70-73.5}{1.7448}<Z<\frac{75-73.5}{1.7748}\right)\]
\[=P(-2.01<Z<0.86)=P(Z<0.86)-P(Z<-2.01)\]
\[=0.8051055-0.0222156=0.7828899\]
Mars company claims that 10% of the M&M’s it produces are green.
Suppose that candies are packaged at random in bags that contain 60
candies.
(a) Describe the sampling distribution of the sample proportion (what
should the distribution look like?); calculate the mean proportion and
standard deviation of the sampling distribution of the sample proportion
of green M&M’s in bags that contain 60 candies (calculate \(p\) and \(se\)).
(b) What is the probability that a bag of 60 candies will have more than
13% green M&M’s?
(a) Describe the sampling distribution of the sample
proportion; calculate the mean proportion and standard deviation of the
sampling distribution of the sample proportion of green M&M’s in
bags that contain 60 candies.
The distribution of the sample proportion will be approximately normal since \(n \geq 60\). The mean proportion \(\pi=0.1\) and the standard error is \(\sqrt{\frac{pq}{n}}=\sqrt{\frac{(0.1)(1-0.1)}{60}}=0.0387\) (the standard deviation of the sampling distribution of the sample proportion). Thus
\[\hat p \sim N(0.1,0.0387)\]
(b) What is the probability that a bag of 60 candies
will have more than 13% green M&M’s?
\[P(\hat
p>0.13)=P\left(Z>\frac{0.13-0.1}{0.0387}\right)\]
\[=P(Z>0.78)=1-P(Z<0.78)=1-0.7823046\]
\[=0.2177\]