Sampling Distributions of Statistics

Sampling Distributions

In this section students will:

Learn about sampling distributions
Understand why sampling distributions are studied

Three Types of Distributions

Data Distribution
the distribution of a variable in a sample

Population Distribution
the probability distribution of a single observation of a variable

Sampling Distribution
the probability distribution of a statistic

Definition of a Sampling Distribution

sampling distribution: a probability distribution of a statistic; it is a distribution of all possible samples (random samples) from a population and how often each outcome occurs in repeated sampling (of the same size \(n\))

Given simple random samples of size \(n\) from a given population with a measured characteristic such as mean \(\overline{X}\), proportion (\(\hat p\)), the probability distribution of all the measured characteristics is called a sampling distribution. It is the distribution of all possible samples (outcomes) of that statistic.

Use of a statistic to estimate the parameter is the main function of inferential statistics as it provides the properties of the statistic.

Simulation of LLN

Law of Large Numbers (LLN) states that as the number of repetitions of an experiment is increased, the relative frequency obtained in the experiment tends to become ever closer to the theoretical probability. Even though the outcomes do not happen according to any set pattern or order (overall), the long-term observed relative frequency will approach the theoretical probability

outcomes=function(n,p){
  u=runif(n)
  x=1*(u<=p) # easier way to convert logical (T/F) to numeric (0,1)
  return(x)
}
p=0.26; max.n=500 # p(1) and max sample size
n=5:max.n # n increasing from sample size 5 to 500
phat=numeric(length(n))
for(i in 1:length(n)){
  phat[i]=mean(outcomes(n[i],p))
}
plot(n,phat,type='l',xlab='Count of Trials',ylab='Proportion of successes')
abline(h=p,lty=2)

Distribution of proportion with increasing n

Central Limit Theorem for the Sample Statistics

In this section students will:

Use properties of the Central Limit Theorem (CLT) to estimate the means and standard deviations of sampling distributions of sample means
Use properties of the Central Limit Theorem (CLT) to estimate the proportions and standard deviations of sampling distributions of sample proportions
Use properties of the Central Limit Theorem (CLT) to estimate the totals and standard deviations of sampling distributions of sample totals

CLT for Sample Mean

Definition
The sampling distribution of the sample mean is approximately normal with mean \(\mu\) and standard deviation of the sampling distribution of the sample mean (also called standard error) \(se=\frac{\sigma}{\sqrt{n}}\), provided \(n\) is sufficiently large.

If we take \(n\) observations of a quantitative variable and then compute the mean of those observations in the sample, then \(\bar{x}\) is the sample mean.

Assumptions: Each observation \(x_i\) has the same probability distribution with mean \(\mu\) and standard deviation \(\sigma\), and the observations are independent.

\[\overline{X} \sim N(\mu,se)\]

\[\text{Standard error of the mean: }se=\frac{\sigma}{\sqrt{n}}\]

\[z=\frac{\bar{x}-\mu}{se}\]

Sample sizes should be \(n\geq30\) for the sample mean. If a distribution is already inherently normal, the sample size stipulation can be ignored.

Simulation example

A simulation of the CLT will show how taking multiple random samples of the same size from the same population will produce a normal distribution of the sample means. The examples show a normal distribution, exponential distribution, and a binomial distribution.

# Normal example
x=rnorm(500,mean=100,sd=10)
hist(x,prob=TRUE)