Use Sophia to knock out your gen-ed requirements quickly and affordably. Learn more
×

Sampling Distributions

Author: Sophia

before you start
This lesson builds on key concepts from an Introduction to Statistics course. Specifically, this tutorial assumes familiarity with the foundational idea of sampling distributions.

1. Introduction to Sampling Distributions

A sampling distribution is a way to understand how a statistic (like an average) from a sample can vary. Imagine you want to know the average amount of money customers spend at a store. Instead of asking every customer, you take several small samples and calculate the average for each sample. The collection of these averages forms a sampling distribution.

Sampling distributions help us understand the variability of our sample statistics. This is important because it allows businesses to make predictions and decisions based on sample data, rather than needing to survey an entire population.

Let’s say a company wants to know the average amount of time customers spend on their website. Instead of tracking every single visitor, they take multiple samples of 100 visitors each and calculate the average time spent for each sample. These averages might be slightly different, but together they form a sampling distribution.

Imagine you take 10 different samples of 100 visitors each and calculate the average time spent on the website for each sample. You might get averages like 5 minutes, 5.2 minutes, 4.8 minutes, etc. If you plot these averages on a graph, you’ll see a distribution of these sample averages. This is your sampling distribution. The histogram below shows a sampling distribution of the average time spent on a website using 10 different samples of 100 visitors each.

Understanding sampling distributions helps businesses make better decisions. For example, if the sampling distribution shows that most sample averages are around 5 minutes, the company can be confident that the true average time spent on the website is close to 5 minutes.

key concept
  • Population: The entire group you’re interested in, like all visitors to the website.
  • Sample: A smaller group selected from the population, like 100 visitors.
  • Sample Statistic: This is a number that describes a sample, like the average time spent on a website.
  • Sampling Distribution: The distribution of a statistic (like the average) from many samples.

term to know
Sampling Distribution
Distribution of a given statistic based on a random sample, showing how the statistic varies from sample to sample.

1a. Importance of Sampling Distributions

Sampling distributions are crucial in business data analytics for several reasons:

  1. Making Predictions: Sampling distributions help businesses make predictions about a population based on sample data. For example, if a company wants to know the average amount customers spend, they can use a sampling distribution to estimate this average without surveying every customer.
  2. Understanding Variability: They show how much a sample statistic (like an average) can vary from sample to sample. This helps businesses understand the reliability of their data. For instance, if the average time spent on a website varies little between samples, the company can be more confident in its estimate.
  3. Decision Making: Businesses often make decisions based on sample data. Sampling distributions provide a way to measure the uncertainty of these decisions. For example, a company might use a sampling distribution to decide whether a new marketing strategy is effective, based on sample sales data.
  4. Quality Control: In manufacturing, sampling distributions are used to monitor product quality. By taking samples from production lines and analyzing the distribution of defects, companies can identify and address quality issues more efficiently.
  5. Cost Efficiency: Surveying an entire population can be expensive and time-consuming. Sampling distributions allow businesses to gather insights and make decisions based on smaller, more manageable samples, saving time and resources.

EXAMPLE

A hospital wants to know the average wait time for patients in the emergency room (ER). Instead of tracking every patient, they take several samples of 100 patients each and calculate the average wait time for each sample. The distribution of these averages forms a sampling distribution.

The key aspects are:

  • Population: All patients visiting the ER.
  • Sample: Groups of 100 patients each.
  • Sample Statistic: Average wait time in the ER.
  • Sampling Distribution: The distribution of the average wait times from multiple samples.
The hospital took 50 samples of 100 patients each, calculated the average wait time for each patient, and plotted these 50 averages, as shown in the histogram below.



By analyzing the sampling distribution, the hospital can estimate the true average wait time for all patients. This helps in making decisions about staffing, resource allocation, and process improvements.

The sample averages are between 28 and 32 minutes, which suggests that the average wait time for patients in the ER is likely around 30 minutes, with some variation. This histogram helps the hospital understand the typical wait time and how much it can vary.

1b. Sampling Distribution of the Mean ()

Suppose you want to understand the average spending of customers at a store. Instead of asking every single customer, you take several smaller samples and calculate the average spending for each sample. The collection of these sample averages forms what is known as the sampling distribution of the mean (bold x with bar on top).

The table below shows the collection of these sample means, which forms the sampling distribution of x with bar on top.

Sample Sample Size Mean ()
1 100 x with bar on top equals $ 48.90
2 100 x with bar on top equals $ 46.53
3 100 x with bar on top equals $ 51.08
... ... ...
30 100 x with bar on top equals $ 49.94

The histogram below is a plot of all the sample means from the table above. The histogram visualizes the distribution of the sample means. The histogram helps you understand the central tendency (average) and variability of the sample means.

The graphic below illustrates the construction of another sampling distribution of x with bar on top. Even though the population mean is 82.5, the image visually demonstrates how the means of samples drawn from that population can vary. This sampling distribution of x with bar on top consists of six samples, each with a sample size of 20.

Source: Sampling Variability - MathBitsNotebook(A2)

Just as with other distributions we have studied, the sampling distribution of x with bar on top has its own mean, standard deviation, and shape.

Returning to the example of understanding the average spending of customers at a store, no information is available regarding the original population's spending patterns. However, this is acceptable due to the Central Limit Theorem, a statistical concept that offers insights into the sampling distribution of the sample mean (x with bar on top).

The Central Limit Theorem (CLT) provides you with information about the characteristics of the sampling distribution of x with bar on top. The CLT tells us:

  • Regardless of the shape of the population distribution (whether skewed, uniform, etc.), the sampling distribution of x with bar on top will be approximately normal (bell-shaped) if the sample size is large enough (at least 30 for most distributions).
  • The sampling distribution of x with bar on top has its own mean and standard deviation.
  • The sampling distribution of x with bar on top will have a mean equal to the population mean.
mu subscript x with bar on top end subscript equals mu subscript p o p u l a t i o n end subscript

  • The standard deviation of the sampling distribution of x with bar on top is the population standard deviation divided by the square root of the sample size (this is called the standard error).
sigma subscript x with bar on top end subscript equals fraction numerator sigma subscript p o p u l a t i o n end subscript over denominator square root of n end fraction

In practice, the true value of σ is typically unknown and cannot be directly calculated. Therefore, σ is estimated using the sample standard deviation, denoted as s. Consequently, the sample standard error is defined as follows:

s subscript x with bar on top end subscript equals fraction numerator s over denominator square root of n end fraction

Let’s see how you might use the Central Limit Theorem to tell you something about a sampling distribution of the x with bar on top.

EXAMPLE

A retail chain wants to understand its average monthly sales across all its 250 stores to make informed business decisions. However, analyzing the average sales across all its stores is time-consuming and resource-intensive. Instead, the company asks you to use a sample-based approach to estimate the average monthly sales for all its 250 stores.

You create 50 samples of sales from 10 stores and find the average of each sample. That is, you create the sampling distribution of x with bar on top shown in the spreadsheet below. The first five rows of the sampling distribution for sales_CLT.xlsx are shown below.

A description of each column is:

  • Sample_Number: Identifies the sample (1 to 50).
  • Store_1 to Store_10: The sales figures for each of the 10 stores in that particular sample.
  • Sample_Means: The average sales calculated from the 10 stores in that sample. The sample means in column L is the sampling distribution of x with bar on top.
For this sampling distribution, the number of samples is 50 and the sample size (n) is 10.



Based on the known properties of the sampling distribution of x with bar on top, you can provide the company with an estimate of the average monthly sales across all 250 stores by using:

mu subscript x with bar on top end subscript equals mu subscript p o p u l a t i o n end subscript

For this problem, mu subscript x with bar on top end subscript can be found by taking the mean of the sample means in column L. In Excel, in cell N2, enter:

=AVERAGE(L2:L51)
You find that the mean of the sampling distribution of x with bar on top is $4,948. This gives the retail chain a reliable estimate of the average monthly sales across all stores. The mean of the sample means is a more reliable estimate of the population mean because it aggregates information from multiple samples, reducing the impact of any anomalies in individual samples. This is why the retail chain uses the mean of the sample to estimate the overall average monthly sales for all its stores.

You can also find the standard error by entering the following formula in cell N3:

=STDEV.S(AF2:AF51)/SQRT(50)
The standard error is $26. Since the population standard deviation is not known, the Excel formula that is being used to calculate standard error is:

s subscript x with bar on top end subscript equals fraction numerator s over denominator square root of n end fraction equals fraction numerator equals S T D E V. S left parenthesis L 2 colon L 51 right parenthesis over denominator square root of 50 end fraction

This helps the retail chain understand the precision of their estimate of the population mean, mu subscript p o p u l a t i o n end subscript. That is, this value helps the retail chain understand how much the mean of the sample means ($4,948) is expected to vary from the true population mean. A smaller standard error indicates a more precise estimate of the population mean, μ.

watch
Check out this video on using a sample-based approach to estimate the average monthly sales.

terms to know
Sampling Distribution of x̄
A distribution of all possible sample means of a given size from a population, showing how the sample mean varies from sample to sample.
Central Limit Theorem (CLT)
A property that states the sampling distribution of x with bar on top approaches a normal distribution as the sample size becomes large, regardless of the original population’s distribution.
Standard Error
Measure of the variability of a sample statistic (such as x with bar on top) from the population, calculated as the standard deviation of the sampling distribution of that statistic.

1c. Estimating Population Mean Using Sampling Distributions: A Practical Approach

The mean of the sampling distribution of x with bar on top is theoretically equal to the population mean, μ. However, in practice, there are several reasons why they might not be exactly the same. For instance:

  • Sampling Variability. Each sample you take is different, and the sample mean x with bar on top will vary from sample to sample. This variability is natural and expected. The more samples you take, the closer the mean of the sampling distribution will get to the population mean, but with a limited number of samples, there will always be some difference.
  • Sample Size. The size of each sample affects the accuracy. Smaller samples tend to have more variability, which can lead to a greater difference between the sample mean and the population mean. Larger samples tend to produce sample means that are closer to the population mean.
  • Random Error. Random errors can occur due to various factors, such as measurement errors or random fluctuations in the data. These errors can cause the sample mean to differ slightly from the population mean.
  • Finite Number of Samples. In the previous example, you took 50 samples from 10 stores. While 50 is a good sample size, it is finite. If you were to take an infinite number of samples, the mean of the sampling distribution of x with bar on top would converge to the population mean, μ. With a finite number of samples, there will always be some small difference.
The histogram below shows that the mean of the sampling distribution of x with bar on top is not exactly equal to the population mean, due to some of the factors discussed above.

1d. Sampling Distribution of the Proportion ()

Imagine you’re working for a company that wants to know the proportion of employees who prefer working remotely versus in the office. Instead of asking every employee, you decide to survey a smaller group, or sample, of employees. The proportion of employees in your sample who prefer working remotely is called p with hat on top. Now, if you took many different samples from your company, each sample would give you a different p with hat on top. The distribution of all these p with hat on top values from different samples is called the sampling distribution of the proportion ().

The table below shows the collection of these sample proportions, which forms the sampling distribution of p with hat on top.

Sample Sample Size Sample Proportion ()
1 100 p with hat on top subscript 1 equals 0.58 equals 58 percent sign
2 100 p with hat on top subscript 2 equals 0.62 equals 62 percent sign
3 100 p with hat on top subscript 3 equals 0.54 equals 54 percent sign
... ... ...
30 100 p with hat on top subscript 30 equals 0.62 equals 62 percent sign

The histogram below is a plot of all the sample proportions from the table above. The histogram visualizes the distribution of the sample proportions. The histogram helps you understand the central tendency (average) and variability of the sample proportions.

The histogram above was created with 30 samples of size 100. If you were to take 100,000 samples of size 100, the histogram would look like the following:

The reason the histogram becomes smoother and looks more like a normal distribution with more samples is due to the law of large numbers and the Central Limit Theorem working together. As you take more samples, the average of the sample proportions will converge to the population proportion. This means that with more samples, the overall distribution of sample means will better represent the true population mean.

The Central Limit Theorem states that the distribution of p with hat on top will approach a normal distribution as the sample size increases. However, this also implies that with a larger number of samples, the variability in the sample means decreases, making the distribution appear more normal.

Below are some key points about the sampling distribution of p with hat on top colon

  • Random Sampling. Each sample should be randomly selected to ensure that every employee has an equal chance of being chosen. This helps make your sample representative of the whole company.

  • Sample Size. The number of employees in each sample affects the sampling distribution. Larger samples tend to give more accurate estimates of the true proportion of employees who prefer working remotely.

  • Shape of the Distribution. The sampling distribution of p with hat on top will approximate a normal distribution as the sample size becomes large.

    For the sampling distribution of p with hat on top, a common rule of thumb is that the sample size, n, should be large enough such that both np and n open parentheses 1 minus p close parentheses are greater than 5, where p is the population proportion. This helps ensure that the normal approximation is valid.

  • Mean of the Distribution. The mean of the sampling distribution of p with hat on top is equal to the true population proportion, p. This means that, on average, p with hat on top will be a good estimate of p. The formula for the mean of the sampling distribution of p with hat on top is:

    mu subscript p with hat on top end subscript equals p

    In the previous example, if you knew the true population proportion, p, was 60%, the mean of the sampling distribution of p with hat on top would be close to p.

  • Standard Deviation (Standard Error): The standard deviation of the sampling distribution of p with hat on top is called the standard error. It measures how much p with hat on top varies from sample to sample. The formula for the standard error of p with hat on top is:

    sigma subscript p with hat on top end subscript equals square root of fraction numerator p open parentheses 1 minus p close parentheses over denominator n end fraction end root
The key reason for studying the sampling distributions of x with bar on top (the sample mean) and p with hat on top (the sample proportion) is that these distributions help assess how close the sample mean or proportion is to the true population mean or proportion. In other words, they indicate how accurate x with bar on top is as an estimate for μ (the population mean) and how accurate p with hat on top is as an estimate for p (the population proportion). These topics will be explored in more detail in future tutorials.

terms to know
Sampling Distribution of
A distribution of all possible sample proportions of a given sample size from a population, showing how the sample proportion varies from sample to sample.
Law of Large Numbers
A property that states as the size of a sample increases, the sample mean will become closer to the population mean and the sample proportion will become closer to the population proportion.

summary
In this lesson, you learned about sampling distributions. A sampling distribution is the distribution of a given statistic (such as the sample mean or sample proportion) based on a random sample. It shows how the statistic varies from sample to sample. A practical reason to study sampling distributions in business data analytics is to make accurate and reliable inferences about a population based on sample data. This tutorial focused on two specific sampling distributions: sampling distribution of the sample mean and sample proportion. The sampling distribution of the sample mean represents the sample means of all possible samples of a given size from a population. As the sample size increases, the sampling distribution of the mean approaches a normal distribution, regardless of the population’s distribution. The sampling distribution of the sample proportion represents the sample proportions of all possible samples of a given size from a population.

Source: THIS TUTORIAL WAS AUTHORED BY SOPHIA LEARNING. PLEASE SEE OUR TERMS OF USE.

Terms to Know
Central Limit Theorem (CLT)

A property that states the sampling distribution of x with bar on top approaches a normal distribution as the sample size becomes large, regardless of the original population’s distribution.

Law of Large Numbers

A property that states as the size of a sample increases, the sample mean will become closer to the population mean and the sample proportion will become closer to the population proportion.

Sampling Distribution

Distribution of a given statistic based on a random sample, showing how the statistic varies from sample to sample.

Sampling Distribution of p̂

A distribution of all possible sample proportions of a given sample size from a population, showing how the sample proportion varies from sample to sample.

Sampling Distribution of x̄

A distribution of all possible sample means of a given size from a population, showing how the sample mean varies from sample to sample.

Standard Error

Measure of the variability of a sample statistic (such as x with bar on top) from the population, calculated as the standard deviation of the sampling distribution of that statistic.