Use Sophia to knock out your gen-ed requirements quickly and affordably. Learn more
×

Examining and Detecting Outliers

Author: Sophia

what's covered
In this lesson, you will learn to examine outliers and their effects on data sets. You will learn how to use boxplots and the Quartile Method to detect outliers in the data. Specifically, this lesson will cover:

Table of Contents

1. Identification of Outliers Using the Quartile Method

There are guidelines developed to determine if an observation is an outlier or not. These guidelines are based upon the quartiles of the data. A quartile is a number that divides the data into four equal parts. There are three quartiles in any data set. The three quartiles are:

  • The first quartile (Q1) is the 25ᵗʰ percentile of the dataset. It is the value below which 25% of the observations fall, meaning that 75% of the observations are above this value.
  • The second quartile (Q2) is the 50ᵗʰ percentile, also known as the median. Q2 is the value that is "dead center" of the data, such that 50% of the observations are below and 50% of the observations are above this value.
  • The third quartile (Q3) is the 75ᵗʰ percentile. It is the value below which 75% of the observations fall, meaning 25% of the observations are above this value.
To visualize these quartiles and how they split up data into four equal parts, imagine the line in the figure below represents the sporting goods sales data where the data values have been sorted from the smallest value to the largest value. A quartile "cuts" the data into four equal parts such that each part contains 25% of the data values.

The interquartile range (IQR) is the range of the middle 50% of values.

formula to know
Interquartile Range (IQR)
IQR equals Q subscript 3 minus Q subscript 1

You can visually think of IQR as the middle 50% of the data as shown in the figure below.

step by step
The IQR value can be used to detect outliers. Use the following method to determine if a particular observation is an outlier.

  1. Calculate IQR equals Q subscript 3 minus Q subscript 1.
  2. Determine a threshold, which is found by multiplying the IQR by a constant, 1.5.

    Threshold = 1.5 x IRQ

  3. Determine a lower and upper bound for potential outliers.

    Lower bound: Q1Threshold
    Upper bound: Q3 + Threshold

  4. Any observation below the lower bound or above the upper bound is considered to be an outlier.

formula to know
Upper/Lower Bound
Lower Bound = Q1 - Threshold = Q1 – (1.5 * IQR)
Upper Bound = Q3 + Threshold = Q3 + (1.5 * IQR)

learn more
Why is the threshold equal to 1.5 times IQR and not some other number? The short answer is that 1.5 strikes a balance between sensitivity and specificity, effectively identifying outliers while avoiding excessive false positives. For a more detailed explanation, visit 1.5 IQR Rule Explained | Built In.

EXAMPLE

Going back to the sporting good sales example from the last tutorial, you can use Excel to find various quartiles of the data, IQR, and the lower/upper bounds.

Q1 Q3 IQR Lower Bound Upper Bound
439 719 280 19 1139

try it
Find the first and third quartiles of the data, IQR, and the lower and upper bound thresholds.

1. Open sporting_goods_sales.xlsx in Office 365.

2. In the same sheet as your column of data (Sales column), create three columns named Q1, Q3, and IQR as shown in the figure below.



3. In the cell right below Q1 use the QUARTILE() function to find the first quartile (Q1) as shown in the table below.

Q1
=QUARTILE(A2:A86,1)

The first argument of the quartile function is the rows that contain the data (rows A2 to A86 in this case), and the second argument corresponds to the quartile you want to calculate. A value of 1 is for the first quartile (Q1). A value of 2 is for the second quartile (Q2). A value of 3 is for the third quartile (Q3).

The value of Q1 is $439. This means that 25% of the sporting goods sales are less than $439 and 75% of the sales values are more than $439.

4. Perform the same calculation for the third quartile (Q3). The table below shows how to calculate Q3.

Q3
=QUARTILE(A2:A86,3)

The value of Q3 is $719. This means that 75% of the sporting goods sales are less than $719 and 25% of the sales are more than $719.

IQR is the difference between Q1 and Q3.

5. Use the basic formula shown in the table below to calculate IQR (the difference between the first and third quartile).

Q1 Q3 IQR
439 719 =G4–F4

The IQR value is 280.

You can now calculate the lower and upper bounds using the threshold.

6. Create two new columns in the worksheet named Lower Bound and Upper Bound. These will be the bounds to determine if an observation is an outlier or not.

7. Create the corresponding bounds using the threshold (1.5 x IQR).

The table below shows the formula to calculate the Lower Bound.

Q1 Q3 IQR Lower Bound Upper Bound
439 719 280 =F4-(1.5*H4)

The Lower Bound value is 19. If there are any values less than 19, those values would be considered outliers.

In a similar manner the table below shows the formula to calculate the Upper Bound.

Q1 Q3 IQR Lower Bound Upper Bound
439 719 280 19 =G4+(1.5*H4)

The Upper Bound value is 1139. If there are any sales values greater than $1,139, those values would be considered outliers.

8. Check your work by comparing your first and third quartiles, IQR, and lower and upper bounds to the one in the example.

key concept
Any values below the lower bound or above the upper bound should be designated as outliers.

watch
Follow along with this video on identifying outliers using the quartile method.

EXAMPLE

Based on the quartile method, the sales value of $1,441 would be considered an outlier.

try it
Determine if there are any outliers in the sporting goods sales data using the quartile method for detecting outliers.

Using the quartile method, you can determine if there are any sales values less than $19 or any sales values greater than $1,139. Those sales values will be considered an outlier.

Going through this procedure verifies that the bar to the right of the other bars included an outlying observation. The sales value of $1,441 is the only value outside the upper and lower bound and should be considered an outlier.

terms to know
Quartile
The three values in a dataset which divide the data into four equal parts.
Interquartile Range (IQR).
The range between Quartiles 1 and 3, which corresponds to the middle 50% of the data.


2. Using Boxplots to Identify Outliers

A boxplot is a visual summary of a quantitative variable distribution that assists in detecting outliers. A boxplot visually displays the five number summary. The five number summary consists of the minimum values, Q1, Q2, Q3, and the maximum value of the quantitative variable being displayed.

The figure below illustrates the basic elements of a boxplot. The rectangular box spans from Q1 to Q3, which represents the middle 50% of the data. Q2 is drawn in the body of the box to indicate the median value. The lines you see extending from the box are referred to as whiskers. Sometimes boxplots are referred to as box and whisker plots. The whiskers will extend to the minimum and maximum values of the data if there are no outliers in the data. If there are outliers, the whiskers will extend to the minimum and maximum values not considered an outlier. The whiskers attempt to capture data outside the IQR, but the whiskers never reach below the lower bound (Q1 - (1.5 x IQR)) or above the upper bound (Q3 + (1.5 x IQR)). If there are any data values below the lower bound or above the upper bound, they will be individually designated to denote those values are outliers.

EXAMPLE

You can use Excel to construct a boxplot for the sporting goods sales data.

Once again you should use Excel for a task like this. You will not be constructing graphs using manual or hand methods at any real-world business analytics job. Before you begin, remind yourself of the five number summary for this data displayed in the table below.

Five Number Summary Value
Minimum 30
Q1 (first quartile) 439
Q2 (second quartile = median) 560
Q3 (third quartile) 719
Maximum 1441

try it
Construct a boxplot using the sporting good sales data.

Open sporting_good_sales2.xlsx workbook, which now contains first and third quartiles of the data, IQR, and the lower and upper bound thresholds. Or, you can continue with the same Excel file from above.

1. In Excel, highlight all the sales values in the Sales column as shown in the figure.

2. Select the Insert menu. Select the drop-down arrow for the Charts group → Box and Whisker.



A boxplot will appear in the worksheet.



3. To view the numerical values that make up the boxplot select the Chart menu → Data LabelsRight.



The boxplot updates to include the five number summary.



You will notice that the values of Q1 and Q3 displayed in the boxplot are not the same values that you calculated using the QUARTILE() function. When the number of observations in a data set is odd, like the sporting goods sales data (n equals 85), Excel removes the median value when it calculates Q1 - Q3. If you want the median included in the first and third quartile calculations (like you did when you used the QUARTILE() function), select the Inclusive median option for the boxplot by performing the following.

4. Select Format (under Chart menu) → right arrow for Series “Sales”Quartile CalculationInclusive median.



The boxplot will now update to include the median in the calculations for Q1 and Q3 and will match the values you obtained using the QUARTILE() function earlier.



Let’s now remove the ‘1’ that you see on the horizontal axis, provide a label for the vertical axis, and provide the boxplot with a title.

To remove the ‘1’ on the horizontal axis, perform the following.

5. Select Axes (under Chart menu) → Primary Horizontal AxisNone.

6. To add a label to the vertical axis, select Axis Titles (under Chart menu) → Primary Vertical Axis TitleEdit Vertical Axis Title. Enter "frequency."

7. To add a title to the boxplot, select Chart Title (under Chart menu) → Edit Chart Title. Enter "Distribution of Sporting Good Sales" for the title.

8. Check your work and compare your boxplot to the one in the example.

The boxplot confirms that the sales value of $1,441 is an outlier. Notice that the value denoted as the maximum value is $1,068 because this is the maximum value of sales that is not considered an outlier. Since there are no outliers in the lower values of the distribution (that is, there are no values below the lower bound of $19), the lower whisker value is the minimum sales value of $30.

watch
Follow along with this video on constructing a boxplot.

learn more
The boxplot was created in the 1970s by the famous American statistician John Tukey. So, while the plot is over 40 years old, it remains a staple in data exploration. Its simplicity, ability to summarize data distribution, and usefulness for identifying outliers make it indispensable.

Using a boxplot allows you to focus more on the overall distribution of the data to detect outliers.

terms to know
Boxplot
A graph used to detect outliers.
Five Number Summary
Five numerical values computed from the data that describe the range, spread, and location of a quantitative variable.
Data Exploration
Process that allows you to understand the distribution of a data set and includes activities like identifying outliers.


3. Outliers and Their Effect on Data Sets

Outliers should be investigated carefully. Often, they contain valuable information about the process under investigation or the data gathering and recording process. Before considering the possible elimination of these points from the data, you should try to understand why they appeared and whether it is likely similar values will continue to appear. Of course, outliers are often bad data points.

In the sporting goods sales data set, the histogram and the boxplot both identified the $1,441 sales value as an outlier. You should then ask yourself if the value is an anomaly, or does it have the potential to be repeated. In this case, even though it is an outlier, there is a possibility (although slight) of repetition, so it should remain in the data set, or should it? In other words, someone else could spend $1,441 at the store in the future, but we wouldn’t necessarily want to include it in our sales projections. Keeping this outlier in the data may make sense in some cases and not in others, so the answer is…it depends. Outliers are specific to their data sets, so consider the circumstances before eliminating them from your data.

If you determined the data point was not likely to be replicated, then you should eliminate it from the data set, before you conduct further testing. Ignoring this outlier could lead to skewed insights and misguided strategies. Let us look at some of these now.

IN CONTEXT
Bad Decisions

Even large companies have a tough time with outliers. Here are some real-world examples of how outliers have affected businesses.

Target
What happened? In 2016, Target’s predictive analytics algorithm misjudged sales forecasts due to a data anomaly. The system identified a teenage girl’s pregnancy based on her shopping habits and began sending her targeted ads for pregnancy-related products before her family even knew about it.

What have we learned? This isolated case highlighted how predictive models, designed to forecast product demand at a larger scale, can also make highly personalized predictions—sometimes with unintended consequences.

Blockbuster
What happened? Blockbuster Video's downfall is a classic example of ignoring outlier trends. As streaming services gained popularity, Blockbuster clung to its brick-and-mortar model, dismissing the outlier of declining rentals until it was too late.

What have we learned? Trends are like fashion; sometimes what is in today might be out tomorrow. Just make sure your business is not looking into the past and ignoring the future. How a business reacts to those trends determines whether it remains successful or dies a slow and painful death.

Housing Crisis
What happened? The 2008 financial crisis serves as a cautionary tale of underestimating outlier risks. Mortgage-backed securities, once considered safe investments, unraveled as outliers (subprime mortgages) defaulted, triggering a global economic meltdown.

What have we learned? Just as our bodies give us outliers when we are about to be ill (think fever, chills, etc.), outliers in business can point to financial sickness. As we learned in The Big Short, arrogance and/or ignorance can cause major economic damage. Keep your eyes open for outliers that may help diagnose problems before they become insurmountable.

summary
In this lesson, you focused on identifying outliers using the quartile method, understanding their effects, and utilizing boxplots for visualization. You reviewed how to calculate the interquartile range (IQR) and use it to determine outliers, which are data points that fall significantly outside the typical range. You also examined the impact of outliers on data analysis, such as skewing results and misleading interpretations. By looking at data exploration in businesses, such as the provided example of Blockbuster, you were able to see the impact of ignoring outlier data and its possible detrimental outcomes.

Source: THIS TUTORIAL WAS AUTHORED BY SOPHIA LEARNING. PLEASE SEE OUR TERMS OF USE.

REFERENCES

n.a. (n.d.). 2.5.3. Creating bins. www.peterstatistics.com

Terms to Know
Boxplot

A graph used to detect outliers.

Data Exploration

Process that allows you to understand the distribution of a data set and includes activities like identifying outliers.

Five Number Summary

Five numerical values computed from the data that describe the range, spread, and location of a quantitative variable.

Interquartile Range (IQR)

The range between Quartiles 1 and 3, which corresponds to the middle 50% of the data.

Quartile

The three values in a dataset which divide the data into four equal parts.

Formulas to Know
Interquartile Range (IQR)

IQR equals Q subscript 3 minus Q subscript 1

Upper/Lower Bound

Lower space Bound equals straight Q subscript 1 – Threshold equals straight Q subscript 1 – left parenthesis 1.5 times IQR right parenthesis
Upper space Bound equals straight Q subscript 3 plus Threshold equals straight Q subscript 3 plus left parenthesis 1.5 times IQR right parenthesis