Table of Contents |
You may recall the term "outliers" when talking about univariate data. However, in bivariate data, outliers are a little bit different.
An outlier is any point that deviates substantially from the overall form of the remainder of the data points.
EXAMPLE
Let's take a look at these two data sets. One thing that you might realize is that the ones on the left seem quite random, whereas in the ones on the right, all the x's except one are 8, which might be a clue to something.Table 1 |
|
Table 2 | ||
x | y |
|
x | y |
---|---|---|---|---|
10 | 746 |
|
8 | 658 |
8 | 677 |
|
8 | 576 |
13 | 1274 |
|
8 | 771 |
9 | 711 |
|
8 | 884 |
11 | 781 |
|
8 | 847 |
14 | 884 |
|
8 | 704 |
6 | 608 |
|
8 | 525 |
4 | 539 |
|
19 | 1,250 |
12 | 815 |
|
8 | 556 |
7 | 642 |
|
8 | 791 |
5 | 573 |
|
8 | 689 |
Graph 1 | Graph 2 |
Types of Outliers | Example |
---|---|
Extreme x-values |
This is an outlier in the x-direction because it's so much further to the right of the other pack of points but not in the y-direction. If you look horizontally, it's sort of in the middle lower part of the y-direction. It's an outlier in the x-direction but not the y-direction. |
Extreme y-values |
This is an outlier in the y-direction because it's so much higher than the other y-direction, but not the x-direction. |
Extreme x- and y-values |
This is an outlier in both the x- and y- direction because it's so much farther to the right and also higher than the rest of the points. |
Neither extreme x- or y-values |
Even though it is not extreme in either the x- or y- direction, it doesn't fit the overall trend established by the rest of the data. |
Influential points are points that, if removed, significantly change a statistical measure. Usually, the measure that we're talking about changing is correlation, but it could also affect other measurements such as the mean of x or y and the standard deviation of x or y.
Some outliers are influential, and some are not.
EXAMPLE
When the scatterplot on the left includes the outlier, the correlation coefficient is 0.816. However, when we remove the outlier, the correlation coefficient changes to 1. Since this dramatically changes the correlation, this outlier would be considered an influential point.
With outlier: r = 0.816 |
Without outlier: r = 1 |
EXAMPLE
When the scatterplot below includes the outlier, the mean of x is 9, the standard deviation of x is 3.3, and the correlation is 0.816. However, when we remove the outlier, the mean becomes 8 because now all the x-values are 8, the standard deviation is 0 because they never deviate from 8, and the correlation is 0. Therefore, it changes all of these measures very substantially by being there. That outlier is certainly influential.
With outlier: mean = 9 standard deviation = 3.3 r = 0.816 |
Without outlier: mean = 8 standard deviation = 0 r = 0 |
EXAMPLE
The outlier in the scatterplot below is not going to have a great effect on the correlation or the least squares regression line that these data sets create. In this case, a line is an inappropriate model, but if you did make a line, having this point versus removing this point wouldn't affect that line or the correlation very much.Source: THIS TUTORIAL WAS AUTHORED BY JONATHAN OSTERS FOR SOPHIA LEARNING. PLEASE SEE OUR TERMS OF USE.