How to Identify Statistical Outliers with an Interquartile Range in Excel

What are Statistical Outliers?

Statistical outliers are data points that are significantly different from the rest of the data set. They can be caused by measurement errors, experimental errors, or natural variability. Outliers can affect the mean, standard deviation, and other summary statistics of a data set, and may also distort the shape of the distribution.

What is an Interquartile Range?

An interquartile range (IQR) is a measure of variability that divides a data set into four equal parts, or quartiles. The first quartile (Q1) is the median of the lower half of the data, the second quartile (Q2) is the median of the whole data, the third quartile (Q3) is the median of the upper half of the data, and the fourth quartile (Q4) is the maximum value of the data. The IQR is the difference between Q3 and Q1, and represents the middle 50% of the data.

How to Use an Interquartile Range to Identify Outliers?

One way to identify outliers is to use the IQR and a multiplier, usually 1.5, to define the lower and upper bounds of the data. Any data point that falls below Q1 – 1.5 * IQR or above Q3 + 1.5 * IQR is considered an outlier. This method is based on the assumption that the data is approximately symmetric and follows a normal distribution.

How to Calculate an Interquartile Range and Identify Outliers in Excel?

To calculate an IQR and identify outliers in Excel, follow these steps:

  1. Enter your data in a column, such as A2:A21.
  2. In another column, such as B2, enter the formula =QUARTILE.EXC(A2:A21,1) to calculate Q1. Drag the formula down to B3 and change the second argument to 2 to calculate Q2. Repeat for B4 and B5 with the arguments 3 and 4 to calculate Q3 and Q4, respectively.
  3. In another column, such as C2, enter the formula =B4-B2 to calculate the IQR. Drag the formula down to C3 and C4 and change the references to B5 and B1 to calculate the lower and upper bounds, respectively.
  4. In another column, such as D2, enter the formula =IF(OR(A2<C2,A2>C4),"Outlier","") to identify outliers. Drag the formula down to fill the rest of the column. This formula will display “Outlier” if the data point is below the lower bound or above the upper bound, and leave the cell blank otherwise.
  5. You can also use conditional formatting to highlight the outliers in the data column. Select the data column, such as A2:A21, and go to Home > Conditional Formatting > New Rule. Choose “Use a formula to determine which cells to format” and enter the formula =OR(A2<$C$2,A2>$C$4). Choose a format, such as red fill, and click OK.

Here is an example of the Excel table with the formulas and the conditional formatting:

Table

Data Q1 Q2 Q3 Q4 IQR Lower Bound Upper Bound Outlier
12 15 19 23 28 8 3 35
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
40 Outlier

What are Other Approaches to Identify Outliers?

There are other approaches to identify outliers, such as:

  • Using a box plot, which is a graphical representation of the data that shows the median, quartiles, and outliers. Outliers are usually marked by dots or asterisks outside the box and whiskers. You can create a box plot in Excel by selecting your data and going to Insert > Charts > Insert Statistic Chart > Box and Whisker.
  • Using a z-score, which is a measure of how many standard deviations a data point is away from the mean. A common rule of thumb is to consider any data point with a z-score greater than 3 or less than -3 as an outlier. You can calculate the z-score in Excel by using the formula =(A2-AVERAGE(A2:A21))/STDEV(A2:A21), where A2:A21 is your data range.
  • Using a modified z-score, which is a variation of the z-score that uses the median and the median absolute deviation (MAD) instead of the mean and the standard deviation. This makes it more robust to outliers. A common rule of thumb is to consider any data point with a modified z-score greater than 3.5 or less than -3.5 as an outlier. You can calculate the modified z-score in Excel by using the formula =0.6745*(A2-MEDIAN(A2:A21))/MEDIAN(ABS(A2:A21-MEDIAN(A2:A21))), where A2:A21 is your data range.

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *