Statistical outliers are data points that are significantly different from the rest of the data. They can affect the analysis and interpretation of the data, and sometimes indicate errors or anomalies. Therefore, it is important to identify and highlight outliers in your data set.
In this article, we will show you how to use Excel formulas to find and highlight outliers in your data. We will explain the basic theory behind the method, the steps to follow, and a detailed example with real data. We will also discuss some alternative approaches to deal with outliers.
The Basic Theory: Interquartile Range and Fences
The method we will use is based on the concept of interquartile range (IQR) and fences. The IQR is a measure of the spread of the middle 50% of the data. It is calculated by subtracting the first quartile (Q1) from the third quartile (Q3). The quartiles are the values that divide the data into four equal groups, each containing 25% of the data. You can use the QUARTILE.INC
function in Excel to find the quartiles of your data.
The fences are the boundaries that define the acceptable range of the data. Any value that falls outside the fences is considered an outlier. The fences are calculated by adding or subtracting 1.5 times the IQR from the quartiles. The lower fence is Q1 – 1.5 * IQR, and the upper fence is Q3 + 1.5 * IQR. You can use simple arithmetic formulas in Excel to find the fences of your data.
The Steps to Follow: Formulas and Conditional Formatting
To find and highlight outliers in your data using Excel formulas, you need to follow these steps:
- Select your data range and sort it in ascending order.
- Calculate the quartiles (Q1 and Q3) of your data using the
QUARTILE.INC
function. For example, if your data is in column A, you can use=QUARTILE.INC(A:A,1)
for Q1 and=QUARTILE.INC(A:A,3)
for Q3. - Calculate the IQR of your data by subtracting Q1 from Q3. For example, if Q1 is in cell B1 and Q3 is in cell B2, you can use
=B2-B1
for IQR. - Calculate the fences (lower and upper) of your data by adding or subtracting 1.5 times the IQR from the quartiles. For example, if Q1 is in cell B1, Q3 is in cell B2, and IQR is in cell B3, you can use
=B1-1.5*B3
for the lower fence and=B2+1.5*B3
for the upper fence. - Use the
OR
function to create a logical formula that returns TRUE if a value is an outlier, i.e., if it is less than the lower fence or greater than the upper fence. For example, if your data is in column A and the lower fence is in cell B4 and the upper fence is in cell B5, you can use=OR(A1<B4,A1>B5)
for the first value in column A. - Copy the formula down to the rest of the values in column A.
- Select your data range and go to the Home tab. Click on Conditional Formatting and choose New Rule.
- In the New Formatting Rule dialog box, select Use a formula to determine which cells to format.
- In the Format values where this formula is true box, enter the formula you created in step 5. For example,
=OR(A1<$B$4,A1>$B$5)
. - Click on Format and choose the formatting style you want to apply to the outliers. For example, you can choose a red fill color and a bold font.
- Click OK to close the Format Cells dialog box and OK again to close the New Formatting Rule dialog box.
- You should see the outliers in your data highlighted with the formatting style you chose.
Example with Real Data
To illustrate the method, let us use a real data set of the monthly average temperatures (in degrees Celsius) of 12 cities around the world in 2020. The data is taken from World Weather Online. You can download the Excel file with the data here.
We will apply the steps described above to find and highlight the outliers in the data.
- Select the data range (B2:M13) and sort it in ascending order by clicking on the Data tab and choosing Sort. In the Sort dialog box, select Temperature as the column to sort by and Smallest to Largest as the order. Click OK to sort the data.
- Calculate the quartiles (Q1 and Q3) of the data using the
QUARTILE.INC
function. In cell N2, enter=QUARTILE.INC(B2:M13,1)
for Q1 and in cell N3, enter=QUARTILE.INC(B2:M13,3)
for Q3. You should get 8.5 for Q1 and 23.5 for Q3. - Calculate the IQR of the data by subtracting Q1 from Q3. In cell N4, enter
=N3-N2
for IQR. You should get 15 for IQR. - Calculate the fences (lower and upper) of your data by adding or subtracting 1.5 times the IQR from the quartiles. In cell N5, enter
=N2-1.5*N4
for the lower fence and in cell N6, enter=N3+1.5*N4
for the upper fence. You should get -14 for the lower fence and 46 for the upper fence. - Use the
OR
function to create a logical formula that returns TRUE if a value is an outlier, i.e., if it is less than the lower fence or greater than the upper fence. In cell O2, enter=OR(B2<$N$5,B2>$N$6)
for the first value in row 2. You should get FALSE as the result. - Copy the formula down to the rest of the values in column O. You should see some TRUE values indicating the outliers in the data.
- Select the data range (B2:M13) and go to the Home tab. Click on Conditional Formatting and choose New Rule.
- In the New Formatting Rule dialog box, select Use a formula to determine which cells to format.
- In the Format values where this formula is true box, enter the formula you created in step 5. For example,
=OR(B2<$N$5,B2>$N$6)
. - Click on Format and choose the formatting style you want to apply to the outliers. For example, you can choose a red fill color and a bold font.
- Click OK to close the Format Cells dialog box and OK again to close the New Formatting Rule dialog box.
- You should see the outliers in your data highlighted with the formatting style you chose. Here is a screenshot of the result:
As you can see, the outliers are the values that are either too low or too high compared to the rest of the data. For example, the temperature of -13.5 in Ulaanbaatar in January and the temperature of 47.5 in Baghdad in July are outliers.
Alternative Approaches to Deal with Outliers
The method we used in this article is one of the common ways to find and highlight outliers in Excel using formulas. However, it is not the only way. There are some alternative approaches that you can use depending on your data and your purpose. Here are some of them:
- Use a different multiplier for the IQR. The 1.5 multiplier we used is a standard choice, but you can use a different value to make the fences more or less strict. For example, you can use 2 or 3 instead of 1.5 to identify only the most extreme outliers, or use 1 or 0.5 to identify more potential outliers.
- Use the mean and standard deviation instead of the quartiles and IQR. The mean is the average of the data and the standard deviation is a measure of how much the data varies from the mean. You can use the
AVERAGE
andSTDEV
functions in Excel to find these values. Then, you can use the formula=ABS([value]-[mean])/[standard deviation]
to find the z-score of each value, which is the number of standard deviations away from the mean. A common rule of thumb is that any value with a z-score greater than 3 or less than -3 is an outlier. You can use this formula and conditional formatting to highlight the outliers in your data. - Use a box plot or a scatter plot to visualize the outliers. A box plot is a type of chart that shows the distribution of the data using the quartiles, the median, and the outliers. A scatter plot is a type of chart that shows the relationship between two variables using dots. You can use the Insert tab in Excel to create these charts and see the outliers in your data. You can also customize the charts to change the appearance and the labels of the outliers.