David Schmudde / Sep 10 2019

How to Remove Outliers in Data With Pandas

With One Axis

  1. Create a pandas.Series one-dimensional ndarray with 200 random values.
  2. Remove all the random numbers that lie in the lowest quantile and the highest quantile.
import pandas as pd
import numpy as np


x = pd.Series(np.random.normal(size=size)) # 200 values
x = x[x.between(x.quantile(.15), x.quantile(.85))] # without outliers

print(x) # Now only 140 values

With Two Axes

This example plots some data over a period of time - a common pattern in data science.

Initial Data Frame and Plot

Start by generating data that can be used in the example. Generate a random set of 200 numbers using numpy.random and an accompanying span of dates using pandas.date_range. Create a pandas.DataFrame containing the random values, indexed by date.

import datetime

todays_date = datetime.datetime.now().date()
dates = pd.date_range(todays_date-datetime.timedelta(10), periods=size, freq='D')

rando_nums = np.random.normal(size=size)

columns = ['rando']
df = pd.DataFrame(rando_nums, index=dates, columns=columns)

df # 200 random numbers indexed by days in a week

Plot the data frame.


Demonstration: Plot the New Series

For demonstration purposes, set the variable y to the y-axis, remove the numbers in the first and last quantile, and overly the resulting plot, removed_outliers, on the previous plot.

y = df['rando']
removed_outliers = y.between(y.quantile(.05), y.quantile(.95))

print(str(y[removed_outliers].size) + "/" + str(size) + " data points remain.") 


Final Data Frame and Plot

The resulting truth table, removed_outliers, contains the expected values with the expected frequency based on the quantile requested:


removed_outliers can be used as a Boolean mask to remove the outliers from the original data frame.

Remember: removed_outliers is indexed by date where each row is true if the value is between the lowest and highest quantiles and false if the value lies at the edges. I want to drop the false rows, i.e. the value and associated dates that are not between the lowest and highest quantiles, i.e. the edge values.

However, if I apply the Boolean mask to the data frame by using df[removed_outliers].index, it will only keep the true values. To keep the false values, invert the table (falsetrue and truefalse) before application.

index_names = df[~removed_outliers].index # INVERT removed_outliers!!
print(index_names) # The resulting 20 dates to drop.

drop the values where the index names are equivalent to set of index_names created above.

import plotly.graph_objs as go

df.drop(index_names, inplace=True)


Plot the results using Plotly. It's interactive and looks nicer. Roll over with your mouse for greater insights into the data.

trace = go.Scatter(
   x = df.index,
   y = df['rando'])