David Schmudde / Sep 10 2019

How to Remove Outliers in Data With Pandas

With One Axis

  1. Create a pandas.Series one-dimensional ndarray with 200 random values.
  2. Remove all the random numbers that lie in the lowest quantile and the highest quantile.
import pandas as pd
import numpy as np

size=200

x = pd.Series(np.random.normal(size=size)) # 200 values
x = x[x.between(x.quantile(.15), x.quantile(.85))] # without outliers

print(x) # Now only 140 values

With Two Axes

This example plots some data over a period of time - a common pattern in data science.

Initial Data Frame and Plot

Start by generating data that can be used in the example. Generate a random set of 200 numbers using numpy.random and an accompanying span of dates using pandas.date_range. Create a pandas.DataFrame containing the random values, indexed by date.

import datetime

todays_date = datetime.datetime.now().date()
dates = pd.date_range(todays_date-datetime.timedelta(10), periods=size, freq='D')

rando_nums = np.random.normal(size=size)

columns = ['rando']
df = pd.DataFrame(rando_nums, index=dates, columns=columns)

df # 200 random numbers indexed by days in a week

Plot the data frame.

df.plot().get_figure()

Demonstration: Plot the New Series

For demonstration purposes, set the variable y to the y-axis, remove the numbers in the first and last quantile, and overly the resulting plot, removed_outliers, on the previous plot.

y = df['rando']
removed_outliers = y.between(y.quantile(.05), y.quantile(.95))

print(str(y[removed_outliers].size) + "/" + str(size) + " data points remain.") 

y[removed_outliers].plot().get_figure()

Final Data Frame and Plot

The resulting truth table, removed_outliers, contains the expected values with the expected frequency based on the quantile requested:

print(removed_outliers.value_counts())

removed_outliers can be used as a Boolean mask to remove the outliers from the original data frame.

Remember: removed_outliers is indexed by date where each row is true if the value is between the lowest and highest quantiles and false if the value lies at the edges. I want to drop the false rows, i.e. the value and associated dates that are not between the lowest and highest quantiles, i.e. the edge values.

However, if I apply the Boolean mask to the data frame by using df[removed_outliers].index, it will only keep the true values. To keep the false values, invert the table (falsetrue and truefalse) before application.

index_names = df[~removed_outliers].index # INVERT removed_outliers!!
print(index_names) # The resulting 20 dates to drop.

drop the values where the index names are equivalent to set of index_names created above.

import plotly.graph_objs as go

df.drop(index_names, inplace=True)

df

Plot the results using Plotly. It's interactive and looks nicer. Roll over with your mouse for greater insights into the data.

trace = go.Scatter(
   x = df.index,
   y = df['rando'])

go.Figure(data=[trace])