How to Remove Outliers in Data With Pandas
With One Axis
- Create a
pandas.Series
one-dimensionalndarray
with 200 random values. - Remove all the random numbers that lie in the lowest quantile and the highest quantile.
import pandas as pd import numpy as np size=200 x = pd.Series(np.random.normal(size=size)) # 200 values x = x[x.between(x.quantile(.15), x.quantile(.85))] # without outliers print(x) # Now only 140 values
With Two Axes
This example plots some data over a period of time - a common pattern in data science.
Initial Data Frame and Plot
Start by generating data that can be used in the example. Generate a random set of 200 numbers using numpy.random
and an accompanying span of dates using pandas.date_range
. Create a pandas.DataFrame
containing the random values, indexed by date.
import datetime todays_date = datetime.datetime.now().date() dates = pd.date_range(todays_date-datetime.timedelta(10), periods=size, freq='D') rando_nums = np.random.normal(size=size) columns = ['rando'] df = pd.DataFrame(rando_nums, index=dates, columns=columns) df # 200 random numbers indexed by days in a week
Plot the data frame.
df.plot().get_figure()
Demonstration: Plot the New Series
For demonstration purposes, set the variable y
to the y-axis, remove the numbers in the first and last quantile, and overly the resulting plot, removed_outliers
, on the previous plot.
y = df['rando'] removed_outliers = y.between(y.quantile(.05), y.quantile(.95)) print(str(y[removed_outliers].size) + "/" + str(size) + " data points remain.") y[removed_outliers].plot().get_figure()
Final Data Frame and Plot
The resulting truth table, removed_outliers
, contains the expected values with the expected frequency based on the quantile requested:
print(removed_outliers.value_counts())
removed_outliers
can be used as a Boolean mask to remove the outliers from the original data frame.
Remember: removed_outliers
is indexed by date where each row is true
if the value is between the lowest and highest quantiles and false
if the value lies at the edges. I want to drop
the false
rows, i.e. the value and associated dates that are not between the lowest and highest quantiles, i.e. the edge values.
However, if I apply the Boolean mask to the data frame by using df[removed_outliers].index
, it will only keep the true
values. To keep the false values, invert the table (false
→ true
and true
→ false
) before application.
index_names = df[~removed_outliers].index # INVERT removed_outliers!! print(index_names) # The resulting 20 dates to drop.
drop
the values where the index names are equivalent to set of index_names
created above.
import plotly.graph_objs as go df.drop(index_names, inplace=True) df
Plot the results using Plotly. It's interactive and looks nicer. Roll over with your mouse for greater insights into the data.
trace = go.Scatter( x = df.index, y = df['rando']) go.Figure(data=[trace])