# How to Remove Outliers in Data With Pandas

## With One Axis

- Create a
`pandas.Series`

one-dimensional`ndarray`

with 200 random values. - Remove all the random numbers that lie in the lowest quantile and the highest quantile.

import pandas as pd import numpy as np size=200 x = pd.Series(np.random.normal(size=size)) # 200 values x = x[x.between(x.quantile(.15), x.quantile(.85))] # without outliers print(x) # Now only 140 values

## With Two Axes

This example plots some data over a period of time - a common pattern in data science.

### Initial Data Frame and Plot

Start by generating data that can be used in the example. Generate a random set of 200 numbers using `numpy.random`

and an accompanying span of dates using `pandas.date_range`

. Create a `pandas.DataFrame`

containing the random values, indexed by date.

import datetime todays_date = datetime.datetime.now().date() dates = pd.date_range(todays_date-datetime.timedelta(10), periods=size, freq='D') rando_nums = np.random.normal(size=size) columns = ['rando'] df = pd.DataFrame(rando_nums, index=dates, columns=columns) df # 200 random numbers indexed by days in a week

Plot the data frame.

df.plot().get_figure()

### Demonstration: Plot the New Series

For demonstration purposes, set the variable `y`

to the y-axis, remove the numbers in the first and last quantile, and overly the resulting plot, `removed_outliers`

, on the previous plot.

y = df['rando'] removed_outliers = y.between(y.quantile(.05), y.quantile(.95)) print(str(y[removed_outliers].size) + "/" + str(size) + " data points remain.") y[removed_outliers].plot().get_figure()

### Final Data Frame and Plot

The resulting truth table, `removed_outliers`

, contains the expected values with the expected frequency based on the quantile requested:

print(removed_outliers.value_counts())

`removed_outliers`

can be used as a Boolean mask to remove the outliers from the original data frame.

Remember: `removed_outliers`

is indexed by date where each row is `true`

if the value is between the lowest and highest quantiles and `false`

if the value lies at the edges. I want to `drop`

the `false`

rows, i.e. the value and associated dates that are *not* between the lowest and highest quantiles, i.e. the *edge* values.

However, if I apply the Boolean mask to the data frame by using `df[removed_outliers].index`

, it will only keep the `true`

values. To keep the false values, invert the table (`false`

→ `true`

and `true`

→ `false`

) before application.

index_names = df[~removed_outliers].index # INVERT removed_outliers!! print(index_names) # The resulting 20 dates to drop.

`drop`

the values where the index names are equivalent to set of `index_names`

created above.

import plotly.graph_objs as go df.drop(index_names, inplace=True) df

Plot the results using Plotly. It's interactive and looks nicer. Roll over with your mouse for greater insights into the data.

trace = go.Scatter( x = df.index, y = df['rando']) go.Figure(data=[trace])