Forecasting Time Series Data With Prophet III
Originally published as Forecasting Time-Series data with Prophet - Part 3 at pythondata.com.
Introduction
This is the third in a series of posts about using Prophet to forecast time series data. Follow this link for parts 1 & 2 of Forecasting Time-Series Data With Prophet.
In those previous posts, I looked at forecasting monthly sales data 24 months into the future. In this post, I wanted to look at using the ‘holiday’ construct found within the Prophet library to try to better forecast around specific events. If we look at our sales data (you can find it here), there’s an obvious pattern each December. That pattern could be for a variety of reasons, but lets assume that its due to a promotion that is run every December.
Import necessary libraries
import pandas as pd import numpy as np from fbprophet import Prophet import matplotlib.pyplot as plt plt.rcParams['figure.figsize']=(20,10) plt.style.use('ggplot')
Matplotlib must be manually registered with Pandas due to a conflict between Prophet and Pandas.
pd.plotting.register_matplotlib_converters()
Read in the data
Read the data in from the retail sales CSV file in the examples folder then set the index to the 'date' column. We are also parsing dates in the data file.
sales_df = pd.read_csv(retail_sales.csv, index_col='date', parse_dates=True)
sales_df.head()
date | sales |
---|---|
2009-10-01 | 338630 |
2009-11-01 | 339386 |
2009-12-01 | 400264 |
2010-01-01 | 314640 |
2010-02-01 | 311022 |
Prepare for Prophet
As explained in previous prophet posts, for prophet to work, we need to change the names of these columns to ds
and y
.
df = sales_df.reset_index()
df.head()
date | sales | |
---|---|---|
0 | 2009-10-01 | 338630 |
1 | 2009-11-01 | 339386 |
2 | 2009-12-01 | 400264 |
3 | 2010-01-01 | 314640 |
4 | 2010-02-01 | 311022 |
Let's rename the columns as required by fbprophet
. Additioinally, fbprophet
doesn't like the index to be a datetime...it wants to see ds
as a non-index column, so we won't set an index differnetly than the integer index.
df=df.rename(columns={'date':'ds', 'sales':'y'})
df.head()
ds | y | |
---|---|---|
0 | 2009-10-01 | 338630 |
1 | 2009-11-01 | 339386 |
2 | 2009-12-01 | 400264 |
3 | 2010-01-01 | 314640 |
4 | 2010-02-01 | 311022 |
Now's a good time to take a look at your data. Plot the data using Pandas' plot
function
df.set_index('ds').y.plot().figure
Reviewing the Data
We can see from this data that there is a spike in the same month each year. While spike could be due to many different reasons, let's assume its because there's a major promotion that this company runs every year at that time, which is in December for this dataset.
Because we know this promotion occurs every December, we want to use this knowledge to help prophet better forecast those months, so we'll use Prohpet's holiday
construct (explained here).
The holiday constrict is a Pandas dataframe with the holiday and date of the holiday. For this example, the construct would look like this:
promotions = pd.DataFrame({ 'holiday': 'december_promotion', 'ds': pd.to_datetime(['2009-12-01', '2010-12-01', '2011-12-01', '2012-12-01', '2013-12-01', '2014-12-01', '2015-12-01']), 'lower_window': 0, 'upper_window': 0, })
This promotions
dataframe consisists of promotion dates for Dec in 2009 through 2015. The lower_window
and upper_window
values are set to zero to indicate that we don't want Prophet to consider any other months than the ones listed.
promotions
holiday | ds | lower_window | upper_window | |
---|---|---|---|---|
0 | december_promotion | 2009-12-01 | 0 | 0 |
1 | december_promotion | 2010-12-01 | 0 | 0 |
2 | december_promotion | 2011-12-01 | 0 | 0 |
3 | december_promotion | 2012-12-01 | 0 | 0 |
4 | december_promotion | 2013-12-01 | 0 | 0 |
5 | december_promotion | 2014-12-01 | 0 | 0 |
6 | december_promotion | 2015-12-01 | 0 | 0 |
To continue, we need to log-transform our data:
df['y'] = np.log(df['y'])
df.tail()
ds | y | |
---|---|---|
67 | 2015-05-01 | 13.044650453675313 |
68 | 2015-06-01 | 13.013059541513272 |
69 | 2015-07-01 | 13.033991074775358 |
70 | 2015-08-01 | 13.030993424699561 |
71 | 2015-09-01 | 12.973670775134828 |
Running Prophet
Now, let's set Prophet up to begin modeling our data using our promotions
dataframe as part of the forecast
Note: Since we are using monthly data, you'll see a message from Prophet saying Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this.
This is OK since we are working with monthly data but you can disable it by using weekly_seasonality=True
in the instantiation of Prophet.
model = Prophet(holidays=promotions, weekly_seasonality=True, daily_seasonality=True) model.fit(df)
We've instantiated the model, now we need to build some future dates to forecast into.
future = model.make_future_dataframe(periods=24, freq = 'm') future.tail()
ds | |
---|---|
91 | 2017-04-30 |
92 | 2017-05-31 |
93 | 2017-06-30 |
94 | 2017-07-31 |
95 | 2017-08-31 |
To forecast this future data, we need to run it through Prophet's model.
forecast = model.predict(future)
The resulting forecast dataframe contains quite a bit of data, but we really only care about a few columns. First, let's look at the full dataframe:
forecast.tail()
We really only want to look at yhat
, yhat_lower
and yhat_upper
, so we can do that with:
forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()
ds | yhat | yhat_lower | yhat_upper | |
---|---|---|---|---|
91 | 2017-04-30 | 13.069092385393887 | 12.666347160467316 | 13.461268900286738 |
92 | 2017-05-31 | 13.071883163433725 | 12.637346529844663 | 13.491574989616737 |
93 | 2017-06-30 | 13.05773483351489 | 12.590041629879847 | 13.506115228104957 |
94 | 2017-07-31 | 13.06466187201633 | 12.565657922306391 | 13.540365994068482 |
95 | 2017-08-31 | 13.012393257087005 | 12.479345577381606 | 13.525504713353449 |
Plotting Prophet results
Prophet has a plotting mechanism called plot
. This plot functionality draws the original data (black dots), the model (blue line) and the error of the forecast (shaded blue area).
model.plot(forecast);
Personally, I'm not a fan of this visualization but I'm not going to build my own...you can see how I do that here.
Additionally, Prophet lets us take a at the components of our model, including the holidays. This component plot is an important plot as it lets you see the components of your model including the trend and seasonality (identified in the yearly
pane).
model.plot_components(forecast);
Comparing holidays vs no-holidays forecasts
Let's re-run our prophet model without our promotions/holidays for comparison.
model_no_holiday = Prophet() model_no_holiday.fit(df);
future_no_holiday = model_no_holiday.make_future_dataframe(periods=24, freq = 'm') future_no_holiday.tail()
ds | |
---|---|
91 | 2017-04-30 |
92 | 2017-05-31 |
93 | 2017-06-30 |
94 | 2017-07-31 |
95 | 2017-08-31 |
forecast_no_holiday = model_no_holiday.predict(future)
Let's compare the two forecasts now. Note: I doubt there will be much difference in these models due to the small amount of data, but its a good example to see the process. We'll set the indexes and then join the forecast dataframes into a new dataframe called compared_df
.
forecast.set_index('ds', inplace=True) forecast_no_holiday.set_index('ds', inplace=True) compared_df = forecast.join(forecast_no_holiday, rsuffix="_no_holiday")
We are only really interested in the yhat
values, so let's remove all the rest and convert the logged values back to their original scale.
compared_df= np.exp(compared_df[['yhat', 'yhat_no_holiday']])
Now, let's take the percentage difference and the average difference for the model with holidays vs that without.
compared_df['diff_per'] = 100*(compared_df['yhat'] - compared_df['yhat_no_holiday']) / compared_df['yhat_no_holiday'] compared_df.tail()
ds | yhat | yhat_no_holiday | diff_per |
---|---|---|---|
2017-04-30 | 474061.52194792114 | 469583.26560153335 | 0.9536660853216669 |
2017-05-31 | 475386.3702518066 | 467836.5237404679 | 1.6137787727593058 |
2017-06-30 | 468707.8037344029 | 477502.74244912295 | -1.8418614036875616 |
2017-07-31 | 471965.8319525281 | 467920.13808058767 | 0.8646120443834493 |
2017-08-31 | 447930.451468062 | 454689.61942474794 | -1.4865454736436081 |
compared_df['diff_per'].mean()
This isn't an enormous difference, (<1%) but there is some difference between using holidays and not using holidays.
If you know there are holidays or events happening that might help/hurt your forecasting efforts, prophet allows you to easily incorporate them into your modeling.