Forecasting Time-Series data with Prophet
From Forecasting Time-Series data with Prophet and the associated notebook.
import pandas as pd import numpy as np from fbprophet import Prophet import matplotlib.pyplot as plt # %matplotlib inline plt.rcParams['figure.figsize']=(20,10) plt.style.use('ggplot')
pd.plotting.register_matplotlib_converters()
Read in the data
Read the data in from the retail sales CSV file in the examples folder then set the index to the 'date' column. We are also parsing dates in the data file.
sales_df = pd.read_csv(retail_sales.csv, index_col='date', parse_dates=True)
sales_df.head()
date | sales |
---|---|
2009-10-01 | 338630 |
2009-11-01 | 339386 |
2009-12-01 | 400264 |
2010-01-01 | 314640 |
2010-02-01 | 311022 |
Prepare for Prophet
For prophet to work, we need to change the names of these columns to 'ds' and 'y', so lets just create a new dataframe and keep our old one handy (you'll see why later). The new dataframe will initially be created with an integer index so we can rename the columns
df = sales_df.reset_index()
df.head()
date | sales | |
---|---|---|
0 | 2009-10-01 | 338630 |
1 | 2009-11-01 | 339386 |
2 | 2009-12-01 | 400264 |
3 | 2010-01-01 | 314640 |
4 | 2010-02-01 | 311022 |
Let's rename the columns as required by fbprophet. Additioinally, fbprophet doesn't like the index to be a datetime...it wants to see 'ds' as a non-index column, so we won't set an index differnetly than the integer index.
df=df.rename(columns={'date':'ds', 'sales':'y'})
df.head()
ds | y | |
---|---|---|
0 | 2009-10-01 | 338630 |
1 | 2009-11-01 | 339386 |
2 | 2009-12-01 | 400264 |
3 | 2010-01-01 | 314640 |
4 | 2010-02-01 | 311022 |
Now's a good time to take a look at your data. Plot the data using pandas' plot
function
df.set_index('ds').y.plot()
When working with time-series data, its good to take a look at the data to determine if trends exist, whether it is stationary, has any outliers and/or any other anamolies. Facebook prophet's example uses the log-transform as a way to remove some of these anomolies but it isn't the absolute 'best' way to do this...but given that its the example and a simple data series, I'll follow their lead for now. Taking the log of a number is easily reversible to be able to see your original data.
To log-transform your data, you can use numpy's log() function
df['y'] = np.log(df['y'])
df.tail()
ds | y | |
---|---|---|
67 | 2015-05-01 | 13.044650453675313 |
68 | 2015-06-01 | 13.013059541513272 |
69 | 2015-07-01 | 13.033991074775358 |
70 | 2015-08-01 | 13.030993424699561 |
71 | 2015-09-01 | 12.973670775134828 |
df.set_index('ds').y.plot().get_figure()
As you can see in the above chart, the plot looks the same as the first one but just at a different scale.
Running Prophet
Now, let's set prophet up to begin modeling our data.
Note: Since we are using monthly data, you'll see a message from Prophet saying Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this.
This is OK since we are woking with monthly data but you can disable it by using weekly_seasonality=True
in the instantiation of Prophet.
model = Prophet() model.fit(df);
Forecasting is fairly useless unless you can look into the future, so we need to add some future dates to our dataframe. For this example, I want to forecast 2 years into the future, so I'll built a future dataframe with 24 periods since we are working with monthly data. Note the freq='m'
inclusion to ensure we are adding 24 months of data.
This can be done with the following code:
future = model.make_future_dataframe(periods=24, freq = 'm') future.tail()
ds | |
---|---|
91 | 2017-04-30 |
92 | 2017-05-31 |
93 | 2017-06-30 |
94 | 2017-07-31 |
95 | 2017-08-31 |
To forecast this future data, we need to run it through Prophet's model.
forecast = model.predict(future)
The resulting forecast dataframe contains quite a bit of data, but we really only care about a few columns. First, let's look at the full dataframe:
forecast.tail()
We really only want to look at yhat, yhat_lower and yhat_upper, so we can do that with:
forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()
ds | yhat | yhat_lower | yhat_upper | |
---|---|---|---|---|
91 | 2017-04-30 | 13.05960091142992 | 12.868789705246625 | 13.261308368361817 |
92 | 2017-05-31 | 13.055874205630584 | 12.851931325157535 | 13.278151695777442 |
93 | 2017-06-30 | 13.076325182235378 | 12.858941552645705 | 13.317519252029376 |
94 | 2017-07-31 | 13.056052915211298 | 12.820086479079336 | 13.30861058921531 |
95 | 2017-08-31 | 13.027370310048981 | 12.775521462626092 | 13.301355837963118 |
Plotting Prophet results
Prophet has a plotting mechanism called plot
. This plot functionality draws the original data (black dots), the model (blue line) and the error of the forecast (shaded blue area).
model.plot(forecast);
Personally, I'm not a fan of this visualization so I like to break the data up and build a chart myself. The next section describes how I build my own visualization for Prophet modeling
Visualizing Prophet models
In order to build a useful dataframe to visualize our model versus our original data, we need to combine the output of the Prophet model with our original data set, then we'll build a new chart manually using pandas and matplotlib.
First, let's set our dataframes to have the same index of ds
df.set_index('ds', inplace=True) forecast.set_index('ds', inplace=True)
Now, we'll combine the original data and our forecast model data
viz_df = sales_df.join(forecast[['yhat', 'yhat_lower','yhat_upper']], how = 'outer')
If we look at the head()
, we see the data has been joined correctly but the scales of our original data (sales) and our model (yhat) are different. We need to rescale the yhat colums(s) to get the same scale, so we'll use numpy's exp
function to do that.
viz_df.head()
sales | yhat | yhat_lower | yhat_upper | |
---|---|---|---|---|
2009-10-01 | 338630.0 | 12.72891629967664 | 12.718874776570726 | 12.738810946015391 |
2009-11-01 | 339386.0 | 12.749435087873692 | 12.739232574321054 | 12.75978767893567 |
2009-12-01 | 400264.0 | 12.887443656406615 | 12.876810477508574 | 12.897705777895284 |
2010-01-01 | 314640.0 | 12.662469402438917 | 12.652474375776993 | 12.673040271207338 |
2010-02-01 | 311022.0 | 12.655825281055929 | 12.645359960829193 | 12.665377674190049 |
viz_df['yhat_rescaled'] = np.exp(viz_df['yhat'])
viz_df.head()
sales | yhat | yhat_lower | yhat_upper | yhat_rescaled | |
---|---|---|---|---|---|
2009-10-01 | 338630.0 | 12.72891629967664 | 12.718874776570726 | 12.738810946015391 | 337363.51235967537 |
2009-11-01 | 339386.0 | 12.749435087873692 | 12.739232574321054 | 12.75978767893567 | 344357.3095608864 |
2009-12-01 | 400264.0 | 12.887443656406615 | 12.876810477508574 | 12.897705777895284 | 395317.159207676 |
2010-01-01 | 314640.0 | 12.662469402438917 | 12.652474375776993 | 12.673040271207338 | 315675.2904628142 |
2010-02-01 | 311022.0 | 12.655825281055929 | 12.645359960829193 | 12.665377674190049 | 313584.8577497736 |
Let's take a look at the sales
and yhat_rescaled
data together in a chart.
viz_df[['sales', 'yhat_rescaled']].plot()
You can see from the chart that the model (blue) is pretty good when plotted against the actual signal (orange) but I like to make my vizualization's a little better to understand. To build my 'better' visualization, we'll need to go back to our original sales_df
and forecast
dataframes.
First things first - we need to find the 2nd to last date of the original sales data in sales_df
in order to ensure the original sales data and model data charts are connected.
sales_df.index = pd.to_datetime(sales_df.index) #make sure our index as a datetime object connect_date = sales_df.index[-2] #select the 2nd to last date
Using the connect_date
we can now grab only the model data that after that date (you'll see why in a minute). To do this, we'll mask the forecast data.
mask = (forecast.index > connect_date) predict_df = forecast.loc[mask]
predict_df.head()
Now, let's build a dataframe to use in our new visualization. We'll follow the same steps we did before.
viz_df = sales_df.join(predict_df[['yhat', 'yhat_lower','yhat_upper']], how = 'outer') viz_df['yhat_scaled']=np.exp(viz_df['yhat'])
Now, if we take a look at the head()
of viz_df
we'll see 'NaN's everywhere except for our original data rows.
viz_df.head()
sales | yhat | yhat_lower | yhat_upper | yhat_scaled | |
---|---|---|---|---|---|
2009-10-01 | 338630.0 | ||||
2009-11-01 | 339386.0 | ||||
2009-12-01 | 400264.0 | ||||
2010-01-01 | 314640.0 | ||||
2010-02-01 | 311022.0 |
If we take a look at the tail()
of the viz_df
you'll see we have data for the forecasted data and NaN's for the original data series.
viz_df.tail()
sales | yhat | yhat_lower | yhat_upper | yhat_scaled | |
---|---|---|---|---|---|
2017-04-30 | 13.05960091142992 | 12.868789705246625 | 13.261308368361817 | 469583.26560153335 | |
2017-05-31 | 13.055874205630584 | 12.851931325157535 | 13.278151695777442 | 467836.5237404679 | |
2017-06-30 | 13.076325182235378 | 12.858941552645705 | 13.317519252029376 | 477502.74244912295 | |
2017-07-31 | 13.056052915211298 | 12.820086479079336 | 13.30861058921531 | 467920.13808058767 | |
2017-08-31 | 13.027370310048981 | 12.775521462626092 | 13.301355837963118 | 454689.61942474794 |
time to plot
Now, let's plot everything to get the 'final' visualization of our sales data and forecast with errors.
fig, ax1 = plt.subplots() ax1.plot(viz_df.sales) ax1.plot(viz_df.yhat_scaled, color='black', linestyle=':') ax1.fill_between(viz_df.index, np.exp(viz_df['yhat_upper']), np.exp(viz_df['yhat_lower']), alpha=0.5, color='darkgray') ax1.set_title('Sales (Orange) vs Sales Forecast (Black)') ax1.set_ylabel('Dollar Sales') ax1.set_xlabel('Date') L=ax1.legend() #get the legend L.get_texts()[0].set_text('Actual Sales') #change the legend text for 1st plot L.get_texts()[1].set_text('Forecasted Sales') #change the legend text for 2nd plot
This visualization is much better (in my opinion) than the default fbprophet plot. It is much easier to quickly understand and describe what's happening. The orange line is actual sales data and the black dotted line is the forecast. The gray shaded area is the uncertaintity estimation of the forecast.