Chad Scherrer / Nov 16 2019

Remix of Python by

Bayesian Optimal Pricing

Pricing is a common problem faced by businesses, and one that can be addressed effectively by Bayesian statistical methods. We'll step through a simple example and build the background necessary to extend get involved with this approach.

Let's start with some hypothetical data. A small company has tried a few different price points (say, one week each) and recorded the demand at each price. We'll abstract away some economic issues in order to focus on the statistical approach. Here's the data:

import pymc3 as pmimport matplotlib as mplmpl.use('Agg')import matplotlib.pyplot as pltimport numpy as npimport matplotlib.pyplot as pltimport pandas as pdmpl.rcParams['figure.dpi']= 300np.random.seed(42)a = 50000k = 20c = -2p  = np.linspace(25,55)p0 = np.linspace(30,50,5)eq = a * p**cq0 = np.random.poisson(a * p0**c)π = eq * (p - k)# np.log(prices-k) + np.log(a) + c*np.log(prices)#plt.plot(p,π);plt.figure()plt.plot(p0,q0,'o',c='C1');plt.xlabel('Price')plt.ylabel('Quantity sold')plt.gcf()

3.0s

Python

That middle data point looks like a possible outlier; we'll come back to that.

For now, let's start with some notation. We'll use $P$ and $Q$ to denote price and quantity, respectively. A "0" subscript will denote observed data, so we were able to sell $Q$ units when the price was set to $P_0$ .

Building a Model

I'm not an economist, but resources I've seen describing the relationship between $Q$ and $P$ are often of the form

Q=aP^c\ .

This ignores that $Q$ is not deterministic, but rather follows a distribution determined by $P$ . For statistical analysis, it makes more sense to write this in terms of a conditional expectation,

\mathbb{E}[Q|P] = a P^c\ .

Taking the log of both sides makes this much more familiar:

\log \mathbb{E}[Q|P] = \log a + c \log P\ .

The right side is now a linear function of the unknowns $\log a$ and $c$ , so considering these as parameters gives us a generalized linear model. Conveniently, the reasonable assumption that $Q$ is discrete and comes from independent unit purchases leads us to a Poisson distribution, which fits well with the log link.

Econometrically, this linear relationship in log-log space corresponds to constant price elasticity. This constant elasticity is just the parameter $c$ , so fitting the model will also give us an estimate of the elasticity.

Expressing this in PyMC3 is straightforward:

import pymc3 as pmwith pm.Model() as m:    loga = pm.Cauchy('loga',0,5)    c = pm.Cauchy('c',0,5)    logμ0 = loga + c * np.log(p0)    μ0 = pm.Deterministic('μ0',np.exp(logμ0))    qval = pm.Poisson('q',μ0,observed=q0)

3.2s

Python

There are a few things about this worth noting...

First, we used very broad Cauchy priors. For complex models, too many degrees of freedom can lead to convergence and interpretability problems, similarly to the indentifiability problems that sometimes happen in maximum likelihood estimation. For a simple case like this, it should be enough to make a note of it, in case we run into trouble.

Next, we expect price elasticity of demand $c$ to be negative, and we may even have a particular range in mind. For example, it could be reasonable to instead go with something like

negc = pm.HalfNormal('-c',3)

Shift+Enter to run

Python

We'd then have a minor change to express things in terms of negc.

The μ0 line may seem wordier than expected. Yes, we could have just used μ0 = a * p0 ** c .

Making the linear predictor explicit makes later changes easier, for example if we decided to include variable elasticity. Wrapping the value in Deterministic just means the sampler will save values of μ0 for us. This is a convenience at the cost of additional RAM use, so we'd leave it out for a complex model.

Model Fitting and Diagnostics

PyMC3 makes it easy to sample from the posterior:

with m:    trace = pm.sample()

11.6s

Python

If all parameters are continuous (as in our case), the default is the No-U-Turn Sampler ("NUTS"). Let's get a summary of the result.

np.round(pm.summary(trace),2)

0.8s

Python

	mean	sd	mc_error	hpd_2.5	hpd_97.5	n_eff	Rhat
loga	9.48	1.48	0.09	6.83	12.44	239.75	1.0
c	-1.63	0.41	0.02	-2.44	-0.89	240.91	1.0
μ0__0	51.93	6.08	0.3	40.29	63.25	316.31	1.0
μ0__1	40.25	3.23	0.11	33.45	46.13	632.42	1.0
μ0__2	32.39	2.64	0.09	27.22	37.44	642.34	1.0
μ0__3	26.8	2.9	0.13	21.5	32.34	411.9	1.0
μ0__4	22.67	3.2	0.16	16.92	28.77	329.95	1.0

7 items

This gives us the posterior mean and standard deviation, along with some other useful information:

mc_error estimates simulation error by breaking the trace into batches, computing the mean of each batch, and then the standard deviation of these means.
hpd_* gives highest posterior density intervals. The 2.5 and 97.5 labels are a bit misleading. There are lots of 95% credible intervals, depending on the relative weights of the left and right tails. The 95% HPD interval is the narrowest among these 95% intervals.
n_eff gives the effective number of samples. We took 2000 samples, but there are some significant autocorrelations. For example, our samples from the posterior of $c$ have about as much information as if we had taken 241 independent samples.
Rhat is sometimes called the potential scale reduction factor, and gives us a factor by which the variance might be reduced, if our MCMC chains had been longer. It's computed in terms of the variance between chains vs within each chain. Values near 1 are good.

Back to our problem. Results here look pretty good, with the exception of $n_\text{eff}$ .

Let's have a closer look what's going on. Here's the joint posterior of the parameters:

def fit0(p0, q0):    with pm.Model() as m:        loga = pm.Cauchy('loga', 0, 5)        c = pm.Cauchy('c', 0, 5, testval=-5)        μ0 = pm.Deterministic('μ0', np.exp(loga+c*np.log(p0)))        qval = pm.Poisson('q', μ0, observed=q0)        t = pm.sample()    return ttrace = fit0(p0,q0)plt.figure()plt.plot(trace['c'],trace['loga'],'.',alpha=0.5)plt.xlabel('$c$')plt.ylabel('$\log a$')plt.gcf()

11.1s

Python

This strong correlation is caused by the data not being centered. All of the $\log P_0$ values are positive, so any increase in the slope leads to a decrease in the intercept, and vice versa. If you don't see this immediately, draw an $(x,y)$ cloud of points with positive $x$ , and compare slope and intercept of some lines passing through the cloud.

The correlation doesn't cause too much trouble for NUTS, but it can be a big problem for some other samplers like Gibbs sampling. This is also a case where fast variational inference approximations tend to drastically underestimate the variance.

Just to note, this doesn't mean Bayesian method can't handle correlations. Rather, it's that the representation of the model can have an impact. There are lots of ways of dealing with this, including explicit parameterization of the correlation, as well as several reparameterization tricks. Let's look at a simple one.

Reparameterization

The trouble came from off-center $\log P_0$ values, so let's try taking that into account in the model. A quick adjustment does the job:

with pm.Model() as m2:    α = pm.Cauchy('α',0,5)    β = pm.Cauchy('β',0,5)    logμ0 = α + β * (np.log(p0) - np.log(p0).mean())    μ0 = pm.Deterministic('μ0',np.exp(logμ0))    qval = pm.Poisson('q0',μ0,observed=q0)

0.9s

Python

Note that we've changed the names of the parameters, to keep things straight. If we wanted to transform between the two parameterizations, we could equate the two logμ0 expressions and solve the system of equations (constant terms equal, and $\log P_0$ terms equal).

Let's see how it does.

with m2:    t = pm.sample()np.round(pm.summary(t),2)

3.7s

Python

	mean	sd	mc_error	hpd_2.5	hpd_97.5	n_eff	Rhat
α	3.5	0.08	0.0	3.35	3.66	868.15	1.0
β	-1.66	0.43	0.01	-2.56	-0.9	825.68	1.0
μ0__0	52.4	6.52	0.21	39.63	64.51	901.25	1.0
μ0__1	40.4	3.39	0.11	34.52	47.42	914.26	1.0
μ0__2	32.37	2.67	0.09	27.26	37.52	876.91	1.0
μ0__3	26.69	2.92	0.1	21.54	32.49	877.41	1.0
μ0__4	22.51	3.26	0.11	16.52	28.89	864.41	1.0

7 items

Much better! Note that it's not the parameters that were affected; $n_\text{eff}$ has also improved for the $\mu_0$ values, which are directly comparable across models.

And of course, the parameter correlations are much better:

plt.figure()plt.plot(t['α'],t['β'],'.',alpha=0.5);plt.xlabel('$\\alpha$')plt.ylabel('$\\beta$')plt.gcf()

1.0s

Python

There are a few other useful tools we can explore. Here's the trace plot:

pm.traceplot(t)plt.gcf()

2.8s

Python

Sampling is done using an independent (Markov) chain for each hardware core (by default, though this can be changed). The trace plot gives us a quick visual check that the chains are behaving well. Ideally, the distributions across chains should look very similar (left column) and have very little autocorrelation (right column). Note that since $\mu_0$ is a vector, the plots in the bottom row superimpose its five components.

Sometimes we need a more compact form, especially if there are lots of parameters. In this case the forest plot is helpful.

pm.forestplot(t,varnames=['α','β'], r_hat=True)plt.gcf()

1.4s

Python

Posterior Predictive Checks

The trace we found is a sample from the joint posterior distribution of the parameters. Instead of a point estimate, we have a sample of "possible worlds" that represent reality. To perform inference, we query each and aggregate the results. So, for example...

μ = np.exp(t.α + t.β * (np.log(p).reshape(-1,1) - np.log(p0).mean()))plt.figure()plt.plot(p,μ,c='k',alpha=0.01);plt.plot(p0,q0,'o',c='C1');#plt.plot(p,eq)plt.ylim(10,70)plt.xlabel('Price')plt.ylabel('Quantity')plt.gcf()

3.4s

Python

Each "hair" represents $\mathbb{E}[Q|P]$ to one sample from the posterior. On introducing the data I suggested the point $P_0=40$ may be an outlier. At first glance, the above plot might seem to confirm this. But remember, this is not a conditional distribution of $(Q|P)$ , but of its expectation.

The following boxplot makes this point more explicitly:

import scipyplt.figure()plt.boxplot(scipy.stats.poisson.cdf(q0,mu=t['μ0']));plt.xticks(range(1,6),p0);plt.xlabel('Price')plt.ylabel('$P(Q<Q_0 | P)$');plt.gcf()

1.1s

Python

Think of this as asking each possible world, "In your reality, what's the probability of seeing a smaller $Q$ value than this?" The responses are then aggregated by price. For $P_0=40$ , about half the results are above 0.1. The point is a bit unusual, but hardly an outlier.

This shouldn't be considered a recipe for the way to do posterior predictive checks. There are countless possibilities, and the right approach depends on the problem at hand.

For example, another very common approach is to use the posterior parameter samples to generate a replicated response $Q_0^\text{rep}$ , and then comparing this to the observed response $Q_0$ :

list(np.mean(q0  > np.random.poisson(t['μ0']), 0))

0.3s

Python

[0.532, 0.722, 0.116, 0.427, 0.664]

These are sometimes called Bayesian $p$ -values. For out-of-sample data, results for a well-calibrated model should follow a uniform distribution.

Optimization

We're almost there! Suppose we have a per-unit cost of $20. Then we can calculate the expected profit for each realization:

k = 20π = (p - k).reshape(-1,1) * μ

0.3s

Python

With a few array indexing tricks, we can then calculate the price that maximizes the overall expected profit (integrating over all parameters):

plt.figure()plt.plot(p,π,c='k',alpha=0.01);plt.plot(p,np.mean(π,1).T,c='C1',lw=2,label="$\mathbb{E}[\pi|P]$");plt.fill_between(p,(np.mean(π,1)-np.std(π,1)).T,(np.mean(π,1)+np.std(π,1)).T,alpha=0.1,color='C1')plt.plot(p,(np.mean(π,1)+np.std(π,1)).T,c='C1',lw=1,label="$\mathbb{E}[\pi|P]\ \pm$1 sd");plt.plot(p,(np.mean(π,1)-np.std(π,1)).T,c='C1',lw=1);pmax = p[np.argmax(np.mean(π,1))]plt.vlines(pmax,300,900,colors='C0',linestyles='dashed',label="argmax$_P\ \mathbb{E}[\pi|P]$")plt.ylim(300,900);plt.xlabel("Price $P$")plt.ylabel("Profit $\pi$")plt.legend();plt.gcf()

2.4s

Python

The dashed blue line indicates our first result for an optimal price.

So... are we done? NO!!! There are some lingering issues that we'll address next time. For example, is expected profit the right thing to maximize? In particular, our estimate is well above the best worst-case scenario (highest unshaded point below the curves). Should we be concerned about this?

Also, the mean is very flat near our solution, suggesting the potential for instability in the solution. Can we quantify this, and does it lead us to a different solution?

Part 2 coming soon!

Bayesian Optimal Pricing

Building a Model

Model Fitting and Diagnostics

Reparameterization

Posterior Predictive Checks

Optimization

Runtimes (1)