Bayesian Optimal Pricing

Pricing is a common problem faced by businesses, and one that can be addressed effectively by Bayesian statistical methods. We'll step through a simple example and build the background necessary to extend get involved with this approach.

Let's start with some hypothetical data. A small company has tried a few different price points (say, one week each) and recorded the demand at each price. We'll abstract away some economic issues in order to focus on the statistical approach. Here's the data:

That middle data point looks like a possible outlier; we'll come back to that.

For now, let's start with some notation. We'll use PP and QQ to denote price and quantity, respectively. A "0" subscript will denote observed data, so we were able to sell QQ units when the price was set to P0P_0.

Building a Model

I'm not an economist, but resources I've seen describing the relationship between QQ and PP are often of the form

Q=aPc .Q=aP^c\ .

This ignores that QQ is not deterministic, but rather follows a distribution determined by PP. For statistical analysis, it makes more sense to write this in terms of a conditional expectation,

E[QP]=aPc .\mathbb{E}[Q|P] = a P^c\ .

Taking the log of both sides makes this much more familiar:

logE[QP]=loga+clogP .\log \mathbb{E}[Q|P] = \log a + c \log P\ .

The right side is now a linear function of the unknowns loga\log a and cc, so considering these as parameters gives us a generalized linear model. Conveniently, the reasonable assumption that QQ is discrete and comes from independent unit purchases leads us to a Poisson distribution, which fits well with the log link.

Econometrically, this linear relationship in log-log space corresponds to constant price elasticity. This constant elasticity is just the parameter cc, so fitting the model will also give us an estimate of the elasticity.

Expressing this in PyMC3 is straightforward:

import pymc3 as pm
with pm.Model() as m:
    loga = pm.Cauchy('loga',0,5)
    c = pm.Cauchy('c',0,5)
    logμ0 = loga + c * np.log(p0)
    μ0 = pm.Deterministic('μ0',np.exp(logμ0))
    qval = pm.Poisson('q',μ0,observed=q0)
3.2s
Python

There are a few things about this worth noting...

First, we used very broad Cauchy priors. For complex models, too many degrees of freedom can lead to convergence and interpretability problems, similarly to the indentifiability problems that sometimes happen in maximum likelihood estimation. For a simple case like this, it should be enough to make a note of it, in case we run into trouble.

Next, we expect price elasticity of demand cc to be negative, and we may even have a particular range in mind. For example, it could be reasonable to instead go with something like

negc = pm.HalfNormal('-c',3)
Shift+Enter to run
Python

We'd then have a minor change to express things in terms of negc.

The μ0 line may seem wordier than expected. Yes, we could have just used μ0 = a * p0 ** c .

Making the linear predictor explicit makes later changes easier, for example if we decided to include variable elasticity. Wrapping the value in Deterministic just means the sampler will save values of μ0 for us. This is a convenience at the cost of additional RAM use, so we'd leave it out for a complex model.

Model Fitting and Diagnostics

PyMC3 makes it easy to sample from the posterior:

with m:
    trace = pm.sample()
11.6s
Python

If all parameters are continuous (as in our case), the default is the No-U-Turn Sampler ("NUTS"). Let's get a summary of the result.

np.round(pm.summary(trace),2)
0.8s
Python
meansdmc_errorhpd_2.5hpd_97.5n_effRhat
loga9.481.480.096.8312.44239.751.0
c-1.630.410.02-2.44-0.89240.911.0
μ0__051.936.080.340.2963.25316.311.0
μ0__140.253.230.1133.4546.13632.421.0
μ0__232.392.640.0927.2237.44642.341.0
μ0__326.82.90.1321.532.34411.91.0
μ0__422.673.20.1616.9228.77329.951.0
7 items

This gives us the posterior mean and standard deviation, along with some other useful information:

  • mc_error estimates simulation error by breaking the trace into batches, computing the mean of each batch, and then the standard deviation of these means.

  • hpd_* gives highest posterior density intervals. The 2.5 and 97.5 labels are a bit misleading. There are lots of 95% credible intervals, depending on the relative weights of the left and right tails. The 95% HPD interval is the narrowest among these 95% intervals.

  • n_eff gives the effective number of samples. We took 2000 samples, but there are some significant autocorrelations. For example, our samples from the posterior of cc have about as much information as if we had taken 241 independent samples.

  • Rhat is sometimes called the potential scale reduction factor, and gives us a factor by which the variance might be reduced, if our MCMC chains had been longer. It's computed in terms of the variance between chains vs within each chain. Values near 1 are good.

Back to our problem. Results here look pretty good, with the exception of neffn_\text{eff}.

Let's have a closer look what's going on. Here's the joint posterior of the parameters:

This strong correlation is caused by the data not being centered. All of the logP0\log P_0 values are positive, so any increase in the slope leads to a decrease in the intercept, and vice versa. If you don't see this immediately, draw an (x,y)(x,y) cloud of points with positive xx, and compare slope and intercept of some lines passing through the cloud.

The correlation doesn't cause too much trouble for NUTS, but it can be a big problem for some other samplers like Gibbs sampling. This is also a case where fast variational inference approximations tend to drastically underestimate the variance.

Just to note, this doesn't mean Bayesian method can't handle correlations. Rather, it's that the representation of the model can have an impact. There are lots of ways of dealing with this, including explicit parameterization of the correlation, as well as several reparameterization tricks. Let's look at a simple one.

Reparameterization

The trouble came from off-center logP0\log P_0 values, so let's try taking that into account in the model. A quick adjustment does the job:

with pm.Model() as m2:
    α = pm.Cauchy('α',0,5)
    β = pm.Cauchy('β',0,5)
    logμ0 = α + β * (np.log(p0) - np.log(p0).mean())
    μ0 = pm.Deterministic('μ0',np.exp(logμ0))
    qval = pm.Poisson('q0',μ0,observed=q0)
0.9s
Python

Note that we've changed the names of the parameters, to keep things straight. If we wanted to transform between the two parameterizations, we could equate the two logμ0 expressions and solve the system of equations (constant terms equal, and logP0\log P_0 terms equal).

Let's see how it does.

with m2:
    t = pm.sample()
np.round(pm.summary(t),2)
3.7s
Python
meansdmc_errorhpd_2.5hpd_97.5n_effRhat
α3.50.080.03.353.66868.151.0
β-1.660.430.01-2.56-0.9825.681.0
μ0__052.46.520.2139.6364.51901.251.0
μ0__140.43.390.1134.5247.42914.261.0
μ0__232.372.670.0927.2637.52876.911.0
μ0__326.692.920.121.5432.49877.411.0
μ0__422.513.260.1116.5228.89864.411.0
7 items

Much better! Note that it's not the parameters that were affected; neffn_\text{eff} has also improved for the μ0\mu_0 values, which are directly comparable across models.

And of course, the parameter correlations are much better:

There are a few other useful tools we can explore. Here's the trace plot:

pm.traceplot(t)
plt.gcf()
2.8s
Python

Sampling is done using an independent (Markov) chain for each hardware core (by default, though this can be changed). The trace plot gives us a quick visual check that the chains are behaving well. Ideally, the distributions across chains should look very similar (left column) and have very little autocorrelation (right column). Note that since μ0\mu_0 is a vector, the plots in the bottom row superimpose its five components.

Sometimes we need a more compact form, especially if there are lots of parameters. In this case the forest plot is helpful.

pm.forestplot(t,varnames=['α','β'], r_hat=True)
plt.gcf()
1.4s
Python

Posterior Predictive Checks

The trace we found is a sample from the joint posterior distribution of the parameters. Instead of a point estimate, we have a sample of "possible worlds" that represent reality. To perform inference, we query each and aggregate the results. So, for example...

μ = np.exp(t.α + t.β * (np.log(p).reshape(-1,1) - np.log(p0).mean()))
plt.figure()
plt.plot(p,μ,c='k',alpha=0.01);
plt.plot(p0,q0,'o',c='C1');
#plt.plot(p,eq)
plt.ylim(10,70)
plt.xlabel('Price')
plt.ylabel('Quantity')
plt.gcf()
3.4s
Python

Each "hair" represents E[QP]\mathbb{E}[Q|P] to one sample from the posterior. On introducing the data I suggested the point P0=40P_0=40 may be an outlier. At first glance, the above plot might seem to confirm this. But remember, this is not a conditional distribution of (QP)(Q|P), but of its expectation.

The following boxplot makes this point more explicitly:

Think of this as asking each possible world, "In your reality, what's the probability of seeing a smaller QQ value than this?" The responses are then aggregated by price. For P0=40P_0=40, about half the results are above 0.1. The point is a bit unusual, but hardly an outlier.

This shouldn't be considered a recipe for the way to do posterior predictive checks. There are countless possibilities, and the right approach depends on the problem at hand.

For example, another very common approach is to use the posterior parameter samples to generate a replicated response Q0repQ_0^\text{rep}, and then comparing this to the observed response Q0Q_0:

list(np.mean(q0  > np.random.poisson(t['μ0']), 0))
0.3s
Python
[0.532, 0.722, 0.116, 0.427, 0.664]

These are sometimes called Bayesian pp-values. For out-of-sample data, results for a well-calibrated model should follow a uniform distribution.

Optimization

We're almost there! Suppose we have a per-unit cost of $20. Then we can calculate the expected profit for each realization:

k = 20
π = (p - k).reshape(-1,1) * μ
0.3s
Python

With a few array indexing tricks, we can then calculate the price that maximizes the overall expected profit (integrating over all parameters):

plt.figure()
plt.plot(p,π,c='k',alpha=0.01);
plt.plot(p,np.mean(π,1).T,c='C1',lw=2,label="$\mathbb{E}[\pi|P]$");
plt.fill_between(p,(np.mean(π,1)-np.std(π,1)).T,(np.mean(π,1)+np.std(π,1)).T,alpha=0.1,color='C1')
plt.plot(p,(np.mean(π,1)+np.std(π,1)).T,c='C1',lw=1,label="$\mathbb{E}[\pi|P]\ \pm$1 sd");
plt.plot(p,(np.mean(π,1)-np.std(π,1)).T,c='C1',lw=1);
pmax = p[np.argmax(np.mean(π,1))]
plt.vlines(pmax,300,900,colors='C0',linestyles='dashed',label="argmax$_P\ \mathbb{E}[\pi|P]$")
plt.ylim(300,900);
plt.xlabel("Price $P$")
plt.ylabel("Profit $\pi$")
plt.legend();
plt.gcf()
2.4s
Python

The dashed blue line indicates our first result for an optimal price.

So... are we done? NO!!! There are some lingering issues that we'll address next time. For example, is expected profit the right thing to maximize? In particular, our estimate is well above the best worst-case scenario (highest unshaded point below the curves). Should we be concerned about this?

Also, the mean is very flat near our solution, suggesting the potential for instability in the solution. Can we quantify this, and does it lead us to a different solution?

Part 2 coming soon!

Runtimes (1)