Markov chain Monte Carlo and tall data

9 minute read

Published:

This note is an attempt to convey the main ideas and notions of the talk entitled “On Markov chain Monte Carlo and tall data” given by Rémi Bardenet (CNRS, Lille, France) during the Stochastic algorithm Day 2017 (Journée algorithmes stochastiques 2017). This conference was held at Université Paris Dauphine (Paris, France) on December 1st, 2017.

Metropolis-Hastings algorithm

The Metropolis-Hastings algorithm builds a Markov Chain \((X_k)_{k\geq 1}\) with a given limiting distribution \(\pi(X)\). It requires the specification of a transition distribution that controls the exploration process of the Markov Chain.

Basically, at a state \(X_k\) of the chain, two steps are required to get to the next state:

  1. Draw a candidate \(X^*\) for the new state \(X_{k+1}\) from the transition distribution \(\pi(X_k \rightarrow .)\)

  2. Accept the update \(X_{k+1} \leftarrow X^*\) with probability \(\alpha(X^*,X_k)=\frac{\pi(X^*)}{\pi(X_k)}\) (otherwise, \(X_{k+1}\leftarrow X_k\))

Notice in particular that \(\pi(X)\) is only involved in the computation of the probability \(\alpha\). Knowing it up to a multiplicative constant is therefore sufficient.

The samples from such Markov Chains are used to compute complex integrals of the form : \(\int h(X)\pi(X) dX=\mathbb{E}[h(X)]\) Indeed, if \((x_1,...,x_M)\) are \(M\) independent samples from the chain, the central limit theorem ensures that:

\[\sqrt{M}\left[\frac{1}{M}\sum\limits_{m=1}^M h(x_m) - \int h(X)\pi(X) dX \right] \rightarrow \mathcal{N}(0,\sigma^2_{\text{lim}}(h)) \label{sumInt}\]

Bayesian statistics

Statisticians recommend actions based on a model \(\pi(\theta,x_1,...,x_N)\) on:

  • \(x_1,...,x_N\) : some observable data

  • \(\theta\) : the (underlying) state of the world

If a loss (function) \(L(\theta,a)\) can be associated with taking action \(a\) when the state of the world is \(\theta\), then its minimization provides a criterion on the choice of the "best" action to recommend \(a^*\). Given that the actual state of the world is unknown, the minimization problem becomes :

\[a^*=\mathop{\text{argmin}}\limits_a \int L(\theta,a)\pi(\theta \vert x_1,...,x_N)d\theta\]

where \(\pi(\theta \vert x_1,...,x_N)\) is the probability that the world is in state \(\theta\) given the observations \(x_1,...,x_N\). One way to compute this integral is to use (independent) samples from \(\pi(\theta \vert x_1,...,x_n)\).

A common situation is the one where:

  • the beliefs on \(\theta\) prior to an experiment (leading to the observations) can be summarized into a distribution \(\pi(\theta)\) (called prior distribution)

  • the measurements/observations \(x_1,...,x_N\) are assumed to be i.i.d. from the known distribution \(\pi(.\vert \theta)\) Then, Bayes’ identity gives \(\pi(\theta \vert x_1,...,x_N) \propto \pi(\theta)\prod\limits_{i=1}^N\pi(x_i\vert \theta)\)

Therefore, the Metropolis-Hastings algorithm is particularly suited for drawing samples from \(\pi(\theta \vert x_1,...,x_N)\) as it only requires to know the target distribution up to a multiplicative constant, which is the case here. In fact, the associated update probability is given by:

\[\alpha(\theta^*,\theta_k)=\frac{\pi(\theta^*)\prod\limits_{i=1}^N\pi(x_i\vert \theta^*)}{ \pi(\theta_k)\prod\limits_{i=1}^N\pi(x_i\vert \theta_k)}\]

The computation of this probability, which has to be done at each iteration of the algorithm, may be very costly if the number of observations \(N\) is huge, and even more so if these observations are scattered across multiple servers.

A workaround is to find a way to compute a proxy of the probability that would only require a (small) subset of all the observations. Basically, the goal is to design a Bayesian equivalent of the Stochastic gradient decent algorithm.

Divide-and-conquer approaches

In these approaches, the dataset is divided into batches. Then, the MCMC algorithms are run separately on each batch. Finally the results are combined using one of the following approaches:

Multiplication of smooth approximations of the batch posterior distributions

Presentation

The data \(\boldsymbol x\) is separated into \(S\) batches \(\boldsymbol x =(\boldsymbol x_1,...,\boldsymbol x_S)\). Denote \(\theta_1^{(s)},...,\theta_M^{(s)}\) the samples from the posterior \(\pi(\theta\vert\boldsymbol x_s)\) obtained after running the MCMC algorithm on batch \(\boldsymbol x_s\). If \(\pi(\theta\vert\boldsymbol x_s)\) is Gaussian (asymptotically true when \(M\rightarrow \infty\)), and we denote \(\pi(\theta\vert\boldsymbol x_s)\sim\mathcal{N}(\mu_s,\Sigma_s)\) then

\[\prod\limits_{s}\pi(\theta\vert\boldsymbol x_s)\sim\mathcal{N}\left(V \sum\limits_{s}\Sigma_s^{-1}\mu_s,\; V \right), \quad V=\left(\sum\limits_{s}\Sigma_s^{-1}\right)^{-1}\]

On the other hand, by sampling from \(\pi(\theta\vert\boldsymbol x_s)\propto \pi(\boldsymbol x_s\vert \theta)\pi(\theta)^{1/S}\), we get

\[\prod\limits_{s}\pi(\theta\vert\boldsymbol x_s) \propto \pi(\theta)\prod\limits_{s}\pi(\boldsymbol x_s\vert\theta) = \pi(\theta)\pi(\boldsymbol x\vert\theta) \propto \pi(\theta \vert \boldsymbol x)\]

Therefore, we have

\[\pi(\theta \vert \boldsymbol x) \propto \mathcal{N}\left(V\sum\limits_{s}\Sigma_s^{-1}\mu_s,\; V\right)\]

where \((\mu_s, \Sigma_s)\) can be approximated by the sample mean and covariance from the MCMC run, namely:

\[\widehat{\mu}_s=\frac{1}{M}\sum\limits_{m=1}^M \theta_m^{(s)},\] \[\widehat{\Sigma}_s=\frac{1}{M}\sum\limits_{m=1}^M (\theta_m^{(s)}-\widehat{\mu}_s)(\theta_m^{(s)}-\widehat{\mu}_s)^T, \quad \widehat{V}=\left(\sum\limits_s \widehat{\Sigma}_s^{-1}\right)^{-1}\]

Drawback

The mean squared error of resulting estimators scales exponentially with the number of batches.

Target a more tractable result than the full posterior

Presentation

Given \(S\) posteriors \(\pi(\theta,\boldsymbol x_1),...,\pi(\theta,\boldsymbol x_S)\) obtained from the MCMC runs on batches \(\boldsymbol x_1,...,\boldsymbol x_S\), instead of trying to estimate the full posterior, we can rather compute the equivalent for probability measures of the median or the barycentre of the posteriors.

Drawback

The statistical meaning of the result is unclear.

Subsampling approaches

The idea is to replace the acceptance probability \(\alpha\) of the Metropolis-Hastings algorithm by an estimator computed on a subset of the whole dataset. To do so, notice that:

\[\log\alpha(\theta^*,\theta)=\log\frac{\pi(\theta^*)}{\pi(\theta)}+\sum\limits_{i=1}^N\log\frac{\pi(x_i\vert \theta^*)}{\pi(x_i\vert \theta)}\]

In particular, if \(u\sim\mathcal{U}(]0,1])\) we have:

\[\begin{aligned} u\le \alpha(\theta^*,\theta) &\Leftrightarrow \log u \le \log \alpha \Leftrightarrow \log u - \log\frac{\pi(\theta^*)}{\pi(\theta)} \le \sum\limits_{i=1}^N\log\frac{\pi(x_i\vert \theta^*)}{\pi(x_i\vert \theta)} \\ &\Leftrightarrow \frac{1}{N}\log\left( u \frac{\pi(\theta)}{\pi(\theta^*)}\right) \le \frac{1}{N} \sum\limits_{i=1}^N\log\frac{\pi(x_i\vert \theta^*)}{\pi(x_i\vert \theta)} \end{aligned}\]

Introducing:

\[\psi(u,\theta,\theta^*):=\frac{1}{n}\log\left( u \frac{\pi(\theta)}{\pi(\theta^*)}\right)\]

and

\[\Lambda_N(\theta,\theta^*):=\frac{1}{N}\sum\limits_{i=1}^N\log\frac{\pi(x_i\vert \theta^*)}{\pi(x_i\vert \theta)}=\frac{1}{\vert \boldsymbol x\vert}\sum\limits_{x\in\boldsymbol x}\log\frac{\pi(x\vert \theta^*)}{\pi(x\vert \theta)}\]

we get,

\[\mathbb{P}\big(\Lambda_N(\theta,\theta^*) \ge \psi(u,\theta,\theta^*)\big)=\mathbb{P}(u\leq \alpha(\theta^*,\theta))=\min(\alpha(\theta^*,\theta),1)\]

Therefore, the MH algorithm is based on the verification of whether

\[\Lambda_N(\theta,\theta^*) \ge \psi(u,\theta,\theta^*) \label{accDec}\]

Computing \(\Lambda_N(\theta,\theta^*)\) constitutes the bottleneck of the method. It can be replaced by an unbiased estimator computed from only a small subset \(\boldsymbol x_s\) of \(S < N\) (uniform) samples of the whole dataset:

\[\Lambda_s(\theta,\theta^*)=\frac{1}{S}\sum\limits_{x\in\boldsymbol x_s}\log\frac{\pi(x\vert \theta^*)}{\pi(x\vert \theta)}\]

In particular, this problem is equivalent to the estimation of the mean \(\Lambda_N(\theta,\theta^*)\) of the population

\[\left\lbrace\log\frac{\pi(x_i\vert \theta^*)}{\pi(x_i\vert \theta)}, i\in [\![1,N]\!] \right\rbrace\]

by the mean \(\Lambda_s(\theta,\theta^*)\) of one of its samples. Two methods can be used to check whether \(\Lambda_N(\theta,\theta^*) \ge \psi(u,\theta,\theta^*)\) based on the values of \(\Lambda_s(\theta,\theta^*)\) and \(\psi(u,\theta,\theta^*)\), with a tolerated error \(\epsilon\) (fixed for all iterations):

T-tests on the sample mean \(\Lambda_s(\theta,\theta^*)\)

The hypothesis \(\Lambda_s(\theta,\theta^*)>\psi(u,\theta,\theta^*)\) can be tested using the statistic :

\[t_s=\frac{\Lambda_s(\theta,\theta^*)-\psi(u,\theta,\theta^*)}{\sqrt{(S-N)/(N-1)}\sqrt{\sigma_s^2(\theta,\theta^*)/S}} \quad\]

where

\[\sigma_s^2(\theta,\theta^*) = \frac{1}{S-1}\sum\limits_{x\in\boldsymbol x_s} \left(\log\frac{\pi(x\vert \theta^*)}{\pi(x\vert \theta)}-\Lambda_s(\theta,\theta^*)\right)^2,\]

which is assumed to follow a Student t-distribution with \(S-1\) degrees of freedom (denoted \(\text{Student}(S-1)\)) when \(S\) is big enough.

The tests proceed in two steps :

  • A first test is realized to check whether the population mean \(\Lambda_N(\theta,\theta^*)\) is significantly different from the threshold \(\psi(u,\theta,\theta^*)\). It will be the case if the statistic \(t_s\) verifies :

    \[|t_s| > \phi_{S-1}^{-1}(1-\epsilon), \quad \phi_{S-1}: \text{ cdf of Student}(S-1)\]

    which is the condition for which the null hypothesis \(\Lambda_N(\theta,\theta^*) = \psi(u,\theta,\theta^*)\) is rejected (in a Student t-test with significance \(\epsilon\)). Therefore, we need to have

    \[\delta_s:=1-\phi_{S-1}(|t_s|)<\epsilon\]

    Samples are added to \(\boldsymbol x_s\) as long as this condition in not verified.

  • Decide \(\Lambda_N(\theta,\theta^*)>\psi(u,\theta,\theta^*)\) if \(\Lambda_s(\theta,\theta^*)>\psi(u,\theta,\theta^*)\), and \(\Lambda_N(\theta,\theta^*)<\psi(u,\theta,\theta^*)\) otherwise.

Concentration inequalities

  • For \(\epsilon >0\), the idea is to "locate" \(\Lambda_N(\theta,\theta^*)\) around \(\Lambda_s(\theta,\theta^*)\) with probability \(1-\epsilon\), i.e. compute \((S,c_s(\epsilon))\) such that :
\[\mathbb{P}\big(\vert\Lambda_s(\theta,\theta^*)-\Lambda_N(\theta,\theta^*)\vert\le c_s(\epsilon)\big)\ge 1-\epsilon\]

Then, with probability \(1-\epsilon\), we have the following confidence interval :

\[\Lambda_s(\theta,\theta^*)-c_s(\epsilon)\le\Lambda_N(\theta,\theta^*)\le \Lambda_s(\theta,\theta^*)+c_s(\epsilon)\]

which gives a way to check whether \(\Lambda_N(\theta,\theta^*)>\psi(u,\theta,\theta^*)\).

In fact, the sample size \(S\) is chosen so that the size \(c_s(\epsilon)\) of the confidence interval/concentration is smaller than the difference \(|\Lambda_s(\theta,\theta^*)-\psi(u,\theta,\theta^*)|\). In the case of sampling without replacement, \(c_s(\epsilon)\) can be shown to depend (linearly) on the sample standard deviation and the value of the sample with the largest magnitude.

A way to decrease this term without adding an excessive number of samples is to use proxy functions \(\lbrace\rho_i(\theta,\theta^*), i\in [\![1,N]\!]\rbrace\) verifying:

  • for all \(i\in [\![1,n]\!]\), \(\rho_i(\theta,\theta^*)\approx \log\frac{\pi(x_i\vert \theta^*)}{\pi(x_i\vert \theta)}\)

  • \(\sum\limits_{i=1}^N \rho_i(\theta,\theta^*)\) is easily computable

  • \(\left\lbrace \left\vert\rho_i(\theta,\theta^*)- \log\frac{\pi(x_i\vert \theta^*)}{\pi(x_i\vert \theta)}\right\vert, i\in [\![1,n]\!]\right\rbrace\) can easily be bounded

In that case, the acceptance decision is equivalent to checking whether

\[\frac{1}{N}\sum\limits_{i=1}^N \left[\log\frac{\pi(x_i\vert \theta^*)}{\pi(x_i\vert \theta)}-\rho_i(\theta,\theta^*)\right] > \tilde\psi(u,\theta,\theta^*)=\psi(u,\theta,\theta^*)-\sum\limits_{i=1}^N \rho_i(\theta,\theta^*)\]

And therefore the same subsampling reasoning can be applied to the population

\[\left\lbrace\log\frac{\pi(x_i\vert \theta^*)}{\pi(x_i\vert \theta)}-\rho_i(\theta,\theta^*), i\in [\![1,n]\!] \right\rbrace\]

which by construction will give rise to smaller values of concentration sizes \(c_s(\epsilon)\). An example of such proxy functions are Taylor expansions of the log-likelihood function \((x,\theta) \mapsto \log \pi(x \vert \theta)\).

Number of calls

In the original algorithm, at each iteration \(k+1\), given the current value of the parameters \(\theta_k\) and a proposal \(\theta^*\), we need to compute

\[\Lambda_N(\theta_k,\theta^*)=\underbrace{\frac{1}{\vert \boldsymbol x\vert}\sum\limits_{x\in\boldsymbol x}\log\pi(x\vert \theta^*)}_{\text{not computed yet}} -\underbrace{\frac{1}{\vert \boldsymbol x\vert}\sum\limits_{x\in\boldsymbol x}\log\pi(x\vert \theta_k)}_{\text{computed at iteration }k}\]

Therefore \(N\) calls to the log-likelihood function \((x,\theta) \mapsto \log \pi(x \vert \theta)\) are needed at each iteration.
With these new subsampling approaches, the samples sizes (and the sample themseleves) change from one iteration to the other. So the quantity to compute at iteration \(k+1\) is

\[\Lambda_s(\theta_k,\theta^*)=\underbrace{\frac{1}{\vert \boldsymbol x_s\vert}\sum\limits_{x\in\boldsymbol x_s}\log\pi(x\vert \theta^*)}_{\text{not computed yet}} -\underbrace{\frac{1}{\vert \boldsymbol x_s\vert}\sum\limits_{x\in\boldsymbol x_s}\log\pi(x\vert \theta_k)}_{\text{may be diffrent than iteration }k}\]

Therefore, in general, \(2S\) calls to the log-likelihood function \((x,\theta) \mapsto \log \pi(x \vert \theta)\) are needed at each iteration, where \(S\) varies between iterations. Hence, subsampling methods are interesting as long as for most (if not all) iterations, we have \(2S << N\).

References

  • Bardenet, R. (2017, December). On Markov chain Monte Carlo for tall data. Paper presented at "Journée algorithmes stochastiques", Université Paris Dauphine, Paris, France.

  • Bardenet, R., Doucet, A., Holmes, C. (2015). On Markov chain Monte Carlo methods for tall data. arXiv preprint arXiv:1505.02827.

  • Korattikara, A., Chen, Y., Welling, M. (2014). Austerity in MCMC land: Cutting the Metropolis-Hastings budget. In Proceedings of the 31st International Conference on Machine Learning (ICML-14) (pp. 181-189).

  • Minsker, S., Srivastava, S., Lin, L., Dunson, D. (2014, January). Scalable and robust Bayesian inference via the median posterior. In International Conference on Machine Learning (pp. 1656-1664).

  • Neiswanger, W., Wang, C., Xing, E. (2013). Asymptotically exact, embarrassingly parallel MCMC. arXiv preprint arXiv:1311.4780.