The Metropolis-Hastings algorithm builds a Markov Chain \((X_k)_{k\geq 1}\) with a given limiting distribution \(\pi(X)\). It requires the specification of a transition distribution that controls the exploration process of the Markov Chain.
Basically, at a state \(X_k\) of the chain, two steps are required to get to the next state:
Draw a candidate \(X^*\) for the new state \(X_{k+1}\) from the transition distribution \(\pi(X_k \rightarrow .)\)
Accept the update \(X_{k+1} \leftarrow X^*\) with probability \(\alpha(X^*,X_k)=\frac{\pi(X^*)}{\pi(X_k)}\) (otherwise, \(X_{k+1}\leftarrow X_k\))
Notice in particular that \(\pi(X)\) is only involved in the computation of the probability \(\alpha\). Knowing it up to a multiplicative constant is therefore sufficient.
The samples from such Markov Chains are used to compute complex integrals of the form : \(\int h(X)\pi(X) dX=\mathbb{E}[h(X)]\) Indeed, if \((x_1,...,x_M)\) are \(M\) independent samples from the chain, the central limit theorem ensures that:
\[\sqrt{M}\left[\frac{1}{M}\sum\limits_{m=1}^M h(x_m) - \int h(X)\pi(X) dX \right] \rightarrow \mathcal{N}(0,\sigma^2_{\text{lim}}(h)) \label{sumInt}\]Statisticians recommend actions based on a model \(\pi(\theta,x_1,...,x_N)\) on:
\(x_1,...,x_N\) : some observable data
\(\theta\) : the (underlying) state of the world
If a loss (function) \(L(\theta,a)\) can be associated with taking action \(a\) when the state of the world is \(\theta\), then its minimization provides a criterion on the choice of the "best" action to recommend \(a^*\). Given that the actual state of the world is unknown, the minimization problem becomes :
\[a^*=\mathop{\text{argmin}}\limits_a \int L(\theta,a)\pi(\theta \vert x_1,...,x_N)d\theta\]where \(\pi(\theta \vert x_1,...,x_N)\) is the probability that the world is in state \(\theta\) given the observations \(x_1,...,x_N\). One way to compute this integral is to use (independent) samples from \(\pi(\theta \vert x_1,...,x_n)\).
A common situation is the one where:
the beliefs on \(\theta\) prior to an experiment (leading to the observations) can be summarized into a distribution \(\pi(\theta)\) (called prior distribution)
the measurements/observations \(x_1,...,x_N\) are assumed to be i.i.d. from the known distribution \(\pi(.\vert \theta)\) Then, Bayes’ identity gives \(\pi(\theta \vert x_1,...,x_N) \propto \pi(\theta)\prod\limits_{i=1}^N\pi(x_i\vert \theta)\)
Therefore, the Metropolis-Hastings algorithm is particularly suited for drawing samples from \(\pi(\theta \vert x_1,...,x_N)\) as it only requires to know the target distribution up to a multiplicative constant, which is the case here. In fact, the associated update probability is given by:
\[\alpha(\theta^*,\theta_k)=\frac{\pi(\theta^*)\prod\limits_{i=1}^N\pi(x_i\vert \theta^*)}{ \pi(\theta_k)\prod\limits_{i=1}^N\pi(x_i\vert \theta_k)}\]The computation of this probability, which has to be done at each iteration of the algorithm, may be very costly if the number of observations \(N\) is huge, and even more so if these observations are scattered across multiple servers.
A workaround is to find a way to compute a proxy of the probability that would only require a (small) subset of all the observations. Basically, the goal is to design a Bayesian equivalent of the Stochastic gradient decent algorithm.
In these approaches, the dataset is divided into batches. Then, the MCMC algorithms are run separately on each batch. Finally the results are combined using one of the following approaches:
The data \(\boldsymbol x\) is separated into \(S\) batches \(\boldsymbol x =(\boldsymbol x_1,...,\boldsymbol x_S)\). Denote \(\theta_1^{(s)},...,\theta_M^{(s)}\) the samples from the posterior \(\pi(\theta\vert\boldsymbol x_s)\) obtained after running the MCMC algorithm on batch \(\boldsymbol x_s\). If \(\pi(\theta\vert\boldsymbol x_s)\) is Gaussian (asymptotically true when \(M\rightarrow \infty\)), and we denote \(\pi(\theta\vert\boldsymbol x_s)\sim\mathcal{N}(\mu_s,\Sigma_s)\) then
\[\prod\limits_{s}\pi(\theta\vert\boldsymbol x_s)\sim\mathcal{N}\left(V \sum\limits_{s}\Sigma_s^{-1}\mu_s,\; V \right), \quad V=\left(\sum\limits_{s}\Sigma_s^{-1}\right)^{-1}\]On the other hand, by sampling from \(\pi(\theta\vert\boldsymbol x_s)\propto \pi(\boldsymbol x_s\vert \theta)\pi(\theta)^{1/S}\), we get
\[\prod\limits_{s}\pi(\theta\vert\boldsymbol x_s) \propto \pi(\theta)\prod\limits_{s}\pi(\boldsymbol x_s\vert\theta) = \pi(\theta)\pi(\boldsymbol x\vert\theta) \propto \pi(\theta \vert \boldsymbol x)\]Therefore, we have
\[\pi(\theta \vert \boldsymbol x) \propto \mathcal{N}\left(V\sum\limits_{s}\Sigma_s^{-1}\mu_s,\; V\right)\]where \((\mu_s, \Sigma_s)\) can be approximated by the sample mean and covariance from the MCMC run, namely:
\[\widehat{\mu}_s=\frac{1}{M}\sum\limits_{m=1}^M \theta_m^{(s)},\] \[\widehat{\Sigma}_s=\frac{1}{M}\sum\limits_{m=1}^M (\theta_m^{(s)}-\widehat{\mu}_s)(\theta_m^{(s)}-\widehat{\mu}_s)^T, \quad \widehat{V}=\left(\sum\limits_s \widehat{\Sigma}_s^{-1}\right)^{-1}\]The mean squared error of resulting estimators scales exponentially with the number of batches.
Given \(S\) posteriors \(\pi(\theta,\boldsymbol x_1),...,\pi(\theta,\boldsymbol x_S)\) obtained from the MCMC runs on batches \(\boldsymbol x_1,...,\boldsymbol x_S\), instead of trying to estimate the full posterior, we can rather compute the equivalent for probability measures of the median or the barycentre of the posteriors.
The statistical meaning of the result is unclear.
The idea is to replace the acceptance probability \(\alpha\) of the Metropolis-Hastings algorithm by an estimator computed on a subset of the whole dataset. To do so, notice that:
\[\log\alpha(\theta^*,\theta)=\log\frac{\pi(\theta^*)}{\pi(\theta)}+\sum\limits_{i=1}^N\log\frac{\pi(x_i\vert \theta^*)}{\pi(x_i\vert \theta)}\]In particular, if \(u\sim\mathcal{U}(]0,1])\) we have:
\[\begin{aligned} u\le \alpha(\theta^*,\theta) &\Leftrightarrow \log u \le \log \alpha \Leftrightarrow \log u - \log\frac{\pi(\theta^*)}{\pi(\theta)} \le \sum\limits_{i=1}^N\log\frac{\pi(x_i\vert \theta^*)}{\pi(x_i\vert \theta)} \\ &\Leftrightarrow \frac{1}{N}\log\left( u \frac{\pi(\theta)}{\pi(\theta^*)}\right) \le \frac{1}{N} \sum\limits_{i=1}^N\log\frac{\pi(x_i\vert \theta^*)}{\pi(x_i\vert \theta)} \end{aligned}\]Introducing:
\[\psi(u,\theta,\theta^*):=\frac{1}{n}\log\left( u \frac{\pi(\theta)}{\pi(\theta^*)}\right)\]and
\[\Lambda_N(\theta,\theta^*):=\frac{1}{N}\sum\limits_{i=1}^N\log\frac{\pi(x_i\vert \theta^*)}{\pi(x_i\vert \theta)}=\frac{1}{\vert \boldsymbol x\vert}\sum\limits_{x\in\boldsymbol x}\log\frac{\pi(x\vert \theta^*)}{\pi(x\vert \theta)}\]we get,
\[\mathbb{P}\big(\Lambda_N(\theta,\theta^*) \ge \psi(u,\theta,\theta^*)\big)=\mathbb{P}(u\leq \alpha(\theta^*,\theta))=\min(\alpha(\theta^*,\theta),1)\]Therefore, the MH algorithm is based on the verification of whether
\[\Lambda_N(\theta,\theta^*) \ge \psi(u,\theta,\theta^*) \label{accDec}\]Computing \(\Lambda_N(\theta,\theta^*)\) constitutes the bottleneck of the method. It can be replaced by an unbiased estimator computed from only a small subset \(\boldsymbol x_s\) of \(S < N\) (uniform) samples of the whole dataset:
\[\Lambda_s(\theta,\theta^*)=\frac{1}{S}\sum\limits_{x\in\boldsymbol x_s}\log\frac{\pi(x\vert \theta^*)}{\pi(x\vert \theta)}\]In particular, this problem is equivalent to the estimation of the mean \(\Lambda_N(\theta,\theta^*)\) of the population
\[\left\lbrace\log\frac{\pi(x_i\vert \theta^*)}{\pi(x_i\vert \theta)}, i\in [\![1,N]\!] \right\rbrace\]by the mean \(\Lambda_s(\theta,\theta^*)\) of one of its samples. Two methods can be used to check whether \(\Lambda_N(\theta,\theta^*) \ge \psi(u,\theta,\theta^*)\) based on the values of \(\Lambda_s(\theta,\theta^*)\) and \(\psi(u,\theta,\theta^*)\), with a tolerated error \(\epsilon\) (fixed for all iterations):
The hypothesis \(\Lambda_s(\theta,\theta^*)>\psi(u,\theta,\theta^*)\) can be tested using the statistic :
\[t_s=\frac{\Lambda_s(\theta,\theta^*)-\psi(u,\theta,\theta^*)}{\sqrt{(S-N)/(N-1)}\sqrt{\sigma_s^2(\theta,\theta^*)/S}} \quad\]where
\[\sigma_s^2(\theta,\theta^*) = \frac{1}{S-1}\sum\limits_{x\in\boldsymbol x_s} \left(\log\frac{\pi(x\vert \theta^*)}{\pi(x\vert \theta)}-\Lambda_s(\theta,\theta^*)\right)^2,\]which is assumed to follow a Student t-distribution with \(S-1\) degrees of freedom (denoted \(\text{Student}(S-1)\)) when \(S\) is big enough.
The tests proceed in two steps :
A first test is realized to check whether the population mean \(\Lambda_N(\theta,\theta^*)\) is significantly different from the threshold \(\psi(u,\theta,\theta^*)\). It will be the case if the statistic \(t_s\) verifies :
\[|t_s| > \phi_{S-1}^{-1}(1-\epsilon), \quad \phi_{S-1}: \text{ cdf of Student}(S-1)\]which is the condition for which the null hypothesis \(\Lambda_N(\theta,\theta^*) = \psi(u,\theta,\theta^*)\) is rejected (in a Student t-test with significance \(\epsilon\)). Therefore, we need to have
\[\delta_s:=1-\phi_{S-1}(|t_s|)<\epsilon\]Samples are added to \(\boldsymbol x_s\) as long as this condition in not verified.
Decide \(\Lambda_N(\theta,\theta^*)>\psi(u,\theta,\theta^*)\) if \(\Lambda_s(\theta,\theta^*)>\psi(u,\theta,\theta^*)\), and \(\Lambda_N(\theta,\theta^*)<\psi(u,\theta,\theta^*)\) otherwise.
Then, with probability \(1-\epsilon\), we have the following confidence interval :
\[\Lambda_s(\theta,\theta^*)-c_s(\epsilon)\le\Lambda_N(\theta,\theta^*)\le \Lambda_s(\theta,\theta^*)+c_s(\epsilon)\]which gives a way to check whether \(\Lambda_N(\theta,\theta^*)>\psi(u,\theta,\theta^*)\).
In fact, the sample size \(S\) is chosen so that the size \(c_s(\epsilon)\) of the confidence interval/concentration is smaller than the difference \(|\Lambda_s(\theta,\theta^*)-\psi(u,\theta,\theta^*)|\). In the case of sampling without replacement, \(c_s(\epsilon)\) can be shown to depend (linearly) on the sample standard deviation and the value of the sample with the largest magnitude.
A way to decrease this term without adding an excessive number of samples is to use proxy functions \(\lbrace\rho_i(\theta,\theta^*), i\in [\![1,N]\!]\rbrace\) verifying:
for all \(i\in [\![1,n]\!]\), \(\rho_i(\theta,\theta^*)\approx \log\frac{\pi(x_i\vert \theta^*)}{\pi(x_i\vert \theta)}\)
\(\sum\limits_{i=1}^N \rho_i(\theta,\theta^*)\) is easily computable
\(\left\lbrace \left\vert\rho_i(\theta,\theta^*)- \log\frac{\pi(x_i\vert \theta^*)}{\pi(x_i\vert \theta)}\right\vert, i\in [\![1,n]\!]\right\rbrace\) can easily be bounded
In that case, the acceptance decision is equivalent to checking whether
\[\frac{1}{N}\sum\limits_{i=1}^N \left[\log\frac{\pi(x_i\vert \theta^*)}{\pi(x_i\vert \theta)}-\rho_i(\theta,\theta^*)\right] > \tilde\psi(u,\theta,\theta^*)=\psi(u,\theta,\theta^*)-\sum\limits_{i=1}^N \rho_i(\theta,\theta^*)\]And therefore the same subsampling reasoning can be applied to the population
\[\left\lbrace\log\frac{\pi(x_i\vert \theta^*)}{\pi(x_i\vert \theta)}-\rho_i(\theta,\theta^*), i\in [\![1,n]\!] \right\rbrace\]which by construction will give rise to smaller values of concentration sizes \(c_s(\epsilon)\). An example of such proxy functions are Taylor expansions of the log-likelihood function \((x,\theta) \mapsto \log \pi(x \vert \theta)\).
In the original algorithm, at each iteration \(k+1\), given the current value of the parameters \(\theta_k\) and a proposal \(\theta^*\), we need to compute
\[\Lambda_N(\theta_k,\theta^*)=\underbrace{\frac{1}{\vert \boldsymbol x\vert}\sum\limits_{x\in\boldsymbol x}\log\pi(x\vert \theta^*)}_{\text{not computed yet}} -\underbrace{\frac{1}{\vert \boldsymbol x\vert}\sum\limits_{x\in\boldsymbol x}\log\pi(x\vert \theta_k)}_{\text{computed at iteration }k}\]Therefore \(N\) calls to the log-likelihood function
\((x,\theta) \mapsto \log \pi(x \vert \theta)\) are needed at each
iteration.
With these new subsampling approaches, the samples sizes (and the sample
themseleves) change from one iteration to the other. So the quantity to
compute at iteration \(k+1\) is
Therefore, in general, \(2S\) calls to the log-likelihood function \((x,\theta) \mapsto \log \pi(x \vert \theta)\) are needed at each iteration, where \(S\) varies between iterations. Hence, subsampling methods are interesting as long as for most (if not all) iterations, we have \(2S << N\).
Bardenet, R. (2017, December). On Markov chain Monte Carlo for tall data. Paper presented at "Journée algorithmes stochastiques", Université Paris Dauphine, Paris, France.
Bardenet, R., Doucet, A., Holmes, C. (2015). On Markov chain Monte Carlo methods for tall data. arXiv preprint arXiv:1505.02827.
Korattikara, A., Chen, Y., Welling, M. (2014). Austerity in MCMC land: Cutting the Metropolis-Hastings budget. In Proceedings of the 31st International Conference on Machine Learning (ICML-14) (pp. 181-189).
Minsker, S., Srivastava, S., Lin, L., Dunson, D. (2014, January). Scalable and robust Bayesian inference via the median posterior. In International Conference on Machine Learning (pp. 1656-1664).
Neiswanger, W., Wang, C., Xing, E. (2013). Asymptotically exact, embarrassingly parallel MCMC. arXiv preprint arXiv:1311.4780.
The goal of supervised Machine Learning algorithms is to predict the values of some unknown variable from known explanatory variables, given a set of observations. Contrary to purely statistical algorithms, no model has to be specified to the algorithm. In many applications, both the number of observations \(n\) and the number of explanatory variables \(d\) can be large. Stochastic algorithms are used in such large scale frameworks to make prediction tractable.
Consider an input/output pair \((X,Y)\in\mathcal{X}\times\mathcal{Y}\), \(\mathcal{Y}\subset \mathbb{R}\), with unknown joint distribution \(\rho\). The goal is to find a function \(\theta : \mathcal{X} \rightarrow \mathbb{R}\) such that \(\theta(X)\) is a good prediction for \(Y\). To do so, we consider a loss function \(l : \mathcal{Y}\times \mathbb{R} \rightarrow \mathbb{R}_+\) such that \(l(y,\theta(x))\) is the loss from predicting \(y\) using \(\theta(x)\).
The Generalization risk (or true risk) is defined as
\[\begin{aligned} \mathcal{R}(\theta):=\mathbb{E}_\rho[l(Y, \theta(X))] \end{aligned}\]and is a measure of the quality of \(\theta(X)\) as a predictor of \(Y\). Given that \(\rho\) is unknown, we can’t in general compute directly this risk.
Starting from a set of \(n\) observations \((x_i,y_i)\in\mathcal{X}\times \mathcal{Y}\), \(i\in[\![1,n]\!]\) assumed to be i.i.d. with (unknown) distribution \(\rho\), the Empirical risk (or training error) is defined as
\[\begin{aligned} \widehat{\mathcal{R}}(\theta)=\frac{1}{n}\sum\limits_{i=1}^nl(y_i,\theta(x_i)) \end{aligned}\]and is used in practice to choose the best predictor \(\theta\), considering the observations and some regularisation factor \(\Omega(\theta)\), through the Empirical Risk Minimisation (ERM) problem :
\[\begin{aligned} \theta^* = \min\limits_\theta \left[\widehat{\mathcal{R}}(\theta) + \mu\Omega(\theta)\right]=\min\limits_\theta \left[\frac{1}{n}\sum\limits_{i=1}^nl(y_i,\theta(x_i)) + \mu\Omega(\theta)\right] \end{aligned}\]From now on, linear functions are considered. We reparametrize :
\[\begin{aligned} \theta(X) = \langle \theta, \Phi(X) \rangle, \quad \theta\in\mathbb{R}^d,\quad \Phi : \mathcal{X}\rightarrow \mathbb{R}^d \end{aligned}\]The high number of explanatory variables \(d\) pushes us to choose first order algorithm for the ERM, namely the Gradient Descent algorithm. For instance, choosing second order algorithms such as the Newton-Raphson algorithm, would imply to compute a \(d\times d\) matrix at each iteration: the Hessian of the overall loss.
Each iteration implies \(\mathcal{O}(dn)\) operations for the computation of the Gradient. Stochastic algorithms are used to reduce this number. In particular, the Stochastic Gradient algorithm only implies \(\mathcal{O}(d)\) operations per iteration.
The idea of Stochastic Gradient Descent (SGD) is to replace, at each iteration \(k\), the gradient of the empirical risk \(\nabla\widehat{\mathcal{R}}(\theta^{(k-1)})\) by a gradient function \(g^{(k)}(\theta^{(k-1)})\) cheaper to compute and verifying :
\[\begin{equation} \mathbb{E}[g^{(k)}(\theta^{(k-1)})\vert \mathcal{F}_{k-1} ]=\nabla\widehat{\mathcal{R}}(\theta^{(k-1)}) \tag{1}\label{SGD_origin} \end{equation}\]for a filtration \((\mathcal{F}_{k})_{k\ge 0}\) such that \(\theta^{(k)}\) is \(\mathcal{F}_k\)-measurable.
Notice that the empirical risk can be written :
\[\begin{aligned} \widehat{\mathcal{R}}(\theta)=\frac{1}{n}\sum\limits_{i=1}^nf_i(\theta), \quad f_i : \theta \mapsto l(y_i,\theta(x_i)) \end{aligned}\]Consider the filtration \(\mathcal{F}_k=\sigma\left((x_i,y_i)_{1\le i \le n}, (i_j)_{0\le j \le k}\right)\) where \((i_j)_{j \ge 0}\) are independent indices uniformly sampled from \([\![1,n]\!]\). It can be shown that \(g^{(k)}:=\nabla{f_{i_k}}\) verifies \eqref{SGD_origin}. The SGD algorithm is then described below.
To reduce noise effect due to the proxy used for the gradient computation step, we may take the average of all states \(\theta^{(k)}\) as an estimator of the optimum, instead of just last one. This average \(\bar\theta^{(k)}\) may be computed "on line", meaning that it can be updated at each iteration without requiring the storage of all previous states.
Assume the following properties for the loss function \(l\) :
\(l\) is \(L\)-smooth, i.e. \(L\) is a lower bound of the eigenvalues of the Hessian matrix of \(l\) at any point
\(l\) is \(\mu\)-strongly convex, i.e. \(\mu\) is a upper bound of the eigenvalues of the Hessian matrix of \(l\) at any point
In particular, at any points of evaluation \((u,v)\), \(l\) verifies :
\[\begin{aligned} \langle \nabla l(v), u-v\rangle + \mu\Vert u-v\Vert^2 \le l(u)-l(v) \le \langle \nabla l(v), u-v\rangle + L\Vert u-v\Vert^2 \end{aligned}\]With some additional assumptions on the observations \(\Phi(x_i)\) (bounded variance and invertible experimental covariance matrix), it can be shown that then empirical risk also verifies those two properties.\
For smooth and convex problems, it was shown that almost surely, \(\theta_k \rightarrow \theta^*\) if
\[\begin{aligned} \sum\limits_{k=1}^\infty \gamma_k =\infty \quad \sum\limits_{k=1}^\infty \gamma_k^2 =\infty \end{aligned}\]and asymptotically, for \(\gamma_k =\frac{\gamma_0}{k}, \gamma_0 \ge \frac{1}{\mu}\), \(\sqrt{k}(\theta_k - \theta^*) \mathop{\rightarrow}\limits^d \mathcal{N}(0,V)\).
The convergence rate of an optimization algorithm measures the speed of convergence of torwards the solution of the problem. It is, with our notations, given by the speed at which \((\widehat{\mathcal{R}}(\theta^{(k)}))_{k\ge 0}\) converges to the minimum of \(\widehat{\mathcal{R}}\). Let’s compare the convergence rate of GD and (A)SGD for both convex and strongly convex functions.
One one hand, GD can converge much master to the optimum \(\theta^*\) than SGD. But on the other hand, each iteration of GD is more costly (by a factor \(n\)) than an iteration of SGD. The idea is to come up with a way to get the best of both worlds.
The Stochastic Average Gradient algorithm, is a SGD algorithm in which the at each iteration, the full gradient \(g^{(k)}\) given by
\[\begin{aligned} g^{(k)}=\frac{1}{n}\sum\limits_{i=1}^n\nabla f_i(\theta^{(k_i)}), \quad k_i\in[\![0,k]\!] \end{aligned}\]is updated through \(i_k\)-th term of the sum, where \(i_k\) is once again an index randomly chosen at each iteration. The term \(\nabla f_{i_k}(\theta^{(k_{i_k})})\) is replaced by \(\nabla f_{i_k}(\theta^{(k)})\) in the sum, where \((\theta^{(k)})_{k\ge 0}\) are obtained through a GD iteration step in which the gradient is replaced by \(g^{(k)}\).
This algorithm has the same update cost as SGD but the averaged gradient \(g^{(k)}\) has a smaller variance than the gradient used in SGD. Moreover, choosing a constant step size \(\gamma_k=\frac{1}{2nL}\) yields a convergence rate of \(\mathcal{O}((1- \frac{1}{8Ln})^k)\). However, it is required to store at all time \(n\) elementary gradients \(\nabla f_i(\theta^{(k_i)})\) as one of them is systematically used in the updating step of \(g^{(k)}\).
Using the Empirical risk to determine the best predictor \(\theta^*\) given the data arises several problems. First, to avoid over-fitting, a regularisation function must be set. Then, the number of iterations needed to achieve the optimum is hard to predict.
On the other hand, using Generalisation risk yields a more general approach given that it measures how accurately we may predict outcome values for previously unseen data (through the expectation). Therefore, regularisation is no longer needed. As we may see now, SGD can be used to minimise the generalisation risk, even though its expression is unknown.
SGD would replace computations of the gradient of the Generalisation risk \(\nabla\mathcal{R}(\theta^{(k-1)})\) by a proxy \(g^{(k)}(\theta^{(k-1)})\) verifying :
\[\begin{equation} \mathbb{E}[g^{(k)}(\theta^{(k-1)})\vert \mathcal{F}_{k-1} ]=\nabla{\mathcal{R}}(\theta^{(k-1)})=\mathbb{E}_\rho[\nabla l(Y, \langle\theta^{(k-1)}, \Phi(X)\rangle)] \tag{2}\label{SGD_origin_GR} \end{equation}\]Considering that the examples \((x_i,y_i)_{1\le i \le n}\) are i.i.d from \(\rho\), we can take for \(g^{(k)}\) at iteration \(1\le k \le n\):
\[\begin{aligned} g^{(k)}(\theta^{(k-1)})=\nabla l(y_k, \langle\theta^{(k-1)}, \Phi(x_k)\rangle) \end{aligned}\]and the filtrations \(\mathcal{F}_{k-1}=\sigma((x_i,y_i)_{1\le i \le k})\). Therefore there is a single pass through the data in this algorithm, and exactly \(n\) iterations are needed. Hence, with a convex (resp. strongly convex) loss function and a constant step size, it has a convergence rate of \(\mathcal{O}(1/\sqrt{n})\) (resp. \(\mathcal{O}(1/(\mu n))\)).
All the convergence rates previously mentioned are related to the use of a quadratic loss function \((x,y) \mapsto l(x,\theta(x))=(y-\theta^T\Phi(x))^2\). When using other loss functions, the results don’t hold, and the SGD may even not converge! For instance, for the logistic loss \((x,y) \mapsto l(x,\theta(x))=\log\left(1+\exp(y-\theta^T\Phi(x))\right)\) averaged SGD with constant step size \(\gamma\) happened to yield convergence rates of \(\mathbb{E}[\bar\theta^{(k)}-\theta^*]=\mathcal{O}(\gamma)\) (therefore independent of the number of iterations...).
To understand this problem and come up with solutions, SGD (with constant step size) was formulated as an homogeneous Markov chain.
Consider a \(L\)-smooth and \(\mu\)-strongly convex risk \(\mathcal{R}\). Then SGD with a step size \(\gamma\) may be seen as the generation of a sequence \((\theta^{(k)}_\gamma)_{k\ge 0}\) with recurrence :
\[\begin{aligned} \theta^{(k)}_\gamma=\theta^{(k-1)}_\gamma-\gamma\underbrace{\left(\nabla\mathcal{R}(\theta^{(k-1)}_\gamma)+\varepsilon^{(k)}(\theta^{(k-1)}_\gamma) \right)}_{=g^{(k)}(\theta^{(k-1)}_\gamma)} \end{aligned}\]where \((\varepsilon^{(k)})_{k\ge 0}\) are i.i.d. random noises verifying for any \(\theta\), \(\mathbb{E}[\varepsilon^{(k)}(\theta)\vert \mathcal{F}_{k-1}]=0\) for the same filtrations as in \eqref{SGD_origin_GR}. Introduced like this, \((\theta^{(k)}_\gamma)_{k\ge 0}\) is a homogeneous Markov chain, which allow us to understand the general behaviour of the SGD.
Indeed, this Markov chain will converge exponentially fast to its (unique) stationary distribution \(\pi_\gamma\). Consequently, this sequence (and therefore the SGD) does not converge to a point, but rather will (asymptotically) oscillate around the mean \(\bar\theta_\gamma = \mathbb{E}_{\pi_\gamma}[\theta]\) with an average magnitude \(\gamma^{1/2}\). On the other hand, considering as an output the average \(\bar\theta^{(k)}_\gamma\) of the \(k+1\) states \(\lbrace\theta^{(i)} :0\le i \le k \rbrace\), we can prove (central limit theorem) that with rate \(\mathcal{O}(1/\sqrt{k})\) :
\[\begin{aligned} \bar\theta^{(k)}_\gamma \mathop{\rightarrow}\limits_{k\rightarrow \infty}^{L^2} \bar\theta_\gamma = \mathbb{E}_{\pi_\gamma}[\theta] \end{aligned}\]Therefore the output of the averaged SGD is known to converge at a given rate to \(\bar\theta_\gamma\), which may different than the optimum \(\theta^*\). The error between the output of the SGD \(\theta^{(k)}_\gamma\) and the optimum can be decomposed:
\[\begin{aligned}\theta^{(k)}_\gamma-\theta^*=\underbrace{\theta^{(k)}_\gamma-\bar\theta_\gamma}_{\mathcal{O}(1/\sqrt{k})}+\underbrace{\bar\theta_\gamma - \theta^*}_{\text{independent of k}} \end{aligned}\]Notice that if we take \(\theta^{(0)}_\gamma \sim \pi_\gamma\), then by definition of \(\pi_\gamma\),
\[\begin{aligned} \theta^{(0)}_\gamma -\gamma\left(\nabla\mathcal{R}(\theta^{(0)}_\gamma)+\varepsilon^{(1)}(\theta^{(0)}_\gamma) \right)=\theta^{(1)}_\gamma \sim \pi_\gamma \end{aligned}\]Hence, by taking the expectation under \(\pi_\gamma\) we get :
\[\begin{aligned} \mathbb{E}_{\pi_\gamma}[\nabla\mathcal{R}(\theta^{(0)}_\gamma)]=0 \end{aligned}\]and recalling the expression of \(\mathcal{R}\) we get
\[\begin{aligned} \mathbb{E}_{\pi_\gamma}[\nabla\mathcal{R}(\theta)]=\mathbb{E}_{\pi_\gamma}\left[\mathbb{E}_\rho[\nabla l(Y, \Phi(X)^T\theta)]\right]=\mathbb{E}_\rho\left[\mathbb{E}_{\pi_\gamma}[\nabla l(Y, \Phi(X)^T\theta)]\right]=0 \end{aligned}\]In particular, in the quadratic loss case, given that \(\theta \mapsto \nabla l(Y, \Phi(X)^T\theta)=2(Y-\Phi(X)^T\theta)\Phi(X)\) is a linear function of \(\theta\), we have
\[\begin{aligned} \mathbb{E}_\rho\left[\mathbb{E}_{\pi_\gamma}[\nabla l(Y, \Phi(X)^T\theta)]\right]=\mathbb{E}_\rho\left[\nabla l(Y, \Phi(X)^T\mathbb{E}_{\pi_\gamma}[\theta])\right]=\nabla\mathcal{R}(\mathbb{E}_{\pi_\gamma}[\theta])=0 \end{aligned}\]And therefore,
\[\begin{aligned} \bar\theta_\gamma = \mathbb{E}_{\pi_\gamma}[\theta]=\theta^*. \end{aligned}\]We retrieve the fact that in the quadratic loss case, the averaged SGD converges to the optimum at rate \(\mathcal{O}(1/\sqrt{k})\).
However, in the general case, using a Taylor expansion of \(\mathcal{R}\), we can only show that
\[\begin{equation} \bar\theta_\gamma = \theta^*+\gamma C +\mathcal{O}(\gamma^2) \tag{3}\label{gen_output} \end{equation}\]where \(C\) is a constant independent of \(\gamma\). Therefore the averaged SGD will only converge to a point "near" the optimum. To get a better estimate of \(\theta^*\), we can use Richardson extrapolation. For instance, for two terms, the idea is to run SGD with rates \(\gamma\) and \(2\gamma\), which will yield estimates of \(\bar\theta_\gamma\) and \(\bar\theta_{2\gamma}\). Then, from \eqref{gen_output}, notice that :
\[\begin{aligned} 2\bar\theta_\gamma - \bar\theta_{2\gamma} = \theta^* + \mathcal{O}(\gamma^2) \end{aligned}\]is a better estimate of \(\theta^*\). This approach can simply be generalized to more terms (runs of the SGD).
Bach, F. (2012). Stochastic gradient methods for machine learning. Technical report. INRIA-ENS, Paris, France. Available at http://www.di.ens.fr/~fbach/fbach_sgd_online.pdf
Dieuleveut, A., Durmus, A., Bach, F. (2017). Bridging the Gap between Constant Step Size Stochastic Gradient Descent and Markov Chains. arXiv preprint arXiv:1707.06386.
Dieuleveut, A. (2017, December). Stochastic algorithms in Machine Learning. Paper presented at "Journée algorithmes stochastiques", Université Paris Dauphine, Paris, France.
Roux, N. L., Schmidt, M., Bach, F. R. (2012). A stochastic gradient method with an exponential convergence rate for finite training sets. In Advances in Neural Information Processing Systems (pp. 2663-2671).
For the first time, a whole session was dedicated to Machine Learning (ML), clearly showing the growing interest of the industry for the subject. It was kicked off by a keynote lecture which focused on how ML could be applied to make better reservoir predictions. In his lecture, Pr. Demyanov insisted on the fact that these algorithms should not be used as black boxes on the data. Instead they should be carefully designed with the help of a domain expert in order to ensure the quality and the interpretability of their outputs. Geologists, geophysicians and geostatisticians, let’s rejoice! It seems like we will not be replaced by an artificial intelligence (that) soon…
Several talks presented applications of some “trendy” ML algorithms to seismic and geological data, among which Convolutional Neural Networks and Generative Adversarial Networks (GANs). In particular, the latter proved to be a very serious contender to the use of multi-point statistics for facies inversion and geological image synthesis.
As for Geostatistics, one can note the return in an industrial context of neglected geostatistical methods, such as simulations using the turning bands method or the use of truncated Gaussian fields for facies inversion. As for the new methods, we retain the use of the stochastic partial derivative (SPDE) approach as a new paradigm allowing to easily integrate local geometric information and work with non-stationary fields. Estimages (represented by yours truly) and MINES ParisTech (represented by researchers Nicolas Desassis and Didier Renard) presented successful applications of this approach to, respectively, seismic image filtering and facies inversion.
Finally, a need for practical solutions to inversion problems transpired from the talks and from a very interesting discussion that concluded the conference. Geostatistics are most useful when integrated in fully-operational workflows, and some of which were presented during the conference. Also, communication is key: designing complex algorithms, even if they work well, will not ensure that they will be subsequently used by practitioners…
Big thanks to EAGE for organizing a great conference. We all look forward to the next one!
]]>