The variational approach to concentration

June 10, 2024

$\newcommand{\norm}[1]{\|#1\|} \newcommand{\Re}{\mathbb{R}} \renewcommand{\Mspace}[1]{\mathcal{M}({#1})} \newcommand{\E}{\mathbb{E}} \newcommand{\la}{\langle} \newcommand{\ra}{\rangle} \renewcommand{\Pr}{\mathbb{P}} \newcommand{\sd}{\mathbb{S}^{d-1}} \newcommand{\Tr}{\text{Tr}} \newcommand{\kl}{D_{\text{KL}}} \newcommand{\ind}{\mathbf{1}} \newcommand{\d}{\text{d}}$

Let $$(S_n)$$ be some stochastic process in, say, $$\Re^d$$. For instance, $$S_n = \sum_{i=1}^n X_i$$ for multivariate observations $$X_i\in\Re^d$$. We are aiming to generate a high probability bound on

$\norm{S_n} = \sup_{v:\norm{v}=1} \la v,S_n\ra.$

There are several well-known ways to attempt this. Among them are covering, chaining, and Doob decomposition (done, e.g., here). Here we’ll explore a separate (and relatively new) approach.

The idea is to use the variational inequality which is at the core of the PAC-Bayesian methodology (so we could equivalently call this the PAC-Bayes approach to concentration). This lets us simultaneously bound $$\la v,S_n\ra$$ in each direction $$v$$. Recall that a PAC-Bayes bound has the form

$$$\label{eq:pb-basic-bound} \Pr( \forall \rho \in \Mspace{\Theta}: \text{Something holds}) \geq 1-\delta, \tag{1}$$$

where $$\Theta$$ is some parameter space and $$\Mspace{\Theta}$$ is the set of all probability measures over that space. A PAC-Bayes bound provides a high probability bound simultaneously over all posteriors. The variational approach to concentration translates this into a high probability bound simultaneously over all directions.

It’s worth taking a moment to understand why this simultaneity property is valuable. A natural thought is to treat $$\la v,S_n\ra$$ as a scalar-valued process and apply well understood concentration results for real-valued functions to it. Doing this in the naive way would give a separate bound for each $$v\in\sd$$. So we’d have a result of the form:

$\forall v\in\sd, \Pr(\la v,S_n\ra \geq B_n) \leq \delta.$

But now we’re stuck. This is not a bound on $$\sup_{v\in\sd}\la v,S_n\ra$$. One’s first thought is to take a union bound over $$v$$ in order to move the “$$\forall v\in\sd$$” inside the probability statement, which would give us the result. But there are uncountably many such vectors. The approach explored here solves this problem by translating the “$$\forall \rho\in\Mspace{\Theta}$$” in \eqref{eq:pb-basic-bound} into “$$\forall v\in\sd$$”.

This variational approach was pioneered by Catoni and Giulini (here and here), and has now been used by a few authors to prove bounds in a variety of settings:

# A general variational inequality

Different authors use seemingly different PAC-Bayesian inequalities to achieve their results. However, we recently showed that all of these inequalities are specific instantiations of the following more general result.

Let $$\Theta$$ be some measurable parameter space and let $$N(\theta)$$ be nonnegative and have expected value at most 1 (it’s an e-value, if you like) for all $$\theta\in\Theta$$. Then

$\Pr(\forall\rho\in\Mspace{\Theta}: \E_\rho \log N(\theta) \leq \kl(\rho\|\nu) + \log(1/\delta))\geq 1-\delta,$

where, as above, $$\Mspace{\Theta}$$ is the set of all probability measures on $$\Theta$$. The variational approach to concentration involves (i) finding some appropriate family of random variables $$N(\theta)$$, (ii) choosing $$\nu$$ and a family of distributions $$\{\rho_\theta:\theta\in\Theta\}$$ such that $$\sup_\theta \kl(\rho_\theta \|\nu)$$ is small, and (iii) applying this “master theorem.”

# Example 1: Sub-Gaussian random vectors

This comes from our paper on time-uniform confidence spheres. Consider $$n$$ iid copies $$X_1,\dots,X_n$$ of a $$\Sigma$$-sub-Gaussian random vector $$X\in\Re^d$$. That is,

$\E\exp(\lambda \la \theta, X\ra) \leq \exp\left(\frac{\lambda^2}{2}\la\theta,\Sigma\theta\ra\right),$

for all $$\lambda\in\Re$$ and $$\theta\in\Re^d$$. This implies that

$N(\theta) = \exp\left\{\lambda\sum_{i\leq n}\la \theta, X_i\ra - \frac{n\lambda^2}{2}\la\theta,\Sigma\theta\ra\right\},$

has expectation at most 1. Let $$\nu$$ be a Gaussian with mean 0 and covariance $$\beta^{-1}I$$ for some $$\beta>0$$. Consider the family of distributions $$\{\rho_u:\norm{u}=1\}$$ where $$\rho_u$$ is a Gaussian with mean $$u$$ and covariance $$\beta^{-1}I$$. Then the KL divergence between $$\rho_u$$ and $$\nu$$ is $$\kl(\rho_u\|\nu) = \beta/2$$. Using the master theorem above, we obtain that, with probability $$1-\delta$$, simultaneously for all distributions $$\rho$$,

$\lambda\sum_{i\leq n} \E_\rho \la\theta, X_i\ra \leq \frac{n\lambda^2}{2}\E_\rho \la \theta, \Sigma\theta\ra + \frac{\beta}{2} + \log(1/\delta).$

Now, for $$\rho=\rho_u$$, $$\E_\rho \la \theta, X_i\ra = \la u,X_i\ra$$ and

$\E_\rho \la \theta, \Sigma\theta\ra = \la u,\Sigma u\ra + \beta^{-1}\Tr(\Sigma) \leq \norm{\Sigma} + \beta^{-1}\Tr(\Sigma),$

using basic properties of the expectation of quadratic forms under Gaussian distributions (see e.g. here), and definition of the operator norm as $$\norm{A} = \sup_{u,v:\norm{u}=\norm{v}=1}\la u,\Sigma v \ra$$. Since this holds simultaneously for all $$\rho_u$$, we obtain that, with probability $$1-\delta$$,

$\sup_u \lambda \sum_{i\leq n} \la u,X_i\ra \leq \frac{n\lambda^2}{2}(\norm{\Sigma} + \beta^{-1}\Tr(\Sigma)) + \frac{\beta}{2} + \log(1/\delta).$

The left hand side is equal to $$\lambda \norm{\sum_{i\leq n}X_i}$$, which gives us our concentration result. One can then optimize $$\lambda$$ using some calculus. This gives us state-of-the-art concentration up to an additive factor of $$(\Tr(\Sigma^2)/n)^{1/4}$$.

# Example 2: Random matrices with finite Orlicz-norm

This example is adapted from Zhivotovskiy (2024). Let $$M_1,\dots,M_n$$ be iid copies of a random matrix $$M$$ with finite sub-exponential Orlicz norm, in the sense that, for some $$C>0$$,

$\norm{\la \theta, M\phi\ra}_{\psi_1} \leq C \la\theta, \Sigma\phi\ra,$

for all $$\theta, \phi\in\Re^d$$ where $$\Sigma = \E M$$. Here

$\norm{Y}_{\psi_1} = \inf\left\{u>0: \E\exp(|Y|/u)\leq 2\right\}.$

We take our parameter space in the master theorem above to be $$\Theta = \Re^d\times \Re^d$$. Let $$\nu$$ again be Gaussian with mean 0 and covariance $$\beta^{-1}\Sigma$$ and let $$\mu_u$$ be a truncated Gaussian mean $$u$$, covariance $$\beta^{-1}\Sigma$$ and radius $$r$$. Being slightly loose with notation and writing $$\d\mu$$ for the density of $$\mu$$, the density of the truncated Gaussian can be written as

$\d\mu_{u}(x) = \frac{\ind\{\norm{x - u}\leq r\}}{Z}\d \rho_u,$

where $$Z$$ is some normalizing constant and $$\rho_u$$ is a non-truncated Gaussian. For a vector $$u\in \Sigma^{1/2}\mathbb{S}^{d-1}$$, the KL-divergence between a truncated normal $$\mu_u$$ and $$\nu$$ is therefore

\begin{align*} \kl(\mu_{u} \| \nu) &= \int\log\left(\frac{1}{Z}\frac{\d \rho_u}{\d\nu}(\theta)\right) {\mu}_{u}(\d\theta) \\ &= \log\left(\frac{1}{Z}\right) + \frac{1}{2}\int (\la \theta-u,\beta\Sigma^{-1}(\theta-u)\ra + \la \theta, \beta\Sigma^{-1}\theta\ra )\mu_{u}(\d\theta) \\ &= \log\left(\frac{1}{Z}\right) + \frac{\beta}{2}\int (2\la \theta,\Sigma^{-1}u\ra - \la u, \Sigma^{-1}u\ra )\mu_{u}(\d\theta) \\ &= \log\left(\frac{1}{Z}\right) + \frac{\beta\la u,\Sigma^{-1} u\ra}{2} \leq \log\left(\frac{1}{Z}\right) + \frac{\beta}{2}. \end{align*}

Here $$Z = \Pr(\norm{\theta - u}\leq r)$$ where $$\theta\sim \rho_{u}$$. Equivalently, $$Z=\Pr(\norm{Y}\leq r)$$ where $$Y$$ is a normal with mean $$0$$ and covariance $$\beta^{-1}\Sigma$$. Hence $$1 - Z = \Pr(\norm{Y}>r)\leq \E\norm{Y}^2/r^2 = \beta^{-1}\Tr(\Sigma)/r^2$$. Thus, taking $$r = \sqrt{2 \beta^{-1}\Tr(\Sigma)}$$ yields $$Z\geq 1/2$$, and we obtain

\begin{align} \kl(\mu_u\|\nu) \leq \log(2) + \frac{\beta}{2}. \end{align}

Therefore, since the KL divergence is additive on product measures, we have that

$\kl(\rho_u\times\rho_v\|\nu\times\nu) \leq 2\log(2) + \beta.$

Now it remains to construct a relevant quantity to use in the PAC-Bayes theorem. Consider

$N(\theta,\phi) = \exp\left\{\lambda\sum_{i\leq n}\la\theta, \Sigma^{-1/2}M_i\Sigma^{-1/2}\phi\ra - n\log\E\exp(\lambda\la \theta, \Sigma^{-1/2}M\Sigma^{-1/2}\phi\ra)\right\},$

where the expectation is over $$M$$. It’s easy to see that this has expectation at most 1 (it can be written as the product of terms each with expectation exactly one). We apply the master theorem with the product distribution $$\mu_u\times \mu_v$$ for $$u,v\in\Sigma^{1/2}\sd$$ where $$\sd = \{x:\norm{x}=1\}$$ is the unit sphere. Therefore, $$u = \Sigma^{1/2}u'$$ for $$v = \Sigma^{1/2}v'$$ for some $$u',v'\in\sd$$. We obtain that with probability $$1-\delta$$, for all $$u,v$$,

\begin{align} &\lambda \sum_{i\leq n}\E_{\mu_u\times \mu_v} \la \theta, \Sigma^{-1/2}M_i\Sigma^{-1/2}\phi\ra \\ &\leq n \E_{\mu_u\times\mu_v} \log \E\exp(\lambda\la\theta, \Sigma^{-1/2}M\Sigma^{1/2}\phi\ra) + \frac{\beta}{2} + \log(2/\delta).\label{eq:matrix_bound1}\tag{2} \end{align}

The truncated Gaussian is symmetric about its mean, so the left hand side above becomes

$\E_{\mu_u\times\mu_v}\la \theta, \Sigma^{-1/2}M_i\Sigma^{-1/2}\phi\ra = \la \Sigma^{-1/2}u,M_i\Sigma^{-1/2}v\ra = \la u', M_i v'\ra,$

where, as before, $$u',v'\in\mathbb{S}^{d-1}$$. It remains to bound the right hand side of \eqref{eq:matrix_bound1}. For this we appeal to a result which bounds the MGF of a random variable in terms of its $$\psi_1$$-norm. In particular, we appeal to an exponential inequality, which states that for a random variable $$Y$$,

$$$\label{eq:exp_ineq} \E[\exp(\lambda (Y - \E Y))]\leq \exp(4\lambda^2\norm{Y-\E Y}_{\psi_1}^2),\quad \forall |\lambda| \leq \frac{1}{2\norm{Y-\E Y}_{\psi_1}}.\tag{3}$$$

Applying this with $$Y = \la \theta,\Sigma^{-1/2} M\Sigma^{-1/2}\phi\ra$$ and noting that $$\E Y = \la \theta, \phi\ra$$, we have

\begin{align} & \norm{\la \theta, \Sigma^{-1/2}M \Sigma^{-1/2} \phi\ra - \la \theta, \Sigma^{-1/2}\E M \Sigma^{1/2} \phi\ra}_{\psi_1} \\ &\leq \norm{\la \theta, \Sigma^{-1/2}M \Sigma^{-1/2} \phi\ra}_{\psi_1} + \norm{\la \theta, \Sigma^{-1/2}\E M \Sigma^{-1/2} \phi\ra}_{\psi_1} \\ &\leq \norm{\la \theta, \Sigma^{-1/2}M \Sigma^{-1/2} \phi\ra}_{\psi_1} + \E\norm{\la \theta, \Sigma^{-1/2} \Sigma^{-1/2} \phi\ra}_{\psi_1} \\ &= 2 \norm{\la \Sigma^{-1/2}\theta, M \Sigma^{-1/2} \phi\ra}_{\psi_1} \\ &\leq 2C \la \Sigma^{-1/2}\theta, \Sigma\Sigma^{-1/2}\phi\ra = 2C\la \theta, \phi\ra \leq C(\norm{\theta}^2 + \norm{\phi}^2). \end{align}

Therefore, \eqref{eq:exp_ineq} yields

\begin{align} \E\exp(\lambda \la\theta,\Sigma^{-1/2}M\Sigma^{-1/2}\phi\ra) &\leq \exp(\lambda \la \theta, \phi\ra + 4C^2\lambda^2(\norm{\theta}^2 + \norm{\phi}^2)^2). \label{eq:Eexp} \tag{4} \end{align}

Note that $$\norm{\theta - u}\leq r$$ and $$\norm{u} = \norm{\Sigma^{1/2}u'} \leq \norm{\Sigma^{1/2}} \leq \sqrt{\norm{\Sigma}}$$, so

$\norm{\theta}^2 \leq (r + \norm{u})^2 \leq \left(\sqrt{2\beta^{-1}\Tr(\Sigma)} + \sqrt{\norm{\Sigma}}\right)^2.$

The same bound holds for $$\norm{\phi}^2$$. Therefore, \eqref{eq:Eexp} gives

\begin{align} &\E_{\rho_u\times\rho_v} \log \E \exp(\lambda\la\theta, \Sigma^{-1/2} M\Sigma^{-1/2}\phi\ra ) \\ &\leq \lambda \la u,v\ra + 8C^2\lambda^2\left(\sqrt{2\beta^{-1}\Tr(\Sigma)} + \sqrt{\norm{\Sigma}}\right)^4 \\ &= \lambda \la u',\Sigma v'\ra + 8C^2\lambda^2\left(\sqrt{2\beta^{-1}\Tr(\Sigma)} + \sqrt{\norm{\Sigma}}\right)^4, \end{align}

assuming that

$|\lambda| \leq \frac{1}{4C(\sqrt{2\beta^{-1}\Tr(\Sigma)} + \sqrt{\norm{\Sigma}})^2}.$

Choosing $$\beta = 2\Tr(\Sigma)/\norm{\Sigma}$$ and putting everything together, \eqref{eq:matrix_bound1} gives that with probability $$1-\delta,$$

$\sup_{u',v'\in\sd}\lambda\sum_{i\leq n} \la u', (M_i-\Sigma) v'\ra \lesssim C^2\lambda^2 \norm{\Sigma}^2 + \frac{\Tr(\Sigma)}{\norm{\Sigma}} + \log(1/\delta).$

Dividing by $$n$$ and optimizing over $$\lambda$$ gives us a final bound that matches state-of-the-art concentration bounds for random matrices.

Back to all notes