Chapter 4: Oracle PAC-Bayes Bounds

4.1 Theory of Oracle PAC-Bayes Bounds

Oracle bounds are theoretical objects that are not suitable for practical applications. Their utility lies in their ability to highlight properties about the behavior of the bounds and they can take the form $$\mathbb{P}_{S\sim\mathcal{D}^m}\left(R\left(\hat{\mathbf{w}}\right)\leq\inf_{\mathbf{w}\in\mathcal{W}}R(\mathbf{w})+r_m(\delta)\right)\geq1-\delta.$$ Where $r_m(\delta)$ is a remainder term that tends to $0$ as $m$ tends to $\infty$. Although this bound cannot be computed in practice it is illustrative of the behavior of the bound. Just like empirical bounds, there exist oracle bounds that hold in expectation and in probability.

4.1.1 Oracle PAC-Bayes Bounds in Expectation

Theorem 4.1 (Alquier, 2023) For $\lambda>0$ we have that $$\mathbb{E}_{S\sim\mathcal{D}^m}R(\hat{\rho}_{\lambda})\leq\inf_{\rho\in\mathcal{M}(\mathcal{W})}\left(R(\rho)+\frac{\lambda C^2}{8m}+\frac{\mathrm{KL}(\rho,\pi)}{\lambda}\right).$$

Proof

Theorem 4.1.1 (Fubini's Theorem) If $\mathcal{X}_1$ and $\mathcal{X}_2$ are $\sigma$-finite measurable spaces and $f:\mathcal{X}_1\times\mathcal{X}_2$ is measurable and $$\int_{\mathcal{X}_1\times\mathcal{X}_2}\vert f(x_1,x_2)\vert d(x_1,x_2)<\infty,$$ then $$\int_{\mathcal{X}_1}\left(\int_{\mathcal{X}_2}f(x_1,x_2)dx_2\right)dx_1=\int_{\mathcal{X}_2}\left(\int_{\mathcal{X}_1}f(x_1,x_2)dx_2\right)dx_1=\int_{\mathcal{X}_1\times\mathcal{X}_2}f(x_1,x_2)d(x_1,x_2).$$

Proof

For the proof of this theorem please refer to (Rodriguez, 2021).

We proceed from Corollary 3.9 to deduce that $$\begin{align*}\mathbb{E}_{S\sim\mathcal{D}^m}\left(R\left(\hat{\rho}_{\lambda}\right)\right)&\leq\mathbb{E}_{S\sim\mathcal{D}^m}\left(\inf_{\rho\in\mathcal{M}(\mathcal{W})}\left(\hat{R}(\rho)+\frac{\lambda C^2}{8m}+\frac{\mathrm{KL}(\rho,\pi)}{\lambda}\right)\right)\\&\leq\inf_{\rho\in\mathcal{M}(\mathcal{W})}\left(\mathbb{E}_{S\sim\mathcal{D}^m}\left(\hat{R}(\rho)+\frac{\lambda C^2}{8m}+\frac{\mathrm{KL}(\rho,\pi)}{\lambda}\right)\right)\\&=\inf_{\rho\in\mathcal{M}(\mathcal{W})}\left(\mathbb{E}_{S\sim\mathcal{D}^m}\left(\hat{R}(\rho)\right)+\frac{\lambda C^2}{8m}+\frac{\mathrm{KL}(\rho,\pi)}{\lambda}\right)\\&=\inf_{\rho\in\mathcal{M}(\mathcal{W})}\left(\mathbb{E}_{\mathbf{w}\sim\rho}\left(\mathbb{E}_{S\sim\mathcal{D}^m}\left(\hat{R}(\mathbf{w})\right)\right)\frac{\lambda C^2}{8m}+\frac{\mathrm{KL}(\rho,\pi)}{\lambda}\right)\end{align*}$$ where Fubini's theorem has been applied in the last inequality. Recalling that $\mathbb{E}_{S\sim\mathcal{D}^m}\left(\hat{R}(\mathbf{w})\right)=R(\mathbf{w})$ completes the proof of the theorem. $\square$

4.1.2 Oracle PAC-Bayes Bounds in Probability

Theorem 4.2 (Alquier, 2023) For any $\lambda>0$, and $\delta\in(0,1)$ we have that $$\mathbb{P}_{S\sim\mathcal{D}^m}\left(R(\hat{\rho}_{\lambda})\leq\inf_{\rho\in\mathcal{M}(\mathcal{W})}\left(R(\rho)+\frac{\lambda C^2}{4m}+\frac{2\mathrm{KL}(\rho,\pi)+\log\left(\frac{2}{\delta}\right)}{\lambda}\right)\right)\geq1-\delta.$$

Proof

Recall the proof of Theorem 3.4 and the subsequent application to the Gibbs posterior that yielded Corollary 3.6. $$\mathbb{P}_{S\sim\mathcal{D}^m}\left(R(\hat{\rho}_{\lambda})\leq\inf_{\rho\in\mathcal{M}(\mathcal{W})}\left(\hat{R}(\rho)+\frac{\lambda C^2}{8m}+\frac{\mathrm{KL}(\rho,\pi)+\log\left(\frac{1}{\delta}\right)}{\lambda}\right)\right)\geq1-\delta.$$ In the proof we utilized the result of Theorem 2.1. The inequality of Theorem 2.1 can be reversed by replacing the $U_i$ by $-U_i$ in its proof. Applying the reverse inequality of Theorem 2.1 in the proof of Theorem 3.4 gives the updated corollary $$\mathbb{P}_{S\sim\mathcal{D}^m}\left(\hat{R}(\rho)\leq R(\rho)+\frac{\lambda C^2}{8m}+\frac{\mathrm{KL}(\rho,\pi)+\log\left(\frac{1}{\delta}\right)}{\lambda}\right)\geq1-\delta.$$ Which holds for all $\rho\in\mathcal{M}(\mathcal{W})$. Applying a union bound on Corollary 3.6 and the updated result above gives $$\mathbb{P}_{S\sim\mathcal{D}^m}\begin{pmatrix}R(\hat{\rho}_{\lambda})\leq\inf_{\rho\in\mathcal{M}(\mathcal{W})}\left(\hat{R}(\rho)+\frac{\lambda C^2}{8m}+\frac{\mathrm{KL}(\rho,\pi)+\log\left(\frac{1}{\delta}\right)}{\lambda}\right),\\\hat{R}(\rho)\leq R(\rho)+\frac{\lambda C^2}{8m}+\frac{\mathrm{KL}(\rho,\pi)+\log\left(\frac{1}{\delta}\right)}{\lambda}\end{pmatrix}\geq1-2\delta,$$ which holds for all $\rho\in\mathcal{M}(\mathcal{W})$. Using the upper bound on $\hat{R}(\rho)$ from the second event on the first event gives $$\mathbb{P}_{S\sim\mathcal{D}^m}\left(R(\hat{\rho}_{\lambda})\leq\inf_{\rho\in\mathcal{M}(\mathcal{W})}\left(\hat{R}(\rho)+\frac{\lambda C^2}{4m}+\frac{2\left(\mathrm{KL}(\rho,\pi)+\log\left(\frac{1}{\delta}\right)\right)}{\lambda}\right)\right)\geq1-2\delta.$$ We can simply replace the $\delta$ with $\frac{\delta}{2}$ to complete the proof. $\square$

Bernstein's Assumption

Definition 4.3 (Alquier, 2023) Let $\mathbf{w}^*$ denote a minimizer of $R$ when it exists, $$R(\mathbf{w}^*)=\min_{\mathbf{w}\in\mathcal{W}}R(\mathbf{w}).$$ When $\mathbf{w}^*$ exists and there is a constant $K$ such that for any $\mathbf{w}\in\mathcal{W}$ we have that $$\mathbb{E}_{S\sim\mathcal{D}^m}\left(\left(l(h_{\mathbf{w}}(x_i),y_i)-l(h_{\mathbf{w}^*}(x_i),y_i)\right)^2\right)\leq K\left(R(\mathbf{w})-R(\mathbf{w}^*)\right)$$ we say that Bernstein's assumption is satisfied with constant $K$.

Theorem 4.4 (Alquier, 2023) Assume Bernstein's assumption is satisfied with some constant $K>0$. Take $\lambda=\frac{m}{\max(2K,C)}$ then we have $$\mathbb{E}_{S\sim\mathcal{D}^m}R(\hat{\rho}_{\lambda})-R\left(\mathbf{w}^*\right)\leq2\inf_{\rho\in\mathcal{M}(\mathcal{W})}\left(R(\rho)-R\left(\mathbf{w}^*\right)+\frac{\max(2K,C)\mathrm{KL}(\rho,\pi)}{m}\right).$$

Proof

Lemma 4.4.1 Let $g$ denote the Bernstein function defined by $$g(x)=\begin{cases}1&x=0\\\frac{e^x-1-x}{x^2}&x\neq0.\end{cases}$$ Let $U_1,\dots,U_n$ be $\mathrm{i.i.d}$ random variables such that $\mathbb{E}(U_i)$ is finite and $U_i-\mathbb{E}(U_i)\leq C$ almost surely for some $C\in\mathbb{R}$. Then, $$\mathbb{E}\left(\exp\left(t\sum_{i=1}^n\left(U_i-\mathbb{E}(U_i)\right)\right)\right)\leq\exp\left(g(Ct)nt^2\mathrm{Var}(U_i)\right).$$

Proof (Habib, 1998)

We first show that function $g$ is increasing. For $x\neq0$ we have that $$g^\prime(x)=\frac{(x-2)e^x+2+x}{x^3}.$$ Let $h(x)=(x-2)e^x+2+x$ then $h(0)=0$ and $h^\prime(x)=(x-2)e^x+1$, so that $h^\prime(0)=0$ and $h^{\prime\prime}(x)=xe^x$. Therefore, $h^\prime(x)<0$ for $x<0$ and $h^\prime(x)>0$ for $x>0$ which implies that $h(x)\geq0$ for all $x$. This means that $g^{\prime}(x)>0$ and the function $g$ is increasing. So that $$e^x=1+x+x^2g(x)\leq1+x+x^2g(\alpha)$$ for $x\leq\alpha$. Therefore, if we have a random variable $X$ with $\mathbb{E}(X)=0$ and $X\leq\alpha$ it follows that $$\mathbb{E}\left(\exp(X)\right)\leq1+g(\alpha)\mathrm{Var}(X)\leq\exp(g(\alpha)\mathrm{Var}(X)).$$ Applying this conclusion to $\alpha=Ct$, $X=t(U_i-\mathbb{E}(U_i))$ we can conclude that $$\mathbb{E}\left(\exp\left(t(U_i-\mathbb{E}(U_i))\right)\right)\leq\exp\left(g(Ct)t^2\mathrm{Var}(U_i)\right)$$ Therefore, by the independence of the $U_i$ $$\begin{align*}\mathbb{E}\left(\exp\left(t\sum_{i=1}^n\left(U_i-\mathbb{E}(U_i)\right)\right)\right)&=\prod_{i=1}^n\mathbb{E}\left(\exp\left(t(U_i-\mathbb{E}(U_i))\right)\right)\\&\leq\prod_{i=1}^n\exp\left(g(Ct)t^2\mathrm{Var}(U_i)\right)\\&=\exp\left(g(Ct)nt^2\mathrm{Var}(U_i)\right)\end{align*}$$ as required. $\square$

Now fix $\mathbf{w}\in\mathcal{W}$ and apply Lemma 4.4.1 to $U_i=l_i(\mathbf{w}^*)-l_i(\mathbf{w})$ (where we inherit the notation of the proof of Theorem 2.1). Note that $\mathbb{E}(U_i)=R(\mathbf{w}^*)-R(\mathbf{w})$ and therefore, $$\mathbb{E}_{S\sim\mathcal{D}^m}\left(\exp\left(tm\left(R(\mathbf{w})-R(\mathbf{w}^*)-\hat{R}(\mathbf{w})+\hat{R}(\mathbf{w})\right)\right)\right)\leq\exp\left(g(Ct)mt^2\mathrm{Var}_{S\sim\mathcal{D}^m}(U_i)\right).$$ Observe that $$\begin{align*}\mathrm{Var}(U_i)&\leq\mathbb{E}_{S\sim\mathcal{D}^m}\left(U_i^2\right)\\&=\mathbb{E}_{S\sim\mathcal{D}^m}\left(l_i(\mathbf{w}^*)-l_i(\mathbf{w})\right)\\&\leq K(R(\mathbf{w})-R(\mathbf{w}^*)).\end{align*}$$ Therefore, with $\lambda=tn$ we get that $$\mathbb{E}_{S\sim\mathcal{D}^m}\left(\exp\left(\lambda\left(R(\mathbf{w})-R(\mathbf{w}^*)-\hat{R}(\mathbf{w})+\hat{R}(\mathbf{w}^*)\right)\right)\right)\leq\exp\left(g\left(\frac{\lambda C}{m}\right)\frac{\lambda^2}{m}K\left(R(\mathbf{w})-R(\mathbf{w}^*)\right)\right)$$ which upon rearrangement gives $$\mathbb{E}_{S\sim\mathcal{D}^m}\left(\exp\left(\lambda\left(1-Kg\left(\frac{\lambda C}{m}\right)\frac{\lambda}{m}\right)\left(R(\mathbf{w})-R(\mathbf{w}^*)\right)-\hat{R}(\mathbf{w})-\hat{R}(\mathbf{w}^*)\right)\right)\leq1.$$ Now integrate with respect to $\pi$ and apply Fubini's theorem along with Lemma 3.4.3 from the proof of Theorem 3.4 to get $$\mathbb{E}_{S\sim\mathcal{D}^m}\left(\exp\left(\lambda\sup_{\rho\in\mathcal{M}(\mathcal{W})}\left(\left(1-Kg\left(\frac{\lambda C}{m}\right)\frac{\lambda}{m}\right)\left(R(\rho)-R(\mathbf{w}^*)\right)-\hat{R}(\rho)-\hat{R}(\mathbf{w}^*)-\mathrm{KL}(\rho,\pi)\right)\right)\right)\leq1.$$ In particular, this holds for $\rho-\hat{\rho}_{\lambda}$, and we can apply Jensen's inequality and re-arrange to yield $$\left(1-Kg\left(\frac{\lambda C}{m}\right)\right)\left(\mathbb{E}_{S\sim\mathcal{D}^m}\left(R(\hat{\rho}_{\lambda})-R(\mathbf{w}^*)\right)\right)\leq\mathbb{E}_{S\sim\mathcal{D}^m}\left(\hat{R}(\rho)-\hat{R}(\mathbf{w})+\frac{\mathrm{KL}(\hat{\rho}_{\lambda},\pi)}{\lambda}\right).$$ From now on $\lambda$ will be such that $1-Kg\left(\frac{\lambda C}{m}\right)\frac{\lambda}{m}>0$, thus $$\mathbb{E}_{S\sim\mathcal{D}^m}\left(R(\hat{\rho}_{\lambda})\right)-R(\mathbf{w}^*)\leq\frac{\mathbb{E}_{S\sim\mathcal{D}^m}\left(\hat{R}(\hat{\rho}_{\lambda})-\hat{R}(\mathbf{w}^*)+\frac{\mathrm{KL}(\hat{\rho}_{\lambda},\pi)}{\lambda}\right)}{1-Kg\left(\frac{\lambda C}{m}\right)\frac{\lambda}{m}}.$$ As with $\lambda=\frac{m}{\max(2K,C)}$ it follows that $$Kg\left(\frac{\lambda C}{m}\right)\frac{\lambda}{m}\leq\frac{1}{2}$$ and so we have $$\mathbb{E}_{S\sim\mathcal{D}^m}\left(R(\hat{\rho}_{\lambda})\right)-R(\mathbf{w}^*)\leq 2\mathbb{E}_{S\sim\mathcal{D}^m}\left(\hat{R}(\hat{\rho}_{\lambda})-\hat{R}(\mathbf{w}^*)+\frac{\mathrm{KL}(\hat{\rho}_{\lambda},\pi)}{\lambda}\right).$$ As $\hat{\rho}_{\lambda}$ minimizes the quantity on the right hand side in expectation we can re-write this as $$\begin{align*}\mathbb{E}_{S\sim\mathcal{D}^m}(R(\hat{\rho}_{\lambda}))&\leq2\mathbb{E}_{S\sim\mathcal{D}^m}\left(\inf_{\rho\in\mathcal{M}(\mathcal{W})}\left(\hat{R}(\mathbf{w})-\hat{R}(\mathbf{w}^*)+\frac{\max(2K,C)\mathrm{KL}(\rho,\pi)}{m}\right)\right)\\&\leq2\inf_{\rho\in\mathcal{M}(\mathcal{W})}\mathbb{E}_{S\sim\mathcal{D}^m}\left(\hat{R}(\mathbf{w})-\hat{R}(\mathbf{w}^*)+\frac{\max(2K,C)\mathrm{KL}(\rho,\pi)}{m}\right)\\&=2\inf_{\rho\in\mathcal{M}(\mathcal{W})}\mathbb{E}_{S\sim\mathcal{D}^m}\left(R(\mathbf{w})-R(\mathbf{w}^*)+\frac{\max(2K,C)\mathrm{KL}(\rho,\pi)}{m}\right),\end{align*}$$ which completes the proof.$\square$

4.2 Data Driven PAC-Bayes Bounds

A lot of work to obtain non-vacuous PAC-Bayes bounds is to develop priors that reduce the size of the KL divergence between the prior and the posterior. The idea behind the work of (Dziugaite, 2020) is to hold out some of the training data to obtain data-inspired priors. For this section, we use a PAC-Bayes bound that can be thought of as the Bayesian equivalent of Theorem 2.2, however, now we are dealing with potentially uncountable hypothesis sets.

Theorem 4.5 (McAllester, 2013) For $\lambda>\frac{1}{2}$ selected before drawing our training sample, then for all $\rho\in\mathcal{M}(\mathcal{W})$ and $\delta\in(0,1)$ we have that $$\mathbb{P}_{S\sim\mathcal{D}^m}\left(R(\rho)\leq\frac{1}{1-\frac{1}{2\lambda}}\left(\hat{R}(\mathbf{\rho})+\frac{\lambda C}{m}\left(\mathrm{KL}(\rho,\pi)+\log\left(\frac{1}{\delta}\right)\right)\right)\right)\geq1-\delta.$$

Proof

For the proof we define the following notation, $$\mathrm{kl}_{\gamma}(q,p)=\gamma q-\log\left(1-p+pe^{\gamma}\right),$$ for $p,q\in[0,1]$ and $\gamma\in\mathbb{R}$. One can show that $$\mathrm{kl}(q,p)=\sup_{\gamma}\left(\mathrm{kl}_{\gamma}(q,p)\right).$$

Lemma 4.5.1 For $\lambda>\frac{1}{2}$, if $\mathrm{kl}_{-\frac{1}{\gamma}}(q,p)\leq c$ then $$p\leq\frac{1}{1-\frac{1}{2\lambda}}(q+\lambda c).$$

Proof

Let $\gamma=-\frac{1}{\lambda}$ for convenience, which means that $\gamma\in(-2,0)$. Re-arranging the assumption we get that $$p\leq\frac{1-e^{\gamma q-c}}{1-e^{\gamma}}.$$ Using $e^{\gamma}\geq1+\gamma$ in the numerator and $e^{\gamma}\leq 1$ in the denominator we get $$p\leq\frac{q-\frac{c}{\gamma}}{1+\frac{1}{2}\gamma},$$ which when we substitute $\lambda$ back in completes the proof of the lemma. $\square$

Lemma 4.5.2 Let $x_1,\dots,x_n$ be realizations of a random variable $X$ with range $[0,1]$ and mean $\mu$. Let $\hat{\mu}=\frac{1}{n}\sum_{i=1}^nx_i$. Then for any fixed $\gamma$ we have that $$\mathbb{E}\left(\exp\left(n\mathrm{kl}_{\gamma}(\hat{\mu},\mu)\right)\right)\leq1.$$

Proof

Note that $\mathbb{E}\left(\exp(n\gamma\hat{\mu})\right)=\left(\mathbb{E}(\exp(\gamma X))\right)^n$ and that by the convexity of $\exp(\cdot)$ we have that $$e^{\gamma X}\leq 1-x+xe^{\gamma}.$$ Therefore, $$\mathbb{E}\left(\exp\left(n\gamma\hat{\mu}\right)\right)\leq\left(1-\mu+\mu e^{\gamma}\right)^n,$$ which implies that $$\mathbb{E}\left(\exp\left(n\left(\gamma\hat{\mu}-\log\left(1-\mu+\mu e^{\gamma}\right)\right)\right)\right)\leq 1$$ which completes the proof of the lemma. $\square$

Lemma 4.5.3 For probability distributions defined on the sample space $\mathcal{X}$ and a measurable function $f$ we have that $$\mathbb{E}_{x\in Q}(f(x))\leq\mathrm{KL}(Q,P)+\log\left(\mathbb{E}_{x\in P}\left(\exp(f(x))\right)\right).$$

Proof

$$\begin{align*}\mathbb{E}_{x\in Q}\left(f(x)\right)&=\mathbb{E}_{x\in Q}\left(\log\left(\exp(f(x))\right)\right)\\&=\mathbb{E}_{x\in Q}\left(\log\left(\frac{P(x)}{Q(x)}\right)e^{f(x)}+\log\left(\frac{Q(x)}{P(x)}\right)\right)\\&\leq\log\left(\mathbb{E}_{x\in Q}\left(\frac{P(x)}{Q(x)}e^{f(x)}\right)\right)+\mathrm{KL}(Q,P)\\&=\mathrm{KL}(Q,P)+\log\left(\mathbb{E}_{x\in P}\left(\exp(f(x))\right)\right).\end{align*}$$

We can use similar reasoning to that given in the proof of Theorem 3.12 to conclude from Lemma 4.5.2 that $$\mathbb{E}_{S\sim\mathcal{D}^m}\left(\exp\left(m\mathrm{kl}_{\gamma}\left(\hat{R}(\mathbf{w}),R(\mathbf{w})\right)\right)\right)\leq1$$ for fixed $\mathbf{w}\in\mathcal{W}$. Now we can take expectations over $\pi$ on both sides an apply Fubini's theorem to deduce that $$\begin{align*}1&\geq\mathbb{E}_{\mathbf{w}\sim\pi}\left(\mathbb{E}_{S\sim\mathcal{D}^m}\left(\exp\left(m\mathrm{kl}_{\gamma}\left(\hat{R}(\mathbf{w}),R(\mathbf{w})\right)\right)\right)\right)\\&\geq\mathbb{E}_{S\sim\mathcal{D}^m}\left(\mathbb{E}_{\mathbf{w}\sim\pi}\left(\exp\left(m\mathrm{kl}_{\gamma}\left(\hat{R}(\mathbf{w}),R(\mathbf{w})\right)\right)\right)\right).\end{align*}$$ To which we can apply Markov's inequality to get that $$\mathbb{P}_{S\sim\mathcal{D}^m}\left(\mathbb{E}_{\mathbf{w}\sim\pi}\left(\exp\left(m\mathrm{kl}_{\gamma}\left(\hat{R}(\mathbf{w}),R(\mathbf{w})\right)\right)\right)\leq\frac{1}{\delta}\right)\geq1-\delta.$$ Letting $f(\mathbf{w})=m\mathrm{kl}_{\gamma}\left(\hat{R}(\mathbf{w}),R(\mathbf{w})\right)$ in Lemma 4.5.3 and using the above result we get that $$\mathbb{P}_{S\sim\mathcal{D}^m}\left(\mathbb{E}_{\mathbf{w}\sim\rho}\left(m\mathrm{kl}_{\gamma}\left(\hat{R}(\mathbf{w}),R(\mathbf{w})\right)\right)\leq\mathrm{KL}(\rho,\pi)+\log\left(\frac{1}{\delta}\right)\right)\geq1-\delta.$$ By the convexity of $\mathrm{kl}_{\gamma}$ we get that $$\mathbb{P}_{S\sim\mathcal{D}^m}\left(m\mathrm{kl}_{\gamma}\left(\hat{R}(\mathbf{w}),R(\mathbf{w})\right)\leq\mathrm{KL}(\rho,\pi)+\log\left(\frac{1}{\delta}\right)\right)\geq1-\delta.$$ Therefore, by re-arranging and applying Lemma 4.5.1 the proof of the theorem is complete. $\square$.

Corollary 4.6 (Dziugaite, 2020) Let $\beta,\delta\in(0,1)$, $\mathcal{D}$ a probability distribution over $\mathcal{Z}$, and $\pi\in\mathcal{M}(\mathcal{W})$. Then for all $\rho\in\mathcal{M}(\mathcal{W})$ we have that $$\mathbb{P}_{S\sim\mathcal{D}^m}\left(R(\rho)\leq\Psi_{\beta,\delta}(\rho,\pi;S)\right)\geq1-\delta,$$ where $\Psi_{\beta,\delta}(\rho,\pi;S)=\frac{1}{\beta}\hat{R}(\rho)+\frac{\mathrm{KL}(\rho,\pi)+\log\left(\frac{1}{\delta}\right)}{2\beta(1-\beta)m}.$

Proof

This is the result of the previous Theorem 4.5 with $\lambda=\frac{1}{2(1-\beta)}$ and $C=1$.

As we have done previously, we can consider the optimization problem of minimizing the bound of Corollary 4.6.

Theorem 4.7 (Dziugaite, 2020) Let $m\in\mathbb{N}$ and fix a probability kernel $\rho:\mathcal{Z}^m\to\mathcal{M}(\mathcal{W})$. Then for all $\beta,\delta\in(0,1)$ and distributions $\mathcal{D}$ defined on $\mathcal{Z}$ we that $\mathbb{E}_{S\sim\mathcal{D}^m}\left(\Psi_{\beta,\delta}(\rho(S),\pi;S)\right)$ is minimized, in $\pi$, by the oracle prior $\pi^*=\mathbb{E}_{S\sim\mathcal{D}^m}(\rho(S))$.

Proof

For a subset $J$ of $\{1,\dots,m\}$ of size $n$, we can use it to sample the training data and yield the subset $S_J$. We can then define the data-dependent oracle prior as $$\pi^*(S_J)=\inf_{\pi\in\mathcal{Z}^n\to\mathcal{M}(\mathcal{W})}\mathbb{E}(\mathrm{KL}(\rho(s),\pi(S_J)))$$ which turns out to be $\pi^*(S_J)=\mathbb{E}(\rho(S)\vert S_J)$. It can be shown that the data-dependent oracle prior minimizes the bound of Corollary 4.6 in expectation. Therefore, despite being a theoretical quantity, as it cannot be computed in practice, it motivates the construction of practical data-dependent priors as a method to tighten the bounds.

4.2.1 Implementing Data-Dependent Priors

To implement data-dependent priors we restrict the optimization problem to make it tractable. We only consider the set of Gaussian priors $\mathcal{F}$ that generate Gaussian posteriors. Neural networks are trained via SGD, and hence there is some randomness to the learning algorithm. Let $(\Omega,\mathcal{F},\nu)$ define a probability space and let us focus on the kernels $$\rho:\Omega\times\mathcal{Z}^m\to\mathcal{M}(\mathcal{W}),\quad\rho(U,S)=\mathcal{N}(\mathbf{w}_S,\mathbf{s}),$$ where $\mathbf{w}_S$ are the learned weights via SGD on the full dataset $S$. The random variable $U$ represents the randomness of the learning algorithm. As before we consider a non-negative integer $n\leq m$ and with $\alpha=\frac{n}{m}$ we define a subset $S_{\alpha}$ of size $n$ containing the first $n$ indices of $S$ processed by SGD. Let $\mathbb{E}^{S_{\alpha},U}[\cdot]$ denote the conditional expectation operator given $S_{\alpha}$ and $U$. Our aim now is to tighten the bound of Corollary 4.6 by minimizing $\mathbb{E}^{S_{\alpha},U}(\mathrm{KL}(\rho(U,S),\pi))$. To do this we further restrict the priors of consideration to those of the form $\mathcal{N}(\mathbf{w}_{\alpha},\sigma I)$ such that with $\sigma$ fixed we are left with the minimization problem $$\begin{equation*}\mathrm{argmin}_{\mathbf{w}_{\alpha}}\left(\mathbb{E}^{S_{\alpha},U}\left(\Vert\mathbf{w}_S-\mathbf{w}_{\alpha}\Vert\right)\right), \end{equation*}$$ which can be solved to yield $\mathbf{w}_{\alpha}=\mathbb{E}^{S_{\alpha},U}(\mathbf{w}_S)$. This minimizer is unknown in practice so we attempt to approximate it. We first define a so-called ghost sample, $S^G$, which is an independent sample equal in distribution to $S$. We combine a $1-\alpha$ fraction of $S^G$ with $S_{\alpha}$ to obtain the sample $S_{\alpha}^G$. Let $\mathbf{w}_{\alpha}^G$ be the mean of $\rho(U,S_{\alpha}^G)$. By construction, SGD will first process $S_{\alpha}$ then the combined portion of $S^G$ and hence $\mathbf{w}_{\alpha}^G$ and $\mathbf{w}_S$ are equal in distribution when conditioned on $S_{\alpha}$ and $U$. Therefore, $\mathbf{w}_{\alpha}^G$ is an unbiased estimator of $\mathbb{E}^{S_{\alpha},U}(\mathbf{w}_S)$. Before formalizing this process algorithmically we clarify some notation.

  • The SGD run on $S$ is the base run.
  • The SGD run on $S_{\alpha}$ is the $\alpha$-prefix run.
  • The SGD run on $S_{\alpha}^G$ is the $\alpha$-prefix+ghost run and obtains the parameters $\mathbf{w}_{\alpha}^G$.

The resulting parameters of the $\alpha$-prefix and $\alpha$-prefix+ghost run can be used as the centres of the Gaussian priors to give the tightened generalization bounds. However, sometimes the ghost sample is not attainable in practice, and hence one simply relies upon $\alpha$-prefix runs to obtain the mean of the prior. It is not clear whether $\alpha$-prefix+ghost run will always obtain a parameter that leads to a tighter generalization bound. Recall, that $\sigma$ is assumed to be fixed in the optimization process. Algorithm 7 is independent of this parameter and so it can be optimized afterwards without requiring a re-run of the optimization process.

Algorithm 6 Stochastic Gradient Descent

Require: Learning rate $\eta$
function SGD $(\mathbf{w}_0,S,b,t,\mathcal{E}=-\infty)$
$\mathbf{w}\leftarrow\mathbf{w}_0$
----for $i\leftarrow 1$ to $t$ do
--------Sample $S^\prime\in S$ with $\vert S^\prime\vert=b$
--------$\mathbf{w}\leftarrow\mathbf{w}-\eta\nabla l_{S^\prime}(\mathbf{w})$
--------if $l_S^{0\text{-}1}(\mathbf{w})\leq\mathcal{E}$ then
------------break
--------end if
----end for
end function

Algorithm 7 Obtaining Bound Using SGD Informed Prior

Require: Stopping criteria $\mathcal{E}$, Prefix fraction $\alpha$, Ghost Data $S^G$ (If available), Batch size $b$.
function GetBound $(\mathcal{E},\alpha,T,\sigma_P)$
----$S_{\alpha}\leftarrow{z_1,\dots,z_{\alpha\vert S\vert}\subset S}$
----$\mathbf{w}_{\alpha}^0\leftarrow$ SGD $\left(\mathbf{w}_0,S_{\alpha},b,\frac{\vert S_{\alpha}\vert}{b}\right)$
----$\mathbf{w}_S\leftarrow$ SGD $\left(\mathbf{w}_{\alpha}^0,S,b,\infty,\mathcal{E}\right)$ Base Run
----$\mathbf{w}_{\alpha}^G\leftarrow$ SGD $\left(\mathbf{w}_{\alpha}^0,S_{\alpha}^G,b,T,\cdot\right)$ Ghost run if data available, otherwise prefix run
----$\pi\leftarrow\mathcal{N}\left(\mathbf{w}_{\alpha}^G,\sigma I\right)$
----$\rho\leftarrow\mathcal{N}\left(\mathbf{w}_S,\sigma I\right)$
----Bound $\leftarrow\Psi_{\delta}^*(\rho,\pi;S\setminus S_{\alpha})$
----return Bound
end function