Chapter 5: Extension of PAC-Bayes Bounds
5.1 Disintegrated PAC-Bayes Bounds
The majority of the PAC-Bayes bounds we have discussed so far have been derived to hold for all posterior distributions. The intention of disintegrated PAC-Bayes bounds is to refine these results by only requiring them to hold for a single posterior distribution. We now study the work of (Viallard, 2021) that sets out a general framework for deriving such bounds. The setup is the same as the one we have considered so far, with the added assumption that $C=1$ and the additional consideration of a deterministic learning algorithm $A:\mathcal{Z}^m\to\mathcal{M}(\mathcal{W})$ that is applied to the training sample $S$.
Definition 5.1 (Viallard, 2021) The two distributions $P$ and $Q$ defined on the some sample space $\mathcal{X}$, then for any $\alpha>1$ their Renyi divergence is defined to be $$D_{\alpha}(Q,P)=\frac{1}{\alpha-1}\log\left(\mathbb{E}_{x\sim P}\left(\frac{Q(x)}{P(x)}\right)^{\alpha}\right).$$
Theorem 5.2 (Viallard, 2021) For any distribution $\mathcal{D}$ on $\mathcal{Z}$, for any parameter space $\mathcal{W}$, for any prior distribution $\pi$ on $\mathcal{W}$, for any $\phi:\mathcal{W}\times\mathcal{Z}^m\to\mathbb{R}^+$, for any $\alpha>1$, for any $\delta>0$ and for any deterministic learning algorithm $A:\mathcal{Z}^m\to\mathcal{M}(\mathcal{W})$ the following holds $$\mathbb{P}_{S\sim\mathcal{D}^m,\mathbf{w}\sim\rho_S}\left(\frac{\alpha}{\alpha-1}\log\left(\phi(\mathbf{w},S)\right)\leq\frac{2\alpha-1}{\alpha-1}\log\left(\frac{2}{\delta}\right)+D_{\alpha}(\rho_{S},\pi)+\log\left(\mathbb{E}_{S^\prime\sim\mathcal{D}^m}\mathbb{E}_{\mathbf{w}^\prime\sim\pi}\phi(\mathbf{w}^\prime,S^\prime)^{\frac{\alpha}{\alpha-1}}\right)\right)\geq1-\delta,$$ where $\mathcal{\rho}_S:=A(S)$.
Proof
First note that $\phi(\mathbf{w},S)$ is a non-negative random variable. Therefore, by Markov's inequality $$\mathbb{P}_{\mathbf{w}\sim\rho_S}\left(\phi(\mathbf{w},S)\leq\frac{2}{\delta}\mathbb{E}_{\mathbf{w}^\prime\sim\rho_S}\left(\phi\left(\mathbf{w}^\prime,S\right)\right)\right)\geq1-\frac{\delta}{2},$$ which is equivalent to $$\mathbb{E}_{\mathbf{w}\sim\rho_S}\left(\phi(\mathbf{w},S)\leq\frac{2}{\delta}\mathbb{E}_{\mathbf{w}^\prime\sim\rho_S}\left(\phi\left(\mathbf{w}^\prime,S\right)\right)\right)\geq1-\frac{\delta}{2}.$$ Taking the expectations over $S\sim\mathcal{D}^m$ to both we obtain the equivalent statements $$\mathbb{P}_{S\sim\mathcal{D}^m,\mathbf{w}\sim\rho_S}\left(\phi(\mathbf{w},S)\leq\frac{2}{\delta}\mathbb{E}_{\mathbf{w}^\prime\sim\rho_S}\left(\phi\left(\mathbf{w}^\prime,S\right)\right)\right)\geq1-\frac{\delta}{2},$$ and $$\mathbb{E}_{S\sim\mathcal{D}^m}\left(\mathbb{E}_{\mathbf{w}\sim\rho_S}\left(\phi(\mathbf{w},S)\leq\frac{2}{\delta}\mathbb{E}_{\mathbf{w}^\prime\sim\rho_S}\left(\phi\left(\mathbf{w}^\prime,S\right)\right)\right)\right)\geq1-\frac{\delta}{2}.$$ Taking the $\log$ of the first of these and then multiplying by $\frac{\alpha}{\alpha-1}$ gives $$\mathbb{P}_{S\sim\mathcal{D}^m,\mathbf{w}\sim\rho_S}\left(\frac{\alpha}{\alpha-1}\log\left(\phi(\mathbf{w},S)\right)\leq\frac{\alpha}{\alpha-1}\log\left(\frac{2}{\delta}\mathbb{E}_{\mathbf{w}^\prime\sim\rho_S}\left(\phi\left(\mathbf{w}^\prime,S\right)\right)\right)\right)\geq1-\frac{\delta}{2}.$$ Focusing on the right hand side we see that $$\begin{align*}\frac{\alpha}{\alpha-1}\log\left(\frac{2}{\delta}\mathbb{E}_{\mathbf{w}^\prime\sim\rho_S}\left(\phi\left(\mathbf{w}^\prime,S\right)\right)\right)&=\frac{\alpha}{\alpha-1}\log\left(\frac{2}{\delta}\mathbb{E}_{\mathbf{w}^\prime\sim\rho_S}\left(\frac{\rho_S(\mathbf{w}^\prime)\pi(\mathbf{w}^\prime)}{\pi(\mathbf{w}^\prime)\rho_S(\mathbf{w}^\prime)}\phi(\mathbf{w}^\prime,S)\right)\right)\end{align*}$$ for all $\pi\in\mathcal{M}(\mathcal{W})$. As $\frac{1}{\alpha}+\frac{1}{\frac{\alpha}{\alpha-1}}=1$ we can apply Holder's inequality to get that $$\mathbb{E}_{\mathbf{w}^\prime\sim\pi}\left(\frac{\rho_S(\mathbf{w}^\prime)}{\pi(\mathbf{w}^\prime)}\phi\left(\mathbf{w}^\prime,S\right)\right)\leq\left(\mathbb{E}_{\mathbf{w}^\prime\sim\pi}\left(\frac{\rho_S(\mathbf{w}^\prime)}{\pi(\mathbf{w}^\prime)}\right)^{\alpha}\right)^{\frac{1}{\alpha}}\left(\mathbb{E}_{\mathbf{w}^\prime\sim\pi}\left(\phi(\mathbf{w}^\prime,S)^{\frac{\alpha}{\alpha-1}}\right)\right)^{\frac{\alpha-1}{a}}.$$ Therefore, $$\frac{\alpha}{\alpha-1}\log\left(\frac{2}{\delta}\mathbb{E}_{\mathbf{w}^\prime\sim\pi}\left(\frac{\rho_S(\mathbf{w}^\prime)}{\pi(\mathbf{w}^\prime)}\phi\left(\mathbf{w}^\prime,S\right)\right)\right)\leq D_{\alpha}(\rho_S,\pi)+\frac{\alpha}{\alpha-1}\log\left(\frac{2}{\delta}\right)+\log\left(\mathbb{E}_{\mathbf{w}^\prime\sim\pi}\phi(\mathbf{w}^\prime,S)^{\frac{\alpha}{\alpha-1}}\right).$$ From which we deduce that $$\mathbb{P}_{S\sim\mathcal{D}^m,\mathbf{w}\sim\rho_S}\left(\frac{\alpha}{\alpha-1}\log\left(\phi(\mathbf{w},S)\right)\leq D_{\alpha}(\rho_S,\pi)+\frac{\alpha}{\alpha-1}\log\left(\frac{2}{\delta}\right)+\log\left(\mathbb{E}_{\mathbf{w}^\prime\sim\pi}\phi(\mathbf{w}^\prime,S)^{\frac{\alpha}{\alpha-1}}\right)\right)\geq1-\frac{\delta}{2}.\quad(\star)$$ As $\mathbb{E}_{\mathbf{w}^\prime\sim\pi}\phi(\mathbf{w}^\prime,S)^{\frac{\alpha}{\alpha-1}}$ is also a non-negative random variables we can apply Markov's inequality again to get $$\mathbb{P}_{S\sim\mathcal{D}^m}\left(\mathbb{E}_{\mathbf{w}^\prime\sim\pi}\phi(\mathbf{w}^\prime,S)^{\frac{\alpha}{\alpha-1}}\leq\frac{\delta}{2}\mathbb{E}_{S^\prime\mathcal{D}^m}\mathbb{E}_{\mathbf{w}^\prime\sim\pi}\phi(\mathbf{w}^\prime,S^\prime)^{\frac{\alpha}{\alpha-1}}\right)\geq1-\frac{\delta}{2}.$$ As the left hand side is not dependent of $\mathbf{w}\sim\rho_S$ we have that $$\mathbb{P}_{S\sim\mathcal{D}^m}\left(\mathbb{E}_{\mathbf{w}^\prime\sim\pi}\phi(\mathbf{w}^\prime,S)^{\frac{\alpha}{\alpha-1}}\leq\frac{\delta}{2}\mathbb{E}_{S^\prime\mathcal{D}^m}\mathbb{E}_{\mathbf{w}^\prime\sim\pi}\phi(\mathbf{w}^\prime,S^\prime)^{\frac{\alpha}{\alpha-1}}\right)=\mathbb{P}_{S\sim\mathcal{D}^m,\mathbf{w}\sim\rho_S}\left(\mathbb{E}_{\mathbf{w}^\prime\sim\pi}\phi(\mathbf{w}^\prime,S)^{\frac{\alpha}{\alpha-1}}\leq\frac{\delta}{2}\mathbb{E}_{S^\prime\mathcal{D}^m}\mathbb{E}_{\mathbf{w}^\prime\sim\pi}\phi(\mathbf{w}^\prime,S^\prime)^{\frac{\alpha}{\alpha-1}}\right).$$ Therefore, $$\mathbb{P}_{S\sim\mathcal{D}^m,\mathbf{w}\sim\rho_S}\left(\frac{\alpha}{\alpha-1}\log\left(\frac{2}{\delta}\right)+\log\left(\mathbb{E}_{\mathbf{w}^\prime\sim\pi}\phi(\mathbf{w}^\prime,S)^{\frac{\alpha}{\alpha-1}}\right)\leq\frac{2\alpha-1}{\alpha-1}\log\left(\frac{2}{\delta}\right)+\log\left(\frac{\delta}{2}\mathbb{E}_{S^\prime\mathcal{D}^m}\mathbb{E}_{\mathbf{w}^\prime\sim\pi}\phi(\mathbf{w}^\prime,S^\prime)^{\frac{\alpha}{\alpha-1}}\right)\right).$$ Combining with $(\star)$ using a union bound completes the proof. $\square$
5.1.1 Application to Neural Network Classifiers
We can contextualize this bound to over-parameterized neural networks. Suppose that $\mathbf{w}\in\mathbb{R}^d$ is a weight vector of a neural network, with $d\gg m$. Assume that the network is trained for $T$ epochs and that these epochs are used to generate $T$ priors $\mathbf{P}=\{\pi_t\}_{t=1}^T$. Let the priors be of the form $\pi_t=\mathcal{N}\left(\mathbf{w}_t,\sigma^2\mathbf{I}_d\right)$ where $\mathbf{w}_t$ is the weight vector obtained after the $t^\text{th}$ epoch. We assume that the priors are obtained from the learning algorithm being applied to the sample $S_{\mathrm{prior}}$ where $S_{\mathrm{prior}}\cap S=\emptyset$.
Corollary 5.3 (Viallard, 2021) For any distribution $\mathcal{D}$ on $\mathcal{Z}$, for any set $\mathcal{W}$, for any set $\mathbf{P}$ of $T$ priors on $\mathcal{W}$, for any learning algorithm $A:\mathcal{Z}^m\to\mathcal{M}(\mathcal{W})$, for any loss $l:\mathcal{W}\times\mathcal{Z}\to[0,1]$ and for any $\delta>0$ then for any $\pi_t\in\mathbf{P}$ we have that $$\mathbb{P}_{\mathcal{S}\sim\mathcal{D}^m,\mathbf{w}\sim\rho_{S}}\left(\mathrm{kl}\left(\hat{R}(\mathbf{w}),R(\mathbf{w})\right)\leq\frac{1}{m}\left(\frac{\Vert\mathbf{w}-\mathbf{w}_t\Vert_2^2}{\sigma^2}+\log\left(\frac{16T\sqrt{m}}{\delta^3}\right)\right)\right)\geq1-\delta.$$
Proof
We can apply Theorem 5.2 with $\phi(\mathbf{w},S)=\exp\left(\frac{\alpha-1}{\alpha}m\mathrm{kl}\left(\hat{R}(\mathbf{w}),R(\mathbf{w})\right)\right)$ and $\alpha=2$. To deduce that for all $\pi_t\in\mathbf{P}$ we have $$\mathbb{P}_{S\sim\mathcal{D}^m,\mathbf{w}\sim\rho_S}\left(\mathrm{kl}\left(\hat{R}(\mathbf{w}),R(\mathbf{w})\right)\leq\frac{1}{m}\left(D_2(\rho_S,\pi_t)+\log\left(\frac{8T}{\delta^3}\mathbb{E}_{S^\prime\sim\mathcal{D}^m}\mathbb{E}_{\mathbf{w}^\prime\sim\pi_t}\left(\exp\left(m\mathrm{kl}\left(\hat{R}(\mathbf{w}^\prime),R(\mathbf{w})\right)\right)\right)\right)\right)\right)\geq1-\delta.$$ Note that the empirical risk in the exponential is with respect to the distribution $S^\prime$ where as the empirical risk on the left hand side of the inequality is with respect to $S$. Recall, the upper bound we determined in the proof of Theorem 3.12, $$\mathbb{E}_{S^\prime\sim\mathcal{D}^m}\mathbb{E}_{\mathbf{w}^\prime\sim\pi_t}\left(\exp\left(m\mathrm{kl}\left(\hat{R}(\mathbf{w}^\prime),R(\mathbf{w})\right)\right)\right)\leq2\sqrt{m}.$$ Furthermore, it is known that for $\rho_S=\mathcal{N}(\mathbf{w},\sigma^2I_d)$ and $\pi_t=\mathcal{N}(\mathbf{v}_t,\sigma^2I_d)$ that $$D_2(\rho_S,\pi_t)=\frac{\Vert\mathbf{w}-\mathbf{v}_t\Vert_2^2}{\sigma^2}.$$ Putting this and our bound into our deductions from Theorem 5.2 completes the proof of the corollary. $\square$
Corollary 5.4 (Viallard, 2021) Under the assumptions of Corollary 5.3 with $\delta\in(0,1)$ and for all $\pi_t\in\mathbf{P}$ we have that $$\begin{align*}\mathbb{P}_{S\sim\mathcal{D}^m,\mathbf{w}\sim\rho_S}&\left(\mathrm{kl}\left(\hat{R}(\mathbf{w}),R(\mathbf{w})\right)\leq\frac{1}{m}\left(\frac{\Vert\mathbf{w}+\mathbf{\epsilon}-\mathbf{w}_t\Vert_2^2-\Vert\mathbf{\epsilon}\Vert_2^2}{2\sigma^2}+\log\left(\frac{2T\sqrt{m}}{\delta}\right)\right)\right),\\\mathbb{P}_{S\sim\mathcal{D}^m,\mathbf{w}\sim\rho_S}&\left(\mathrm{kl}\left(\hat{R}(\mathbf{w}),R(\mathbf{w})\right)\leq\frac{1}{m}\left(\frac{m+1}{m}\frac{\Vert\mathbf{w}+\mathbf{\epsilon}-\mathbf{w}_t\Vert_2^2-\Vert\mathbf{\epsilon}\Vert_2^2}{2\sigma^2}+\log\left(\frac{T(m+1)}{\delta}\right)\right)\right),\end{align*}$$ and for all $c\in\mathbf{C}$ $$R(\mathbf{w})\leq\frac{1-\exp\left(-c\hat{R}(\mathbf{w})-\frac{1}{m}\left(\frac{\Vert\mathbf{w}+\mathbf{\epsilon}-\mathbf{w}_t\Vert_2^2-\Vert\mathbf{\epsilon}\Vert_2^2}{2\sigma^2}+\log\left(\frac{T\vert\mathbf{C}\vert}{\delta}\right)\right)\right)}{1-\exp(-c)}.$$ Where $\mathbf{\epsilon}\sim\mathcal{N}\left(\mathbf{0},\sigma^2\mathbf{I}_d\right)$ is Gaussian noise such that $\mathbf{w}+\mathbf{\epsilon}$ acts as the weights sampled from $\mathcal{N}(\mathbf{w},\sigma^2\mathbf{I}_d)$, and $\mathbf{C}$ is a set of hyper-parameters fixed a priori.
Proof
The proof of these individual statements follow the same structure. We will only prove the last of these with the aid of Theorem 5.4.3 proven below. The first two can be proven in a similar way using Theorem 1 $(i)$ from (Rivasplata, 2020) and Proposition 3.1 from (Blanchard, 2007) respectively.
Lemma 5.4.1 For $\rho_S=\mathcal{N}\left(\mathbf{w},\sigma^2I_d\right)$ and $\pi=\mathcal{N}\left(\mathbf{v},\sigma^2I_d\right)$, we have that $$\log\left(\frac{\rho_S(\mathbf{w+\epsilon})}{\pi(\mathbf{w+\epsilon})}\right)=\frac{1}{2\sigma^2}\left(\Vert\mathbf{w+\epsilon-v}\Vert_2^2-\Vert\mathbf{\epsilon}\Vert_2^2\right),$$ where $\mathbf{\epsilon}\sim\mathcal{N}\left(\mathbf{0},\sigma^2I_d\right)$ such that $\mathbf{w+\epsilon}$ acts as the weights sampled from $\mathcal{N}(\mathbf{w},\sigma^2\mathbf{I}_d)$.
Proof
This follows from simple computations after recalling that $$\rho_S(\mathbf{w+\epsilon})=\left(\frac{1}{\sigma\sqrt{2\pi}}\right)^d\exp\left(-\frac{1}{2\sigma^2}\Vert\mathbf{\epsilon}\Vert_2^2\right),\text{ and }\pi(\mathbf{w+\epsilon})=\left(\frac{1}{\sigma\sqrt{2\pi}}\right)^d\exp\left(-\frac{1}{2\sigma^2}\Vert\mathbf{w+\epsilon-v}\Vert_2^2\right).$$ So this completes the proof of the lemma. $\square$
Lemma 5.4.2 (Catoni, 2007) For any positive $\lambda$ and $\mathbf{w}\in\mathcal{W}$, we have that $$\mathbb{E}_{S\sim\mathcal{D}^m}\left(\exp\left(\lambda\left(\Phi_{\frac{\lambda}{m}}\left(R(\mathbf{w})\right)-\hat{R}(\mathbf{w})\right)\right)\right)\leq1.$$
Proof
Define the Bernoulli random variables $\sigma_i(\mathbf{w})=\mathbb{I}\left(h_{\mathbf{w}}(x_i)\neq y_i\right)$. Using independence, the concavity of $\log$ and $\lambda>0$ we deduce that $$\begin{align*}\log\left(\mathbb{E}\left(\exp\left(-\lambda\hat{R}(\mathbf{w})\right)\right)\right)&=\sum_{i=1}^m\log\left(\mathbb{E}\left(\exp\left(-\frac{\lambda}{m}\sigma_i\right)\right)\right)\\&\leq m\log\left(\frac{1}{m}\sum_{i=1}^m\mathbb{E}\left(\exp\left(-\frac{\lambda}{m}\sigma_i\right)\right)\right).\end{align*}$$ Now note that $$R(\mathbf{w})=\frac{1}{m}\sum_{i=1}^m\mathbb{E}(\sigma_i)$$ and becaus the $\sigma_i$ are Bernoulli random variables we have that $$\frac{1}{m}\sum_{i=1}^m\mathbb{E}\left(\exp\left(-\frac{\lambda}{m}\sigma_i\right)\right)=\frac{1}{m}\sum_{i=1}^m\left((1-\mathbb{E}(\sigma_i))+\exp\left(-\frac{\lambda}{m}\right)\mathbb{E}(\sigma_i)\right)=\frac{1}{m}\sum_{i=1}^m\left(\mathbb{E}(\sigma_i)\left(\exp\left(-\frac{\lambda}{m}\right)-1\right)+1\right).$$ Therefore, $$\begin{align*}\Phi_{\frac{\lambda}{m}}(R(\mathbf{w}))&=\frac{1}{-\lambda}m\log\left(1-\left(1-\exp\left(-\frac{\lambda}{m}\right)\right)\frac{1}{m}\sum_{i=1}^m\mathbb{E}(\sigma_i)\right)\\&=\frac{1}{-\lambda}m\log\left(\frac{1}{m}\sum_{i=1}^m\left(\mathbb{E}(\sigma_i)\left(\exp\left(-\frac{\lambda}{m}\right)-1\right)+1\right)\right)\\&=\frac{1}{-\lambda}m\log\left(\frac{1}{m}\sum_{i=1}^m\mathbb{E}\left(\exp\left(-\frac{\lambda}{m}\sigma_i\right)\right)\right).\end{align*}$$ From which we conclude that $$\log\left(\mathbb{E}\left(\exp\left(-\lambda\hat{R}(\mathbf{w})\right)\right)\right)\leq-\lambda\Phi_{\frac{\lambda}{m}}(R(\mathbf{w})).$$ Re-arranging the terms completes the proof of the lemma.$\square$
Theorem 5.4.3 (Catoni, 2007) For any positive $\lambda$, any posterior distribution $\rho\in\mathcal{M}(\mathcal{W})$, then $$\mathbb{P}_{S\sim\mathcal{D}^m}\left(R(\rho)\leq\Phi^{-1}_{\frac{\lambda}{m}}\left(\hat{R}(\rho)+\frac{1}{\lambda}\log\left(\frac{1}{\delta}\frac{d\rho}{d\pi}\right)\right)\right)\geq1-\delta.$$
Proof
To prove this we start from Lemma 5.4.2 and integrate with respect to $\pi$ to get that $$\mathbb{E}_{S\sim\mathcal{D}^m}\left(\exp\left(\lambda\left(\Phi_{\lambda}{m}\left(R(\pi)\right)-\hat{R}(\pi)\right)\right)\right)\leq1.$$ Which for any posterior $\rho$ can be written as $$\mathbb{E}_{S\sim\mathcal{D}^m}\left(\exp\left(\lambda\left(\Phi_{\lambda}{m}\left(R(\rho)\right)-\hat{R}(\rho)\right)-\log\left(\frac{d\rho}{d\pi}\right)+\log(\delta)\right)\right)\leq\delta.$$ From this we can deduce using by Markov's inequality that $$\begin{align*}\mathbb{P}_{S\sim\mathcal{D}^m}\left(R(\rho)\geq\Phi^{-1}_{\frac{\lambda}{m}}\left(\hat{R}(\rho)+\frac{1}{\lambda}\log\left(\frac{1}{\delta}\frac{d\rho}{d\pi}\right)\right)\right)&=\mathbb{P}_{S\sim\mathcal{D}^m}\left(\exp\left(\lambda\left(\Phi_{\lambda}{m}\left(R(\rho)\right)-\hat{R}(\rho)\right)-\log\left(\frac{d\rho}{d\pi}\right)+\log(\delta)\right)\geq e^0\right)\\&\leq\mathbb{E}_{S\sim\mathcal{D}^m}\left(\exp\left(\lambda\left(\Phi_{\lambda}{m}\left(R(\rho)\right)-\hat{R}(\rho)\right)-\log\left(\frac{d\rho}{d\pi}\right)+\log(\delta)\right)\right)\\&\leq\delta\end{align*}$$ from which when we take complements we complete the proof of the theorem. $\square$
Apply Theorem 5.4.3 $T\vert\mathbf{C}\vert$ times with confidence $\frac{\delta}{T\vert\mathbf{C}\vert}$. For each prior $\pi_t\in\mathbf{P}$ and hyperparameter $c\in\mathbf{C}$, we have that $$\mathbb{P}_{S\sim\mathcal{D}^m}\left(R(\rho_S)\leq\frac{1}{1-e^{-c}}\left(1-\exp\left(-c\hat{R}(\rho_S)-\frac{1}{m}\left(\log\left(\frac{\rho_S(\mathbf{w})}{\pi_t(\mathbf{w})}\right)+\log\left(\frac{T\vert\mathbf{C}\vert}{\delta}\right)\right)\right)\right)\right)\geq1-\frac{\delta}{T\vert\mathbf{C}\vert}.$$ Applying a union bound argument and Lemma 5.4.1 the conclusions of the theorem follows which completes the proof. $\square$
5.2 PAC-Bayes Compression Bounds
We will now see how compression ideas can be capitalized to tighten PAC-Bayes bounds. The work of (Zhou, 2019) evaluates generalization bounds by first measuring the effective compressed size of a neural network and then substituting this into the bounds. We have seen that compression techniques can efficiently reduce the effective size of a network, and so accounting for this can lead to tighter bounds. This also captures the intuition that we expect a model to overfit if it is more difficult to compress. Therefore, these updated bounds also incorporate a notion of model complexity. The work of (Zhou, 2019) utilizes a refined version of Theorem 3.11.
Theorem 5.5 (Catoni, 2007) Let $L$ be a $0$-$1$ valued loss function. Let $\pi$ be a probability measure on the parameter space, and let $\alpha>1,\delta>0$. Then, $$\mathbb{P}_{S\sim\mathcal{D}^m}\left(R(\rho)\leq\inf_{\lambda>1}\Phi^{-1}_{\lambda/m}\left(\hat{R}(\rho)+\frac{\alpha}{\lambda}\left(\mathrm{KL}(\rho,\pi)-\log(\delta)+2\log\left(\frac{\log\left(\alpha^2\lambda\right)}{\log(\alpha)}\right)\right)\right)\right)\geq1-\delta.$$
Proof
We start from Theorem 3.11 and try to optimize the bound with respect to $\lambda$. Let us introduce the parameter $\alpha>1$ and let $\Lambda=\left\{\alpha^k:k\in\mathbb{N}\right\}$ on which we define the probability measure $\nu\left(\alpha^k\right)=\frac{1}{(k+1)(k+2)}$. Now for each $k\in\mathbb{N}$ apply Theorem 3.11 with $\lambda=\alpha^k$ and confidence $1-\frac{\delta}{(k+1)(k+2)}$. Now apply a union bound argument to conclude that $$\mathbb{P}_{S\sim\mathcal{D}^m}\left(R(\rho)\leq\inf_{\lambda^\prime\in\Lambda}\Phi^{-1}_{\frac{\lambda^\prime}{m}}\left(\hat{R}(\rho)+\frac{\mathrm{KL}(\rho,\pi)+\log\left(\frac{1}{\delta}\right)+2\log\left(\frac{\log\left(\alpha^2\lambda^\prime\right)}{\log(\alpha)}\right)}{\lambda^\prime}\right)\right)\geq1-\delta.$$ We note that $\lambda\in(1,\infty)$ (as for $\lambda<1$ we get a bound larger than $1$) and so there is a $\lambda^\prime\in\Lambda$ such that $$\frac{\lambda}{\alpha}\leq\lambda^\prime\leq\lambda.$$ Moreover, for any $q\in(0,1)$ we have that $\beta\mapsto\Phi_{\beta}(q)$ is increasing on $\mathbb{R}_+$. Therefore, $$\mathbb{P}_{S\sim\mathcal{D}^m}\left(R(\rho)\leq\inf_{\lambda\in(1,\infty)}\Phi^{-1}_{\frac{\lambda}{m}}\left(\hat{R}(\rho)+\frac{\alpha}{\lambda}\left(\mathrm{KL}(\rho,\pi)-\log\left(\delta\right)+2\log\left(\frac{\log\left(\alpha^2\lambda^\prime\right)}{\log(\alpha)}\right)\right)\right)\right)\geq1-\delta,$$ which completes the proof. $\square$
The intention now is to motivate the choice of $\pi$ using ideas of compressibility such that $\mathrm{KL}(\rho,\pi)$ is kept small. To do this we will choose a prior $\pi$ that assigns greater probability mass to models with a shorter code length.
Theorem 5.6 (Zhou, 2019) Let $\vert\mathbf{w}\vert_c$ denote the number of bits required to represent hypothesis $h_{\mathbf{w}}$ using some pre-specified coding $c$. Let $\rho$ denote the point mass distribution at $\hat{\mathbf{w}}$ which is the compression of $\mathbf{w}$ and corresponds to the compressed model $h_{\hat{\mathbf{w}}}$. Let $M$ denote any probability measure on the positive integers. Then there exists a prior $\pi_c$ such that $$\mathrm{KL}(\rho,\pi_c)\leq\left\vert\hat{\mathbf{w}}\right\vert_c\log(2)-\log\left(M\left(\left\vert\hat{\mathbf{w}}\right\vert_c\right)\right).$$
Proof
Let $\mathcal{W}_c\subseteq\mathcal{W}$ be the set of compressed weights. Then let $\pi_c$ be a distribution on $\mathcal{W}_c$ defined by $$\pi_c(\mathbf{w})=\frac{1}{Z}M(\vert\mathbf{w}\vert_c)\cdot2^{-\vert\mathbf{w}\vert_c},\text{ where }Z=\sum_{\mathbf{w}\in\mathcal{W}_c}M(\vert\mathbf{w}\vert_c)\cdot 2^{-\vert\mathbf{w}\vert_c}.$$ As $c$ is injective on $\mathcal{W}_c$ we have that $Z\leq 1$. Therefore, $$\begin{align*}\mathrm{KL}(\rho,\pi_c)=\log\left(\frac{\rho\left(\hat{\mathbf{w}}\right)}{\pi_c\left(\hat{\mathbf{w}}\right)}\right)\rho\left(\hat{\mathbf{w}}\right)&=-\log\left(\pi_c\left(\hat{\mathbf{w}}\right)\right)\\&=\log(Z)+\left\vert\hat{\mathbf{w}}\right\vert_c\log(2)-\log\left(M\left(\left\vert\hat{\mathbf{w}}\right\vert_c\right)\right)\\&\leq\left\vert\hat{\mathbf{w}}\right\vert_c\log(2)-\log\left(M\left(\left\vert\hat{\mathbf{w}}\right\vert_c\right)\right).\end{align*}$$ Which completes the proof of the theorem. $\square$
Remark 5.7 An example of a coding scheme $c$ could be the Huffman encoding. However, such a compression scheme is agnostic to any structure of the hypotheses which is translated to the space $\mathcal{W}$. By exploiting structure in the hypothesis class the bound can be improved substantially.
We now formalize compression schemes to allow us to refine Theorem 5.6. Denote a compression procedure by a triple $(S,C,Q)$ where
- $S={s_1,\dots,s_k}\subseteq{1,\dots,d}$ is the location of the non-zero weights,
- $C={c_1,\dots,c_r}\subseteq\mathbb{R}$, is a codebook, and
- $Q=(q_1,\dots,q_k)$ for $q_i\in{1,\dots,r}$ are the quantized values.
Define the corresponding weights $\mathbf{w}(S,Q,C)\in\mathbb{R}^d$ as, $$w_i(S,Q,C)=\begin{cases}c_{q_j}&i=s_j\\0&\text{otherwise}.\end{cases}$$ Training a neural network is a stochastic process due to the randomness of SGD. So to analyse the generalization error we try to capture randomness in the analysis by applying Gaussian noise to weights. For this we use $\rho\sim\mathcal{N}\left(\mathbf{w},\sigma^2J\right)$, with $J$ being a diagonal matrix.
Theorem 5.8 (Zhou, 2019) Let $(S,C,Q)$ be the output of a compression scheme, and let $\rho_{S,C,Q}$ be the stochastic estimator given by the weights decoded from the triplet and variance $\sigma^2$. Let $c$ denote an arbitrary fixed coding scheme and let $M$ denote an arbitrary distribution on the positive integers. Then for any $\tau>0$, there is a prior $\pi$ such that $$\begin{align*}\mathrm{KL}(\rho_{S,C,Q},\pi)\leq&(k\lceil\log(r)\rceil+\vert S\vert_c+\vert C\vert_c)\log(2)-\log(M(k\lceil\log(r)\rceil+\vert S\vert_c+\vert C\vert_c))\\&+\sum_{i=1}^k\mathrm{KL}\left(\mathcal{N}\left(c_{q_i},\sigma^2\right),\sum_{j=1}^r\mathcal{N}\left(c_j,\tau^2\right)\right).\end{align*}$$
Proof
The following is a proof by construction, that is we construct prior $\pi$ with the desired property. To do this we want to express the prior as a mixture over all possible compressions provided by the algorithm. We first define the mixture component $$\pi_{S,Q,C}=\mathcal{N}\left(\mathbf{w}(S,Q,C),\tau^2\right).$$ We then define our prior to be a weighted mixture over all possible compressions, that is $$\pi=\frac{1}{Z}\sum_{S,Q,C}M\left(\vert S\vert_c+\vert C\vert_c+k\lceil\log(r)\rceil\right)\cdot2^{-\vert S\vert_c-\vert C\vert_c-k\lceil\log(r)\rceil}\pi_{S,Q,C}.$$ Where $Z\leq1$ as the compression scheme is injective. Let $\left(\hat{S},\hat{Q},\hat{C}\right)$ be the output of our compression algorithm, so that out posterior $\rho$ is $\mathcal{N}\left(\mathbf{w}\left(\hat{S},\hat{Q},\hat{C}\right),\sigma^2\right)$. Therefore, $$\begin{align*}\mathrm{KL}(\rho,\pi)&\leq\mathrm{KL}\left(\rho,\sum_{S,Q,C}M\left(\vert S\vert_c+\vert C\vert_c+k\lceil\log(r)\rceil\right)\cdot2^{-\vert S\vert_c-\vert C\vert_c-k\lceil\log(r)\rceil}\pi_{S,Q,C}\right)\\&\leq\mathrm{KL}\left(\rho,\sum_QM\left(\left\vert \hat{S}\right\vert_c+\left\vert \hat{C}\right\vert_c+k\lceil\log\left(\hat{r}\right)\rceil\right)\cdot2^{-\left\vert \hat{S}\right\vert_c-\left\vert \hat{C}\right\vert_c-k\lceil\log\left(\hat{r}\right)\rceil}\pi_{\hat{S},Q,\hat{C}}\right)\\&\leq\left(\left\vert \hat{S}\right\vert_c+\left\vert \hat{C}\right\vert_c+k\lceil\log\left(\hat{r}\right)\rceil\right)\log(2)+\log\left(M\left(\left\vert \hat{S}\right\vert_c+\left\vert \hat{C}\right\vert_c+k\lceil\log\left(\hat{r}\right)\rceil\right)\right)+\mathrm{KL}\left(\rho,\sum_Q\pi_{\hat{S},Q,\hat{C}}\right)\end{align*}$$ Let $\phi_{\tau}=\mathcal{N}\left(0,\tau^2\right)$. Then as the mixture term is independent across coordinates we have that $$\left(\sum_Q\pi_{\hat{S},Q,\hat{C}}\right)(x)=\sum_{q^1,\dots q^k=1}^r\prod_{i=1}^k\phi_{\tau}\left(x_i-\hat{c}_{q^i}\right)=\prod_{i=1}^k\sum_{q^i=1}^r\phi_{\tau}\left(x_i-\hat{c}_{q^i}\right).$$ Furthermore, as $\rho$ is independent over the coordinates, we get that $$\mathrm{KL}\left(\rho,\sum_Q\pi_{\hat{S},Q,\hat{C}}k\right)=\sum_{i=1}^k\mathrm{KL}\left(\rho_i,\sum_{q^i=1}^r\mathcal{N}\left(\hat{c}_{q^i},\tau^2\right)\right),$$ from which the result follows and completes the proof of the theorem. $\square$
Choosing the prior alluded to by Theorem 5.8 and utilizing Theorem 5.5 one can obtain a PAC-Bayes generalization bound that exploits notions of compressibility.