Chapter 6: Appendix
6.1 Extensions to Convolutional Neural Networks
In this section, we extend the ideas of Section 2.3 to convolutional neural networks (CNN) (Arora, 2018). This extension is not trivial due to the parameter sharing that occurs in the CNN architecture. To investigate these ideas we update our notation from that of Section 2.3. In particular, we suppose that the $i^\text{th}$ layer has an image dimension of $n_1^i\times n_2^i$, where each pixel has $l^i$ channels, and the filter at layer $i$ has size $\kappa_i\times\kappa_i$ with stride $s_i$. The convolutional filter has dimension $l^{i-1}\times l^i\times\kappa_i\times\kappa_i$. If we apply Algorithm 3 to each copy of the filter then the number of new parameters grows proportionally to $n_1^in_2^i$, which is undesirable. On the other hand, compressing the filter once and re-using it for all patches removes the implicit assumption that the noise generated by the compression behaves similar to a Gaussian as the shared filters introduces correlations. To solve these issues Algorithm 8 generates $p$-wise independent compressed filters for each convolution location. This results in $p$ more parameters than a single compression, but if $p$ grows logarithmically with respect to the relevant parameters then the filters behave like fully independent filters. To proceed with this idea we need to introduce some operations. For $k^\prime\leq k$ let $Y$ be a $k^\text{th}$ order tensor and $Z$ a $(k^\prime)^{\text{th}}$ order tensor with a matching dimensionality to the last $k^\prime$-dimensions of $Y$. The product operator $\times_{k^\prime}$ when given tensors $Y$ and $Z$ returns a $(k-k^\prime)^\text{th}$ order tensor as follows $$\left(Y\times_{k^\prime} Z\right)_{i_1,\dots,i_{k-k^\prime}}=\left\langle Y_{i_1,\dots,i_{k-k^\prime}},Z\right\rangle=\left\langle\mathrm{vec}\left(Y_{i_1,\dots,i_{k-k^\prime}}\right),\mathrm{vec}(Z)\right\rangle.$$ Let $X\in\mathbb{R}^{l\times n_1\times n_2}$ be an $n\times n$ image where the pixels have $l$ features. Denote the $\kappa\times\kappa$ sub-image starting from pixel $(i,j)$ by $X_{(i,j),\kappa}\in\mathbb{R}^{l\times\kappa\times\kappa}$. Let $A\in\mathbb{R}^{l^\prime\times l\times\kappa\times\kappa}$ be a convolutional weight tensor. The convolutional operator with stride $s$ can then be defined as $$\left(A*_sX\right)_{i,j}=A\times_3X_{\left(s(i-1)+1,s(j+1)+1\right),\kappa}$$ for $1\leq i\leq\left\lfloor\frac{n_1-\kappa}{s}\right\rfloor=:n_1^\prime$ and $1\leq j\leq\left\lfloor\frac{n_2-\kappa}{s}\right\rfloor=:n_2^\prime$ so that $A*_sX\in\mathbb{R}^{l^\prime\times n_1^\prime\times n_2^\prime}$. Algorithm 8 generates $p$-wise independent filters $\hat{A}_{(a,b)}$ for each convolution location $(a,b)\in\left[n_1^\prime\right]\times\left[n_2^\prime\right]$ and so $\hat{A}*_s X$ will be used to denote the convolution operator $$\left(\left(\hat{A}*_sX\right)_{i,j}\right)=\hat{A}_{(i,j)}\times_3X_{\left(s(i-1)+1,s(j+1)+1\right),\kappa}$$ for $1\leq i\leq n_1^\prime$ and $1\leq j\leq n_2^\prime$. With this we see that for any $i>1$ we have $$x^{i+1}=\phi\left(A^i*_{s_i}x^i\right),\text{ and }\;x^j=M^{ij}\left(x^i\right)=J_{x^i}^{ij}\times_3x^i.$$
Definition 6.1 For any two layer $i\leq j$, we define the inter-layer cushion $\mu_{i,j}$ as the largest number such that for any $(x,y)\in S$ we have that $$\mu_{i,j}\frac{1}{\sqrt{n_1^in_2^i}}\left\Vert J_{x^i}^{i,j}\right\Vert_F\left\Vert x^i\right\Vert\leq\left\Vert J_{x^i}^{i,j}x^i\right\Vert.$$ For any layer $i$ let the minimal inter-layer cushion be $\mu_{i\to}=\min_{i\leq j\leq d}\mu_{i,j}=\min\left(\frac{1}{\sqrt{l^i}},\min_{i<j\leq d}\mu_{i,j}\right).$
Definition 6.2 Let $J_x^{i,j}\in\mathbb{R}^{l^i\times n_1^i\times n_2^i\times l^j\times n_1^j\times n_2^j}$ be the Jacobian of $M^{i,j}$ at $x$. We say that the Jacobian is $\beta$ well-distributed if for any $(x,y)\in S$, any $i,j$ and any $(a,b)\in\left[n_1^i\times n_2^i\right]$ we have that $$\left\Vert\left[J_x^{i,j}\right]_{:,a,b,:,:,:}\right\Vert_F\leq\frac{\beta}{\sqrt{n_1^in_2^i}}\left\Vert J_x^{i,j}\right\Vert_F.$$
Algorithm 8 $\left(A,\epsilon,\eta, n_1^\prime\times n_2^\prime\right)$
Require: Convolution Tensor $A\in\mathbb{R}^{l^\prime\times l\times\kappa\times\kappa}$, error parameters $\epsilon,\eta$.
Ensure Generate $n_1^\prime\times n_2^\prime$ different tensors $\hat{A}_{(i,j)}\left((i,j)\in\left[n_1^\prime\right]\times\left[n_2^\prime\right]\right)$ that satisfy $(\star)$.
Let $k=\frac{Q\lceil\frac{\kappa}{s}\rceil^2\left(\log\left(\frac{1}{\eta}\right)\right)^2}{\epsilon^2}$ for a large enough universal constant $Q$.
Let $p=\log\left(\frac{1}{\eta}\right)$.
Sample a uniformly random subspace $\mathcal{S}$ of $l^\prime\times l\times\kappa\times\kappa$ of dimension $k\times p$.
for $(i,j)\in\left[n_1^\prime\right]\times\left[n_2^\prime\right]$ do
----Sample $k$ matrices $M_1,\dots,M_k\in\mathcal{N}(0,1)^{l^\prime\times l\times\kappa\times\kappa}$ with random $\mathrm{i.i.d}$ entries.
----for $k^\prime=1\to k$ do
--------Let $M_{k^\prime}^\prime=\sqrt{\frac{ll^\prime\kappa^2}{kp}}\mathrm{Proj}_{\mathcal{S}}(M_{k^\prime})$.
--------Let $Z_{k^\prime}=\langle A,M_{k^\prime}^\prime\rangle M_{k^\prime}^\prime$.
----end for
----Let $\hat{A}_{(i,j)}=\frac{1}{k}\sum_{k^\prime=1}^kZ_{k^\prime}$.
end for
return $\hat{\mathbf{w}}=\frac{1}{k}\sum_{i=1}^kz_iv_i$
For any $\delta>0$, $\epsilon\leq1$, let $G=\left\{\left(U^i,V^i\right)\right\}_{i=1}^m$ be a set of matrix/vector pairs where $U\in\mathbb{R}^{l^\prime\times n_1^\prime\times n_2^\prime\times n_u}$ and $V\in\mathbb{R}^{l\times n_1\times n_2}$, let $\hat{A}_{(i,j)}\in\mathbb{R}^{l\times l^\prime}$ be the output of Algorithm 8 with $\eta=\frac{\delta}{\eta}$ and $\Delta_{(i,j)}=\hat{A}_{(i,j)}-A$. Suppose the $U$'s are $\beta$-well-distributed. Then for any $(U,V)\in G$ we have that $$\mathbb{P}\left(\left\Vert U\times_3\left(\Delta*_s V\right)\right\Vert\leq\frac{\eta\beta}{\sqrt{l_1^\prime l_2^\prime}}\Vert A\Vert_F\Vert U\Vert_F\Vert V\Vert_F\right)\geq1-\delta\quad(\star).$$ Algorithm 8 is designed to generate different compressed filters $\hat{A}_{i,j}$ in a way that keeps the total number of parameters small, but also ensures that the $\hat{A}_{i,j}$'s behave similarly to the compressed filters that would generated if Algorithm 3 were applied to each location independently.
Theorem 6.3 For any convolutional neural network $h_{\mathbf{w}}$ with $\rho_{\delta}\geq3d$, any probability $0<\delta\leq1$ and any margin $\gamma$, then Algorithm 8 generates weights $\tilde{\mathbf{w}}$ for the network $h_{\tilde{\mathbf{w}}}$ such that $$\mathbb{P}_{S\sim\mathcal{D}^m}\left(L_0(h_{\tilde{\mathbf{w}}})\leq\hat{L}_{\gamma}(h_{\mathbf{w}})+\tilde{O}\left(\sqrt{\frac{c^2d^2\max_{(x,y)\in S}\Vert h_{\mathbf{w}}(x)\Vert_2^2\sum_{i=1}^d\frac{\beta^2\left(\left\lceil\frac{\kappa_i}{s_i}\right\rceil\right)^2}{\mu_i^2\mu_{i\to}^2}}{\gamma^2m}}\right)\right)\geq1-\delta,$$ where $\mu_i,\mu_{i\to},c,\rho_{\delta}$ and $\beta$ are layer cushion, inter-layer cushion, activation contraction, inter-layer smoothness and well-distributed Jacobian respectively. Furthermore, $\kappa_i$ and $s_i$ are the filter and stride in layer $i$.
Corollary 6.4 For any convolutional neural network $h_{\mathbf{w}}$ with $\rho_{\delta}\geq3d$, any probability $0<\delta\leq1$ and any margin $\gamma$, then Algorithm 8 generates weights $\tilde{\mathbf{w}}$ for the network $h_{\tilde{\mathbf{w}}}$ such that $$\mathbb{P}_{S\sim\mathcal{D}^m}\left(L_0(h_{\tilde{\mathbf{w}}})\leq\hat{L}_{\gamma}(h_{\mathbf{w}})+\zeta+\tilde{O}\left(\sqrt{\frac{c^2d^2\max_{(x,y)\in S}\Vert h_{\mathbf{w}}(x)\Vert_2^2\sum_{i=1}^d\frac{\beta^2\left(\left\lceil\frac{\kappa_i}{s_i}\right\rceil\right)^2}{\mu_i^2\mu_{i\to}^2}}{\gamma^2m}}\right)\right)\geq1-\delta,$$ where $\mu_i,\mu_{i\to},c,\rho_{\delta}$ and $\beta$ are layer cushion, inter-layer cushion, activation contraction, inter-layer smoothness and well-distributed Jacobian respectively measured on a $1-\zeta$ fraction of the training set $S$. Furthermore, $\kappa_i$ and $s_i$ are the filter and stride in layer $i$.
6.2 Current State of the Art PAC-Bayes Bounds
We have seen that PAC-Bayes bounds provide a theoretical perspective on the learning process and the consequences it has on the performance of the learned classifier. In practice, we would ideally want these bounds to be meaningful. When implemented naively they produce vacuous bounds that provide no information. The first implementation of non-vacuous PAC-Bayes was discussed in Section 3.2 with the work (Dziugaite, 2017) that focused on a particular setting to get the non-vacuous bounds. Since then there have been directed efforts to improve the tightness of these bounds and extend the success to different contexts. Currently, the tightest bounds seen in practice come from the work of (Lotfi, 2022). In this section, we will discuss the work and see how it is a development of some previous work we have discussed. The work of (Lotfi, 2022) is an extension of the work of (Zhou, 2019) and follows the same compression paradigm that was first considered by (Arora, 2018). In (Lotfi, 2022) the tighter generalization bounds are achieved by first restricting to lower-dimensional settings using a notion called intrinsic dimensionality. Then they develop more aggressive quantization schemes that are adapted to the problem at hand.
6.3.1 The PAC-Bayes Foundations
Throughout this section, we will adopt the same notation as the rest of this report. Consider Theorem 2.1, the $\log(M)$ term counts the number of bits needed to specify any hypothesis $h_{\mathbf{w}}$ with $\mathbf{w}\in\mathcal{W}$, supposing that we assume each hypothesis is equally likely. If instead we have some prior belief on the likely hypotheses we can construct a variable length code that uses fewer bits to specify those hypotheses. For a prior distribution $\pi$, then for any $\mathbf{w}\in\mathcal{W}$ the number of bits required to represent hypothesis $h_{\mathbf{w}}$ is $\log_2\left(\frac{1}{\pi(\mathbf{w})}\right)$ when using an optimal compression code for $\pi$. Furthermore, if we consider a set of good distributions $\rho$ and we do not care which element of $Q$ we arrive at, we can gain some bits back. In particular, the average number of bits to code a sample from $\rho$ using $\pi$ is the cross entropy $H(\rho,\pi)$ and since we are agnostic to the sample from $\rho$ we get back $H(\rho)$ bits. Therefore, the average number of bits is $$H(\rho,\pi)-H(\rho)=\mathrm{KL}(\rho,\pi).$$
Definition 6.5 For probability measures $\rho$ and $\pi$ on a state space $\mathcal{X}$ that are absolutely continuous with respect to some measure $\lambda$, then $$H(\rho,\pi)=\int_{\mathcal{X}}\rho(x)\log\left(\pi(x)\right)d\lambda(x),$$ where $H(\rho):=H(\rho,\rho)$.
With these improvements, we can get bounds such as Theorem 3.10. For this work, we will work with Theorem 5.5 to get the generalization bounds. The prior that we will use here is known as the universal prior and explicitly penalizes the minimum compressed length of the hypothesis, $$\pi(\mathbf{w})=\frac{2^{-K(\mathbf{w})}}{Z}.$$ Then using a point mass posterior on the single parameter $\mathbf{w}^*$ we get that $$\mathrm{KL}\left(\mathbf{I}_{\{\mathbf{w}=\mathbf{w}^*\}},\pi\right)=\log\left(\frac{1}{\pi(\mathbf{w}^*)}\right)\leq K\left(\mathbf{w}^*\right)\log(2)\leq l(\mathbf{w}^*)\log(2)+2\log\left(l(\mathbf{w}^*)\right),$$ where $l(\mathbf{w})$ is the length of the program that reproduces $\mathbf{w}$. Improving the tightness of our bounds comes about by reducing the compressed length $l(\mathbf{w}^*)$ for the $\mathbf{w}^*$ achieved through training. For this work, the method for model compression consists of two components. One component is reducing the dimensionality of the problem, and the second is developing an aggressive quantization scheme.
6.3.2 Finding Random Subspaces
A neural network parameterized by the weight vector $\mathbf{w}\in\mathbb{R}^D$ is often optimized through gradient descent so that the updates occur in the $D$-dimensional loss landscape. Despite $D$ being very large the optimization process is relatively stable and converges to simple solutions. However, we can work in a reduced dimension $d<D$ (referred to as the intrinsic dimension) by considering $$\mathbf{w}=\mathbf{w}_0+P\hat{\mathbf{w}},$$ where $\mathbf{w}_0\in\mathbb{R}^D$ is the initialized weight, $P\in\mathbb{R}^{D\times d}$ is such that $P^\top P\approx I_{d\times d}$ and $\hat{\mathbf{w}}\in\mathbb{R}^d$. Now the vector $\hat{\mathbf{w}}$ is optimized so that the updates take place on a $d$-dimensional landscape. Finding the smallest value of $d$ for which we still attain good performance on the problem at hand is the bottleneck to this approach. The work lies in finding projection $P$ that is stable under training and finding optimal subspaces in which to optimize in. Imposing the condition $P^\top P\approx I_{d\times d}$ solves the first concern, for the next, we consider three possible methods for constructing $P$.
- Kronecker Sum Projector: Construct the matrix $$P_{\oplus}=\frac{\mathbf{1}\otimes R_1+R_2\otimes\mathbf{1}}{\sqrt{2D}}$$ where $\otimes$ is the Kronecker product, $R_1,R_2\sim\mathcal{N}(0,1)^{\sqrt{D}\times d}$. Note that $P^\top_{\oplus}P_{\oplus}=I_{d\times d}+O\left(\frac{1}{\sqrt{D}}\right).$ Applying this to a vector $\hat{\mathbf{w}}\in\mathbb{R}^d$ takes $O\left(d\sqrt{D}\right)$ time.
- Kronecker Product Projector: Construct the matrix $$P_{\otimes}=\frac{Q_1\otimes Q_2}{\sqrt{D}}$$ where $Q_1,Q_2\sim\mathcal{N}(0,1)^{\sqrt{D}\times\sqrt{d}}$. Note that $P_{\otimes}^\top P_{\otimes}=I_{d\times d}+O\left(\frac{1}{D^{\frac{1}{4}}}\right).$ Applying this to a vector $\hat{\mathbf{w}}\in\mathbb{R}^d$ takes $O\left(\sqrt{dD}\right)$ time.
6.3.3 Finding Random Subspaces
For the full precision weight vector $\mathbf{w}=(w_1,\dots,w_d)\in\mathbb{R}^d$ and vector $c=(c_1,\dots,c_L)\in\mathbb{R}^L$ of $L$ quantization levels, construct the quantized vector $\tilde{\mathbf{w}}\in\mathbb{R}^d$ where $\tilde{w}_i=c_{q(i)}$ where $q(i)=\mathrm{argmin}_k\vert w_i-c_k\vert$. The vector $c$ is learned alongside $\mathbf{w}$ where the gradients are defined as $$\frac{\partial\tilde{w}_i}{\partial w_j}=\delta_{ij},\text{ and }\;\frac{\partial\tilde{w}_i}{\partial c_k}=\mathbf{I}_{q(i)=k}.$$ $c$ is initialized to have uniform spacing between the minimum and maximum values of $\mathbf{w}$, or using $k$-means. The latter approach refers to a quantization scheme proposed in (Choi, 2017) where for $k=L$ we partition the weights into clusters $\mathcal{C}_1,\dots,\mathcal{C}_k$ with $c_1,\dots,c_k$ such that $$\mathrm{argmin}_{\mathcal{C}_1,\dots,\mathcal{C}_k}\left(\sum_{i=1}^k\sum_{w\in\mathcal{C}_i}\vert w_i-c_i\vert^2\right),\quad\text{ for }c_i=\frac{1}{\vert\mathcal{C}_i\vert}\sum_{w\in\mathcal{C}_i}w.$$ Next, we capitalize on the fact that certain quantization levels will be more likely than others to introduce a variable length coding scheme. For each level $c_k$ associate the probability $p_k$ and apply arithmetic coding. Each arithmetic coding of $\mathbf{w}$ takes at most $\left\lceil d\times H(p)\right\rceil$ bits, where $p$ is the discrete distribution of the $p_k$'s. Considering the total number of bits we see that $$l(\mathbf{w})\leq\left\lceil d\times H(p)\right\rceil+L\times\left(16+\left\lceil\log_2(d)\right\rceil\right)+2$$ as $L\times\left\lceil\log_2(d)\right\rceil$ bits are required for the probabilities $p_k$ and $16L$ bites for the codebook $c$.
6.3.4 Optimization
Note that the smaller the intrinsic dimension $d$ is the closer that our trained weight will be to the initialized weight $\mathbf{w}_0$. Therefore, $\mathbf{w}_0$ is more likely under our universal prior. Recall, that we must therefore condition on $\mathbf{w}_0$ to generate our prior. Similarly, if we optimize over different hyper-parameters such as $d$, $L$ or the learning rate ($\eta$), we must encode these into the prior and pay a penalty for optimizing over them. To do this we simply redefine our weight vector to be $\mathbf{w}^\prime=(\mathbf{w},d,L,\eta)$ and so our prior becomes $$\pi\left(\mathbf{w}^\prime\right)=\frac{2^{-K\left(\mathbf{w}^\prime\right)}}{Z},$$ where now we have that $$K\left(\mathbf{w}^\prime\right)\leq K(\mathbf{w}\vert d,L)+K(d)+K(L)+K(\eta).$$ Typically, we optimize these hyper-parameters over finite sets and so we can bound these terms by the ceiling of $\log_2$ of the size of these sets. The process we have discussed can be summarized in Algorithm 9.
Algorithm 9 Compute PAC-Bayes Compression Bound
Require: Neural network $h_{\mathbf{w}}$, training data set $S=\{(x_i,y_i)\}_{i=1}^m$, Clusters $L$, Intrinsic Dimension $d$, Confidence $1-\delta$, Prior distribution $\pi$.
function $\mathrm{COMPUTEBOUND}\left(h_{\mathbf{w}},L,d,S,\delta,\pi\right)$
----$\mathbf{w}\leftarrow\mathrm{TRAINID}(h_{\mathbf{w}},d,S)$
----$\tilde{\mathbf{w}}\leftarrow\mathrm{TRAINQUANTIZE}\left(h_{\mathbf{w}},L,S\right).$
----Compute quantized empirical risk $\hat{R}(\tilde{\mathbf{w}}).$
----$\mathrm{KL}(\rho,\pi)\leftarrow\mathrm{GETKL}\left(\tilde{w},\pi\right).$
----return $\mathrm{GETCATONIBOUND}\left(\hat{R}(\tilde{w}),\mathrm{KL}(\rho,\pi),\delta,m\right)$.
end function
function $\mathrm{TRAINQUANTIZE}\left(h_{\mathbf{w}},L,d,S,\delta,\pi\right)$
----Initialize $c\leftarrow\mathrm{GETCLUSTERS}\left(\mathbf{w},L\right).$
----for $i=1\to\mathrm{quantepochs}$ do
--------$\begin{pmatrix}c\\\mathbf{w} \end{pmatrix}\leftarrow\begin{pmatrix}c-\eta\nabla_c\mathcal{L}\left(\mathbf{w},c\right)\\\mathbf{w}-\eta\nabla_{\mathbf{w}}\mathcal{L}(\mathbf{w},c)\end{pmatrix}.$
----end for
----return $\tilde{w}$
end function
function $\mathrm{GETKL}\left(\tilde{\mathbf{w}},\pi\right)$
----$c,\mathrm{count}\leftarrow\mathrm{GETUNIQUEVALSCOUNTS}\left(\tilde{\mathbf{w}}\right).$
----$\mathrm{messagesize}\leftarrow\mathrm{DOARITHMETICENCODING}\left(\tilde{w},c,\mathrm{count}\right).$
----$\mathrm{messagesize}\leftarrow\mathrm{messagesize}+\mathrm{hyperparamsearch}$
----return $\mathrm{messagesize}+2\times\log\left(\mathrm{messagesize}\right).$
end function
6.3 Rademacher Complexity
Recall, that we have the space $\mathcal{Z}$ on which a distribution $\mathcal{D}$ is defined from which we draw an $\mathrm{i.i.d}$ sampled $S=\{(x_i,y_i)\}_{i=1}^m$. Suppose we have a class of functions $\mathcal{F}=\{f:\mathcal{Z}\to\mathbb{R}\}$.
Definition 6.6 (Balcan, 2011) The empirical Rademacher complexity of $\mathcal{F}$ is $$\hat{\mathfrak{R}}(\mathcal{F})=\mathbb{E}_{\sigma\in\{\pm1\}}\left(\sup_{f\in\mathcal{F}}\left(\frac{1}{m}\sum_{i=1}^m\sigma_if((x_i,y_i))\right)\right),$$ where each $\sigma_i$ is an independent random variable uniformly distribution on $\{\pm1\}$.
Definition 6.7 (Balcan, 2011) The Rademacher complexity of $\mathcal{F}$ is $$\mathfrak{R}(\mathcal{F})=\mathbb{E}_{S\sim\mathcal{D}^m}\left(\hat{\mathfrak{R}}(\mathcal{F})\right).$$
Theorem 6.8 (Balcan, 2011) For a parameter $\delta\in(0,1)$ if $\mathcal{F}\subseteq\{f:\mathcal{Z}\to[0,1]\}$ then $$\mathbb{P}_{S\sim\mathcal{D}^m}\left(\mathbb{E}_{z\sim\mathcal{D}}\left(f(z)\right)\leq\frac{1}{m}\sum_{i=1}^mf(x_i,y_i)+2\mathfrak{R}(\mathcal{F})+\sqrt{\frac{\log\left(\frac{1}{\delta}\right)}{m}}\right)\geq1-\delta,$$ and $$\mathbb{P}_{S\sim\mathcal{D}^m}\left(\mathbb{E}_{z\sim\mathcal{D}}\left(f(z)\right)\leq\frac{1}{m}\sum_{i=1}^mf(x_i,y_i)+2\hat{\mathfrak{R}}(\mathcal{F})+3\sqrt{\frac{\log\left(\frac{2}{\delta}\right)}{m}}\right)\geq1-\delta.$$
Proof
Theorem 6.8.1 (McDiarmid Inequality) Let $x_1,\dots, x_n$ be independent random variables taking values in a set $A$ and let $c_1,\dots, c_n$ be positive real constants. If $\phi:A^n\to\mathbb{R}$ satisfies $$\sup_{x_1,\dots,x_n,x_i^\prime}\left\vert\phi(x_1,\dots,x_i,\dots,x_n)-\phi\left(x_1,\dots,x_i^\prime,\dots,x_n\right)\right\vert\leq c_i,$$ for $1\leq i\leq n$, then $$\mathbb{P}\left(\phi(x_1,\dots,x_n)-\mathbb{E}\left(\phi(x_1,\dots,x_n)\right)\geq\epsilon\right)\leq\exp\left(\frac{-2\epsilon}{\sum_{i=1}^nc_i^2}\right).$$
Proof
For a proof of this theorem refer to (Scott(c), 2014).
Lemma 6.8.2 The function $$\phi(S)=\sup_{h\in\mathcal{F}}\left(\mathbb{E}_{\hat{S}\sim\mathcal{D}^m}\left(h(x,y)\right)-\frac{1}{m}\sum_{i=1}^mh(x_i,y_i)\right)$$ satisfies $$\sup_{z_1,\dots,z_n,z_i^\prime\in\mathcal{Z}}\left\vert\phi(z_1,\dots,z_i,\dots,z_m)-\phi(z_1,\dots,z_i^\prime,\dots,z_m)\right\vert\leq\frac{1}{m}.$$
Proof
Let $S=\{z_1,\dots,z_m\}$ and $S^\prime=\{z_1,\dots,z_i^\prime,\dots,z_m\}$ then $$\left\vert\phi(S)-\phi\left(S^\prime\right)\right\vert=\left\vert\sup_{h\in\mathcal{F}}\left(\mathbb{E}_{\hat{S}\sim\mathcal{D}^m}\left(h(x,y)\right)-\frac{1}{m}\sum_{(x_j,y_j)\in S}h(x_j,y_j)\right)-\sup_{h\in\mathcal{F}}\left(\mathbb{E}_{\hat{S}\sim\mathcal{D}^m}\left(h(x,y)\right)-\frac{1}{m}\sum_{(x_j,y_j)\in S^\prime}h(x_j,y_j)\right)\right\vert.$$ Let $h^*\in\mathcal{F}$ be the function the maximizes the supremum of $\phi(S)$, then $$\left\vert\phi(S)-\phi\left(S^\prime\right)\right\vert=\left\vert\mathbb{E}_{\hat{S}\sim\mathcal{D}^m}\left(h^*(x,y)\right)-\frac{1}{m}\sum_{(x_j,y_j)\in S}h^*(x_j,y_j)-\sup_{h\in\mathcal{F}}\left(\mathbb{E}_{\hat{S}\sim\mathcal{D}^m}\left(h(x,y)\right)-\frac{1}{m}\sum_{(x_j,y_j)\in S^\prime}h(x_j,y_j)\right)\right\vert$$ and because $h^*$ can at best also maximize $\phi\left(S^\prime\right)$ we also have that $$\begin{align*}\left\vert\phi(S)-\phi\left(S^\prime\right)\right\vert&\leq\left\vert\mathbb{E}_{\hat{S}\sim\mathcal{D}^m}\left(h^*(x,y)\right)-\frac{1}{m}\sum_{(x_j,y_j)\in S}h^*(x_j,y_j)-\mathbb{E}_{\hat{S}\sim\mathcal{D}^m}\left(h^*(x,y)\right)-\frac{1}{m}\sum_{(x_j,y_j)\in S^\prime}h^*(x_j,y_j)\right\vert\\&=\left\vert\frac{1}{m}\sum_{(x_j,y_j)\in S^\prime}h^*(x_j,y_j)-\frac{1}{m}\sum_{(x_j,y_j)\in S}h^*(x_j,y_j)\right\vert.\end{align*}$$ By using the definitions of $S$ and $S^\prime$ this simplifies to $$\begin{align*}\left\vert\phi(S)-\phi\left(S^\prime\right)\right\vert&\leq\frac{1}{m}\left\vert h^*(x_i,y_i)-h^*\left(x_i^\prime,y_i^\prime\right)\right\vert\\&\leq\frac{1}{m},\end{align*}$$ which completes the proof of the lemma. $\square$
Lemma 6.8.2 shows that $\phi(S)=\sup_{h\in\mathcal{F}}\left(\mathbb{E}_{\hat{S}\sim\mathcal{D}^m}\left(h(x,y)\right)-\frac{1}{m}\sum_{i=1}^mh(x_i,y_i)\right)$ satisfies the conditions of Theorem 6.8.1, therefore, $$\mathbb{P}\left(\phi(S)-\mathbb{E}_{S^\prime\sim\mathcal{D}^m}\left(\phi\left(S^\prime\right)\right)\geq t\right)\leq\exp\left(-\frac{t^2}{m}\right).$$ With $t=\sqrt{\frac{\log\left(\frac{1}{\delta}\right)}{m}}$ we deduce that $$\mathbb{P}_{S\sim\mathcal{D}^m}\left(\mathbb{E}_{\hat{S}\sim\mathcal{D}^m}(f(x,y))\leq\frac{1}{m}\sum_{i=1}^mf(x_i,y_i)+\mathbb{E}_{\hat{S}^\prime\sim\mathcal{D}^m}\left(\phi\left(\hat{S}^\prime\right)\right)\right)\geq1-\delta.$$ Now we need to bound the expectation of $\phi(S)$ using Rademacher complexity to complete the proof. Let $\tilde{S}=\left\{\tilde{z}_1,\dots,\tilde{z}_m\right\}$ be a sample independent but identically distributed to $S$. As $$\mathbb{E}_{\tilde{S}}\left(\frac{1}{m}\sum_{(x,y)\in\tilde{S}}h(x,y)\Bigg\vert S\right)=\mathbb{E}_{z\sim\mathcal{D}}\left(h(z)\right),\text{ and }\;\mathbb{E}_{\tilde{S}}\left(\frac{1}{m}\sum_{(x,y)\in S}h(x,y)\Bigg\vert S\right)=\frac{1}{m}\sum_{(x,y)\in S}h(x,y)$$ we deduce that $$\begin{align*}\mathbb{E}_{S\sim\mathcal{D}^m}\left(\phi(S)\right)&=\mathbb{E}_{S\sim\mathcal{D}^m}\left(\sup_{h\in\mathcal{F}}\left(\mathbb{E}_{\tilde{S}\sim\mathcal{D}^m}\left(\frac{1}{m}\sum_{(x,y)\in\tilde{S}}\left(h(x,y)\right)-\frac{1}{m}\sum_{(x,y)\in S}h(x,y)\Bigg\vert S\right)\right)\right).\end{align*}$$ We can apply Jensen's inequality as $\sup$ is convex to deduce that $$\mathbb{E}_{S\sim\mathcal{D}^m}\left(\sup_{h\in\mathcal{F}}\left(\mathbb{E}_{\tilde{S}\sim\mathcal{D}^m}\left(\frac{1}{m}\sum_{(x,y)\in\tilde{S}}h(x,y)-\frac{1}{m}\sum_{(x,y)\in S}h(x,y)\Bigg\vert S\right)\right)\right)\leq\mathbb{E}_{S\sim\mathcal{D}^m}\mathbb{E}_{\tilde{S}\sim\mathcal{D}^m}\left(\sup_{h\in\mathcal{F}}\left(\frac{1}{m}\sum_{(x,y)\in\tilde{S}}h(x,y)-\frac{1}{m}\sum_{(x,y)\in S}h(x,y)\right)\right).$$ As $\mathbb{E}(\sigma_i)=0$ we can multiply each term by $\sigma_i$, and in distribution we have $-\sigma_i=\sigma_i$ so that $$\begin{align*}\mathbb{E}_{S\sim\mathcal{D}^m}\mathbb{E}_{\tilde{S}\sim\mathcal{D}^m}\left(\sup_{h\in\mathcal{F}}\left(\frac{1}{m}\sum_{(x,y)\in\tilde{S}}h(x,y)-\frac{1}{m}\sum_{(x,y)\in S}h(x,y)\right)\right)&=\mathbb{E}_{\sigma\in\{\pm1\}^m}\mathbb{E}_{S\sim\mathcal{D}^m}\mathbb{E}_{\tilde{S}\sim\mathcal{D}^m}\left(\sup_{h\in\mathcal{F}}\left(\frac{1}{m}\sum_{(x,y)\in\tilde{S},\sigma_i\in\sigma}\sigma_ih(x,y)-\frac{1}{m}\sum_{(x,y)\in S,\sigma_i\in\sigma}\sigma_ih(x,y)\right)\right)\\&\leq\mathbb{E}_{\sigma\in\{\pm1\}^m}\mathbb{E}_{S\sim\mathcal{D}^m}\left(\sup_{h\in\mathcal{F}}\left(\frac{1}{m}\sum_{(x,y)\in S,\sigma_i\in\sigma}\sigma_ih(x,y)\right)\right)+\mathbb{E}_{\sigma\in\{\pm1\}^m}\mathbb{E}_{\tilde{S}\sim\mathcal{D}^m}\left(\sup_{h\in\mathcal{F}}\left(\frac{1}{m}\sum_{(x,y)\in\tilde{S},\sigma_i\in\sigma}\sigma_ih(x,y)\right)\right)\\&=2\mathfrak{R}(\mathcal{F}),\end{align*}$$ which when substituted into our previous bounds completes the proof of the first statement. To obtain the second statement we note that $\hat{\mathfrak{R}}(\mathcal{F})$ satisfies Theorem 6.7.1 with constant $\frac{1}{m}$. Therefore, a second application of Theorem 6.8.1 with confidence level (where a confidence level of $\frac{\delta}{2}$ is used for each application) gives the desired result. $\square$
If we let $\mathcal{F}=\left\{(x,y)\mapsto\mathbb{I}\left(h_{\mathbf{w}}(x)[y]\leq\gamma+\max_{j\neq y}h_{\mathbf{w}}(x)[j]\right):\mathbf{w}\in\mathcal{W}\right\}$ then for any $\delta\in(0,1)$ and $\mathbf{w}\in\mathcal{W}$ we have that $$\mathbb{P}_{S\sim\mathcal{D}^m}\left(L_{\gamma}\left(h_{\mathbf{w}}\right)\leq\hat{L}_{\gamma}(h_{\mathbf{w}})+2\hat{\mathfrak{R}}(\mathcal{F})+3\sqrt{\frac{\log\left(\frac{2}{\delta}\right)}{m}}\right)\geq1-\delta.$$
Definition 6.9 (Rebeschini, 2022) Given a set $\mathcal{S}$ and a function $\rho:\mathcal{S}\times\mathcal{S}\to\mathbb{R}_+$, we call $(\mathcal{S},\rho)$ a pseudo-metric space if for all $x,y,z\in\mathcal{S}$ we have
- $\rho(x,y)=\rho(y,x)$,
- $\rho(x,z)\leq\rho(x,y)+\rho(y,z)$, and
- $\rho(x,x)=0$.
Definition 6.10 (Rebeschini, 2022) Let $(\mathcal{S},\rho)$ be a pseudo-metric space and let $\epsilon>0$. Then the set $\mathcal{C}\subseteq\mathcal{s}$ is an $\epsilon$-cover of $(\mathcal{S},\rho)$ if for every $x\in\mathcal{S}$ there is a $y\in\mathcal{C}$ such that $\rho(x,y)\leq\epsilon$. The set $\mathcal{C}$ is a minimal $\epsilon$-cover if there is no other $\epsilon$-cover with lower cardinality. The cardinality of any minimal $\epsilon$-cover is the $\epsilon$-covering number denoted $N(\mathcal{S},\rho,\epsilon)$.
For a given training set $S=\{(x_i,y_i)\}_{i=1}^m$ we can consider the set $$\mathcal{G}=\{(f(x_1,y_1),\dots,f(x_m,y_m)):f\in\mathcal{F}\}.$$
Theorem 6.11 (Lotz, 2020) Let $\mathcal{F}\subseteq\{f:\mathcal{Z}\to[0,1]\}$ and $S\sim\mathcal{D}^m$ then $$\hat{\mathfrak{R}}(\mathcal{F})\leq\inf_{\epsilon>0}\left(\epsilon+\sqrt{\frac{2N(\mathcal{G},\rho,\epsilon)}{m}}\right).$$
Proof
Lemma 6.11.1 (Massart's Lemma) (Rebeschini, 2022) Let $\mathcal{T}\subseteq\mathbb{R}^n$ then we have that $$\mathfrak{R}(\mathcal{T})\leq\max_{t\in\mathcal{T}}\Vert t\Vert_2\frac{\sqrt{2\log\vert\mathcal{T}\vert}}{n}.$$
Proof (Scott(b), 2014)
For all $a\geq0$ we have that $$\begin{align*}\exp\left(a\mathbb{E}_{\sigma\in\{\pm1\}^n}\left(\sup_{t\in\mathcal{T}}\sum_{i=1}^n\sigma_it_i\right)\right)&=\exp\left(\mathbb{E}_{\sigma\in\{\pm1\}^n}\left(a\sup_{t\in\mathcal{T}}\sum_{i=1}^n\sigma_it_i\right)\right)\\&\leq\mathbb{E}_{\sigma\in\{\pm1\}^n}\left(\exp\left(a\sup_{t\in\mathcal{T}}\sum_{i=1}^n\sigma_it_i\right)\right)\\&=\mathbb{E}_{\sigma\in\{\pm1\}^n}\left(\sup_{t\in\mathcal{T}}\left(\exp\left(a\sum_{i=1}^n\sigma_it_i\right)\right)\right)\\&\leq\sum_{t\in\mathcal{T}}\mathbb{E}_{\sigma\in\{\pm1\}^n}\left(\exp\left(a\sum_{i=1}^n\sigma_iut_i\right)\right),\end{align*}$$ where for the first inequality we have used Jensen's inequality and the second equality holds due as $\exp(\cdot)$ is strictly monotonically increasing. The right-hand side is just an MGF which can be split into a product due to independence, hence $$\begin{align*}\exp\left(a\mathbb{E}_{\sigma\in\{\pm1\}^n}\left(\sup_{t\in\mathcal{T}}\sum_{i=1}^n\sigma_it_i\right)\right)&=\sum_{t\in\mathcal{T}}\prod_{i=1}^n\mathbb{E}_{\sigma_i}\left(\exp\left(a\sigma_it_i\right)\right)\\&\leq\sum_{t\in\mathcal{T}}\prod_{i=1}^n\exp\left(\frac{a(2t_i)^2}{8}\right),\end{align*}$$ where we get the inequality from Lemma 2.1.3. Therefore, $$\begin{align*}\exp\left(a\mathbb{E}_{\sigma\in\{\pm1\}^n}\left(\sup_{t\in\mathcal{T}}\sum_{i=1}^n\sigma_it_i\right)\right)&\leq\sum_{t\in\mathcal{T}}\exp\left(\frac{a^2}{2}\sum_{i=1}^nt_i^2\right)\\&\leq\sum_{t\in{\mathcal{T}}}\exp\left(\frac{a^2\max_{t\in\mathcal{T}}\Vert t\Vert^2}{2}\right)\\&=\exp\left(\frac{a^2\max_{t\in\mathcal{T}}\Vert t\Vert^2}{2}\right)\Vert t\Vert^2\left\vert\mathcal{T}\right\vert.\end{align*}$$ Taking the logarithm of both sides and dividing by $a$ we get that $$\mathbb{E}_{\sigma\in\{\pm1\}^n}\left(\sup_{t\in\mathcal{T}}\left(\sum_{i=1}^n\sigma_it_i\right)\right)\leq\frac{\log\left(\vert\mathcal{T}\vert\right)}{a}+\frac{a\max_{t\in\mathcal{T}}\Vert t\Vert^2}{2}=\max_{t\in\mathcal{T}}\Vert t\Vert\sqrt{2\log\left(\vert\mathcal{T}\vert\right)},$$ which completes the proof of the lemma. $\square$
Let $T\subseteq\mathcal{G}$ be an $\epsilon$-net of size $N(\mathcal{G},\rho,\epsilon)$, then by Lemma 6.11.1 we have that $$\mathbb{E}_{\sigma\in\{\pm1\}^m}\left(\max_{g^\prime\in T}\frac{1}{m}\sigma_ig^\prime(x_i,y_i)\right)\leq\max_{g^\prime\in T}\Vert g(x_i,y_i)\Vert_2\frac{\sqrt{2\log\left(N(\mathcal{G},\rho,\epsilon)\right)}}{m}\leq\sqrt{m}\frac{\sqrt{2\log\left(N(\mathcal{G},\rho,\epsilon)\right)}}{m}=\sqrt{\frac{2\log\left(N(\mathcal{G},\rho,\epsilon)\right)}{m}}.$$ Using this we can conclude that, $$\begin{align*}\hat{\mathfrak{R}}(\mathcal{G})&=\mathbb{E}_{\sigma\in\{\pm1\}^m}\left(\sup_{g\in\mathcal{G}}\left(\frac{1}{m}\sum_{i=1}^m\sigma_ig(x_i,y_i)\right)\right)\\&\leq\mathbb{E}_{\sigma\in\{\pm1\}^m}\left(\sup_{g\in\mathcal{G}}\left(\frac{1}{m}\sum_{i=1}^m\sigma_ig(x_i,y_i)-\sigma_ig^\prime(x_i,y_i)\right)\right)+\mathbb{E}_{\sigma\in\{\pm1\}^m}\left(\frac{1}{m}\sum_{i=1}^m\sigma_ig^\prime(x_i,y_i)\right)\\&\leq\mathbb{E}_{\sigma\in\{\pm1\}^m}\left(\sup_{g\in\mathcal{G}}\left(\frac{1}{m}\sum_{i=1}^m\vert g(x_i,y_i)-g^\prime(x_i,y_i)\vert\right)\right)+\mathbb{E}_{\sigma\in\{\pm1\}^m}\left(\max_{g^\prime\in T}\left(\frac{1}{m}\sum_{i=1}^m\sigma_ig^\prime(x_i,y_i)\right)\right)\\&\leq\sup_{g\in\mathcal{G}}\rho((g(x_1,y_1),\dots,g(x_m,y_m)),(g^\prime(x_1,y_1),\dots,g^\prime(x_m,y_m)))+\sqrt{\frac{2\log\left(N(\mathcal{G},\rho,\epsilon)\right)}{m}}\\&\leq\epsilon+\sqrt{\frac{2\log\left(N(\mathcal{G},\rho,\epsilon)\right)}{m}},\end{align*}$$ which holds for all $\epsilon>0$ which completes the proof of the theorem. $\square$