An Introduction into Neural Network Generalization

In the summer of 2023 I completed an undergraduate research project under the supervision of Professor Alessio Lomuscio. The aim of the project was to investigate the generalization of deep neural networks. I started the project by performing a survey of the different approaches taken to understand this phenomena. After which my I focused my attention to the statistical guarantees on network generalization via Probably Accurately Correct bounds (PAC). Below one can found a few report I compiled on the general literature regarding neural network generalization. The main focus of resulted in the report on using region testing to evaluate PAC bounds, and the guide on PAC bounds for neural networks. Below one can also find my introduction to the problem of neural network generalization as well as my conclusions from surveying the literature.

Introduction to Neural Network Generalization

Generalization is the ability to effectively extrapolate concepts in a manner that is consistent with unseen data. For a deep neural network, this means that it can learn representations of a set of training data that not only captures the information in the training data but also enables it to make good inferences on unseen data. The capacity of deep neural networks to generalize to unseen data has been a major focus of enquiry. Over the last few years, investigations have been conducted to understand why it is that deep neural networks do not overfit training samples despite having a far greater number of parameters. Some investigations approach the problem from a theoretical perspective, whilst others take empirical approaches. From the theoretical perspective, assumptions are made about the underlying network that facilitates the application of mathematical arguments to explain this phenomenon. Although this approach provides enlightenment of the underlying processes, often the assumptions are rather strict and lead to results that are vacuous in practice. On the other hand, empirical work tries to make sense of this by performing controlled experiments. The results from these are largely correlational and are not grounded like the theoretical results, however, they provide insight into how this effect can be exploited and amplified to construct motivated networks with enhanced performance.

Generalization error for a deep neural network is the difference in the performance of the network on the training data and on unseen test data. Both theoretical and empirical investigations have led to quantitative and qualitative insights into the ability of deep neural networks to learn from training samples in a manner that generalizes well to unseen data. Throughout my investigations bounds will be tested and explored in the simplified setting of fully connected ReLU neural networks, with the accompanying code being available here. The bounds will be applied for neural networks performing classification tasks. Networks will receive training data, from which it is to learn a classification function that accurately partitions the data into their respective classes.

Fully connected neural networks consist of stacked layers of hidden units, where each hidden unit is connected to every hidden unit in the previous layer and to every hidden unit in the subsequent layer. The parameters of the neural networks dictate how connected hidden units interact. Usually, each connection between hidden units has an associated weight and bias parameter. Inputs at a hidden unit are propagated to the next layer by multiplying the weight parameter and adding the bias parameter to the input value along each of the branching connections. The value at a hidden unit is the value of an activation function applied to the sum of the manipulated inputs of the previous layer that are incident to that particular unit. Typically, the activation function is non-linear which increases the network's ability to learn complex functions.

The learning algorithm will be dictated by a loss function, which is a function designed to quantify the quality of the network. Gradient-based methods are the primary methodology used to train networks set up in this way. The loss function defines a hyper-surface in a space of dimension equal to that of the number of parameters of the network. Finding the global minimum of this surface is desirable as it corresponds to the set of parameters that maximizes a particular notion of quality. Through training the parameters of the network are altered to try and reach the lower regions of the loss landscape. Gradient-based methods do this by observing how the loss changes with respect to perturbations in the parameters at different training examples to determine the direction in which the parameters should be manipulated to move down the loss landscape. Computing the gradients at each individual training example is computationally expensive, and so instead gradients are calculated for batches of examples. These batches reduce computational costs but also introduce some noise into the algorithm that can be useful in helping the algorithm escape local minima of landscape and find more stable regions. This procedure is called Stochastic Gradient Descent (SGD) and has a central role in endowing neural networks with their generalization capacities.

Throughout my investigations, I have seen ways to formally set up the problem to rigorously derive probabilistic bounds on how the learned function performs on test data, how structural properties can be exploited to compress networks whilst still maintaining performance, and how ideas such as Information theory can be applied to gain insight to the underlying processes of the learning algorithm. Empirically metrics are derived to give a heuristic on how the network is generalizing through the training procedure and they are probed to understand how the gradient-based learning algorithms lead to effective representations of the training data.

Conclusions on Neural Network Generalization

A culmination of theoretical and empirical techniques have been employed to try and develop explanations for deep neural networks' impressive capacity to generalize. The Bayesian machine learning framework is useful for formalizing learning algorithms in a probabilistic setting. It enables the construction of probabilistic bounds that incorporate the inherent randomness of the training data. Despite giving a statistical perspective on deep neural networks in practice the results are vacuous and require considerable fine-tuning to be useful. It falls short of explaining the phenomena seen in practice. It is a technique for investigation more traditional learning algorithms that were developed before the advent of deep neural networks and has struggled to adapt to the current state of the field.\\The information-theoretic approach is motivated by the transfer of information from the dataset to the network and has yielded remarkable insights into the stages of the training process. It has provided a theoretical setting to investigate the training process of deep neural networks. It focuses on information contained in the data and does not take into account the architecture or the learning algorithm. In large part, the success of deep neural networks is attributed to the architecture and the learning algorithm, and hence the information-theoretic framework perhaps is insufficient to fully understand deep neural networks.

The most promising and practical approach involves the investigation of the gradients involved in the networks. Computing the gradient of the network with respect to the parameters at an output, or calculating the gradient of the loss with respect to the parameters at training examples encapsulates information about the learning algorithm and the training data. In the case of the neural tangent kernel, the gradient is with respect to the weights and the output of the network and does not involve the loss function. Therefore, it reveals what the network has learned but lacks detail on generalization. On the other hand, coherence and stiffness involve gradients of the loss function and hence can be directly linked to generalization. Using gradients appeals to the central component of deep neural networks, stochastic gradient descent. This is how the network learns and the learned structures are what give rise to the network's generalizing abilities. The techniques involving gradients discussed in this survey require are agnostic to any properties of the network and so receive few issues when employed practically.

The topological perspective is a sophisticated approach to understanding the evolution of the geometrical properties of the training data through the layers of the neural network. The approach reveals why certain architectural properties of the network are effective by understanding how they are able to manipulate the structure of the data. A lot of insight has been gained from trying to understand the various aspects of a deep neural network through a geometrical perspective and the topological arguments explored in this survey provide a compelling framework to do this. A limitation however is that implementing the framework in practice is computationally expensive and it neglects the properties of the learning algorithm.

Future Work

It has been noted that gradients of the loss function at different parameter values can be effectively correlated to generalization, as it is a rich source of information on the training process of deep neural networks. However, the current form that they are explored in this survey, lacks a strong geometrical analysis and do not explicitly involve any information on the network's architecture. Future work in the field could look into investigating stiffness and coherence in a topological framework to develop notions that involve the data, the learning algorithm and the architecture.