Investigation of Deep Neural Networks From Control and Optimization Perspectives

Open Access
- Author:
- Saab, Samer
- Graduate Program:
- Electrical Engineering
- Degree:
- Doctor of Philosophy
- Document Type:
- Dissertation
- Date of Defense:
- July 12, 2021
- Committee Members:
- Minghui Zhu, Co-Chair & Dissertation Advisor
Shashi Phoha, Co-Chair of Committee
Puneet Singla, Outside Unit & Field Member
Nilanjan Ray Chaudhuri, Major Field Member
Asok Ray, Major Field Member & Dissertation Advisor
Kultegin Aydin, Program Head/Chair - Keywords:
- Deep Learning
Neural Networks
Optimization
Control Theory
First-order Gradient Descent
Heavy-ball Optimizer
Recurrent Neural Networks
State-space Representations - Abstract:
- Deep neural networks (NNs) have set the state-of-the-art on a diverse array of machine learning tasks. However, the models that achieve such performances typically involve arbitrarily large number of parameters, and thus require extensive training, which can be expensive. Not to mention, there remains a lot to be understood on why the structure of NNs allows them to be such efficient function approximators. This dissertation aims to reinterpret deep NNs, and their gradient-descent (GD) based optimization algorithms, as dynamical systems, to enable the transfer of techniques and knowledge from the well established field of control theory. With this reinterpretation, the work contained in this dissertation identifies and develops efficient optimization algorithms for the training process of a NN, as well as aid in understanding the internal mechanisms of a NN, shedding light on what is considered a black-box algorithm. First, the identification and development of enhanced first-order GD optimizers is addressed. This is done by first reformulating a general class of first-order GD algorithms used for training as a dynamical system with a closed-loop transfer function. This creates leeway that enables the adoption of techniques from control theory to map certain control laws to their equivalent first-order GD optimizers. Root-locus analysis is then used to design suitable control laws for fast convergence, and Jury’s conditions on stability are used to identify the range of allowable hyper-parameters for optimal convergence. It is shown that for a strictly convex quadratic cost function, a first-order compensator yields fastest convergence rate, where this first-order-type control law is the heavy-ball (HB) method, also known as momentum, equipped with the Polyak optimal hyper-parameters. The importance of identifying the optimizer, and its optimal hyper-parameters, that yields an optimal convergence rate is its inherent capability of reducing training time and costs. With this in mind two optimizers are developed. Leading off with the results obtained that state that the HB optimizer is optimal for quadratic functions, an adaptive HB is developed that iteratively estimates the Polyak optimal hyper-parameters in a novel algorithm. The objective function in this case is represented by time-varying quadratic functions. In the deterministic setting, this approach establishes global linear convergence on L-smooth and strictly convex objective functions. Whereas in the stochastic setting, convergence is established for L-smooth non-convex functions with bounded gradients. The second developed optimizer is a multivariate adaptive GD method with attributes like small per-iteration costs, fast convergence rates, and reduced tuning. This method updates every element of the model parameters separately in a computationally efficient manner using an adaptive vector-form learning rate. The adaptive learning rate computes the absolute difference of current and previous model parameters over the difference in subgradients of current and previous state estimates. In the deterministic setting, convergence is established for L-smooth and strongly convex cost functions. Whereas in the stochastic setting, convergence is established for a non-convex L-smooth cost function. The results of both optimizers are validated on numerous experiments, yielding competitive performance with state-of-the-art adaptive optimizers on popular image classification data sets, and outperforms them on the nonconvex Beale function. Both methods are shown to require minimal to no tuning. Second, in order to help understand the inner mechanisms of a NN model better, the notion of a dynamical system reinterpretation is extended to entire NN architectures, including both static NNs and recurrent NNs (RNNs). For the case of static feedforward NNs, it is shown that the introduction of k-many skip connections into network architectures, defines kth-order dynamical equations on the layer-wise transformations. State-space representations of general kth-order additive dense networks, where the concatenation operation is replaced by addition, as well as kth-order smooth networks, are found. Thus, the effect on the inner mechanisms of a NN of k many skip-connections can be better understood. For example, it is shown that by imposing kth-order smoothness on network architectures with d-many nodes per layer increases the state-space dimension by a multiple of k, and so the effective embedding dimension of the data manifold by the neural network is k · d-many dimensions. Moreover, this representation helps understand that the neural activation functions act as a controller of the system as it moves forward in layers. Therefore, it can be concluded that the GD algorithm is learning a controller that maps the data from input to output. Similarly for RNNs, skip-connections are utilized to obtain a state-space representation of the hidden-states data flow, where techniques from control theory are applied for desired outcomes. A novel RNN design, called the dynamically stabilized RNN (DSRNN), is developed that uses weighted skip-connections, where the weights are learnable parameters, which leads to a state-space representation. Being that the stability of hidden-states largely dictates whether an RNN can overcome the problems of long-term unpredictability and vanishing or exploding gradients, a controller is designed in the form of a regularization term to stabilize the hidden-states. The regularizer enables placement of eigenvalues of the (linearized) transfer function matrix, thereby acting as an internal controller for the hidden-state trajectories. The DSRNN is validated on a forecasting task of a double pendulum, which is a chaotic system that exhibits behavior of long-term unpredictability. Its performance is compared against an LSTM and vanilla RNN.