An oft-repeated claim is that you can't train a neural network exactly, i.e., find the global optimum of the loss function. I've heard a variety of arguments here, including that the loss function is non-convex, or that neural networks are too "complex". I don't find any of these arguments convincing. In fact, it is possible to train a neural network exactly!
I will consider a multilayer perceptron (MLP) with ReLU activation functions, being trained using the mean squared error loss. I will also ignore softmax layers. In this setup, it is easy to see that the output of the network is a piecewise-linear function of the input.
The key point here is that if we let the weights vary, then in fact the output is a piecewise-polynomial function of the input and the weights. A piecewise-polynomial function can be reformulated as a system of polynomials (with additional variables). Hence, the forward pass of the network can be formulated as a system of polynomials, for a given input. If we write down the forward passes over a dataset, we get one big system of polynomials.
Then, we can write down the loss function as a polynomial. We can minimize this, subject to the constraints defining the forward passes, exactly, using cylindrical algebraic decomposition (CAD). CAD is a powerful method for solving systems of polynomial (in)equalities, and it can be performed on a computer algebra system like Mathematica.
Suppose we have a two-layer MLP: \(f(x) = W_2 \max(0, W_1 x ) \), where \(W_1 \in \mathbb{R}^{m \times n}\), \(W_2 \in \mathbb{R}^{p \times m}\) (for simplicity, I assume no bias). We will consider a dataset of \(N\) points \((x_i, t_i)\), where \(x_i \in \mathbb{R}^n\) and \(t_i \in \mathbb{R}^p\). Define the output of the network as \(y_i = f(x_i)\). We will train the network using the mean squared error loss, ignoring the \(1/N\) factor: \(\mathcal{L}(W_1, W_2) = \sum_{i=1}^N (y_i - t_i)^2\).
The non-linearity introduced by the ReLu is what needs special handling here. We will use the following trick: suppose \(z = \max(0, x)\). This is equivalent to: \(z \geq 0, z - x \geq 0, z(z-x) = 0\). From the last constraint, we have \(z = 0\) or \(z = x\). So, if \(x < 0\), then, because \(z \geq x\), we have \(z = 0\). If \(x \geq 0\), then \(z = x\). Notably, we have reformulated ReLu as a system of polynomials.
We can now apply this trick to the forward pass of the network, as ReLu is taken component-wise. We will now develop this concretely. Let \(i\) index data points, and \(j \in \{1, \ldots, m\}\) index hidden units. For each data point \(i\), we introduce hidden activations \(h_i \in \mathbb{R}^m\).
For the hidden activations, we apply ReLu component-wise, so, using the trick from above, we have \(h_{i,j} \geq 0, h_{i,j} - \sum_{k=1}^n (W_1)_{j,k} (x_i)_k \geq 0, h_{i,j}(h_{i,j} - \sum_{k=1}^n (W_1)_{j,k} (x_i)_k) = 0\). Then, we apply the second linear layer: \(y_i = W_2 h_i\). We will eliminate \(y_i\) and just directly use the hidden activations in the loss function. So, the loss function is \(\sum_{i=1}^N \sum_{l=1}^p (\sum_{j=1}^m (W_2)_{l,j} h_{i,j} - t_{i,l})^2\). Overall, our optimization problem is: \[ \begin{align} & \min_{W_1, W_2} \sum_{i=1}^N \sum_{l=1}^p \left(\sum_{j=1}^m (W_2)_{l,j} h_{i,j} - t_{i,l}\right)^2 \\ & \text{s.t.} \\ & h_{i,j} \geq 0, \quad \forall i \in \{1, \ldots, N\}, j \in \{1, \ldots, m\} \\ & h_{i,j} - \sum_{k=1}^n (W_1)_{j,k} (x_i)_k \geq 0, \quad \forall i \in \{1, \ldots, N\}, j \in \{1, \ldots, m\} \\ & h_{i,j} \left( h_{i,j} - \sum_{k=1}^n (W_1)_{j,k} (x_i)_k \right) = 0, \quad \forall i \in \{1, \ldots, N\}, j \in \{1, \ldots, m\} \\ \end{align} \]
It is easy to see that this can be mechanically generalized to more layers. In order to optimize this, we could use a variety of optimization methods for nonlinear optimization. But, most of these rely on approximations, or converting constraints to penalty terms. We will instead use CAD! CAD is a powerful method for finding solutions to systems of polynomial (in)equalities. It can also perform quantifier elimination over Boolean combinations of polynomial (in)equalities.
This allows us to do optimization. In fact, if you pass a polynomial optimization problem to Mathematica with Minimize, it will use CAD to solve it exactly. Namely, it does the following. Suppose we wish to minimize \(f(x)\) subject to \(\text{constraints}(x)\), where \(f(x)\) and \(\text{constraints}(x)\) are polynomials. Mathematica internally converts this into the quantifier elimination problem \( \inf \{ \tau | \exists x, \text{constraints}(x) \land f(x) \leq \tau \}\). By eliminating the quantifier, we get conditions on \(\tau\), and can hence find the minimum value of \(f(x)\).
Let's consider an (extremely!) small example. Suppose we have a 2-layer MLP with ReLu, as above. Suppose our inputs are in \(\mathbb{R}^1\) and our outputs are in \(\mathbb{R}^1\), and we have one hidden unit, i.e., the weights of the neural network are just two scalars \(w_1, w_2\). Suppose our dataset is \((x_i, t_i) = (i, i^2)\), for \(i = 1, 2, 3\) (i.e., we are fitting a quadratic function). We can write down the optimization problem, and solve it in Mathematica, as follows:
(* Given data *)
x1 = 1;
x2 = 2;
x3 = 3;
t1 = 1;
t2 = 4;
t3 = 9;
(* vars *)
vars = {w1, w2, h1, h2, h3};
(* Loss *)
loss =
(w2 h1 - t1)^2 +
(w2 h2 - t2)^2 +
(w2 h3 - t3)^2;
(* forward passes *)
constraints = {
(* point 1 *)
h1 >= 0,
h1 - w1 x1 >= 0,
h1 (h1 - w1 x1) == 0,
(* point 2 *)
h2 >= 0,
h2 - w1 x2 >= 0,
h2 (h2 - w1 x2) == 0,
(* point 3 *)
h3 >= 0,
h3 - w1 x3 >= 0,
h3 (h3 - w1 x3) == 0
};
Minimize[{loss, constraints}, vars]
Mathematica returns the following solution (the first number is the minimum value, and the second is a list of the values of the variables):
\[
\left\{\frac{38}{7},\left\{w_1\mapsto\frac{1}{3},w_2\mapsto\frac{54}{7},h_1\mapsto\frac{1}{3},h_2\mapsto\frac{2}{3},h_3\mapsto1\right\}\right\}
\]
Now, I won't mince words: this method is totally impractical for anything even a little bit larger! CAD's complexity is doubly exponential in the number of variables. We can work out the number of variables in our problem. Again, let's consider the two-layer MLP with ReLu. For \(W_1\), we have \(n \times m\) variables. For \(W_2\), we have \(m \times p\) variables. For the hidden activations, we have \(N \times m\) variables. So, in total, we have \(m (n + p + N)\) variables.
We can generalize this an \(L\) layer MLP. Suppose the input dimension is \(d_0\), with hidden dimensions \(d_1, \ldots, d_{L-1}\), and output dimension \(d_L\). First, the total number of weight variables we have is \( \sum_{i=0}^{L-1} d_i d_{i+1} \). For hidden activations, we have \( N \sum_{l=1}^{L-1} d_l \). So, in total, we have \( \sum_{i=0}^{L-1} d_i d_{i+1} + N \sum_{l=1}^{L-1} d_l \) variables.
Now, because CAD has complexity \( 2^{2^\text{# vars}} \), this procedure is totally infeasible for any but the tiniest of problems, like the one I worked out above.
Is there any easier way to exactly train a neural network? Likely not! It has been shown that exact training of a neural network is in the complexity class \( \exists \mathbb{R}\)-complete, which means it is equivalent to solving a system of polynomial (in)equalities. The reformulation I gave above shows indeed that it lies in \( \exists \mathbb{R}\). In the paper, they also show the other direction, i.e., that it is \(\exists \mathbb{R}\)-hard. Hence, exact training of a neural network is \(\exists \mathbb{R}\)-complete.