Understanding Short-Horizon Bias in Stochastic Meta-Optimization

Yuhuai Wu  , Mengye Ren11footnotemark: 1   , Renjie Liao & Roger B. Grosse
University of Toronto and Vector Institute
{ywu, mren, rjliao, rgrosse}@cs.toronto.edu
Equal contribution.
Abstract

Careful tuning of the learning rate, or even schedules thereof, can be crucial to effective neural net training. There has been much recent interest in gradient-based meta-optimization, where one tunes hyperparameters, or even learns an optimizer, in order to minimize the expected loss when the training procedure is unrolled. But because the training procedure must be unrolled thousands of times, the meta-objective must be defined with an orders-of-magnitude shorter time horizon than is typical for neural net training. We show that such short-horizon meta-objectives cause a serious bias towards small step sizes, an effect we term short-horizon bias. We introduce a toy problem, a noisy quadratic cost function, on which we analyze short-horizon bias by deriving and comparing the optimal schedules for short and long time horizons. We then run meta-optimization experiments (both offline and online) on standard benchmark datasets, showing that meta-optimization chooses too small a learning rate by multiple orders of magnitude, even when run with a moderately long time horizon (100 steps) typical of work in the area. We believe short-horizon bias is a fundamental problem that needs to be addressed if meta-optimization is to scale to practical neural net training regimes. Code available at https://github.com/renmengye/meta-optim-public.

1 Introduction

The learning rate is one of the most important and frustrating hyperparameters to tune in deep learning. Too small a value causes slow progress, while too large a value causes fluctuations or even divergence. While a fixed learning rate often works well for simpler problems, good performance on the ImageNet (russakovsky2015imagenet) benchmark requires a carefully tuned schedule. A variety of decay schedules have been proposed for different architectures, including polynomial, exponential, staircase, etc. Learning rate decay is also required to achieve convergence guarantee for stochastic gradient methods under certain conditions (bottou1999online). Clever learning rate heuristics have resulted in large improvements in training efficiency (imagenet1hour; CLR). A related hyperparameter is momentum; typically fixed to a reasonable value such as 0.9, careful tuning can also give significant performance gains (Sutskever13). While optimizers such as Adam (kingma2015adam) are often described as adapting coordinate-specific learning rates, in fact they also have global learning rate and momentum hyperparameters analogously to SGD, and tuning at least the learning rate can be important to good performance.

In light of this, it is not surprising that there have been many attempts to adapt learning rates, either online during optimization (schraudolph1999smd; schaul2013nomorepesky), or offline by fitting a learning rate schedule (maclaurin2015hypergrad). More ambitiously, others have attempted to learn an optimizer  (l2l; l2o; MAML; lv2017learngd; wichrowska2017learnopt; metz2017unrollgan). All of these approaches are forms of meta-optimization, where one defines a meta-objective (typically the expected loss after some number of optimization steps) and tunes the hyperparameters to minimize this meta-objective. But because gradient-based meta-optimization can require thousands of updates, each of which unrolls the entire base-level optimization procedure, the meta-optimization is thousands of times more expensive than the base-level optimization. Therefore, the meta-objective must be defined with a much smaller time horizon (e.g. hundreds of updates) than we are ordinarily interested in for large-scale optimization. The hope is that the learned hyperparameters or optimizer will generalize well to much longer time horizons. Unfortunately, we show that this is not achieved in this paper. This is because of a strong tradeoff between short-term and long-term performance, which we refer to as short-horizon bias.

Refer to caption
Figure 1: Aggressive learning rate (red) followed by a decay schedule (yellow) wins over conservative learning rate (blue) by making more progress along the low curvature direction (x direction).

In this work, we investigate the short-horizon bias both mathematically and empirically. First, we analyze a quadratic cost function with noisy gradients based on schaul2013nomorepesky. We consider this a good proxy for neural net training because second-order optimization algorithms have been shown to train neural networks in orders-of-magnitude fewer iterations (martens_hf), suggesting that much of the difficulty of SGD training can be explained by quadratic approximations to the cost. In our noisy quadratic problem, the dynamics of SGD with momentum can be analyzed exactly, allowing us to derive the greedy-optimal (i.e. 1-step horizon) learning rate and momentum in closed form, as well as to (locally) minimize the long-horizon loss using gradient descent. We analyze the differences between the short-horizon and long-horizon schedules.

Interestingly, when the noisy quadratic problem is either deterministic or spherical, greedy schedules are optimal. However, when the problem is both stochastic and badly conditioned (as is most neural net training), the greedy schedules decay the learning rate far too quickly, leading to slow convergence towards the optimum. This is because reducing the learning rate dampens the fluctuations along high curvature directions, giving it a large immediate reduction in loss. But this comes at the expense of long-run performance, because the optimizer fails to make progress along low curvature directions. This phenomenon is illustrated in Figure 1, a noisy quadratic problem in 2 dimensions, in which two learning rate schedule are compared: a small fixed learning rate (blue), versus a larger fixed learning rate (red) followed by exponential decay (yellow). The latter schedule initially has higher loss, but it makes more progress towards the optimum, such that it achieves an even smaller loss once the learning rate is decayed.

Figure 2 shows this effect quantitatively for a noisy quadratic problem in 1000 dimensions (defined in Section 2.3). The solid lines show the loss after various numbers of steps of lookahead with a fixed learning rate; if this is used as the meta-objective, it favors small learning rates. The dashed curves show the loss if the same trajectories are followed by 50 steps with an exponentially decayed learning rate; these curves favor higher learning rates, and bear little obvious relationship to the solid ones. This illustrates the difficulty of selecting learning rates based on short-horizon information.

Refer to caption
Figure 2: Short-horizon meta-objectives for the noisy quadratic problem. Solid: loss after k updates with fixed learning rate. Dashed: loss after k updates with fixed learning rate, followed by exponential decay.

The second part of our paper empirically investigates gradient-based meta-optimization for neural net training. We consider two idealized meta-optimization algorithms: an offline algorithm which fits a learning rate decay schedule by running optimization many times from scratch, and an online algorithm which adapts the learning rate during training. Since our interest is in studying the effect of the meta-objective itself rather than failures of meta-optimization, we give the meta-optimizers sufficient time to optimize their meta-objectives well. We show that short-horizon meta-optimizers, both online and offline, dramatically underperform a hand-tuned fixed learning rate, and sometimes cause the base-level optimization progress to slow to a crawl, even with moderately long time horizons (e.g. 100 or 1000 steps) similar to those used in prior work on gradient-based meta-optimization.

In short, we expect that any meta-objective which does not correct for short-horizon bias will probably fail when run for a much longer time horizon than it was trained on. There are applications where short-horizon meta-optimization is directly useful, such as few-shot learning (mann; ravi2017oneshot). In those settings, short-horizon bias is by definition not an issue. But much of the appeal of meta-optimization comes from the possibility of using it to speed up or simplify the training of large neural networks. In such settings, short-horizon bias is a fundamental obstacle that must be addressed for meta-optimization to be practically useful.

2 Noisy quadratic problem

In this section, we consider a toy problem which demonstrates the short-horizon bias and can be analyzed analytically. In particular, we borrow the noisy quadratic model of schaul2013nomorepesky; the true function being optimized is a quadratic, but in each iteration we observe a noisy version with the correct curvature but a perturbed minimum. This can be equivalently viewed as noisy observations of the gradient, which are intended to capture the stochasticity of a mini-batch-based optimizer. We analyze the dynamics of SGD with momentum on this example, and compare the long-horizon-optimized and greedy-optimal learning rate schedules.

2.1 Background

Approximating the cost surface of a neural network with a quadratic function has led to powerful insights and algorithms. Second-order optimization methods such as Newton-Raphson and natural gradient (amari1998) iteratively minimize a quadratic approximation to the cost function. Hessian-free (H-F) optimization (martens_hf) is an approximate natural gradient method which tries to minimize a quadratic approximation using conjugate gradient. It can often fit deep neural networks in orders-of-magnitude fewer updates than SGD, suggesting that much of the difficulty of neural net optimization is captured by quadratic models. In the setting of Bayesian neural networks, quadratic approximations to the log-likelihood motivated the Laplace approximation (MacKay1992) and variational inference (graves11; noisykfac). koh2017understanding used quadratic approximations to analyze the sensitivity of a neural network’s predictions to particular training labels, thereby yielding insight into adversarial examples.

Such quadratic approximations to the cost function have also provided insights into learning rate and momentum adaptation. In a deterministic setting, under certain conditions, second-order optimization algorithms can be run with a learning rate of 1; for this reason, H-F was able to eliminate the need to tune learning rate or momentum hyperparameters. K-FAC observed that for a deterministic quadratic cost function, greedily choosing the learning rate and momentum to minimize the error on the next step is equivalent to conjugate gradient (CG). Since CG achieves the minimum possible loss of any gradient-based optimizer on each iteration, the greedily chosen learning rates and momenta are optimal, in the sense that the greedy sequence achieves the minimum possible loss value of any sequence of learning rates and momenta. This property fails to hold in the stochastic setting, however, and as we show in this section, the greedy choice of learning rate and momentum can do considerably worse than optimal.

Our primary interest in this work is to adapt scalar learning rate and momentum hyperparameters shared across all dimensions. Some optimizers based on diagonal curvature approximations (kingma2015adam) have been motivated in terms of adapting dimension-specific learning rates, but in practice, one still needs to tune scalar learning rate and momentum hyperparameters. Even K-FAC (K-FAC), which is based on more powerful curvature approximations, has scalar learning rate and momentum hyperparameters. Our analysis applies to all of these methods since they can be viewed as performing SGD in a preconditioned space.

2.2 Analysis

2.2.1 Notations

We will primarily focus on the SGD with momentum algorithm in this paper. The update is written as follows:

𝐯(t+1) =μ(t)𝐯(t)α(t)𝜽(t), (1)
𝜽(t+1) =𝜽(t)+𝐯(t+1), (2)

where is the loss function, t is the training step, and α(t) is the learning rate. We call the gradient trace 𝐯(t) “velocity”, and its decay constant μ(t) “momentum”. We denote the ith coordinate of a vector 𝐯 as vi. When we focus on a single dimension, we sometimes drop the dimension subscripts. We also denote A()=𝔼[]2+𝕍[], where 𝔼 and 𝕍 denote expectation and variance respectively.

2.2.2 Problem formulation

We now define the noisy quadratic model, where in each iteration, the optimizer is given the gradient for a noisy version of a quadratic cost function, where the curvature is correct but the minimum is sampled stochastically from a Gaussian distribution. We assume WLOG that the Hessian is diagonal because SGD is a rotation invariant algorithm, and therefore the dynamics can be analyzed in a coordinate system corresponding to the eigenvectors of the Hessian. We make the further (nontrivial) assumption that the noise covariance is also diagonal.11 1 This amounts to assuming that the Hessian and the noise covariance are codiagonalizable. One heuristic justification for this assumption in the context of neural net optimization is that under certain assumptions, the covariance of the gradients is proportional to the Fisher information matrix, which is close to the Hessian (martens_natural_gradient). Mathematically, the stochastic cost function is written as:

^(𝜽)=12ihi(θici)2, (3)

where 𝐜 is the stochastic minimum, and each ci follows a Gaussian distribution with mean θi* and variance σi2. The expected loss is given by:

(𝜽)=𝔼[^(𝜽)]=12ihi((θiθi*)2+σi2). (4)

The optimum of is given by 𝜽*=𝔼[𝐜]; we assume WLOG that 𝜽*=𝟎. The stochastic gradient is given by ^θi=hi(θici). Since the deterministic gradient is given by θi=hiθi, the stochastic gradient can be viewed as a noisy Gaussian observation of the deterministic gradient with variance hi2σi2. This interpretation motivates the use of this noisy quadratic problem as a model of SGD dynamics.

We treat the iterate 𝜽(t) as a random variable (where the randomness comes from the sampled 𝐜’s); the expected loss in each iteration is given by

𝔼[(𝜽(t))] =𝔼[12ihi((θi(t))2+σi2)] (5)
=12ihi(𝔼[θi(t)]2+𝕍[θi(t)]+σi2). (6)

2.2.3 Optimized and Greedy-Optimal Schedules

We are interested in adapting a global learning rate α(t) and a global momentum decay parameter μ(t) for each time step t. We first derive a recursive formula for the mean and variance of the iterates at each step, and then analyze the greedy-optimal schedule for α(t) and μ(t).

Several observations allow us to compactly model the dynamics of SGD with momentum on the noisy quadratic model. First, 𝔼[(𝜽(t))] can be expressed in terms of 𝔼[θi] and 𝕍[θi] using Eqn. 5. Second, due to the diagonality of the Hessian and the noise covariance matrix, each coordinate evolves independently of the others. Third, the means and variances of the parameters θi(t) and the velocity vi(t) are functions of those statistics at the previous step.

Because each dimension evolves independently, we now drop the dimension subscripts. Combining these observations, we model the dynamics of SGD with momentum as a deterministic recurrence relation with sufficient statistics 𝔼[θ(t)], 𝔼[v(t)], 𝕍[θ(t)], 𝕍[v(t)], and Σθ,v(t)=Cov(θ(t),v(t)). The dynamics are as follows:

Theorem 1 (Mean and variance dynamics).

The expectations of the parameter θ and the velocity v are updated as,

𝔼[v(t+1)] =μ(t)𝔼[v(t)](α(t)h)𝔼[θ(t)],
𝔼[θ(t+1)] =𝔼[θ(t)]+𝔼[v(t+1)].
The variances of the parameter θ and the velocity v are updated as
𝕍[v(t+1)] =(μ(t))2𝕍[v(t)]+(α(t)h)2𝕍[θ(t)]2μ(t)α(t)hΣθ,v(t)+(α(t)hσ)2,
𝕍[θ(t+1)] =(12α(t)h)𝕍[θ(t)]+𝕍[v(t+1)]+2μ(t)Σθ,v(t),
Σθ,v(t+1) =μ(t)Σθ,v(t)α(t)h𝕍[θ(t)]+𝕍[v(t+1)].

By applying Theorem 1 recursively, we can obtain 𝔼[𝜽(t)] and 𝕍[𝜽(t)], and hence 𝔼[(𝜽(t))], for every t. Therefore, using gradient-based optimization, we can fit a locally optimal learning rate and momentum schedule, i.e. a sequence of values {(α(t),μ(t))}t=1T which locally minimizes 𝔼[(𝜽(t))] at some particular time T. We refer to this as the optimized schedule.

Furthermore, there is a closed-form solution for one-step lookahead, i.e., we can solve for the optimal learning rate α(t)* and momentum μ(t)* that minimizes 𝔼[(𝜽(t+1))] given the statistics at time t. We call this as the greedy-optimal schedule.

Theorem 2 (Greedy-optimal learning rate and momentum).

The greedy-optimal learning rate and momentum schedule is given by

α(t)*=ihi2A(θi(t))[jhjA(vj(t))](jhj𝔼[θj(t)vj(t)])hi2𝔼[θi(t)vi(t)]ihi3[A(θi(t))+σi2][jhjA(vj(t))](jhj2𝔼[θj(t)vj(t)])hi2𝔼[θi(t)vi(t)],
μ(t)*=ihi(1α(t)*hi)𝔼[θi(t)vi(t)]ihiA(vi(t)).

Note that schaul2013nomorepesky derived the greedy optimal learning rate for SGD, and Theorem 2 extends it to the greedy optimal learning rate and momentum for SGD with momentum.

Refer to caption

Refer to caption

Refer to caption

  Refer to caption

Refer to caption

   Refer to caption

                                (a)                                 (b)
Figure 3: Comparisons of the optimized learning rates and momenta trained by gradient descent (red), greedy learning rates and momenta (blue), and the optimized fixed learning rate and momentum (green) in both noisy (a) and deterministic (b) quadratic settings. In the deterministic case, our optimized schedule matched the greedy one, just as the theory predicts.

2.2.4 Univariate and Spherical Cases

As noted in Section 2.1, K-FAC found the greedy choice of α and μ to be optimal for gradient descent on deterministic quadratic objectives. We now show that the greedy schedule is also optimal for SGD without momentum in the case of univariate noisy quadratics, and hence also for multivariate ones with spherical Hessians and gradient covariances. In particular, the following holds for SGD without momentum on a univariate noisy quadratic:

Theorem 3 (Optimal learning rate, univariate).

For all T, the sequence of learning rates {α(t)*}t=1T1 that minimizes (θ(T)) is given by

α(t)*=A(θ(t))h(A(θ(t))+σ2). (7)

Moreover, this agrees with the greedy-optimal learning rate schedule as derived by schaul2013nomorepesky.

If the Hessian and the gradient covariance are both spherical, then each dimension evolves identically and independently according to the univariate dynamics. Of course, one is unlikely to encounter an optimization problem where both are exactly spherical. But some approximate second-order optimizers, such as K-FAC, can be viewed as preconditioned SGD, i.e. SGD in a transformed space where the Hessian and the gradient covariance are better conditioned (K-FAC). In principle, with a good enough preconditioner, the Hessian and the gradient covariance would be close enough to spherical that a greedy choice of α and μ would perform well. It will be interesting to investigate whether any practical optimization algorithms demonstrate this behavior.

2.3 Experiments

In this section, we compare the optimized and greedy-optimal schedules on a noisy quadratic problem. We chose a 1000 dimensional quadratic cost function with the curvature distribution from Li05sharpnessin, on which CG achieves its worst-case convergence rate. We assume that hi=𝕍[θi], and hence σi2=1hi; this choice is motivated by the observations that under certain assumptions, the Fisher information matrix is a good approximation to the Hessian matrix, but also reflects the covariance structure of the gradient noise (martens_natural_gradient). We computed the greedy-optimal schedules using Theorem 3. For the optimized schedules, we minimized the expected loss at time T=250 using Adam using Adam (kingma2015adam), with a learning rate 0.003 and 500 steps. We set an upper bound for the learning rate which prevented the loss component for any dimension from becoming larger than its initial value; this was needed because otherwise the optimized schedule allowed the loss to temporarily grow very large, a pathological solution which would be unstable on realistic problems. We also considered fixed learning rate and momentum, with the two hyperparameters fit using Adam. The training curves and the corresponding learning rates and momenta are shown in Figure 3(a). The optimized schedule achieved a much lower final expected loss value (4.25) than was obtained by the greedy-optimal schedule (63.86) or fixed schedule (42.19).

We also show the sums of the losses along the 50 highest curvature directions and 50 lowest curvature directions. We find that under the optimized schedule, the losses along the high curvature directions hardly decrease initially. However, because it maintains a high learning rate, the losses along the low curvature directions decrease significantly. After 50 iterations, it begins decaying the learning rate, at which point it achieves a large drop in both the high-curvature and total losses. On the other hand, under the greedy-optimal schedule, the learning rates and momenta become small very early on, which immediately reduces the losses on the high curvature directions, and hence also the total loss. However, in the long term, since the learning rates are too small to make substantial progress along the low curvature directions, the total loss converged to a much higher value in the end. This gives valuable insight into the nature of the short-horizon bias in meta-optimization: short-horizon objectives will often encourage the learning rate and momentum to decay quickly, so as to achieve the largest gain in the short term, but at the expense of long-run performance.

It is interesting to compare this behavior with the deterministic case. We repeated the above experiment for a deterministic quadratic cost function (i.e. σi2=0) with the same Hessian; results are shown in Figure 3(b). The greedy schedule matches the optimized one, as predicted by the analysis of K-FAC. This result illustrates that stochasticity is necessary for short-horizon bias to manifest. Interestingly, the learning rate and momentum schedules in the deterministic case are nearly flat, while the optimized schedules for the stochastic case are much more complex, suggesting that stochastic optimization raises a different set of issues for hyperparameter adaptation.

3 Gradient-Based Meta-Optimization

We now turn our attention to gradient-based hyperparameter optimization. A variety of approaches have been proposed which tune hyperparameters by doing gradient descent on a meta-objective (schraudolph1999smd; maclaurin2015hypergrad; l2l). We empirically analyze an idealized version of a gradient-based meta-optimization algorithm called stochastic meta-descent (SMD)  (schraudolph1999smd). Our version of SMD is idealized in two ways: first, we drop the algorithmic tricks used in prior work, and instead allow the meta-optimizer more memory and computation than would be economical in practice. Second, we limit the representational power of our meta-model: whereas l2l aimed to learn a full optimization algorithm, we focus on the much simpler problem of adapting learning rate and momentum hyperparameters, or schedules thereof. The aim of these two simplifications is that we would like to do a good enough job of optimizing the meta-objective that any base-level optimization failures can be attributed to deficiencies in the meta-objective itself (such as short-horizon bias) rather than incomplete meta-optimization.

Despite these simplifications, we believe our experiments are relevant to practical meta-optimization algorithms which optimize the meta-objective less thoroughly. Since the goal of the meta-optimizer is to adapt two hyperparameters, it’s possible that poor meta-optimization could cause the hyperparameters to get stuck in regions that happen to perform well; indeed, we observed this phenomenon in some of our early explorations. But it would be dangerous to rely on poor meta-optimization, since improved meta-optimization methods would then lead to worse base-level performance, and tuning the meta-optimizer could become a roundabout way of tuning learning rates and momenta.

We also believe our experiments are relevant to meta-optimization methods which aim to learn entire algorithms. Even if the learned algorithms don’t have explicit learning rate parameters, it’s possible for a learning rate schedule to be encoded into an algorithm itself; for instance, Adagrad (duchi2011adagrad) implicitly uses a polynomial decay schedule because it sums rather than averages the squared derivatives in the denominator. Hence, one would need to worry about whether the meta-optimizer is implicitly fitting a learning rate schedule that’s optimized for short-term performance.

3.1 Background: Stochastic Meta-Descent

Refer to caption

Figure 4: Regular SGD in the form of a computation graph. The learning rate parameter α is part of the differentiable computations.

The high-level idea of stochastic meta-descent (SMD) (schraudolph1999smd) is to perform gradient descent on the learning rate, or any other differentiable hyperparameters. This is feasible since any gradient based optimization algorithm can be unrolled as a computation graph (see Figure 4), and automatic differentiation is readily available in most deep learning libraries.

There are two basic types of automatic differentiation (autodiff) methods: forward mode and reverse mode. In forward mode autodiff, directional derivatives are computed alongside the forward computation. In contrast, reverse mode autodiff (a.k.a. backpropagation) computes the gradients moving backwards through the computation graph. Meta-optimization using reverse mode can be computationally demanding due to memory constraints, since the parameters need to be stored at every step. maclaurin2015hypergrad got around this by cleverly exploiting approximate reversibility to minimize the memory cost of activations. Since we are optimizing only two hyperparameters, however, forward mode autodiff can be done cheaply. Here, we provide the forward differentiation equations for obtaining the gradient of vanilla SGD learning rate. Let d𝜽tdα be 𝒖t, and dtdα be α, and the Hessian at step t to be Ht. By chain rule, we get,

α =𝒈t𝒖t1, (8)
𝒖t =𝒖t1𝒈tαHt𝒖t1. (9)

While the Hessian is infeasible to construct explicitly, the Hessian-vector product in Equation 9 can be computed efficiently using reverse-on-reverse (werbos1988backprop) or forward-on-reverse automatic differentiation (pearlmutter1994hessian), in time linear in the cost of the forward pass. See schraudolph2002smdmv for more details.

Input: α0,η,𝜽,T,M
Output: α
𝜽0𝜽;
αα0;
for m1M do
       𝒖𝟎;
       for t1T do
             X,𝒚 GetMiniBatch();
             𝒈 BGrad(L(X,𝐲,𝛉),𝛉, 1);
             𝜽𝑛𝑒𝑤 Step(𝛉,𝐠,α);
             α𝒈𝒖;
             𝒖 FGrad(𝛉new,[α,𝛉],[1,𝐮]);
             𝜽𝜽𝑛𝑒𝑤;
            
      α MetaStep(α,α,η);
       𝜽𝜽0
return α
Algorithm 1 Stochastic Meta-Descent

Using the gradients with respect to hyperparameters, as given in Eq. 9, we can apply gradient based meta-optimization, just like optimizing regular parameters. It is worth noting that, although SMD was originally proposed for optimizing vanilla SGD, in practice it can be applied to other optimization algorithms such as SGD with momentum or Adam (kingma2015adam). Moreover, gradient-based optimizers other than SGD can be used for the meta-optimization as well.

The basic SMD algorithm is given as Algorithm 1. Here, α is a set of hyperparameters (e.g. learning rate), and α0 are inital hyperparameter values; 𝜽 is a set of optimization intermediate variables, such as weights and velocities; η is a set of meta-optimizer hyperparameters (e.g. meta learning rate). BGrad(y,x,dy) is the backward gradient function that computes the gradients of the loss function wrt. 𝜽, and FGrad(y,x,dx) is the forward gradient function that accumulates the gradients of 𝜽 with respect to α. Step and MetaStep optimize regular parameters and hyperparameters, respectively, for one step using gradient-based methods. Additionally, T is the lookahead window size, and M is the number of meta updates.

Simplifications from the original SMD algorithm.

The original SMD algorithm (schraudolph1999smd) fit coordinate-wise adaptive learning rates with intermediate gradients (𝒖t) accumulated throughout the process of training. Since computing separate directional derivatives for each coordinate using forward mode autodiff is computationally prohibitive, the algorithm used approximate updates. Both features introduced bias into the meta-gradients. We make several changes to the original algorithm. First, we tune only a global learning rate parameter. Second, we use exact forward mode accumulation because this is feasible for a single learning rate. Third, rather than accumulate directional derivatives during training, we compute the meta-updates on separate SGD trajectories simulated using fixed network parameters. Finally, we compute multiple meta-updates in order to ensure that the meta-objective is optimized sufficiently well. Together, these changes ensure unbiased meta-gradients, as well as careful optimization of the meta-objective, at the cost of high computational overhead. We do not recommend this approach as a practical SMD implementation, but rather as a way of understanding the biases in the meta-objective itself.

3.2 Offline meta-optimization

To understand the sensitivity of the optimized hyperparameters to the horizon, we first carried out an offline experiment on a multi-layered perceptron (MLP) on MNIST (mnist). Specifically, we fit learning rate decay schedules offline by repeatedly training the network, and a single meta-gradient was obtained from each training run.

Refer to caption
Figure 5: Meta-objective surfaces and SMD trajectories (red) optimizing initial effective learning rate and decay exponent with horizons of {100, 1k, 5k, 20k} steps. 2.5k random samples with Gaussian interpolation are used to illustrate the meta-objective surface.
Refer to caption
Figure 6: Training curves using meta-optimized learning rate schedules with {100, 1k, 5k, 20k} step horizons.
Learnable decay schedule.

We used a parametric learning rate decay schedule known as inverse time decay (welling2011sgld): αt=α0(1+tK)β, where α0 is the initial learning rate, t is the number of training steps, β is the learning rate decay exponent, and K is the time constant. We jointly optimized α0 and β. We fixed μ=0.9, K=5000 for simplicity.

Experimental details.

The network had two layers of 100 hidden units, with ReLU activations. Weights were initialized with a zero-mean Gaussian with standard deviation 0.1. We used a warm start from a network trained for 50 SGD with momentum steps, using α=0.1,μ=0.9. (We used a warm start because the dynamics are generally different at the very start of training.) We trained all hyperparameters in log space using Adam optimizer, with 5k meta steps.

Figure 5 shows SMD optimization trajectories on the meta-objective surfaces, initialized with multiple random hyperparameter settings. The SMD trajectories appear to have converged to the global optimum.

Importantly, the meta-objectives with longer horizons favored a much smaller learning rate decay exponent β, leading to a more gradual decay schedule. The meta-objective surfaces were very different depending on the time horizon, and the final β value differed by over two orders of magnitude between 100 and 20k step horizons.

Figure 6 plots the training curves of a network using the optimized learning rate schedules. The resulting training loss at 20k steps with the 100 step horizon was over three orders of magnitude larger than with the 20k step horizon. In general, short horizons gave better performance initially, but were surpassed by longer horizons. The differences in error were less drastic, but we see that the 100 step network was severely undertrained, and the 1k step network achieved noticeably worse test error than the longer-horizon ones.

3.3 Online Meta-Optimization

In this section, we study whether online adaptation also suffers from short-horizon bias. Specifically, we used Algorithm 1) to adapt the learning rate and momentum hyperparameters online while a network is trained. We experimented with an MLP on MNIST and a CNN on CIFAR-10 (Krizhevsky09learningmultiple).

Experimental details.

For the MNIST experiments, we used an MLP network with two hidden layers of 100 units, with ReLU activations. Weights were initialized with a zero-mean Gaussian with standard deviation 0.1. For CIFAR-10 experiments, we used a CNN network adapted from Caffe (caffe), with 3 convolutional layers of filter size 3 × 3 and depth [32, 32, 64], and 2×2 max pooling with stride 2 after every convolution layer, and follwed by a fully connected hidden layer of 100 units. Meta-optimization was done with 100 steps of Adam for every 10 steps of regular training. We adapted the learning rate α and momentum μ. After 25k steps, adaptation was stopped, and we trained for another 25k steps with an exponentially decaying learning rate such that it reached 1e-4 on the last time step. We re-parameterized the learning rate with the effective learning rate αeff=α1μ, and the momentum with 1μ, so that they can be optimized more smoothly in the log space.

Figure 7 shows training curves both with online SMD and with hand-tuned fixed learning rate and momentum hyperparameters. We show several SMD runs initialized from widely varying hyperparameters; all the SMD runs behaved similarly, suggesting it optimized the meta-objective efficiently enough. Under SMD, learning rates were quickly decreased to very small values, leading to slow progress in the long term, consistent with the noisy quadratic and offline adaptation experiments.

Refer to caption
Refer to caption
Figure 7: Training cuurves and learning rates from online SMD with lookahead of 5 steps (blue), and hand-tuned fixed learning rate (red). Each blue curve corresponds to a different initial learning rate.
Refer to caption
Refer to caption
Figure 8: Online SMD with deterministic lookahead of 5 steps (blue), compared with a manually tuned fixed learning rate (red). Other settings are the same as Figure 7.

As online SMD can be too conservative in the choice of learning rate, it is natural to ask whether removing the stochasticity in the lookahead sequence can fix the problem. We therefore considered online SMD where the entire lookahead trajectory used a single mini-batch, hence removing the stochasticity. As shown in Figure 8, this deterministic lookahead scheme led to the opposite problem: the adapted learning rates were very large, leading to instability. We conclude that the stochasticity of mini-batch training cannot be simply ignored in meta-optimization.

4 Conclusion

In this paper, we analyzed the problem of short-horizon bias in meta-optimization. We presented a noisy quadratic toy problem which we analyzed mathematically, and observed that the optimal learning rate schedule differs greatly from a greedy schedule that minimizes training loss one step ahead. While the greedy schedule tends to decay the learning rate drastically to reduce the loss on high curvature directions, the optimal schedule keeps a high learning rate in order to make steady progress on low curvature directions, and eventually achieves far lower loss. We showed that this bias stems from the combination of stochasticity and ill-conditioning: when the problem is either deterministic or spherical, the greedy learning rate schedule is globally optimal; however, when the problem is both stochastic and ill-conditioned (as is most neural net training), the greedy schedule performs poorly. We empirially verified the short-horizon bias in the context of neural net training by applying gradient based meta-optimization, both offline and online. We found the same pathological behaviors as in the noisy quadratic problem — a fast learning rate decay and poor long-run performance.

While our results suggest that meta-optimization should not be applied blindly, our noisy quadratic analysis also provides grounds for optimism: by removing ill-conditioning (by using a good preconditioner) and/or stochasticity (with large batch sizes or variance reduction techniques), it may be possible to enter the regime where short-horizon meta-optimization works well. It remains to be seen whether this is achievable with existing optimization algorithms.

Acknowledgement

YW is supported by a Google PhD Fellowship. RL is supported by Connaught International Scholarships.

References

Appendix A Proofs of Theorems

The proofs are organized as follows; we provide a proof to Theorem 1 in A.1, a proof to Theorem 2 in A.2 and a proof to Theorem 3 in A.3.

A.1 Model Dynamics

Recall the stochastic gradient descent with momentum is defined as follows,

v(t+1) =μ(t)v(t)α(t)(hθ(t)+hσξ),ξ𝒩(0,1)
θ(t+1) =θ(t)+v(t+1)=θ(t)+μ(t)v(t)α(t)(hθ(t)+hσξ)
=(1α(t)h)θ(t)+μ(t)v(t)+hσξ.

A.1.1 Dynamics of the expectation

We calculate the mean of the velocity v(t+1),

𝔼[v(t+1)] =𝔼[μ(t)v(t)α(t)hθ(t)]
=μ(t)𝔼[v(t)]α(t)h𝔼[θ(t)]. (10)

We calculate the mean of the parameter θ(t+1),

𝔼[θ(t+1)] =𝔼[θ(t)]+𝔼[v(t+1)]. (11)

Let’s assume the following initial conditions:

𝔼[v(0)] =0
𝔼[θ(0)] =E0.

Then Eq.(10) and Eq.(11) describes how 𝔼[θ(t)],𝔼[v(t)] changes over time t.

A.1.2 Dynamics of the variance

We calculate the variance of the velocity v(t+1),

𝕍[v(t+1)] =𝕍[μ(t)v(t)α(t)hθ(t)]+(α(t)hσ)2
=(μ(t))2𝕍[v(t)]+(α(t)h)2𝕍[θ(t)]2μ(t)α(t)hCov(θ(t),v(t))+(α(t)hσ)2. (12)

The variance of the parameter θ(t+1) is given by,

𝕍[θ(t+1)] =𝕍[θ(t)]+𝕍[v(t+1)]+2(μ(t)Cov(θ(t),v(t))α(t)h𝕍[θ(t)]). (13)

We also need to derive how the covariance of θ and v changes over time:

Cov(θ(t+1),v(t+1)) =Cov((θ(t)+v(t+1)),v(t+1))
=Cov(θ(t),v(t+1))+𝕍[v(t+1)]
=μ(t)Cov(θ(t),v(t))α(t)h𝕍[θ(t)]+𝕍[v(t+1)]. (14)

Let’s assume the following initial conditions:

𝕍[v(0)] =0
𝕍[θ(0)] =V0
Cov(θ(0),v(0)) =0.

Combining Eq.(12-14), we obtain the following dynamics (from t=0,,T1):

𝕍[v(t+1)] =(μ(t))2𝕍[v(t)]+(α(t)h)2𝕍[θ(t)]2μ(t)α(t)hCov(θ(t),v(t))+(α(t)hσ)2
𝕍[θ(t+1)] =𝕍[θ(t)]+𝕍[v(t+1)]+2(μ(t)Cov(θ(t),v(t))α(t)h𝕍[θ(t)])
Cov(θ(t+1),v(t+1)) =μ(t)Cov(θ(t),v(t))α(t)h𝕍[θ(t)]+𝕍[v(t+1)].

A.2 Greedy optimality

A.2.1 Univariate case

The loss at time step t is,

(t+1) =12h(𝔼[θ(t+1)]2+𝕍[θ(t+1)])
=12h[(𝔼[θ(t)]+μ(t)𝔼[v(t)](α(t)h)𝔼[θ(t)])2+𝕍[θ(t)]+(μ(t))2𝕍[v(t)]+(α(t)h)2𝕍[θ(t)]
2μ(t)α(t)hCov(θ(t),v(t))+(α(t)hσ)2+2(μ(t)Cov(θ(t),v(t))α(t)h𝕍[θ(t)])]
=12h[((1α(t)h)𝔼[θ(t)]+μ(t)𝔼[v(t)])2+(1α(t)h)2𝕍[θ(t)]+(μ(t))2𝕍[v(t)]
+2μ(t)(1α(t)h)Cov(θ(t),v(t))+(α(t)hσ)2]
=12h[(1α(t)h)2(𝔼[θ(t)]2+𝕍[θ(t)])+(μ(t))2(𝔼[v(t)]2+𝕍[v(t)])
+2μ(t)(1α(t)h)(𝔼[θ(t)]𝔼[v(t)]+Cov(θ(t),v(t)))+(α(t)hσ)2].

For simplicity, we denote A()=𝔼[]2+𝕍[], and notice that 𝔼[θ(t)v(t)]=𝔼[θ(t)]𝔼[v(t)]+Cov(θ(t),v(t)), hence,

(t+1)=12h[(1α(t)h)2A(θ(t))+(μ(t))2A(v(t))+2μ(t)(1α(t)h)𝔼[θ(t)v(t)]+(α(t)hσ)2]. (15)

In order to find the optimal learning rate and momentum for minimizing (t+1), we take the derivative with respect to α(t) and μ(t), and set it to 0:

α(t)(t+1) =(1α(t)h)A(θ(t))(h)μ(t)h𝔼[θ(t)v(t)]+α(t)(hσ)2=0
α(t)h(A(θ(t))+σ2) =A(θ(t))+μ(t)𝔼[θ(t)v(t)]
μ(t)(t+1) =μ(t)A(v(t))+(1α(t)h)𝔼[θ(t)v(t)]=0
μ(t)* =(1α(t)h)𝔼[θ(t)v(t)]A(v(t))
α(t)h(A(θ(t))+σ2) =A(θ(t))(1α(t)h)𝔼[θ(t)v(t)]A(v(t))𝔼[θ(t)v(t)]
α(t)* =A(θ(t))A(v(t))𝔼[θ(t)v(t)]2h(A(v(t))(A(θ(t))+σ2)𝔼[θ(t)v(t)]2).

A.2.2 high dimension case

The loss is the sum of losses along all directions:

(t+1)=i12hi[(1α(t)hi)2A(θi(t))+(μ(t))2A(vi(t))+2μ(t)(1α(t)hi)𝔼[θi(t)vi(t)]+(α(t)hiσi)2]

Now we obtain optimal learning rate and momentum by setting the derivative to 0,

α(t)(t+1)=ihi[(1α(t)hi)A(θi(t))(hi)μ(t)hi𝔼[θi(t)vi(t)]+α(t)(hiσi)2]=0
α(t)i((hi)3(A(θi(t))+(σi)2))=i((hi)2A(θi(t))+μ(t)(hi)2𝔼[θi(t)vi(t)])
μ(t)(t+1)=ihiμ(t)A(vi(t))+hi(1α(t)hi)𝔼[θi(t)vi(t)]=0
μ(t)*=ihi(1α(t)hi)𝔼[θi(t)vi(t)]ihiA(vi(t))
α(t)i(((hi)3(A(θi(t))+(σi)2))(jhjA(vj(t)))(j(hj)2𝔼[θj(t)vj(t)])(hi)2𝔼[θi(t)vi(t)])=
i((hi)2A(θi(t))(jhjA(vj(t)))(jhj𝔼[θj(t)vj(t)])(hi)2𝔼[θi(t)vi(t)])
α(t)*=i((hi)2A(θi(t))(jhjA(vj(t)))(jhj𝔼[θj(t)vj(t)])(hi)2𝔼[θi(t)vi(t)])i(((hi)3(A(θi(t))+(σi)2))(jhjA(vj(t)))(j(hj)2𝔼[θj(t)vj(t)])(hi)2𝔼[θi(t)vi(t)]).

A.3 Univariate Optimality in SGD

We now consider a dynamic programming approach to solve the problem. We formalize the optimization problem of {αi} as follows. We first denote min as the minimum expected loss at the last time step T (i.e., under the optimal learning rate).

min=minα(t),α(t+1),,α(T1)𝔼ξ(t),ξ(t+1),,ξ(T1)[(θ(T))].

Recall that the loss can be expressed in terms of the expectation and variance of θ. Denote A(t)=(𝔼[θ(t)])2+𝕍[θ(t)]. The final loss can be expressed in terms of Amin(T), obtained by using optimal learning rate schedule:

min =12hAmin(T)+σ2.

As in Theorem 1, we derive the dynamics for SGD without momentum:

θ(t) =(1α(t1)h)θ(t1)+α(t1)hσξ(t1)
(𝔼[θ(t)],𝕍[θ(t)]) =((1α(t1)h)𝔼[θ(t1)],(1α(t1)h)2𝕍[θ(t1)]+(α(t1)hσ)2).

Thus, we can find a recurrence relation of the sequence A(t):

A(t)=(1α(t1)h)2((𝔼[θ(t1)])2+𝕍[θ(t1)]2)+(α(t1)hσ)2=(1α(t1)h)2Amin(T1)+(α(t1)hσ)2.

Since Amin(T) is a function of α(T1)*. We can obtain optimal learning rate α(T1)* by taking the derivative of min w.r.t. α(T1) and setting it to zero:

dmindα(T1) =12hdAmin(T1)dα(T2)=0
dAmin(T)dα(T1)=0
α(T1)*=Amin(T1)h(Amin(T1)+σ2).

Thus we can write Amin(T) in terms of Amin(T1) and the optimal α(T1)*:

Amin(T) =(1Amin(T1)Amin(T1)+σ2)2Amin(T1)+(Amin(T1)Amin(T1)+σ2σ)2
=(σ2Amin(T1)+σ2)2Amin(T1)+(Amin(T1)Amin(T1)+σ2σ)2
=Amin(T1)σ2Amin(T1)+σ2.

Therefore,

min =12hAmin(T)+σ2
=12h(Amin(T1)σ2Amin(T1)+σ2)+σ2.

We now generalize the above derivation. First rewrite min in terms of AminTk and calculate the optimal learning rate at time step Tk.

Theorem 4.

For all T, and k, 1kT, we have,

min=12h(Amin(Tk)σ2kAmin(Tk)+σ2)+σ2. (16)

Therefore, the optimal learning α(t) at timestep t is given as,

α(t)*=A(t)h(A(t)+σ2). (17)
Proof.

The form of min can be easily proven by induction on k, and use the identity that,

(aba+b)bk(aba+b)+b=ab(k+1)a+b.

The optimal learning rate then follows immediately by taking the derivative of min w.r.t. α(Tk1) and setting it to zero. Note that the subscript min is omitted from A(t) in Eq.(17) as we assume all A(t) are obtained using optimal α*, and hence minimum. ∎