Understanding ShortHorizon Bias in Stochastic MetaOptimization
Abstract
Careful tuning of the learning rate, or even schedules thereof, can be crucial to effective neural net training. There has been much recent interest in gradientbased metaoptimization, where one tunes hyperparameters, or even learns an optimizer, in order to minimize the expected loss when the training procedure is unrolled. But because the training procedure must be unrolled thousands of times, the metaobjective must be defined with an ordersofmagnitude shorter time horizon than is typical for neural net training. We show that such shorthorizon metaobjectives cause a serious bias towards small step sizes, an effect we term shorthorizon bias. We introduce a toy problem, a noisy quadratic cost function, on which we analyze shorthorizon bias by deriving and comparing the optimal schedules for short and long time horizons. We then run metaoptimization experiments (both offline and online) on standard benchmark datasets, showing that metaoptimization chooses too small a learning rate by multiple orders of magnitude, even when run with a moderately long time horizon (100 steps) typical of work in the area. We believe shorthorizon bias is a fundamental problem that needs to be addressed if metaoptimization is to scale to practical neural net training regimes. Code available at https://github.com/renmengye/metaoptimpublic.
1 Introduction
The learning rate is one of the most important and frustrating hyperparameters to tune in deep learning. Too small a value causes slow progress, while too large a value causes fluctuations or even divergence. While a fixed learning rate often works well for simpler problems, good performance on the ImageNet (russakovsky2015imagenet) benchmark requires a carefully tuned schedule. A variety of decay schedules have been proposed for different architectures, including polynomial, exponential, staircase, etc. Learning rate decay is also required to achieve convergence guarantee for stochastic gradient methods under certain conditions (bottou1999online). Clever learning rate heuristics have resulted in large improvements in training efficiency (imagenet1hour; CLR). A related hyperparameter is momentum; typically fixed to a reasonable value such as 0.9, careful tuning can also give significant performance gains (Sutskever13). While optimizers such as Adam (kingma2015adam) are often described as adapting coordinatespecific learning rates, in fact they also have global learning rate and momentum hyperparameters analogously to SGD, and tuning at least the learning rate can be important to good performance.
In light of this, it is not surprising that there have been many attempts to adapt learning rates, either online during optimization (schraudolph1999smd; schaul2013nomorepesky), or offline by fitting a learning rate schedule (maclaurin2015hypergrad). More ambitiously, others have attempted to learn an optimizer (l2l; l2o; MAML; lv2017learngd; wichrowska2017learnopt; metz2017unrollgan). All of these approaches are forms of metaoptimization, where one defines a metaobjective (typically the expected loss after some number of optimization steps) and tunes the hyperparameters to minimize this metaobjective. But because gradientbased metaoptimization can require thousands of updates, each of which unrolls the entire baselevel optimization procedure, the metaoptimization is thousands of times more expensive than the baselevel optimization. Therefore, the metaobjective must be defined with a much smaller time horizon (e.g. hundreds of updates) than we are ordinarily interested in for largescale optimization. The hope is that the learned hyperparameters or optimizer will generalize well to much longer time horizons. Unfortunately, we show that this is not achieved in this paper. This is because of a strong tradeoff between shortterm and longterm performance, which we refer to as shorthorizon bias.
In this work, we investigate the shorthorizon bias both mathematically and empirically. First, we analyze a quadratic cost function with noisy gradients based on schaul2013nomorepesky. We consider this a good proxy for neural net training because secondorder optimization algorithms have been shown to train neural networks in ordersofmagnitude fewer iterations (martens_hf), suggesting that much of the difficulty of SGD training can be explained by quadratic approximations to the cost. In our noisy quadratic problem, the dynamics of SGD with momentum can be analyzed exactly, allowing us to derive the greedyoptimal (i.e. 1step horizon) learning rate and momentum in closed form, as well as to (locally) minimize the longhorizon loss using gradient descent. We analyze the differences between the shorthorizon and longhorizon schedules.
Interestingly, when the noisy quadratic problem is either deterministic or spherical, greedy schedules are optimal. However, when the problem is both stochastic and badly conditioned (as is most neural net training), the greedy schedules decay the learning rate far too quickly, leading to slow convergence towards the optimum. This is because reducing the learning rate dampens the fluctuations along high curvature directions, giving it a large immediate reduction in loss. But this comes at the expense of longrun performance, because the optimizer fails to make progress along low curvature directions. This phenomenon is illustrated in Figure 1, a noisy quadratic problem in 2 dimensions, in which two learning rate schedule are compared: a small fixed learning rate (blue), versus a larger fixed learning rate (red) followed by exponential decay (yellow). The latter schedule initially has higher loss, but it makes more progress towards the optimum, such that it achieves an even smaller loss once the learning rate is decayed.
Figure 2 shows this effect quantitatively for a noisy quadratic problem in 1000 dimensions (defined in Section 2.3). The solid lines show the loss after various numbers of steps of lookahead with a fixed learning rate; if this is used as the metaobjective, it favors small learning rates. The dashed curves show the loss if the same trajectories are followed by $50$ steps with an exponentially decayed learning rate; these curves favor higher learning rates, and bear little obvious relationship to the solid ones. This illustrates the difficulty of selecting learning rates based on shorthorizon information.
The second part of our paper empirically investigates gradientbased metaoptimization for neural net training. We consider two idealized metaoptimization algorithms: an offline algorithm which fits a learning rate decay schedule by running optimization many times from scratch, and an online algorithm which adapts the learning rate during training. Since our interest is in studying the effect of the metaobjective itself rather than failures of metaoptimization, we give the metaoptimizers sufficient time to optimize their metaobjectives well. We show that shorthorizon metaoptimizers, both online and offline, dramatically underperform a handtuned fixed learning rate, and sometimes cause the baselevel optimization progress to slow to a crawl, even with moderately long time horizons (e.g. 100 or 1000 steps) similar to those used in prior work on gradientbased metaoptimization.
In short, we expect that any metaobjective which does not correct for shorthorizon bias will probably fail when run for a much longer time horizon than it was trained on. There are applications where shorthorizon metaoptimization is directly useful, such as fewshot learning (mann; ravi2017oneshot). In those settings, shorthorizon bias is by definition not an issue. But much of the appeal of metaoptimization comes from the possibility of using it to speed up or simplify the training of large neural networks. In such settings, shorthorizon bias is a fundamental obstacle that must be addressed for metaoptimization to be practically useful.
2 Noisy quadratic problem
In this section, we consider a toy problem which demonstrates the shorthorizon bias and can be analyzed analytically. In particular, we borrow the noisy quadratic model of schaul2013nomorepesky; the true function being optimized is a quadratic, but in each iteration we observe a noisy version with the correct curvature but a perturbed minimum. This can be equivalently viewed as noisy observations of the gradient, which are intended to capture the stochasticity of a minibatchbased optimizer. We analyze the dynamics of SGD with momentum on this example, and compare the longhorizonoptimized and greedyoptimal learning rate schedules.
2.1 Background
Approximating the cost surface of a neural network with a quadratic function has led to powerful insights and algorithms. Secondorder optimization methods such as NewtonRaphson and natural gradient (amari1998) iteratively minimize a quadratic approximation to the cost function. Hessianfree (HF) optimization (martens_hf) is an approximate natural gradient method which tries to minimize a quadratic approximation using conjugate gradient. It can often fit deep neural networks in ordersofmagnitude fewer updates than SGD, suggesting that much of the difficulty of neural net optimization is captured by quadratic models. In the setting of Bayesian neural networks, quadratic approximations to the loglikelihood motivated the Laplace approximation (MacKay1992) and variational inference (graves11; noisykfac). koh2017understanding used quadratic approximations to analyze the sensitivity of a neural network’s predictions to particular training labels, thereby yielding insight into adversarial examples.
Such quadratic approximations to the cost function have also provided insights into learning rate and momentum adaptation. In a deterministic setting, under certain conditions, secondorder optimization algorithms can be run with a learning rate of 1; for this reason, HF was able to eliminate the need to tune learning rate or momentum hyperparameters. KFAC observed that for a deterministic quadratic cost function, greedily choosing the learning rate and momentum to minimize the error on the next step is equivalent to conjugate gradient (CG). Since CG achieves the minimum possible loss of any gradientbased optimizer on each iteration, the greedily chosen learning rates and momenta are optimal, in the sense that the greedy sequence achieves the minimum possible loss value of any sequence of learning rates and momenta. This property fails to hold in the stochastic setting, however, and as we show in this section, the greedy choice of learning rate and momentum can do considerably worse than optimal.
Our primary interest in this work is to adapt scalar learning rate and momentum hyperparameters shared across all dimensions. Some optimizers based on diagonal curvature approximations (kingma2015adam) have been motivated in terms of adapting dimensionspecific learning rates, but in practice, one still needs to tune scalar learning rate and momentum hyperparameters. Even KFAC (KFAC), which is based on more powerful curvature approximations, has scalar learning rate and momentum hyperparameters. Our analysis applies to all of these methods since they can be viewed as performing SGD in a preconditioned space.
2.2 Analysis
2.2.1 Notations
We will primarily focus on the SGD with momentum algorithm in this paper. The update is written as follows:
${\mathbf{v}}^{(t+1)}$  $={\mu}^{(t)}{\mathbf{v}}^{(t)}{\alpha}^{(t)}{\nabla}_{{\bm{\theta}}^{(t)}}\mathcal{L},$  (1)  
${\bm{\theta}}^{(t+1)}$  $={\bm{\theta}}^{(t)}+{\mathbf{v}}^{(t+1)},$  (2) 
where $\mathcal{L}$ is the loss function, $t$ is the training step, and ${\alpha}^{(t)}$ is the learning rate. We call the gradient trace ${\mathbf{v}}^{(t)}$ “velocity”, and its decay constant ${\mu}^{(t)}$ “momentum”. We denote the $i$th coordinate of a vector $\mathbf{v}$ as ${v}_{i}$. When we focus on a single dimension, we sometimes drop the dimension subscripts. We also denote $A(\cdot )=\mathbb{E}{[\cdot ]}^{2}+\mathbb{V}[\cdot ]$, where $\mathbb{E}$ and $\mathbb{V}$ denote expectation and variance respectively.
2.2.2 Problem formulation
We now define the noisy quadratic model, where in each iteration, the optimizer is given the gradient for a noisy version of a quadratic cost function, where the curvature is correct but the minimum is sampled stochastically from a Gaussian distribution. We assume WLOG that the Hessian is diagonal because SGD is a rotation invariant algorithm, and therefore the dynamics can be analyzed in a coordinate system corresponding to the eigenvectors of the Hessian. We make the further (nontrivial) assumption that the noise covariance is also diagonal.^{1}^{1} 1 This amounts to assuming that the Hessian and the noise covariance are codiagonalizable. One heuristic justification for this assumption in the context of neural net optimization is that under certain assumptions, the covariance of the gradients is proportional to the Fisher information matrix, which is close to the Hessian (martens_natural_gradient). Mathematically, the stochastic cost function is written as:
$$\widehat{\mathcal{L}}(\bm{\theta})=\frac{1}{2}\sum _{i}{h}_{i}{({\theta}_{i}{c}_{i})}^{2},$$  (3) 
where $\mathbf{c}$ is the stochastic minimum, and each ${c}_{i}$ follows a Gaussian distribution with mean ${\theta}_{i}^{*}$ and variance ${\sigma}_{i}^{2}$. The expected loss is given by:
$$\mathcal{L}(\bm{\theta})=\mathbb{E}\left[\widehat{\mathcal{L}}(\bm{\theta})\right]=\frac{1}{2}\sum _{i}{h}_{i}\left({({\theta}_{i}{\theta}_{i}^{*})}^{2}+{\sigma}_{i}^{2}\right).$$  (4) 
The optimum of $\mathcal{L}$ is given by ${\bm{\theta}}^{*}=\mathbb{E}[\mathbf{c}]$; we assume WLOG that ${\bm{\theta}}^{*}=\mathrm{\U0001d7ce}$. The stochastic gradient is given by $\frac{\partial \widehat{\mathcal{L}}}{\partial {\theta}_{i}}={h}_{i}({\theta}_{i}{c}_{i})$. Since the deterministic gradient is given by $\frac{\partial \mathcal{L}}{\partial {\theta}_{i}}={h}_{i}{\theta}_{i}$, the stochastic gradient can be viewed as a noisy Gaussian observation of the deterministic gradient with variance ${h}_{i}^{2}{\sigma}_{i}^{2}$. This interpretation motivates the use of this noisy quadratic problem as a model of SGD dynamics.
We treat the iterate ${\bm{\theta}}^{(t)}$ as a random variable (where the randomness comes from the sampled $\mathbf{c}$’s); the expected loss in each iteration is given by
$\mathbb{E}\left[\mathcal{L}({\bm{\theta}}^{(t)})\right]$  $=\mathbb{E}\left[{\displaystyle \frac{1}{2}}{\displaystyle \sum _{i}}{h}_{i}\left({({\theta}_{i}^{(t)})}^{2}+{\sigma}_{i}^{2}\right)\right]$  (5)  
$={\displaystyle \frac{1}{2}}{\displaystyle \sum _{i}}{h}_{i}\left(\mathbb{E}{\left[{\theta}_{i}^{(t)}\right]}^{2}+\mathbb{V}\left[{\theta}_{i}^{(t)}\right]+{\sigma}_{i}^{2}\right).$  (6) 
2.2.3 Optimized and GreedyOptimal Schedules
We are interested in adapting a global learning rate ${\alpha}^{(t)}$ and a global momentum decay parameter ${\mu}^{(t)}$ for each time step $t$. We first derive a recursive formula for the mean and variance of the iterates at each step, and then analyze the greedyoptimal schedule for ${\alpha}^{(t)}$ and ${\mu}^{(t)}$.
Several observations allow us to compactly model the dynamics of SGD with momentum on the noisy quadratic model. First, $\mathbb{E}[\mathcal{L}({\bm{\theta}}^{(t)})]$ can be expressed in terms of $\mathbb{E}[{\theta}_{i}]$ and $\mathbb{V}[{\theta}_{i}]$ using Eqn. 5. Second, due to the diagonality of the Hessian and the noise covariance matrix, each coordinate evolves independently of the others. Third, the means and variances of the parameters ${\theta}_{i}^{(t)}$ and the velocity ${v}_{i}^{(t)}$ are functions of those statistics at the previous step.
Because each dimension evolves independently, we now drop the dimension subscripts. Combining these observations, we model the dynamics of SGD with momentum as a deterministic recurrence relation with sufficient statistics $\mathbb{E}[{\theta}^{(t)}]$, $\mathbb{E}[{v}^{(t)}]$, $\mathbb{V}[{\theta}^{(t)}]$, $\mathbb{V}[{v}^{(t)}]$, and ${\mathrm{\Sigma}}_{\theta ,v}^{(t)}=\mathrm{Cov}({\theta}^{(t)},{v}^{(t)})$. The dynamics are as follows:
Theorem 1 (Mean and variance dynamics).
The expectations of the parameter $\theta $ and the velocity $v$ are updated as,
$\mathbb{E}\left[{v}^{(t+1)}\right]$  $={\mu}^{(t)}\mathbb{E}\left[{v}^{(t)}\right]({\alpha}^{(t)}h)\mathbb{E}\left[{\theta}^{(t)}\right],$  
$\mathbb{E}\left[{\theta}^{(t+1)}\right]$  $=\mathbb{E}\left[{\theta}^{(t)}\right]+\mathbb{E}\left[{v}^{(t+1)}\right].$  
The variances of the parameter $\theta $ and the velocity $v$ are updated as  
$\mathbb{V}\left[{v}^{(t+1)}\right]$  $={\left({\mu}^{(t)}\right)}^{2}\mathbb{V}\left[{v}^{(t)}\right]+{\left({\alpha}^{(t)}h\right)}^{2}\mathbb{V}\left[{\theta}^{(t)}\right]2{\mu}^{(t)}{\alpha}^{(t)}h{\mathrm{\Sigma}}_{\theta ,v}^{(t)}+{\left({\alpha}^{(t)}h\sigma \right)}^{2},$  
$\mathbb{V}\left[{\theta}^{(t+1)}\right]$  $=\left(12{\alpha}^{(t)}h\right)\mathbb{V}\left[{\theta}^{(t)}\right]+\mathbb{V}\left[{v}^{(t+1)}\right]+2{\mu}^{(t)}{\mathrm{\Sigma}}_{\theta ,v}^{(t)},$  
${\mathrm{\Sigma}}_{\theta ,v}^{(t+1)}$  $={\mu}^{(t)}{\mathrm{\Sigma}}_{\theta ,v}^{(t)}{\alpha}^{(t)}h\mathbb{V}\left[{\theta}^{(t)}\right]+\mathbb{V}\left[{v}^{(t+1)}\right].$ 
By applying Theorem 1 recursively, we can obtain $\mathbb{E}[{\bm{\theta}}^{(t)}]$ and $\mathbb{V}[{\bm{\theta}}^{(t)}]$, and hence $\mathbb{E}[\mathcal{L}({\bm{\theta}}^{(t)})]$, for every $t$. Therefore, using gradientbased optimization, we can fit a locally optimal learning rate and momentum schedule, i.e. a sequence of values ${\{({\alpha}^{(t)},{\mu}^{(t)})\}}_{t=1}^{T}$ which locally minimizes $\mathbb{E}[\mathcal{L}({\bm{\theta}}^{(t)})]$ at some particular time $T$. We refer to this as the optimized schedule.
Furthermore, there is a closedform solution for onestep lookahead, i.e., we can solve for the optimal learning rate ${\alpha}^{(t)*}$ and momentum ${\mu}^{(t)*}$ that minimizes $\mathbb{E}[\mathcal{L}({\bm{\theta}}^{(t+1)})]$ given the statistics at time $t$. We call this as the greedyoptimal schedule.
Theorem 2 (Greedyoptimal learning rate and momentum).
The greedyoptimal learning rate and momentum schedule is given by
${\alpha}^{(t)*}={\displaystyle \frac{{\sum}_{i}{h}_{i}^{2}A\left({\theta}_{i}^{(t)}\right)\left[{\sum}_{j}{h}_{j}A\left({v}_{j}^{(t)}\right)\right]\left({\sum}_{j}{h}_{j}\mathbb{E}\left[{\theta}_{j}^{(t)}{v}_{j}^{(t)}\right]\right){h}_{i}^{2}\mathbb{E}\left[{\theta}_{i}^{(t)}{v}_{i}^{(t)}\right]}{{\sum}_{i}{h}_{i}^{3}\left[A\left({\theta}_{i}^{(t)}\right)+{\sigma}_{i}^{2}\right]\left[{\sum}_{j}{h}_{j}A\left({v}_{j}^{(t)}\right)\right]\left({\sum}_{j}{h}_{j}^{2}\mathbb{E}\left[{\theta}_{j}^{(t)}{v}_{j}^{(t)}\right]\right){h}_{i}^{2}\mathbb{E}\left[{\theta}_{i}^{(t)}{v}_{i}^{(t)}\right]}},$  
${\mu}^{(t)*}={\displaystyle \frac{{\sum}_{i}{h}_{i}\left(1{\alpha}^{(t)*}{h}_{i}\right)\mathbb{E}\left[{\theta}_{i}^{(t)}{v}_{i}^{(t)}\right]}{{\sum}_{i}{h}_{i}A\left({v}_{i}^{(t)}\right)}}.$ 
Note that schaul2013nomorepesky derived the greedy optimal learning rate for SGD, and Theorem 2 extends it to the greedy optimal learning rate and momentum for SGD with momentum.




(a)  (b) 
2.2.4 Univariate and Spherical Cases
As noted in Section 2.1, KFAC found the greedy choice of $\alpha $ and $\mu $ to be optimal for gradient descent on deterministic quadratic objectives. We now show that the greedy schedule is also optimal for SGD without momentum in the case of univariate noisy quadratics, and hence also for multivariate ones with spherical Hessians and gradient covariances. In particular, the following holds for SGD without momentum on a univariate noisy quadratic:
Theorem 3 (Optimal learning rate, univariate).
For all $T\mathrm{\in}\mathbb{N}$, the sequence of learning rates ${\mathrm{\{}{\alpha}^{\mathrm{(}t\mathrm{)}\mathit{}\mathrm{*}}\mathrm{\}}}_{t\mathrm{=}\mathrm{1}}^{T\mathrm{}\mathrm{1}}$ that minimizes $\mathcal{L}\mathit{}\mathrm{(}{\theta}^{\mathrm{(}T\mathrm{)}}\mathrm{)}$ is given by
$${\alpha}^{(t)*}=\frac{A\left({\theta}^{(t)}\right)}{h\left(A\left({\theta}^{(t)}\right)+{\sigma}^{2}\right)}.$$  (7) 
Moreover, this agrees with the greedyoptimal learning rate schedule as derived by schaul2013nomorepesky.
If the Hessian and the gradient covariance are both spherical, then each dimension evolves identically and independently according to the univariate dynamics. Of course, one is unlikely to encounter an optimization problem where both are exactly spherical. But some approximate secondorder optimizers, such as KFAC, can be viewed as preconditioned SGD, i.e. SGD in a transformed space where the Hessian and the gradient covariance are better conditioned (KFAC). In principle, with a good enough preconditioner, the Hessian and the gradient covariance would be close enough to spherical that a greedy choice of $\alpha $ and $\mu $ would perform well. It will be interesting to investigate whether any practical optimization algorithms demonstrate this behavior.
2.3 Experiments
In this section, we compare the optimized and greedyoptimal schedules on a noisy quadratic problem. We chose a 1000 dimensional quadratic cost function with the curvature distribution from Li05sharpnessin, on which CG achieves its worstcase convergence rate. We assume that ${h}_{i}=\mathbb{V}[\frac{\partial \mathcal{L}}{\partial {\theta}_{i}}]$, and hence ${\sigma}_{i}^{2}=\frac{1}{{h}_{i}}$; this choice is motivated by the observations that under certain assumptions, the Fisher information matrix is a good approximation to the Hessian matrix, but also reflects the covariance structure of the gradient noise (martens_natural_gradient). We computed the greedyoptimal schedules using Theorem 3. For the optimized schedules, we minimized the expected loss at time $T=250$ using Adam using Adam (kingma2015adam), with a learning rate 0.003 and 500 steps. We set an upper bound for the learning rate which prevented the loss component for any dimension from becoming larger than its initial value; this was needed because otherwise the optimized schedule allowed the loss to temporarily grow very large, a pathological solution which would be unstable on realistic problems. We also considered fixed learning rate and momentum, with the two hyperparameters fit using Adam. The training curves and the corresponding learning rates and momenta are shown in Figure 3(a). The optimized schedule achieved a much lower final expected loss value (4.25) than was obtained by the greedyoptimal schedule (63.86) or fixed schedule (42.19).
We also show the sums of the losses along the 50 highest curvature directions and 50 lowest curvature directions. We find that under the optimized schedule, the losses along the high curvature directions hardly decrease initially. However, because it maintains a high learning rate, the losses along the low curvature directions decrease significantly. After 50 iterations, it begins decaying the learning rate, at which point it achieves a large drop in both the highcurvature and total losses. On the other hand, under the greedyoptimal schedule, the learning rates and momenta become small very early on, which immediately reduces the losses on the high curvature directions, and hence also the total loss. However, in the long term, since the learning rates are too small to make substantial progress along the low curvature directions, the total loss converged to a much higher value in the end. This gives valuable insight into the nature of the shorthorizon bias in metaoptimization: shorthorizon objectives will often encourage the learning rate and momentum to decay quickly, so as to achieve the largest gain in the short term, but at the expense of longrun performance.
It is interesting to compare this behavior with the deterministic case. We repeated the above experiment for a deterministic quadratic cost function (i.e. ${\sigma}_{i}^{2}=0$) with the same Hessian; results are shown in Figure 3(b). The greedy schedule matches the optimized one, as predicted by the analysis of KFAC. This result illustrates that stochasticity is necessary for shorthorizon bias to manifest. Interestingly, the learning rate and momentum schedules in the deterministic case are nearly flat, while the optimized schedules for the stochastic case are much more complex, suggesting that stochastic optimization raises a different set of issues for hyperparameter adaptation.
3 GradientBased MetaOptimization
We now turn our attention to gradientbased hyperparameter optimization. A variety of approaches have been proposed which tune hyperparameters by doing gradient descent on a metaobjective (schraudolph1999smd; maclaurin2015hypergrad; l2l). We empirically analyze an idealized version of a gradientbased metaoptimization algorithm called stochastic metadescent (SMD) (schraudolph1999smd). Our version of SMD is idealized in two ways: first, we drop the algorithmic tricks used in prior work, and instead allow the metaoptimizer more memory and computation than would be economical in practice. Second, we limit the representational power of our metamodel: whereas l2l aimed to learn a full optimization algorithm, we focus on the much simpler problem of adapting learning rate and momentum hyperparameters, or schedules thereof. The aim of these two simplifications is that we would like to do a good enough job of optimizing the metaobjective that any baselevel optimization failures can be attributed to deficiencies in the metaobjective itself (such as shorthorizon bias) rather than incomplete metaoptimization.
Despite these simplifications, we believe our experiments are relevant to practical metaoptimization algorithms which optimize the metaobjective less thoroughly. Since the goal of the metaoptimizer is to adapt two hyperparameters, it’s possible that poor metaoptimization could cause the hyperparameters to get stuck in regions that happen to perform well; indeed, we observed this phenomenon in some of our early explorations. But it would be dangerous to rely on poor metaoptimization, since improved metaoptimization methods would then lead to worse baselevel performance, and tuning the metaoptimizer could become a roundabout way of tuning learning rates and momenta.
We also believe our experiments are relevant to metaoptimization methods which aim to learn entire algorithms. Even if the learned algorithms don’t have explicit learning rate parameters, it’s possible for a learning rate schedule to be encoded into an algorithm itself; for instance, Adagrad (duchi2011adagrad) implicitly uses a polynomial decay schedule because it sums rather than averages the squared derivatives in the denominator. Hence, one would need to worry about whether the metaoptimizer is implicitly fitting a learning rate schedule that’s optimized for shortterm performance.
3.1 Background: Stochastic MetaDescent
The highlevel idea of stochastic metadescent (SMD) (schraudolph1999smd) is to perform gradient descent on the learning rate, or any other differentiable hyperparameters. This is feasible since any gradient based optimization algorithm can be unrolled as a computation graph (see Figure 4), and automatic differentiation is readily available in most deep learning libraries.
There are two basic types of automatic differentiation (autodiff) methods: forward mode and reverse mode. In forward mode autodiff, directional derivatives are computed alongside the forward computation. In contrast, reverse mode autodiff (a.k.a. backpropagation) computes the gradients moving backwards through the computation graph. Metaoptimization using reverse mode can be computationally demanding due to memory constraints, since the parameters need to be stored at every step. maclaurin2015hypergrad got around this by cleverly exploiting approximate reversibility to minimize the memory cost of activations. Since we are optimizing only two hyperparameters, however, forward mode autodiff can be done cheaply. Here, we provide the forward differentiation equations for obtaining the gradient of vanilla SGD learning rate. Let $\frac{d{\bm{\theta}}_{t}}{d\alpha}$ be ${\bm{u}}_{t}$, and $\frac{d{\mathcal{L}}_{t}}{d\alpha}$ be ${\alpha}^{\prime}$, and the Hessian at step $t$ to be ${H}_{t}$. By chain rule, we get,
${\alpha}^{\prime}$  $={\bm{g}}_{t}\cdot {\bm{u}}_{t1},$  (8)  
${\bm{u}}_{t}$  $={\bm{u}}_{t1}{\bm{g}}_{t}\alpha {H}_{t}{\bm{u}}_{t1}.$  (9) 
While the Hessian is infeasible to construct explicitly, the Hessianvector product in Equation 9 can be computed efficiently using reverseonreverse (werbos1988backprop) or forwardonreverse automatic differentiation (pearlmutter1994hessian), in time linear in the cost of the forward pass. See schraudolph2002smdmv for more details.
Using the gradients with respect to hyperparameters, as given in Eq. 9, we can apply gradient based metaoptimization, just like optimizing regular parameters. It is worth noting that, although SMD was originally proposed for optimizing vanilla SGD, in practice it can be applied to other optimization algorithms such as SGD with momentum or Adam (kingma2015adam). Moreover, gradientbased optimizers other than SGD can be used for the metaoptimization as well.
The basic SMD algorithm is given as Algorithm 1. Here, $\alpha $ is a set of hyperparameters (e.g. learning rate), and ${\alpha}_{0}$ are inital hyperparameter values; $\bm{\theta}$ is a set of optimization intermediate variables, such as weights and velocities; $\eta $ is a set of metaoptimizer hyperparameters (e.g. meta learning rate). BGrad$\mathrm{(}y\mathrm{,}x\mathrm{,}d\mathtt{}y\mathrm{)}$ is the backward gradient function that computes the gradients of the loss function wrt. $\bm{\theta}$, and FGrad$\mathrm{(}y\mathrm{,}x\mathrm{,}d\mathtt{}x\mathrm{)}$ is the forward gradient function that accumulates the gradients of $\bm{\theta}$ with respect to $\alpha $. Step and MetaStep optimize regular parameters and hyperparameters, respectively, for one step using gradientbased methods. Additionally, $T$ is the lookahead window size, and $M$ is the number of meta updates.
Simplifications from the original SMD algorithm.
The original SMD algorithm (schraudolph1999smd) fit coordinatewise adaptive learning rates with intermediate gradients (${\bm{u}}_{t}$) accumulated throughout the process of training. Since computing separate directional derivatives for each coordinate using forward mode autodiff is computationally prohibitive, the algorithm used approximate updates. Both features introduced bias into the metagradients. We make several changes to the original algorithm. First, we tune only a global learning rate parameter. Second, we use exact forward mode accumulation because this is feasible for a single learning rate. Third, rather than accumulate directional derivatives during training, we compute the metaupdates on separate SGD trajectories simulated using fixed network parameters. Finally, we compute multiple metaupdates in order to ensure that the metaobjective is optimized sufficiently well. Together, these changes ensure unbiased metagradients, as well as careful optimization of the metaobjective, at the cost of high computational overhead. We do not recommend this approach as a practical SMD implementation, but rather as a way of understanding the biases in the metaobjective itself.
3.2 Offline metaoptimization
To understand the sensitivity of the optimized hyperparameters to the horizon, we first carried out an offline experiment on a multilayered perceptron (MLP) on MNIST (mnist). Specifically, we fit learning rate decay schedules offline by repeatedly training the network, and a single metagradient was obtained from each training run.
Learnable decay schedule.
We used a parametric learning rate decay schedule known as inverse time decay (welling2011sgld): ${\alpha}_{t}=\frac{{\alpha}_{0}}{{(1+\frac{t}{K})}^{\beta}}$, where ${\alpha}_{0}$ is the initial learning rate, $t$ is the number of training steps, $\beta $ is the learning rate decay exponent, and $K$ is the time constant. We jointly optimized ${\alpha}_{0}$ and $\beta $. We fixed $\mu =0.9$, $K=5000$ for simplicity.
Experimental details.
The network had two layers of 100 hidden units, with ReLU activations. Weights were initialized with a zeromean Gaussian with standard deviation 0.1. We used a warm start from a network trained for 50 SGD with momentum steps, using $\alpha =0.1,\mu =0.9$. (We used a warm start because the dynamics are generally different at the very start of training.) We trained all hyperparameters in log space using Adam optimizer, with 5k meta steps.
Figure 5 shows SMD optimization trajectories on the metaobjective surfaces, initialized with multiple random hyperparameter settings. The SMD trajectories appear to have converged to the global optimum.
Importantly, the metaobjectives with longer horizons favored a much smaller learning rate decay exponent $\beta $, leading to a more gradual decay schedule. The metaobjective surfaces were very different depending on the time horizon, and the final $\beta $ value differed by over two orders of magnitude between 100 and 20k step horizons.
Figure 6 plots the training curves of a network using the optimized learning rate schedules. The resulting training loss at 20k steps with the 100 step horizon was over three orders of magnitude larger than with the 20k step horizon. In general, short horizons gave better performance initially, but were surpassed by longer horizons. The differences in error were less drastic, but we see that the 100 step network was severely undertrained, and the 1k step network achieved noticeably worse test error than the longerhorizon ones.
3.3 Online MetaOptimization
In this section, we study whether online adaptation also suffers from shorthorizon bias. Specifically, we used Algorithm 1) to adapt the learning rate and momentum hyperparameters online while a network is trained. We experimented with an MLP on MNIST and a CNN on CIFAR10 (Krizhevsky09learningmultiple).
Experimental details.
For the MNIST experiments, we used an MLP network with two hidden layers of 100 units, with ReLU activations. Weights were initialized with a zeromean Gaussian with standard deviation 0.1. For CIFAR10 experiments, we used a CNN network adapted from Caffe (caffe), with 3 convolutional layers of filter size 3 $\times $ 3 and depth [32, 32, 64], and $2\times 2$ max pooling with stride 2 after every convolution layer, and follwed by a fully connected hidden layer of 100 units. Metaoptimization was done with 100 steps of Adam for every 10 steps of regular training. We adapted the learning rate $\alpha $ and momentum $\mu $. After 25k steps, adaptation was stopped, and we trained for another 25k steps with an exponentially decaying learning rate such that it reached 1e4 on the last time step. We reparameterized the learning rate with the effective learning rate ${\alpha}_{\text{eff}}=\frac{\alpha}{1\mu}$, and the momentum with $1\mu $, so that they can be optimized more smoothly in the log space.
Figure 7 shows training curves both with online SMD and with handtuned fixed learning rate and momentum hyperparameters. We show several SMD runs initialized from widely varying hyperparameters; all the SMD runs behaved similarly, suggesting it optimized the metaobjective efficiently enough. Under SMD, learning rates were quickly decreased to very small values, leading to slow progress in the long term, consistent with the noisy quadratic and offline adaptation experiments.
As online SMD can be too conservative in the choice of learning rate, it is natural to ask whether removing the stochasticity in the lookahead sequence can fix the problem. We therefore considered online SMD where the entire lookahead trajectory used a single minibatch, hence removing the stochasticity. As shown in Figure 8, this deterministic lookahead scheme led to the opposite problem: the adapted learning rates were very large, leading to instability. We conclude that the stochasticity of minibatch training cannot be simply ignored in metaoptimization.
4 Conclusion
In this paper, we analyzed the problem of shorthorizon bias in metaoptimization. We presented a noisy quadratic toy problem which we analyzed mathematically, and observed that the optimal learning rate schedule differs greatly from a greedy schedule that minimizes training loss one step ahead. While the greedy schedule tends to decay the learning rate drastically to reduce the loss on high curvature directions, the optimal schedule keeps a high learning rate in order to make steady progress on low curvature directions, and eventually achieves far lower loss. We showed that this bias stems from the combination of stochasticity and illconditioning: when the problem is either deterministic or spherical, the greedy learning rate schedule is globally optimal; however, when the problem is both stochastic and illconditioned (as is most neural net training), the greedy schedule performs poorly. We empirially verified the shorthorizon bias in the context of neural net training by applying gradient based metaoptimization, both offline and online. We found the same pathological behaviors as in the noisy quadratic problem — a fast learning rate decay and poor longrun performance.
While our results suggest that metaoptimization should not be applied blindly, our noisy quadratic analysis also provides grounds for optimism: by removing illconditioning (by using a good preconditioner) and/or stochasticity (with large batch sizes or variance reduction techniques), it may be possible to enter the regime where shorthorizon metaoptimization works well. It remains to be seen whether this is achievable with existing optimization algorithms.
Acknowledgement
YW is supported by a Google PhD Fellowship. RL is supported by Connaught International Scholarships.
References
Appendix A Proofs of Theorems
The proofs are organized as follows; we provide a proof to Theorem 1 in A.1, a proof to Theorem 2 in A.2 and a proof to Theorem 3 in A.3.
A.1 Model Dynamics
Recall the stochastic gradient descent with momentum is defined as follows,
${v}^{(t+1)}$  $={\mu}^{(t)}{v}^{(t)}{\alpha}^{(t)}(h{\theta}^{(t)}+h\sigma \xi ),\xi \sim \mathcal{N}(0,1)$  
${\theta}^{(t+1)}$  $={\theta}^{(t)}+{v}^{(t+1)}={\theta}^{(t)}+{\mu}^{(t)}{v}^{(t)}{\alpha}^{(t)}(h{\theta}^{(t)}+h\sigma \xi )$  
$=(1{\alpha}^{(t)}h){\theta}^{(t)}+{\mu}^{(t)}{v}^{(t)}+h\sigma \xi .$ 
A.1.1 Dynamics of the expectation
We calculate the mean of the velocity ${v}^{(t+1)}$,
$\mathbb{E}\left[{v}^{(t+1)}\right]$  $=\mathbb{E}\left[{\mu}^{(t)}{v}^{(t)}{\alpha}^{(t)}h{\theta}^{(t)}\right]$  
$={\mu}^{(t)}\mathbb{E}\left[{v}^{(t)}\right]{\alpha}^{(t)}h\mathbb{E}\left[{\theta}^{(t)}\right].$  (10) 
We calculate the mean of the parameter ${\theta}^{(t+1)}$,
$\mathbb{E}\left[{\theta}^{(t+1)}\right]$  $=\mathbb{E}\left[{\theta}^{(t)}\right]+\mathbb{E}\left[{v}^{(t+1)}\right].$  (11) 
Let’s assume the following initial conditions:
$\mathbb{E}\left[{v}^{(0)}\right]$  $=0$  
$\mathbb{E}\left[{\theta}^{(0)}\right]$  $={E}_{0}.$ 
Then Eq.(10) and Eq.(11) describes how $\mathbb{E}\left[{\theta}^{(t)}\right],\mathbb{E}\left[{v}^{(t)}\right]$ changes over time $t$.
A.1.2 Dynamics of the variance
We calculate the variance of the velocity ${v}^{(t+1)}$,
$\mathbb{V}\left[{v}^{(t+1)}\right]$  $=\mathbb{V}\left[{\mu}^{(t)}{v}^{(t)}{\alpha}^{(t)}h{\theta}^{(t)}\right]+{({\alpha}^{(t)}h\sigma )}^{2}$  
$={({\mu}^{(t)})}^{2}\mathbb{V}\left[{v}^{(t)}\right]+{({\alpha}^{(t)}h)}^{2}\mathbb{V}\left[{\theta}^{(t)}\right]2{\mu}^{(t)}{\alpha}^{(t)}h\cdot \mathrm{Cov}({\theta}^{(t)},{v}^{(t)})+{({\alpha}^{(t)}h\sigma )}^{2}.$  (12) 
The variance of the parameter ${\theta}^{(t+1)}$ is given by,
$\mathbb{V}\left[{\theta}^{(t+1)}\right]$  $=\mathbb{V}\left[{\theta}^{(t)}\right]+\mathbb{V}\left[{v}^{(t+1)}\right]+2\left({\mu}^{(t)}\mathrm{Cov}({\theta}^{(t)},{v}^{(t)}){\alpha}^{(t)}h\mathbb{V}\left[{\theta}^{(t)}\right]\right).$  (13) 
We also need to derive how the covariance of $\theta $ and $v$ changes over time:
$\mathrm{Cov}({\theta}^{(t+1)},{v}^{(t+1)})$  $=\mathrm{Cov}(({\theta}^{(t)}+{v}^{(t+1)}),{v}^{(t+1)})$  
$=\mathrm{Cov}({\theta}^{(t)},{v}^{(t+1)})+\mathbb{V}\left[{v}^{(t+1)}\right]$  
$={\mu}^{(t)}\mathrm{Cov}({\theta}^{(t)},{v}^{(t)}){\alpha}^{(t)}h\mathbb{V}\left[{\theta}^{(t)}\right]+\mathbb{V}\left[{v}^{(t+1)}\right].$  (14) 
Let’s assume the following initial conditions:
$\mathbb{V}\left[{v}^{(0)}\right]$  $=0$  
$\mathbb{V}\left[{\theta}^{(0)}\right]$  $={V}_{0}$  
$\mathrm{Cov}({\theta}^{(0)},{v}^{(0)})$  $=0.$ 
Combining Eq.(1214), we obtain the following dynamics (from $t=0,\mathrm{\dots},T1$):
$\mathbb{V}\left[{v}^{(t+1)}\right]$  $={({\mu}^{(t)})}^{2}\mathbb{V}\left[{v}^{(t)}\right]+{({\alpha}^{(t)}h)}^{2}\mathbb{V}\left[{\theta}^{(t)}\right]2{\mu}^{(t)}{\alpha}^{(t)}h\cdot \mathrm{Cov}({\theta}^{(t)},{v}^{(t)})+{({\alpha}^{(t)}h\sigma )}^{2}$  
$\mathbb{V}\left[{\theta}^{(t+1)}\right]$  $=\mathbb{V}\left[{\theta}^{(t)}\right]+\mathbb{V}\left[{v}^{(t+1)}\right]+2\left({\mu}^{(t)}\mathrm{Cov}({\theta}^{(t)},{v}^{(t)}){\alpha}^{(t)}h\mathbb{V}\left[{\theta}^{(t)}\right]\right)$  
$\mathrm{Cov}({\theta}^{(t+1)},{v}^{(t+1)})$  $={\mu}^{(t)}\mathrm{Cov}({\theta}^{(t)},{v}^{(t)}){\alpha}^{(t)}h\mathbb{V}\left[{\theta}^{(t)}\right]+\mathbb{V}\left[{v}^{(t+1)}\right].$ 
A.2 Greedy optimality
A.2.1 Univariate case
The loss at time step $t$ is,
${\mathcal{L}}^{(t+1)}$  $={\displaystyle \frac{1}{2}}h\left(\mathbb{E}{\left[{\theta}^{(t+1)}\right]}^{2}+\mathbb{V}\left[{\theta}^{(t+1)}\right]\right)$  
$={\displaystyle \frac{1}{2}}h[{(\mathbb{E}\left[{\theta}^{(t)}\right]+{\mu}^{(t)}\mathbb{E}\left[{v}^{(t)}\right]({\alpha}^{(t)}h)\mathbb{E}\left[{\theta}^{(t)}\right])}^{2}+\mathbb{V}\left[{\theta}^{(t)}\right]+{({\mu}^{(t)})}^{2}\mathbb{V}\left[{v}^{(t)}\right]+{({\alpha}^{(t)}h)}^{2}\mathbb{V}\left[{\theta}^{(t)}\right]$  
$2{\mu}^{(t)}{\alpha}^{(t)}h\cdot \mathrm{Cov}({\theta}^{(t)},{v}^{(t)})+{({\alpha}^{(t)}h\sigma )}^{2}+2({\mu}^{(t)}\mathrm{Cov}({\theta}^{(t)},{v}^{(t)}){\alpha}^{(t)}h\mathbb{V}\left[{\theta}^{(t)}\right])]$  
$={\displaystyle \frac{1}{2}}h[{((1{\alpha}^{(t)}h)\mathbb{E}\left[{\theta}^{(t)}\right]+{\mu}^{(t)}\mathbb{E}\left[{v}^{(t)}\right])}^{2}+{(1{\alpha}^{(t)}h)}^{2}\mathbb{V}\left[{\theta}^{(t)}\right]+{({\mu}^{(t)})}^{2}\mathbb{V}\left[{v}^{(t)}\right]$  
$+2{\mu}^{(t)}(1{\alpha}^{(t)}h)\mathrm{Cov}({\theta}^{(t)},{v}^{(t)})+{({\alpha}^{(t)}h\sigma )}^{2}]$  
$={\displaystyle \frac{1}{2}}h[{(1{\alpha}^{(t)}h)}^{2}(\mathbb{E}{\left[{\theta}^{(t)}\right]}^{2}+\mathbb{V}\left[{\theta}^{(t)}\right])+{({\mu}^{(t)})}^{2}(\mathbb{E}{\left[{v}^{(t)}\right]}^{2}+\mathbb{V}\left[{v}^{(t)}\right])$  
$+2{\mu}^{(t)}(1{\alpha}^{(t)}h)(\mathbb{E}\left[{\theta}^{(t)}\right]\mathbb{E}\left[{v}^{(t)}\right]+\mathrm{Cov}({\theta}^{(t)},{v}^{(t)}))+{({\alpha}^{(t)}h\sigma )}^{2}].$ 
For simplicity, we denote $A(\cdot )=\mathbb{E}{[\cdot ]}^{2}+\mathbb{V}[\cdot ]$, and notice that $\mathbb{E}\left[{\theta}^{(t)}{v}^{(t)}\right]=\mathbb{E}\left[{\theta}^{(t)}\right]\mathbb{E}\left[{v}^{(t)}\right]+\mathrm{Cov}({\theta}^{(t)},{v}^{(t)})$, hence,
$${\mathcal{L}}^{(t+1)}=\frac{1}{2}h\left[{(1{\alpha}^{(t)}h)}^{2}A({\theta}^{(t)})+{({\mu}^{(t)})}^{2}A({v}^{(t)})+2{\mu}^{(t)}(1{\alpha}^{(t)}h)\mathbb{E}\left[{\theta}^{(t)}{v}^{(t)}\right]+{({\alpha}^{(t)}h\sigma )}^{2}\right].$$  (15) 
In order to find the optimal learning rate and momentum for minimizing ${\mathcal{L}}^{(t+1)}$, we take the derivative with respect to ${\alpha}^{(t)}$ and ${\mu}^{(t)}$, and set it to $0$:
${\nabla}_{{\alpha}^{(t)}}{\mathcal{L}}^{(t+1)}$  $=(1{\alpha}^{(t)}h)A({\theta}^{(t)})\cdot (h){\mu}^{(t)}h\mathbb{E}\left[{\theta}^{(t)}{v}^{(t)}\right]+{\alpha}^{(t)}{(h\sigma )}^{2}=0$  
${\alpha}^{(t)}h(A({\theta}^{(t)})+{\sigma}^{2})$  $=A({\theta}^{(t)})+{\mu}^{(t)}\mathbb{E}\left[{\theta}^{(t)}{v}^{(t)}\right]$  
${\nabla}_{{\mu}^{(t)}}{\mathcal{L}}^{(t+1)}$  $={\mu}^{(t)}A({v}^{(t)})+(1{\alpha}^{(t)}h)\mathbb{E}\left[{\theta}^{(t)}{v}^{(t)}\right]=0$  
${\mu}^{(t)*}$  $={\displaystyle \frac{(1{\alpha}^{(t)}h)\mathbb{E}\left[{\theta}^{(t)}{v}^{(t)}\right]}{A({v}^{(t)})}}$  
${\alpha}^{(t)}h(A({\theta}^{(t)})+{\sigma}^{2})$  $=A({\theta}^{(t)}){\displaystyle \frac{(1{\alpha}^{(t)}h)\mathbb{E}\left[{\theta}^{(t)}{v}^{(t)}\right]}{A({v}^{(t)})}}\mathbb{E}\left[{\theta}^{(t)}{v}^{(t)}\right]$  
${\alpha}^{(t)*}$  $={\displaystyle \frac{A({\theta}^{(t)})A({v}^{(t)})\mathbb{E}{\left[{\theta}^{(t)}{v}^{(t)}\right]}^{2}}{h\left(A({v}^{(t)})(A({\theta}^{(t)})+{\sigma}^{2})\mathbb{E}{\left[{\theta}^{(t)}{v}^{(t)}\right]}^{2}\right)}}.$ 
A.2.2 high dimension case
The loss is the sum of losses along all directions:
$${\mathcal{L}}^{(t+1)}=\sum _{i}\frac{1}{2}{h}_{i}\left[{(1{\alpha}^{(t)}{h}_{i})}^{2}A({\theta}_{i}^{(t)})+{({\mu}^{(t)})}^{2}A({v}_{i}^{(t)})+2{\mu}^{(t)}(1{\alpha}^{(t)}{h}_{i})\mathbb{E}\left[{\theta}_{i}^{(t)}{v}_{i}^{(t)}\right]+{({\alpha}^{(t)}{h}_{i}{\sigma}_{i})}^{2}\right]$$ 
Now we obtain optimal learning rate and momentum by setting the derivative to $0$,
${\nabla}_{{\alpha}^{(t)}}{\mathcal{L}}^{(t+1)}={\displaystyle \sum _{i}}{h}_{i}\left[(1{\alpha}^{(t)}{h}_{i})A({\theta}_{i}^{(t)})\cdot ({h}_{i}){\mu}^{(t)}{h}_{i}\mathbb{E}\left[{\theta}_{i}^{(t)}{v}_{i}^{(t)}\right]+{\alpha}^{(t)}{({h}_{i}{\sigma}_{i})}^{2}\right]=0$  
${\alpha}^{(t)}{\displaystyle \sum _{i}}\left({({h}_{i})}^{3}(A({\theta}_{i}^{(t)})+{({\sigma}_{i})}^{2})\right)={\displaystyle \sum _{i}}\left({({h}_{i})}^{2}A({\theta}_{i}^{(t)})+{\mu}^{(t)}{({h}_{i})}^{2}\mathbb{E}\left[{\theta}_{i}^{(t)}{v}_{i}^{(t)}\right]\right)$  
${\nabla}_{{\mu}^{(t)}}{\mathcal{L}}^{(t+1)}={\displaystyle \sum _{i}}{h}_{i}{\mu}^{(t)}A({v}_{i}^{(t)})+{h}_{i}(1{\alpha}^{(t)}{h}_{i})\mathbb{E}\left[{\theta}_{i}^{(t)}{v}_{i}^{(t)}\right]=0$  
${\mu}^{(t)*}={\displaystyle \frac{{\sum}_{i}{h}_{i}(1{\alpha}^{(t)}{h}_{i})\mathbb{E}\left[{\theta}_{i}^{(t)}{v}_{i}^{(t)}\right]}{{\sum}_{i}{h}_{i}A({v}_{i}^{(t)})}}$  
${\alpha}^{(t)}{\displaystyle \sum _{i}}\left(\left({({h}_{i})}^{3}(A({\theta}_{i}^{(t)})+{({\sigma}_{i})}^{2})\right)\left({\displaystyle \sum _{j}}{h}_{j}A({v}_{j}^{(t)})\right)\left({\displaystyle \sum _{j}}{({h}_{j})}^{2}\mathbb{E}\left[{\theta}_{j}^{(t)}{v}_{j}^{(t)}\right]\right){({h}_{i})}^{2}\mathbb{E}\left[{\theta}_{i}^{(t)}{v}_{i}^{(t)}\right]\right)=$  
$\sum _{i}}\left({({h}_{i})}^{2}A({\theta}_{i}^{(t)})\left({\displaystyle \sum _{j}}{h}_{j}A({v}_{j}^{(t)})\right)\left({\displaystyle \sum _{j}}{h}_{j}\mathbb{E}\left[{\theta}_{j}^{(t)}{v}_{j}^{(t)}\right]\right){({h}_{i})}^{2}\mathbb{E}\left[{\theta}_{i}^{(t)}{v}_{i}^{(t)}\right]\right)$  
${\alpha}^{(t)*}={\displaystyle \frac{{\sum}_{i}\left({({h}_{i})}^{2}A({\theta}_{i}^{(t)})\left({\sum}_{j}{h}_{j}A({v}_{j}^{(t)})\right)\left({\sum}_{j}{h}_{j}\mathbb{E}\left[{\theta}_{j}^{(t)}{v}_{j}^{(t)}\right]\right){({h}_{i})}^{2}\mathbb{E}\left[{\theta}_{i}^{(t)}{v}_{i}^{(t)}\right]\right)}{{\sum}_{i}\left(\left({({h}_{i})}^{3}(A({\theta}_{i}^{(t)})+{({\sigma}_{i})}^{2})\right)\left({\sum}_{j}{h}_{j}A({v}_{j}^{(t)})\right)\left({\sum}_{j}{({h}_{j})}^{2}\mathbb{E}\left[{\theta}_{j}^{(t)}{v}_{j}^{(t)}\right]\right){({h}_{i})}^{2}\mathbb{E}\left[{\theta}_{i}^{(t)}{v}_{i}^{(t)}\right]\right)}}.$ 
A.3 Univariate Optimality in SGD
We now consider a dynamic programming approach to solve the problem. We formalize the optimization problem of $\{{\alpha}_{i}\}$ as follows. We first denote ${\mathcal{L}}_{\mathrm{min}}$ as the minimum expected loss at the last time step $T$ (i.e., under the optimal learning rate).
${\mathcal{L}}_{\mathrm{min}}=\underset{{\alpha}^{(t)},{\alpha}^{(t+1)},\mathrm{\dots},{\alpha}^{(T1)}}{\mathrm{min}}{\mathbb{E}}_{{\xi}^{(t)},{\xi}^{(t+1)},\mathrm{\dots},{\xi}^{(T1)}}\left[\mathcal{L}({\theta}^{(T)})\right].$ 
Recall that the loss can be expressed in terms of the expectation and variance of $\theta $. Denote ${A}^{(t)}={(\mathbb{E}\left[{\theta}^{(t)}\right])}^{2}+\mathbb{V}\left[{\theta}^{(t)}\right]$. The final loss can be expressed in terms of ${A}_{\mathrm{min}}^{(T)}$, obtained by using optimal learning rate schedule:
${\mathcal{L}}_{\mathrm{min}}$  $={\displaystyle \frac{1}{2}}h{A}_{\mathrm{min}}^{(T)}+{\sigma}^{2}.$ 
As in Theorem 1, we derive the dynamics for SGD without momentum:
${\theta}^{(t)}$  $=(1{\alpha}^{(t1)}h){\theta}^{(t1)}+{\alpha}^{(t1)}h\sigma {\xi}^{(t1)}$  
$\Rightarrow (\mathbb{E}\left[{\theta}^{(t)}\right],\mathbb{V}\left[{\theta}^{(t)}\right])$  $=((1{\alpha}^{(t1)}h)\mathbb{E}\left[{\theta}^{(t1)}\right],{(1{\alpha}^{(t1)}h)}^{2}\mathbb{V}\left[{\theta}^{(t1)}\right]+{({\alpha}^{(t1)}h\sigma )}^{2}).$ 
Thus, we can find a recurrence relation of the sequence ${A}^{(t)}$:
$${A}^{(t)}={(1{\alpha}^{(t1)}h)}^{2}\left({(\mathbb{E}\left[{\theta}^{(t1)}\right])}^{2}+\mathbb{V}{\left[{\theta}^{(t1)}\right]}^{2}\right)+{({\alpha}^{(t1)}h\sigma )}^{2}={(1{\alpha}^{(t1)}h)}^{2}{A}_{\mathrm{min}}^{(T1)}+{({\alpha}^{(t1)}h\sigma )}^{2}.$$ 
Since ${A}_{\mathrm{min}}^{(T)}$ is a function of ${\alpha}^{(T1)*}$. We can obtain optimal learning rate ${\alpha}^{(T1)*}$ by taking the derivative of ${\mathcal{L}}_{\mathrm{min}}$ w.r.t. ${\alpha}^{(T1)}$ and setting it to zero:
$\frac{d{\mathcal{L}}_{\mathrm{min}}}{d{\alpha}^{(T1)}}$  $={\displaystyle \frac{1}{2}}h{\displaystyle \frac{d{A}_{\mathrm{min}}^{(T1)}}{d{\alpha}^{(T2)}}}=0$  
$\Rightarrow $  $\frac{d{A}_{\mathrm{min}}^{(T)}}{d{\alpha}^{(T1)}}}=0$  
$\Rightarrow $  ${\alpha}^{(T1)*}={\displaystyle \frac{{A}_{\mathrm{min}}^{(T1)}}{h({A}_{\mathrm{min}}^{(T1)}+{\sigma}^{2})}}.$ 
Thus we can write ${A}_{\mathrm{min}}^{(T)}$ in terms of ${A}_{\mathrm{min}}^{(T1)}$ and the optimal ${\alpha}^{(T1)*}$:
${A}_{\mathrm{min}}^{(T)}$  $={\left(1{\displaystyle \frac{{A}_{\mathrm{min}}^{(T1)}}{{A}_{\mathrm{min}}^{(T1)}+{\sigma}^{2}}}\right)}^{2}{A}_{\mathrm{min}}^{(T1)}+{\left({\displaystyle \frac{{A}_{\mathrm{min}}^{(T1)}}{{A}_{\mathrm{min}}^{(T1)}+{\sigma}^{2}}}\sigma \right)}^{2}$  
$={\left({\displaystyle \frac{{\sigma}^{2}}{{A}_{\mathrm{min}}^{(T1)}+{\sigma}^{2}}}\right)}^{2}{A}_{\mathrm{min}}^{(T1)}+{\left({\displaystyle \frac{{A}_{\mathrm{min}}^{(T1)}}{{A}_{\mathrm{min}}^{(T1)}+{\sigma}^{2}}}\sigma \right)}^{2}$  
$={\displaystyle \frac{{A}_{\mathrm{min}}^{(T1)}{\sigma}^{2}}{{A}_{\mathrm{min}}^{(T1)}+{\sigma}^{2}}}.$ 
Therefore,
${\mathcal{L}}_{\mathrm{min}}$  $={\displaystyle \frac{1}{2}}h{A}_{\mathrm{min}}^{(T)}+{\sigma}^{2}$  
$={\displaystyle \frac{1}{2}}h\left({\displaystyle \frac{{A}_{\mathrm{min}}^{(T1)}{\sigma}^{2}}{{A}_{\mathrm{min}}^{(T1)}+{\sigma}^{2}}}\right)+{\sigma}^{2}.$ 
We now generalize the above derivation. First rewrite ${\mathcal{L}}_{\mathrm{min}}$ in terms of ${A}_{\mathrm{min}}^{Tk}$ and calculate the optimal learning rate at time step $Tk$.
Theorem 4.
For all $T\mathrm{\in}\mathbb{N}$, and $k\mathrm{\in}\mathbb{N}$, $\mathrm{1}\mathrm{\le}k\mathrm{\le}T$, we have,
$${\mathcal{L}}_{\mathrm{min}}=\frac{1}{2}h\left(\frac{{A}_{\mathrm{min}}^{(Tk)}{\sigma}^{2}}{k{A}_{\mathrm{min}}^{(Tk)}+{\sigma}^{2}}\right)+{\sigma}^{2}.$$  (16) 
Therefore, the optimal learning ${\alpha}^{\mathrm{(}t\mathrm{)}}$ at timestep $t$ is given as,
$${\alpha}^{(t)*}=\frac{{A}^{(t)}}{h({A}^{(t)}+{\sigma}^{2})}.$$  (17) 
Proof.
The form of ${\mathcal{L}}_{\mathrm{min}}$ can be easily proven by induction on $k$, and use the identity that,
$$\frac{(\frac{ab}{a+b})b}{k(\frac{ab}{a+b})+b}=\frac{ab}{(k+1)a+b}.$$ 
The optimal learning rate then follows immediately by taking the derivative of ${\mathcal{L}}_{\mathrm{min}}$ w.r.t. ${\alpha}^{(Tk1)}$ and setting it to zero. Note that the subscript $\mathrm{min}$ is omitted from ${A}^{(t)}$ in Eq.(17) as we assume all ${A}^{(t)}$ are obtained using optimal ${\alpha}^{*}$, and hence minimum. ∎