Theoretical bounds on estimation error for metalearning
Abstract
Machine learning models have traditionally been developed under the assumption that the training and test distributions match exactly. However, recent success in fewshot learning and related problems are encouraging signs that these models can be adapted to more realistic settings where train and test distributions differ. Unfortunately, there is severely limited theoretical support for these algorithms and little is known about the difficulty of these problems. In this work, we provide novel informationtheoretic lowerbounds on minimax rates of convergence for algorithms that are trained on data from multiple sources and tested on novel data. Our bounds depend intuitively on the information shared between sources of data, and characterize the difficulty of learning in this setting for arbitrary algorithms. We demonstrate these bounds on a hierarchical Bayesian model of metalearning, computing both upper and lower bounds on parameter estimation via maximumaposteriori inference.
1 Introduction
Many practical machine learning applications deal with distributional shift from training to testing. One example is fewshot classification (Ravi and Larochelle, 2016; Vinyals et al., 2016), where new classes need to be learned at test time based on only a few examples for each novel class. Recently, fewshot classification has seen increased success; however, theoretical properties of this problem remain poorly understood.
In this paper we analyze the metalearning setting, where the learner is given access to samples from a set of metatraining distributions, or tasks. At testtime, the learner is exposed to only a small number of samples from some novel task. The metalearner aims to uncover a useful inductive bias from the original samples, which allows them to learn a new task more efficiently.^{1}^{1}1Note that this definition encompasses fewshot learning. While some progress has been made towards understanding the generalization performance of specific metalearning algorithms (Amit and Meir, 2017; Khodak et al., 2019; Bullins et al., 2019; Denevi et al., 2019; Cao et al., 2019), little is known about the difficulty of the metalearning problem in general. Existing work has studied generalization upperbounds for novel data distributions (BenDavid et al., 2010; Amit and Meir, 2017), yet to our knowledge, the inherent difficulty of these tasks relative to the i.i.d case has not been characterized.
In this work, we derive novel bounds for meta learners. We first present a general information theoretic lower bound, Theorem 1, that we use to derive bounds in particular settings. Using this result, we derive lower bounds in terms of the number of training tasks, data per training task, and data available in a novel target task. Additionally, we provide a specialized analysis for the case where the space of learning tasks is only partially observed, proving that infinite training tasks or data per training task are insufficient to achieve zero minimax risk (Corollary 2).
We then derive upper and lower bounds for a particular metalearning setting. In recent work, Grant et al. (2018) recast the popular metalearning algorithm MAML (Finn et al., 2017) in terms of inference in a Bayesian hierarchical model. Following this, we provide a theoretical analysis of a hierarchical Bayesian model for metalinearregression. We compute sample complexity bounds for posterior inference under Empirical Bayes (Robbins, 1956) in this model and compare them to our predicted lowerbounds in the minimax framework. Furthermore, through asymptotic analysis of the error rate of the MAP estimator, we identify crucial features of the metalearning environment which are necessary for novel task generalization.
Our primary contributions can be summarized as follows:

We introduce novel lower bounds on minimax risk of parameter estimation in metalearning.

Through these bounds, we compare the relative utility of samples from metatraining tasks and the novel task and emphasize the importance of the relationship between the tasks.

We provide novel upper bounds on the error rate for estimation in a hierarchical metalinearregression problem, which we verify through an empirical evaluation.
2 Related work
An early version of this work (Lucas et al., 2019) presented a restricted version of Theorem 1. The current version includes significantly more content, including more general lower bounds and corresponding upper bounds in a hierarchical Bayesian model of metalearning (Section 5).
Baxter (2000) introduced a formulation for inductive bias learning where the learner is embedded in an environment of multiple tasks. The learner must find a hypothesis space which enables good generalization on average tasks within the environment, using finite samples. In our setting, the learner is not explicitly tasked with finding a reduced hypothesis space but instead learns using a general twostage approach, which matches the standard metalearning paradigm (Vilalta and Drissi, 2002). In the first stage an inductive bias is extracted from the data, and in the second stage the learner estimates using data from a novel task distribution. Further, we focus on bounding minimax risk of meta learners. Under minimax risk, an optimal learner achieves minimum error on the hardest learning problem in the environment. While average case risk of meta learners is more commonly studied, recent work has turned attention towards the minimax setting (Kpotufe and Martinet, 2018; Hanneke and Kpotufe, 2019, 2020; Mousavi Kalan et al., 2020; Mehta et al., 2012). The worstcase error in metalearning is particularly important in safetycritical systems, for example in medical diagnosis.
Mousavi Kalan et al. (2020) study the minimax risk of transfer learning. In their setting, the learner is provided with a large amount of data from a single source task and is tasked with generalizing to a target task with a limited amount of data. They assume relatedness between tasks by imposing closeness in parameterspace (whereas in our setting, we assume closeness in distribution via KL divergence). They prove only lower bounds, but notably generalize beyond the linear setting towards single layer neural networks.
There is a large volume of prior work studying upperbounds on generalization error in multitask environments (BenDavid and Borbely, 2008; BenDavid et al., 2010; Pentina and Lampert, 2014; Amit and Meir, 2017; Mehta et al., 2012). While the approaches in these works vary, one common factor is the need to characterize taskrelatedness. Broadly, these approaches either assume a shared distribution for sampling tasks (Baxter, 2000; Pentina and Lampert, 2014; Amit and Meir, 2017), or a measure of distance between distributions (BenDavid and Borbely, 2008; BenDavid et al., 2010; Mohri and Medina, 2012). Our lowerbounds utilize a weak form of task relatedness, assuming that the environment contains a finite set that is suitably separated in parameter space but close in KL divergence—this set of assumptions also arises often when computing i.i.d minimax lower bounds (Loh, 2017).
One practical approach to metalearning is learning a linear mapping on top of a learned feature space. Prototypical Networks (Snell et al., 2017) effectively learn a discriminative embedding function and performs linear classification on top using the novel task data. Analyzing these approaches is challenging due to metriclearning inspired objectives (that require noni.i.d sampling) and the simultaneous learning of feature mappings and toplevel linear functions. Though some progress has been made (Jin et al., 2009; Saunshi et al., 2019; Wang et al., 2019; Du et al., 2020). Maurer (2009), for example, explores linear models fitted over a shared linear feature map in a Hilbert space. Our results can be applied in these settings if a suitable packing of the representation space is defined.
Other approaches to metalearning aim to parameterize learning algorithms themselves. Traditionally, this has been achieved by hyperparameter tuning (Rasmussen and Nickisch, 2010; MacKay et al., 2019) but recent fully parameterized optimizers also show promising performance in deep neural network optimization (Andrychowicz et al., 2016), fewshot learning (Ravi and Larochelle, 2016), unsupervised learning (Metz et al., 2019), and reinforcement learning (Duan et al., 2016). Yet another approach learns the initialization of taskspecific parameters, that are further adapted through regular gradient descent. ModelAgnostic MetaLearning (Finn et al., 2017), or MAML, augments the global parameters with a metainitialization of the weight parameters. Grant et al. (2018) recast MAML in terms of inference in a Bayesian hierarchical model. In Section 5, we consider learning in a hierarchical environment of linear models and provide both lower and upper bounds on the error of estimating the parameters of a novel linear regression problem.
Lower bounding estimation error is a critical component of understanding learning problems (and algorithms). Accordingly, there is a large body of literature producing such lower bounds (Khas’ minskii, 1979; Yang and Barron, 1999; Loh, 2017). We focus on producing lowerbounds for parameter estimation using local packing sets, but expect that extending these results to density estimation or nonparametric estimation is feasible.
3 Novel task environment risk
Most existing theoretical work studying outofdistribution generalization focuses on providing upperbounds on generalization performance (BenDavid et al., 2010; Pentina and Lampert, 2014; Amit and Meir, 2017). We begin by instead exploring the converse: what is the best performance we can hope to achieve on any given task in the environment? After introducing notation and minimax risks, we then show how these ideas can be applied, using meta linear regression as an example.
A full reference table for notation can be found in Appendix A and a short summary is given here. We consider algorithms that learn in an environment , with data domain and a space of distributions with support . In the typical i.i.d setting, the algorithm is provided training data , consisting of i.i.d samples from .
In the standard multitask setting, we sample training data from a set of training tasks . We extend this to a metalearning, or noveltask setting by first drawing : training data points from the first distributions, for a total of samples. We call this the metatraining set. We then draw a small sample of novel data, called a support set, , from .
Consider a symmetric loss function for nondecreasing and arbitrary metric . We seek to estimate the output of , a functional that maps distributions to a metric space . For example, may describe the coefficient vector of a highdimensional hyperplane when is a space of linear models, and may be the Euclidean distance.
The i.i.d minimax risk
Before studying the metalearning setting, we first begin with a definition of the i.i.d minimax risk that measures the worstcase error of the best possible estimator,
(1) 
For notational convenience, we denote the output of by . The estimator for is denoted, , and maps samples from to an estimate of .
Noveltask minimax risk
In the noveltask setting, we wish to estimate , the parameters of the novel task distribution . We consider twostage estimators for . In the first stage, the metalearner uses a learning algorithm , that maps the metatraining set to an estimation algorithm, . In the second stage, the learner computes , the estimate of .
The noveltask minimax risk is given by,
(2) 
where . This ensures a degree of relatedness between the novel and metatraining tasks.
The estimator for now depends additionally on the samples in , where only samples from are available to the learner. Thus, addresses the domain shift expected at testtime in the metalearning setting and allows the learner to use data from multiple tasks. The goal of is to learn an inductive bias from such that a good estimate is possible with only data points from . In this setting, is equivalent to the number of shots in fewshot learning.
An example with metalinear regression
We present here a short summary based on meta linear regression, which we will analyze in more detail in Section 5.
In Figure 1, we show observed data samples from a family of polynomial regression models. Our aim is to output an algorithm which recovers the parameters of a new polynomial function from limited observations–we choose a MAP estimator which is described fully in Section 5. In the bottom right, we are given only 5 data points from a novel task distribution and estimate the parameters of the model with both the MLE and MAP estimators — the MLE overfits the support set while the MAP estimator is close to the true function.
In terms of the terminology used above, the set,
is the space of polynomial regression models, parameterized by . For this problem, we take . In Figure 1, tasks are generated with , for unknown, sparse, . Thus, each model is a polynomial function with few large coefficients. The algorithm , first takes samples from and computes an estimate, . This estimate of is then used to compute . Note that this approach is able to learn the correct inductive bias from the data, without requiring a carefully designed regularizer. The lower bounds we derive in Section 4 can be applied to problems of this general type, and the upper and lower bounds in Section 5 apply specifically to metalearning linear regression.
4 Information theoretic lower bounds on novel task generalization
In this section, we first present our most general result: Theorem 1. Using this, we derive Corollary 1 that gives a lower bound in terms of the sample size in the training and novel tasks. Corollary 1 recovers a wellknown i.i.d lower bound (Theorem 2) when , and, importantly, highlights that the novel task data is significantly more valuable than the training task data. Additionally, we provide a specialized bound that applies when the environment is partially observed — proving that in this setting training task data is insufficient to drive the minimax risk to zero.
In Theorem 1, we assume that contains distinct separated distributions but only tasks are visible to the learner. Intuitively, the error rate lowerbound shrinks as the amount of information shared between the training tasks and the novel task grows. All proofs are given in Appendix B.1. Recall for nondecreasing and arbitrary metric .
Theorem 1 (Minimax novel task risk lower bound).
Let contain distinct distributions such that and for all . Let be a random ordering of the elements, and be a vector of i.i.d samples from . Further, define to be an matrix whose column consist of i.i.d samples from . Then,
Note that is a property of the socalled packing set, , and may depend on the sample size, , and other properties of . For example, practical instances of this bound typically require or similar, as in Theorem 3 below. To derive this result, we bound the statistical estimation error by the error on a corresponding decoding problem where we must predict the novel task index, given the metatraining set and . Fano’s inequality provides bestcase error rates for this problem.
Using Theorem 1, we derive our first bound on the noveltask minimax risk that depends on the number of metatraining tasks () and datapoints per training task (, ), via a localpacking argument. The following corollary implies that if we have metatraining tasks in our packing that are close (in terms of their pairwise KL distance), then learning a novel task from training samples drawn from the metatraining tasks requires significantly more examples; in particular, learning the novel task from samples drawn from the meta training set requires times the sample complexity of the novel task. This matches our intuition that learning the novel task implies the ability to distinguish it from all wellseparated metatraining tasks.
Corollary 1.
Assume the same setting as in Theorem 1. Then,
A tighter bound on partially observed environments
We now consider the special case of Theorem 1 when , meaning that the metatraining tasks cannot cover the full packing set. In this setting, we prove that no algorithm can generalize perfectly to tasks in unseen regions of the space with small , regardless of the number of data points observed in each metatraining task.
Corollary 2.
Assume the same setting as in Theorem 1, with . Then,
In this work, we have focused on the setting where contains an equal number of samples from each of the metatraining tasks — this is the sampling scheme shown in Figure 2. However, it is possible to extend these results to different sampling schemes for . For example, in the appendix we derive bounds with as a mixture distribution. Surprisingly, despite task identity being hidden from the learner, the asymptotic rate for these two sampling schemes match.
4.1 Measuring TaskRelatedness
The use of local packing requires the design of an appropriate set of distributions whose corresponding parameters are separated but maintain small KL divergences. In the multitask setting such an assumption is intuitively reasonable: challenging tasks should require separated parameters for ideal explanations (separated) but should satisfy some relatedness measure (small KL). Importantly, these parameters can depend on sample size and other problemspecific variables. As we will see shortly, lower bounds on minimax risk in the i.i.d setting may also assume the same notion of relatedness for the localpacking in .
Task relatedness is a necessary feature for upperbounds on novel task generalization, but are typically difficult to define (see e.g. BenDavid and Borbely (2008)). Our lower bounds utilize a relatively weak notion of taskrelatedness, and thus may be overly pessimistic compared to the upper bounds computed in existing work. However, task relatedness of this form can be formulated in a representation space shared across tasks and thus can be applied in settings like those explored by e.g. Du et al. (2020). Deriving lower bounds under the different task relatedness assumptions present in the literature would make for exciting future work.
4.2 Comparison to risk of i.i.d learners
From the statement of Theorem 1 it is not clear how this lowerbound compares to that of the i.i.d learner which has access only to the samples from . To investigate the benefit of additional metatraining tasks, we compare our derived minimax risk lower bounds to those achieved by i.i.d learners. To do so, we revisit standard minimax lower bounds that can be found in e.g. Loh (2017).
Theorem 2 (IID minimax lowerbound).
Suppose satisfy for all . Then,
We include a proof of this result in Appendix B.1, using localpacking as in our metalearning bounds. As hoped, Corollary 1 recovers Theorem 2 when there are no training tasks available. Moreover, this i.i.d bound is strictly larger than the one computed in Corollary 1 in general. Note that while this i.i.d minimax risk is asymptotically tight for several learning problems (Loh, 2017; Raskutti et al., 2011), there is no immediate guarantee that the same is true for our metalearning minimax bounds. We investigate the quality of these bounds by providing comparable upper bounds in the next section.
5 Analysis of a hierarchical Bayesian model of metalearning
Our goal is to analyze the sample complexity of metalearning for linear regression, where samples are drawn from multiple metatraining tasks and we want to generalize to a new task with only a few data points. After introducing the setting, we will compute lowerbounds on the minimax risk using our results from Section 4, revealing a scaling on the metatraining sample complexity. Following the lower bound, we derive an accompanying upperbound on the risk of a MAP estimator, derived from an empirical Bayes estimate over a hierarchical Bayesian model. Asymptotic analysis of this bound reveals that if the observed samples from the novel task vary considerably more than the task parameters, then observing more metatraining samples may significantly improve convergence in the small regime. This is validated empirically in Section 6.
For , where is the total number of tasks, we define,
Each task has some design matrix and unknown parameters . For simplicity, we assume known isotropic noise models and that for all , with .
Our meta learner will fit the data using an empirical Bayes estimate in a hierarchical Bayesian model:
We will consider the Maximum a Posterior estimator,
and will characterize its risk, , where the expectation is with respect to sampled data only. The posterior distribution under the Empirical Bayes estimate for is given in Appendix C.2. The derivation is standard but dense and we recommend dedicated readers to consult Gelman et al. (2013), or an equivalent text, for more details.
5.1 Minimax lower bounds
We now compute lower bounds for parameter estimation with metalearning over multiple linear regression tasks. Beginning with a definition of the space of data generating distributions,
where are the parameters to be learned, and is the design matrix of each linear regression task in the environment. We write , which we assume is bounded for all and (an assumption that is validated for random Gaussian matrices by Raskutti et al. (2011)).
Theorem 3 (Meta linear regression lower bound).
Consider defined as above and let . If and , then,
The proof is given in Appendix B.5. We see that the size of the metatraining set has an inverse exponential scaling in the dimension, . This reflects the complexity of the space growing exponentially in dimensions and the need for a matching growth in data size to cover the environment sufficiently.
5.2 Minimax upper bounds
To compute upper bounds on the estimation error, we require an additional assumption. Namely, we will assume that the design matrices also have bounded minimum singular values, (see Raskutti et al. (2011) for some justification). For the upperbounds, we allow the bounds on the singular values of the design matrices and the observation noise in the novel task to be different than those in the metatraining tasks. We note that we can still recover the setting assumed in the lower bounds, where all tasks match on these parameters, as a special case.
The learner observes data points from each linear regression model in . We then bound the error of estimating , for which samples are available.
The expected error rate of the MAP estimator can be decomposed as the posterior variance and bias squared. In the appendix we provide a detailed derivation of these results. The bound depends on dimensionality , the observation noise in each task , the number of tasks , the number of data points in each metatraining task , and the number of data points in the novel task .
Theorem 4 (Meta Linear Regression Upper Bound).
Let be the maximumaposteriori estimator, . Then,
where,
Expectations are taken over the data conditioned on . Additional terms not depending on are defined in Appendix C.2.
While the bounds presented in Theorem 4 are relatively complicated, we can probe the asymptotic convergence of the MAP estimator to the true task parameters, . In the following section, we will discuss some of the consequences of this result and its implications for our lower bounds.
5.3 Asymptotic behavior of the MAP estimator
We first notice that when is small, the risk cannot be reduced to zero by adding more metatraining data. Recent work has suggested such a relationship may be inevitable (Hanneke and Kpotufe, 2020). Our lower bound presented in Corollary 2 agrees that more samples from a small number of metatraining tasks will not reduce the error to zero. However, unlike our lower bounds based on local packing, the lower bounds presented in this section predict that if the metatraining tasks cover the space sufficiently then an optimal algorithm might hope to reduce the error entirely with enough samples. We hypothesize that this gap is due to limitations in the standard proof techniques we utilize for the lowerbounds when the number of tasks grows, and expect a sharper bound may be possible.
To emulate the fewshot learning setting where is relatively small, we consider , with and fixed. In this case, the risk is bounded as,
where , is the ratio of the observation noise to the variance in sampling , and is the condition number of the design matrices. This leads to a key takeaway: if the observed samples from vary considerably more than the parameters , then observing more samples in will significantly improve convergence towards the true parameters in the small regime. Further, adding more tasks (increasing ) also improves these constant factors by removing the dependence on the condition number, .
6 Empirical Investigations
In this section, we provide additional quantitative exploration of the upper bound studied in Section 5. The aim is to take steps towards relating the bounds to experimental results; we know of little theoretical work in metalearning that attempt to relate their results to practical empirical datasets. Full details of the experiments in this section can be found in Appendix D.
6.1 Hierarchical Bayes polynomial regression
We first focus on the setting of polynomial regression over inputs in the range . Some examples of these functions and samples are presented in Figure 1, alongside the MAP and MLE estimates for the novel task.
Figure 2 shows the analytical expected error rate (risk) under various environment settings. We observe that even in this simple hierarchical model, the estimator exhibits complex behavior that is correctly predicted by Theorem 4. In Figure 2A, we varied the novel task difficulty by increasing the novel task observation noise (). We plot three curves for three different dataset size configurations. When the novel task is much noisier than the source tasks, it is greatly beneficial to add more metatraining data (blue vs. red). And while larger made little difference when the novel task was relatively difficult (blue vs. green), the expected loss was orders of magnitude lower when the novel task became easier. In Figure 2B, we fixed the relative task difficulty and instead varied and . The xaxis now indicates the total data available to the learner. We observed that adding more tasks has a large effect in the lowdata regime but, as predicted, the error has a nonzero asymptotic lowerbound — eventually it is more beneficial to add more noveltask data samples.
These empirical simulations verify that our theoretical analysis is predictive of the behavior of this meta learning algorithm, as each of these observations can be inferred from Theorem 4. While this model is simple, it captures key features of and provides insight into the more general metalearning problem. We also explored a nonlinear sinusoid metaregression problem (Finn et al., 2017), finding that our theory is largely predictive of the general trends in this setting too.
6.2 Sinusoid regression with MAML
Following the connections between MAML and hierarchical Bayes explored by Grant et al. (2018), we also explored regression on sinusoids using MAML. Our aim was to investigate how predictive our linear theory is for this highly nonlinear problem setting. As in Finn et al. (2017), we sample sinusoid functions by placing a prior over the amplitude and phase. In other works (Finn et al., 2017; Grant et al., 2018) the same prior is used for the training and testing stages. However, to better measure generalization to novel tasks we use different prior distributions when training versus evaluating the model.
We display the risk averaged over 30 trials in Figure 3. We varied the novel task difficulty by increasing the observation noise in the novel task. We plot separate curves for different dataset size configurations, and observe that the empirical results align fairly well with the results derived by sampling the hierarchical model (Figure 2A). Adding more metatraining data (increasing ) is beneficial (green vs. yellow) and adding more test datapoints (higher ) is also beneficial (red vs. green). Here however, these relationships did not interact with the task difficulty, as the wins for increased metatraining and metatesting data were consistent, until task noise prevents any setting of the model from performing the task.
7 Conclusion
Metalearning algorithms identify the inductive bias from source tasks and make models more adaptive towards unseen novel distribution. In this paper, we take initial steps towards characterizing the difficulty of metalearning and understanding how these limitations present in practice. We have derived both lower bounds and upper bounds on the error of metalearners, which are particularly relevant in the fewshot learning setting where is small. Our bounds capture key features of the metalearning problem, such as the effect of increasing the number of shots or training tasks. We have also identified a gap between our lower and upper bounds when there are a large number of training tasks, which we hypothesize is a limitation of the proof technique that we applied to derive the lower bounds — suggesting an exciting direction for future research.
8 Acknowledgements
This work benefited greatly from the input of many other researchers. In particular, we extend our thanks to Shai BenDavid, Karolina Dziugaite, Samory Kpotufe, and Daniel Roy for discussions and feedback on the results presented in this work. We thank Ahmad Beirami and anonymous reviewers for their valuable feedback that led to significant improvements to this paper. We also thank Elliot Creager, Will Grathwohl, Mufan Li, and many of our other colleagues at the Vector Institute for feedback that greatly improved the presentation of this work. Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and companies sponsoring the Vector Institute (www.vectorinstitute.ai/partners).
References
 Metalearning by adjusting priors based on extended PACbayes theory. arXiv preprint arXiv:1711.01244. Cited by: §1, §2, §3.
 Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems 29, pp. 3981–3989. Cited by: §2.
 A model of inductive bias learning. Journal of artificial intelligence research 12, pp. 149–198. Cited by: §2, §2.
 A theory of learning from different domains. Machine learning 79 (12), pp. 151–175. Cited by: §1, §2, §3.
 A notion of task relatedness yielding provable multipletask learning guarantees. Machine learning 73 (3), pp. 273–287. Cited by: §2, §4.1.
 Generalize across tasks: efficient algorithms for linear representation learning. In Proceedings of the 30th International Conference on Algorithmic Learning Theory, A. Garivier and S. Kale (Eds.), Proceedings of Machine Learning Research, Vol. 98, Chicago, Illinois, pp. 235–246. External Links: Link Cited by: §1.
 A theoretical analysis of the number of shots in fewshot learning. arXiv preprint arXiv:1909.11722. Cited by: §1.
 Elements of information theory. John Wiley & Sons. Cited by: Lemma 1.
 Learningtolearn stochastic gradient descent with biased regularization. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, Long Beach, California, USA, pp. 1566–1575. Cited by: §1.
 Fewshot learning via learning the representation, provably. arXiv preprint arXiv:2002.09434. Cited by: §2, §4.1.
 RL$^2$: fast reinforcement learning via slow reinforcement learning. CoRR abs/1611.02779. External Links: 1611.02779 Cited by: §2.
 Transmission of information. A Statistical Theory of Communication. Cited by: Lemma 1.
 Modelagnostic metalearning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 1126–1135. Cited by: §D.2, §1, §2, §6.1, §6.2.
 Bayesian data analysis. Chapman and Hall/CRC. Cited by: §5.
 Recasting gradientbased metalearning as hierarchical Bayes. arXiv preprint arXiv:1801.08930. Cited by: §1, §2, §6.2.
 On the value of target data in transfer learning. In Advances in Neural Information Processing Systems 32, pp. 9871–9881. Cited by: §2.
 A nofreelunch theorem for multitask learning. arXiv preprint arXiv:2006.15785. Cited by: §2, §5.3.
 Regularized distance metric learning:theory and algorithm. In Advances in Neural Information Processing Systems 22, pp. 862–870. Cited by: §2.
 A lower bound on the risks of nonparametric estimates of densities in the uniform metric. Theory of Probability & Its Applications 23 (4), pp. 794–798. Cited by: §2, Lemma 2.
 Provable guarantees for gradientbased metalearning. arXiv preprint arXiv:1902.10644. Cited by: §1.
 Marginal singularity, and the benefits of labels in covariateshift. In Proceedings of the 31st Conference On Learning Theory, S. Bubeck, V. Perchet, and P. Rigollet (Eds.), Proceedings of Machine Learning Research, Vol. 75, , pp. 1882–1886. Cited by: §2.
 On lower bounds for statistical learning theory. Entropy 19 (11), pp. 617. Cited by: §2, §2, §4.2, §4.2, Lemma 3.
 Informationtheoretic limitations on novel task generalization. Neurips 2019 Workshop on Machine Learning with Guarantees. Cited by: §2.
 Selftuning networks: bilevel optimization of hyperparameters using structured bestresponse functions. In 7th International Conference on Learning Representations, Cited by: §2.
 Transfer bounds for linear feature learning. Machine learning 75 (3), pp. 327–350. Cited by: §2.
 Minimax multitask learning and a generalized losscompositional paradigm for mtl. Advances in Neural Information Processing Systems 25, pp. 2150–2158. Cited by: §2, §2.
 Metalearning update rules for unsupervised representation learning. In 7th International Conference on Learning Representations, Cited by: §2.
 New analysis and algorithm for learning with drifting distributions. In International Conference on Algorithmic Learning Theory, pp. 124–138. Cited by: §2.
 Minimax lower bounds for transfer learning with linear and onehidden layer neural networks. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 1959–1969. External Links: Link Cited by: §2, §2.
 A PACbayesian bound for lifelong learning. In International Conference on Machine Learning, pp. 991–999. Cited by: §2, §3.
 Minimax rates of estimation for highdimensional linear regression over balls. IEEE transactions on information theory 57 (10), pp. 6976–6994. Cited by: §4.2, §5.1, §5.2.
 Gaussian processes for machine learning (GPML) toolbox. J. Mach. Learn. Res. 11, pp. 3011–3015. Cited by: §2.
 Optimization as a model for fewshot learning. International Conference on Learning Representations. Cited by: §1, §2.
 An empirical bayes approach to statistics. In Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, Berkeley, Calif., pp. 157–163. Cited by: §1.
 A theoretical analysis of contrastive unsupervised representation learning. In International Conference on Machine Learning, pp. 5628–5637. Cited by: §2.
 Prototypical networks for fewshot learning. In Advances in Neural Information Processing Systems, pp. 4077–4087. Cited by: §2.
 A perspective view and survey of metalearning. Artificial intelligence review 18 (2), pp. 77–95. Cited by: §2.
 Matching networks for one shot learning. In Advances in neural information processing systems, pp. 3630–3638. Cited by: §1.
 Some matrixinequalities and metrization of matric space. Cited by: Lemma 8.
 Multitask metric learning: theory and algorithm. In Proceedings of Machine Learning Research, K. Chaudhuri and M. Sugiyama (Eds.), Proceedings of Machine Learning Research, Vol. 89, , pp. 3362–3371. Cited by: §2.
 Informationtheoretic determination of minimax rates of convergence. Annals of Statistics, pp. 1564–1599. Cited by: §2.
Appendix A Notation
Description  

The domain of the data, e.g.  
The range of the data, e.g.  
The product space  
A collection of distributions over  
A (finite) subset of distributions in  
An element of  
The product distribution, whose samples correspond to independent draws from  
The product distribution, , for  
The mixture distribution, , for  
A metric space, containing parameters for each distribution  
A functional, mapping distributions in to parameters in  
An estimator  
The mutual information between random variables and .  
Denotes training datasets drawn i.i.d from some . Typically ,  
Denotes a metatraining set drawn i.i.d from  
, for  Indicates the set 
The norm ball of radius , centered at . 
Appendix B Lower Bound Proofs
We will make use of several standard results below, which we present here.
Lemma 1.
Lemma 2.
Mutual information equality (Khas’ minskii, 1979) Consider random variables , then,
Lemma 3.
Local packing lemma (Loh, 2017) Consider distributions . Let be a random variable distributed uniformly on and let be a vector of i.i.d samples from . Then,
We will require a novel local packing bound for the noveltask risk, which we present in Lemma 5.
b.1 IID Lower Bound
We first prove the i.i.d result, which will serve as a guide for our novel lower bounds.
See 2
Proof.
First, notice that,
Now define the decision rule,
with ties broken arbitrarily. We proceed by bounding the expected loss. First, using Markov’s inequality,
Next, consider the case . Through the triangle inequality,
Thus, the probability that the distance is less than is at least as large as the probability that the estimator is correct.
Now, using Fano’s inequality with , and (and the corresponding Markov chain
), we have,
Combining the above inequalities with the Local Packing Lemma gives the final result. ∎
b.2 Proof of Theorem 1
See 1
Proof.
As in the i.i.d case, we first bound the supremum from below with an average,
where the inner sum is over all length orderings, with .
As before, we consider the following estimator,
Using Markov’s inequality, and then following the proof of Theorem 2, we have,
with the use of Fano’s inequality, we arrive at,
Conditioned on , each element of and are independent but they are not identically distributed. Thus, with the application of Lemma 2,
The result follows by combining these inequalities. ∎
Remark
In the above proof of Theorem 1, we did not need to make use of the form of the distribution of , only that the correct graph structure was observed. This grants us some flexibility, which we utilize later in Section B.4 to prove lower bounds for mixture distributions.
We now proceed with proofs of the corollaries of Section 4.
b.3 Local packing results
Lemma 4 (Metalearning local packing).
Consider the same setting as in Theorem 1, then
Proof.
There are orderings on the first indices, given the . We introduce the following notation,
As in previous proofs, we notice that we can write,
First, note that we can upper bound , where denotes a single row in . Further,
We will use convexity of the KL divergence to upper bound this quantity. Each distribution is an average over a random selection of index orderings.
When applying convexity, all pairs of selections that exactly match will lead to a KL divergence of zero. There are the same number of these in each component of . Thus we care only about selections that contain either or such that matching pairs of distributions exactly is not possible. Further, we need only consider pairs of product distributions who differ only in a single, identical position.
Each of the above described pairs of distributions has KL divergence equal to . We conclude by counting the total number of orderings producing such pairs. First, there are choices for the index of and . Then, there are total orderings of the remaining elements. Thus, we have,
∎
These results together provide an immediate proof of Corollary 1.
See 1
Proof.
Putting together the results of Theorem 1, Lemma 3, and Lemma 4, and using the fact that , the result follows immediately. ∎
See 2
Proof.
This result follows as an application of the data processing inequality. Notice that forms a Markov chain. Thus,
by the data processing inequality. We can compute in closed form:
The proof is completed by plugging in the i.i.d local packing bound alongside the above. ∎
b.4 Bounds using mixture distributions
In this section we introduce tools to lower bound the minimax risk when the metatraining set is sampled from a mixture over the metatraining tasks, . We note first that Theorem 1 can be reproduced exactly when . Thus, we need only provide a local packing bound for the mixture distribution. In Lemma 5 we provide such a lower bound for the special case where , so that data is sampled from a mixture over the entire environment.
Lemma 5 (Leaveonetaskout mixture local packing).
Let contain distinct distributions such that for all and let . Let be a random ordering of the elements, and define to be a vector of i.i.d samples from . Then,
Proof.
From Lemma 3 (and some simple arithmetic) we have,
Note that by the definition of the mixture distribution,
Using the convexity of the KL divergence,
Noting for the last step that the KL is zero if and only if the distributions are the same almost everywhere. ∎
b.5 Proof of Hierarchical linear model lower bound
Recall that the space of distributions we consider is given by,
See 3
Proof.
The proof consists of two steps, we first construct a packing of . Then, we upper bound the KL divergence between two distributions in this packing and use Corollary 1 to give the desired bound.
The maximal packing number for the unit 2norm ball can be bounded by the following,
We use a common scaling trick. First, through this bound, we can build a packing set, , with packing radius , giving . We define a new packing set of the same cardinality by taking for all (requiring ). Giving for all ,
similarly, .
We now proceed with bounding the KL divergences.
where . We write and , then,
The second line is derived using the CauchySchwarz inequality, and the final inequality uses . We will not proceed by invoking Corollary 1 on this packing set. This will require choosing to achieve our desired rate and will in turn impose constraints on the problem dimensions to ensure the packing is valid.
Now, using Corollary 1,
Choosing gives,
for . To enforce , we further require that,
Additionally, we may only consider packing sets with KL divergence no more than , hence we also require that,
Thus,
∎
Appendix C Hierarchical Bayesian Linear Regression Upper Bounds
c.1 Some useful linear algebra results
Let denotes the maximum singular value of ; denotes the minimum singular value of .
Lemma 6.
Singular value of sum of two matrices Let