Meta-Learning in Machine Learning

Wednesday, November 04, 2020

Meta-learning in machine learning refers to learning algorithms that learn from other learning algorithms. Most commonly, this means the use of machine learning algorithms that learn how to best combine the predictions from other machine learning algorithms in the field of ensemble learning. Nevertheless, meta-learning might also refer to the manual process of model selecting and algorithm tuning performed by a practitioner on a machine learning project that modern automl algorithms seek to automate. It also refers to learning across multiple related predictive modeling tasks, called multi-task learning, where meta-learning algorithms learn how to learn.


What Is Meta?

Meta typically means raising the level of abstraction one step and often refers to information about something else. For example, you are probably familiar with “meta-data,” which is data about data. Data about data is often called metadata. You store data in a file and a common example of metadata is data about the data stored in the file, such as:

The name of the file.

The size of the file.

The date the file was created.

The date the file was last modified.

The type of file.

The path to the file.

Now that we are familiar with the idea of “meta,” let’s consider the use of the term in machine learning, such as “meta-learning.”


What Is Meta-Learning?

Meta-learning refers to learning about learning. Meta-learning in machine learning most commonly refers to machine learning algorithms that learn from the output of other machine learning algorithms. In our machine learning project where we are trying to figure out (learn) what algorithm performs best on our data, we could think of a machine learning algorithm taking the place of ourselves, at least to some extent.

Machine learning algorithms learn from historical data. For example, supervised learning algorithms learn how to map examples of input patterns to examples of output patterns to address classification and regression predictive modeling problems. Algorithms are trained on historical data directly to produce a model. The model can then be used later to predict output values, such as a number or a class label, for new examples of input.

Learning Algorithms: Learn from historical data and make predictions given new examples of data. Meta-learning algorithms learn from the output of other machine learning algorithms that learn from data. This means that meta-learning requires the presence of other learning algorithms that have already been trained on data.

For example, supervised meta-learning algorithms learn how to map examples of output from other learning algorithms (such as predicted numbers or class labels) onto examples of target values for classification and regression problems. Similarly, meta-learning algorithms make predictions by taking the output from existing machine learning algorithms as input and predicting a number or class label.

Meta-Learning Algorithms: Learn from the output of learning algorithms and make a prediction given predictions made by other models. In this way, meta-learning occurs one level above machine learning. If machine learning learns how to best use information in data to make predictions, then meta-learning or meta machine learning learns how to best use the predictions from machine learning algorithms to make predictions.

Now that we are familiar with the idea of meta-learning, let’s look at some examples of meta-learning algorithms.

Meta-Algorithms, Meta-Classifiers, and Meta-Models

Meta-learning algorithms are often referred to simply as meta-algorithms or meta-learners.

Meta-Algorithm: Short-hand for a meta-learning machine learning algorithm. Similarly, meta-learning algorithms for classification tasks may be referred to as meta-classifiers and meta-learning algorithms for regression tasks may be referred to as meta-repressors.

Meta-Classifier: Meta-learning algorithm for classification predictive modeling tasks.

Meta-Regression: Meta-learning algorithm for regression predictive modeling tasks.

After a meta-learning algorithm is trained, it results in a meta-learning model, e.g. the specific rules, coefficients, or structure learned from data. The meta-learning model or meta-model can then be used to make predictions.

Meta-Model: Result of running a meta-learning algorithm. The most widely known meta-learning algorithm is called stacked generalization, or stacking for short.

Stacking is probably the most-popular meta-learning technique. Stacking is a type of ensemble learning algorithm. Ensemble learning refers to machine learning algorithms that combine the predictions for two or more predictive models. Stacking uses another machine learning model, a meta-model, to learn how to best combine the predictions of the contributing ensemble members.

In order to induce a Meta classifier, first the base classifiers are trained (stage one), and then the Meta classifier (second stage). In the prediction phase, base classifiers will output their classifications, and then the Meta-classifier(s) will make the final classification (as a function of the base classifiers). As such, the stacking ensemble algorithm is referred to as a type of meta-learning, or as a meta-learning algorithm.

Stacking: Ensemble machine learning algorithm that uses meta-learning to combine the predictions made by ensemble members.

There are also lesser-known ensemble learning algorithms that use a meta-model to learn how to combine the predictions from other machine learning models. Most notably, a mixture of experts that uses a gating model (the meta-model) to learn how to combine the predictions of expert models.

More generally, meta-models for supervised learning are almost always ensemble learning algorithms, and any ensemble learning algorithm that uses another model to combine the predictions from ensemble members may be referred to as a meta-learning algorithm.

Instead, stacking introduces the concept of a metal earner […] Stacking tries to learn which classifiers are the reliable ones, using another learning algorithm—the metal earner—to discover how best to combine the output of the base learners.

Model Selection and Tuning as Meta-Learning

Training a machine learning algorithm on a historical dataset is a search process. The internal structure, rules, or coefficients that comprise the model are modified against some loss function. This type of search process is referred to as optimization, as we are not simply seeking a solution, but a solution that maximizes a performance metric like classification or minimizes a loss score, like prediction error.

This idea of learning as optimization is not simply a useful metaphor; it is the literal computation performed at the heart of most machine learning algorithms, either analytically (least squares) or numerically (gradient descent), or some hybrid optimization procedure.

A level above training a model, the meta-learning involves finding a data preparation procedure, learning algorithm, and learning algorithm hyperparameters (the full modeling pipeline) that result in the best score for a performance metric on the test harness.

This, too, is an optimization procedure that is typically performed by a human. As such, we could think of ourselves as meta-learners on a machine learning project. This is not the common meaning of the term, yet it is a valid usage. This would cover tasks such as model selection and algorithm hyperparameter tuning.

Automating the procedure is generally referred to as automated machine learning, shortened to “automl.”

… the user simply provides data, and the AutoML system automatically determines the approach that performs best for this particular application. Thereby, AutoML makes state-of-the-art machine learning approaches accessible to domain scientists who are interested in applying machine learning but do not have the resources to learn about the technologies behind it in detail. Automl may not be referred to as meta-learning, but automl algorithms may harness meta-learning across learning tasks, referred to as learning to learn.

Meta-learning, or learning to learn, is the science of systematically observing how different machine learning approaches perform on a wide range of learning tasks, and then learning from this experience, or meta-data, to learn new tasks much faster than otherwise possible.

Multi-Task Learning as Meta-Learning

Learning to learn is a related field of study that is also colloquially referred as meta-learning.

If learning involves an algorithm that improves with experience on a task, then learning to learn is an algorithm that is used across multiple tasks that improves with experiences and tasks.

… an algorithm is said to learn to learn if its performance at each task improves with experience and with the number of tasks.

Rather than manually developing an algorithm for each task or selecting and tuning an existing algorithm for each task, learning to learn algorithms adjust themselves based on a collection of similar tasks. Meta-learning provides an alternative paradigm where a machine learning model gains experience over multiple learning episodes – often covering a distribution of related tasks – and uses this experience to improve its future learning performance.

Algorithms that are developed for multi-task learning problems learn how to learn and may be referred to as performing meta-learning. The idea of using learning to learn or meta-learning to acquire knowledge or inductive biases has a long history.

Learning to Learn: Application learning algorithms on multi-task learning problems in which they perform meta-learning across the tasks, e.g. learning about learning on the tasks.

This includes familiar techniques such as transfer learning that are common in deep learning algorithms for computer vision. This is where a deep neural network is trained on one computer vision task and is used as the starting point, perhaps with very little modification or training for a related vision task.

Transfer learning works well when the features that are automatically extracted by the network from the input images are useful across multiple related tasks, such as the abstract features extracted from common objects in photographs.

This is typically understood in a supervised learning context, where the input is the same but the target may be of a different nature. For example, we may learn about one set of visual categories, such as cats and dogs, in the first setting, then learn about a different set of visual categories, such as ants and wasps, in the second setting.