Empowering Models to Learn How to Learn: A Deep Dive into MAML, FOMAML, and Reptile

In the ever-evolving landscape of machine learning, one of the most exciting frontiers is meta-learning—the art of teaching models not just to solve tasks, but to quickly adapt to new tasks with minimal data. In this long-form blog post, we’ll explore three influential algorithms in this domain:

Along the way, we’ll revisit the key questions that arose during our discussion, unpack ablation studies, and point you to the seminal papers that introduced these methods.

Table of Contents

  1. The Meta-Learning Challenge
  2. MAML: Learning a Good Initialization
    1. Algorithm Overview (Supervised)
    2. Inner Loop vs Outer Loop
    3. Ablation Studies and Insights
  3. FOMAML: First-Order Approximation
    1. Why Skip Second-Order Gradients?
    2. Outer Loop Update in FOMAML
  4. Reptile: A Simpler Alternative
    1. SGD Inner Loop
    2. Meta-Update via Vector Difference
  5. Frequently Asked Questions
  6. Papers and Further Reading
  7. Conclusion

The Meta-Learning Challenge

Traditional supervised learning trains a model on a large, fixed dataset and then evaluates it on held-out data. But in many real-world scenarios—medical diagnosis, robotics, personalization—we need models that can adapt rapidly when faced with a new task and only a handful of labeled examples. This is the realm of few-shot learning, and meta-learning offers a principled way to achieve it.

At its heart, meta-learning asks: Can we learn a learning procedure? Instead of just optimizing a model to perform well on one task, we optimize it so that fine-tuning on new tasks is quick and effective.

MAML: Learning a Good Initialization

Introduced by Finn et al. (2017) in “Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks” (arXiv:1703.03400), MAML is one of the most popular meta-learning methods. It is “model-agnostic” because it can be applied to any differentiable model—classification, regression, or even reinforcement learning.

Algorithm Overview (Supervised)

1. Random Initialization: Start with parameters θ.
2. Sample Tasks: Draw a batch of tasks T_i ∼ p(T), each with its own small support set (K-shot) and query set.
3. Inner Loop (Adaptation):
   θ'_i = θ - α ∇_θ L_T_i^train(θ)
   (Few gradient steps on each task’s support set.)
4. Outer Loop (Meta-Update):
   Evaluate L_T_i^val(θ'_i) on query sets.
   θ = θ - β ∇_θ Σ_i L_T_i^val(θ'_i)
   (Backprop through inner loop; second-order gradients.)

Inner Loop vs Outer Loop

Inner loop: Task-specific fine-tuning (few-shot learning).
Outer loop: Meta-optimization to improve the starting point θ for all tasks.

Ablation Studies and Insights

In Section 5.4 of the MAML paper, the authors ran ablations to test:

They demonstrated on Omniglot and Mini-ImageNet that MAML consistently outperforms these alternatives, confirming the power of learning to learn.

FOMAML: First-Order Approximation

While MAML is elegant, the need for second-order derivatives can be computationally heavy. Enter First-Order MAML (FOMAML), proposed by Nichol et al. (2018).

Why Skip Second-Order Gradients?

Outer Loop Update in FOMAML

1. Inner Loop (same as MAML):
   θ'_i = θ - α ∇_θ L^train(θ)
2. Outer Loop:
   g_i = ∇_{θ'_i} L^val(θ'_i)
   θ = θ - β Σ_i g_i
(Treat θ'_i as constant; no second-order gradients.)

Reptile: A Simpler Alternative

Also introduced in Nichol et al. (2018), Reptile sidesteps meta-gradients altogether by using a vector-difference update.

SGD Inner Loop

Sample task T_i. Run k steps of SGD on the support set starting from θ, yielding θ'_i.

Meta-Update via Vector Difference

θ = θ + ε (θ'_i - θ)

No losses or gradients in the outer loop—just move θ toward θ'_i.

Frequently Asked Questions

  1. What is “1-shot” vs “5-shot” classification?
    N-shot means K labeled examples per class. In a 5-way, 1-shot task, you have 5 classes with 1 example each.
  2. What happens in the inner vs outer loops?
    Inner loop: Fast adaptation to a specific task.
    Outer loop: Meta-update to improve θ for future tasks.
  3. How exactly do we update θ in each algorithm?
    MAML: θ = θ - β ∇_θ L^val(θ'_i(θ)) (second-order)
    FOMAML: θ = θ - β ∇_{θ'_i} L^val(θ'_i) (first-order)
    Reptile: θ = θ + ε (θ'_i - θ)
  4. What is stochastic gradient descent (SGD)?
    Running SGD means repeatedly sampling mini-batches, computing the loss gradient, and stepping θ accordingly.

Papers and Further Reading

Conclusion

Meta-learning represents a powerful paradigm shift: teaching models how to learn rather than what to learn. Starting from MAML’s elegant bi-level optimization to FOMAML’s pragmatic simplifications and Reptile’s sheer simplicity, we see a spectrum of methods balancing computational cost and adaptation performance.

By understanding these algorithms, their updates, and their trade-offs, you’ll be well-equipped to tackle few-shot and continual learning challenges in your own projects. Happy meta-learning!