Keeping Foundation Models from Forgetting Everything

For this one, I stepped into the Author lens and presented A Practitioner's Guide to Continual Multimodal Pretraining like a conference pitch. Full credit belongs to the original paper's research team, and this post is my summary plus class takeaways.

The TL;DR Overview

The paper is called A Practitioner's Guide to Continual Multimodal Pretraining.

It tackles a very real issue: how do you keep updating a massive AI model without it forgetting everything it already knows?

Right now, foundation models mostly get major updates (retraining from scratch on massive data, which is expensive) or patch updates (editing a single tiny fact). But in the real world, models need minor updates. For example, a vision model may need to adapt to medical X-rays one month and satellite imagery the next. You do not want to retrain the entire model every time, but you also cannot risk it forgetting previously learned concepts. That is the classic stability-plasticity tradeoff.

The authors argue that existing benchmarks are not realistic enough for this setting: they often assume unlimited compute, underrepresent multimodal update behavior, and do not always track whether zero-shot capabilities degrade after adaptation.

To address this, they introduce FoMo-in-Flux (Foundation Models in Flux), which is compelling for two reasons:

Compute-bounded evaluation: storage is flexible, but update compute is strictly constrained using memory-adjusted FLOPs, which mirrors deployment reality.
Harsher forgetting-sensitive scoring: evaluation emphasizes balanced retention and adaptation, so models that learn new data but forget old capabilities are penalized properly.

My Comments & Takeaways

During presentation and discussion, a few points stood out:

Obscure concept datasets are a smart stress test: using synthetic "obscure" categories is useful for probing long-tail adaptation and exposure to AI-generated data loops.
Replay what you adapted to: one key insight is that replaying recent adaptations can matter more than replaying broad original pretraining data when preventing forgetting.
Order matters less in the long run: stream ordering can shift short-term learning behavior, but with the same eventual total data, outcomes tend to converge.

Something to Think About

A discussion point from class that stayed with me: if we rely heavily on merge-style anchoring to base weights for stability, do we eventually hit a plasticity ceiling?

If base weights become too strong an anchor, they may protect old capabilities but also restrict deep adaptation to truly divergent domains. Is there a hard limit to continuous minor updating-and at some point, do we inevitably need to retrain from scratch?