Continual Learning Researcher Lens Curriculum Learning

Stress-Testing "Pacing Over Ordering" (Playing the Researcher)

Class Notes Continual Learning & Memory Models

This is the last post in my Continual Learning series, and this time I presented from the Researcher role.

So instead of only summarizing the paper, I had to think about what should come next-if I were continuing this line of work, what experiment would I run right after this?

The paper is When Do Curricula Work?, and it makes a bold claim about how training order really works. Here is my simple breakdown of the paper, one human-learning analogy that came to mind, and the follow-up I would pitch.

The TL;DR: Pacing vs. Ordering

In human learning, we usually rely on a curriculum-we learn easy things first (algebra) before hard things (calculus). For a long time, AI researchers assumed neural networks should learn the exact same way.

But this paper basically argues that the "easy-to-hard" order does not actually matter for AI.

Their massive takeaway? Pacing is everything, ordering is basically random noise. They found that a "random curriculum" (dynamically increasing the dataset size but feeding in completely random examples) performed just as well as a carefully ordered easy-to-hard curriculum.

My Thoughts: The "Trial By Fire" Theory

Reading this paper got me thinking about how humans actually learn. We assume it is always easy-to-hard, but is it really?

Take riding a bike. A standard "curriculum" would be starting with training wheels. But a lot of us just jump on a normal two-wheeler, struggle, fall a few times, and eventually our brains just click and figure out the balance. It is incredibly hard at first, but then it becomes second nature.

That is basically an anti-curriculum-trial by fire.

In machine learning, there is a real argument for this. If you only give a model "easy" examples, it might get lazy and just memorize simple shortcuts. But if you throw the absolute hardest, most complex data at it from day one, it is forced to learn the deep, underlying patterns immediately. It is like learning to drive in crazy city traffic so that highway driving feels effortless later. It makes you wonder if AI models actually benefit more from the "scraped knee" approach to learning.

My Pitch: Stress-Testing the Decomposition

The paper's conclusion that "ordering does not matter" is huge, but there is a catch: they proved this using a very specific setup-image classification (CIFAR, Food-101) using standard CNNs (ResNet-50) and a very specific learning rate schedule (cosine decay).

So, my proposed follow-up project is simple: take their exact methodology and stress-test it across four specific axes that they could not cover, just to see if this rule actually survives at scale.

Key takeaway: If their theory holds up everywhere, it means the field can stop wasting millions of compute hours trying to perfectly order data. But if it breaks on any of these axes, we get a concrete decision rule on exactly when to actually invest in a curriculum.

Class Discussion Highlight...

During the discussion, a massive blind spot was brought up regarding the paper's methodology that completely changes how you view their results.

It was pointed out that three of the four benchmarks used to prove this "ordering does not matter" theory are perfectly class-balanced. But the real world does not work like that! The question was raised: How might extreme class imbalance completely change these findings?

There was also a critique of how the authors defined "hard" examples. For their anti-curriculum tests, the authors prioritized "inconsistently classified" images. But it was argued that this just front-loads a bunch of noisy, junk images and mixes them up with genuinely complex data at the start of training, which totally skews the results.

It was a great reminder that even when a NeurIPS paper makes a massive, sweeping claim, there is almost always a hidden variable just waiting to be stress-tested.