This is the last post in my Continual Learning series, and this time I presented from the Researcher role.
So instead of only summarizing the paper, I had to think about what should come next-if I were continuing this line of work, what experiment would I run right after this?
The paper is When Do Curricula Work?, and it makes a bold claim about how training order really works. Here is my simple breakdown of the paper, one human-learning analogy that came to mind, and the follow-up I would pitch.
The TL;DR: Pacing vs. Ordering
In human learning, we usually rely on a curriculum-we learn easy things first (algebra) before hard things (calculus). For a long time, AI researchers assumed neural networks should learn the exact same way.
But this paper basically argues that the "easy-to-hard" order does not actually matter for AI.
- Ordering: Which examples you show (easy vs. hard).
- Pacing: How fast you increase the size of the training dataset.
Their massive takeaway? Pacing is everything, ordering is basically random noise. They found that a "random curriculum" (dynamically increasing the dataset size but feeding in completely random examples) performed just as well as a carefully ordered easy-to-hard curriculum.
My Thoughts: The "Trial By Fire" Theory
Reading this paper got me thinking about how humans actually learn. We assume it is always easy-to-hard, but is it really?
Take riding a bike. A standard "curriculum" would be starting with training wheels. But a lot of us just jump on a normal two-wheeler, struggle, fall a few times, and eventually our brains just click and figure out the balance. It is incredibly hard at first, but then it becomes second nature.
That is basically an anti-curriculum-trial by fire.
In machine learning, there is a real argument for this. If you only give a model "easy" examples, it might get lazy and just memorize simple shortcuts. But if you throw the absolute hardest, most complex data at it from day one, it is forced to learn the deep, underlying patterns immediately. It is like learning to drive in crazy city traffic so that highway driving feels effortless later. It makes you wonder if AI models actually benefit more from the "scraped knee" approach to learning.
My Pitch: Stress-Testing the Decomposition
The paper's conclusion that "ordering does not matter" is huge, but there is a catch: they proved this using a very specific setup-image classification (CIFAR, Food-101) using standard CNNs (ResNet-50) and a very specific learning rate schedule (cosine decay).
So, my proposed follow-up project is simple: take their exact methodology and stress-test it across four specific axes that they could not cover, just to see if this rule actually survives at scale.
- Axis 1: Model Scale & Architecture - The original paper only used ResNet-50. Does this "pacing over ordering" rule still hold if we switch to Vision Transformers (ViTs)? We need to know if curriculum consistency breaks when the architecture changes.
- Axis 2: Modality (Text vs. Vision) - Everything they tested was image classification. But text has a natural, hierarchical structure that images do not. A complex sentence literally requires you to understand the simpler words inside it. I want to apply their exact algorithm to language model pretraining. If ordering suddenly starts mattering here, it proves that language has dependencies that vision just does not.
- Axis 3: The Learning Rate Schedule - The original paper strictly used cosine decay, which drops the learning rate to basically zero by the end of training. But think about it-if your curriculum saves the hardest examples for the very end, but your learning rate is effectively zero when you finally see them, the model will not learn anything anyway! I'd rerun the tests with a moderate decay schedule to see if the learning rate was secretly suppressing the benefits of the curriculum.
- Axis 4: Noise + Limited Time - The authors tested noisy data and limited-time training separately. But in the real world (like training an LLM on web-scraped data for a single epoch), you have to deal with both at the exact same time. Do the benefits of a curriculum compound here, or do they actually cancel out? Nobody has tested this.
Key takeaway: If their theory holds up everywhere, it means the field can stop wasting millions of compute hours trying to perfectly order data. But if it breaks on any of these axes, we get a concrete decision rule on exactly when to actually invest in a curriculum.
Class Discussion Highlight...
During the discussion, a massive blind spot was brought up regarding the paper's methodology that completely changes how you view their results.
It was pointed out that three of the four benchmarks used to prove this "ordering does not matter" theory are perfectly class-balanced. But the real world does not work like that! The question was raised: How might extreme class imbalance completely change these findings?
There was also a critique of how the authors defined "hard" examples. For their anti-curriculum tests, the authors prioritized "inconsistently classified" images. But it was argued that this just front-loads a bunch of noisy, junk images and mixes them up with genuinely complex data at the start of training, which totally skews the results.
It was a great reminder that even when a NeurIPS paper makes a massive, sweeping claim, there is almost always a hidden variable just waiting to be stress-tested.