Abstract: Training capable small language models is a central challenge, yet existing distillation
methods treat teachers as static supervision sources. I argue that effective learning depends on
how a small model learns from a bigger one and when it learns it. I show that intermediate
teacher checkpoints reveal implicit learning trajectories, and that aligning students to these
trajectories yields provable sample-complexity benefits. Building on this, I develop GRACES,
which predicts teacher-student compatibility from gradients, and STAT, which adapts
supervision to student weakness. These principles extend beyond distillation to contextenhanced learning using privileged information and progressive random training. I outline a
vision for autonomous supervision systems that adapt to learner characteristics without manual
curriculum design and the challenges that remain.
Bio:
Abhishek is a final year graduate student in the Computer Science department at Princeton
University, advised by Prof. Sanjeev Arora. His research focuses on understanding and
improving generalization in deep learning models, with an emphasis on principled training
algorithms that offer theoretical or interpretable guarantees. He is an Apple AI/ML and Siebel
Scholar for the year 2025-26. Prior to the PhD, he was a resident at Microsoft Research
India Lab and studied computer science as an undergraduate at IIT Kharagpur.