Abstract:
Training small language models requires effective distillation, yet existing methods treat
teachers as static supervision sources. I argue that effective learning depends on what a model
learns and when, principles that extend beyond traditional teacher-student setups.
First, I show that intermediate teacher checkpoints reveal implicit learning curriculums,
and that aligning students to these trajectories yields provable sample-complexity
benefits. Building on this, I develop GRACES, which predicts teacher–student
compatibility from gradients, and STAT, which adapts supervision to a student’s weak
skills. I show how these ideas extend beyond distillation to progressive subnetwork
training and context-enhanced learning, pointing toward a more general theory of efficient
learning. I outline a vision for autonomous systems that can construct their own training
curricula.
Bio:
Abhishek is a final year graduate student in the Computer Science department at Princeton
University, advised by Prof. Sanjeev Arora. His research focuses on understanding and
improving generalization in deep learning models, with an emphasis on principled training
algorithms that offer theoretical or interpretable guarantees. He is an Apple AI/ML and Siebel
Scholar for the year 2025-26. Prior to the PhD, he was a resident at Microsoft Research
India Lab and studied computer science as an undergraduate at IIT Kharagpur.