ElectroSphere: Electrical Electronics Engineering Bulletin
GKD-ER: Gradient-Space Knowledge Distillation with Episodic Replay for Mitigating Catastrophic Forgetting in Continual Learning
Abstract
John Tian
Continual learning (CL) seeks to enable machine learning models to learn a sequence of tasks incrementally without suffering substantial degradation on previously mastered tasks. Achieving this objective is central to developing advanced intelligent systems that operate over extended time horizons, adapt to dynamic and evolving data distributions, and handle changing environmental conditions. Application domains are broad and include: robotics operating in dynamic and partially unknown terrains, personalized recommendation systems that track ever-shifting user preferences, and autonomous vehicles that face continuously varying traffic patterns and weather conditions [1-3].
However, conventional neural networks trained incrementally suffer from catastrophic forgetting, wherein parameters optimized for newer tasks overwrite or disrupt those that were previously tuned for older tasks. Such destructive interference results in a sharp loss of performance on earlier tasks, reducing the reliability and utility of the model over time. Without effective mitigation strategies, catastrophic forgetting severely limits the viability of long-lived, incrementally evolving models, often forcing practitioners to resort to expensive retraining from scratch.
We introduce GKD-ER (Gradient-space Knowledge Distillation with Episodic Replay), a theoretically grounded and empirically validated framework that substantially reduces catastrophic forgetting. GKD-ER integrates three powerful and complementary techniques:
Gradient Projection (GP) [4]: By carefully identifying and removing gradient components that harm older tasks, GP ensures parameter updates for new tasks are orthogonal to previously learned knowledge, thus safeguarding the stability of older representations at the parameter level.
Knowledge Distillation (KD) [5,6]: By enforcing alignment between the current model’s outputs on old data and those from a reference (saved) version of the model, KD maintains consistent functional representations. This ensures that the functional mapping learned for previous tasks is preserved as new tasks are introduced, minimizing representational drift.
Episodic Replay (ER) [7,8]: By periodically revisiting a small memory buffer containing representative samples from past tasks, ER provides direct empirical anchors. These examples serve as stable checkpoints, continuously reminding the model of the previously encountered data distributions and reinforcing old decision boundaries.
Under standard smoothness and boundedness conditions, as well as representative replay assumptions, we provide rigorous theoretical analysis showing that GKD-ER can achieve bounded forgetting. Empirically, on well-established benchmarks such as Permuted MNIST and Split MNIST, GKD-ER outperforms strong baselines (Naive, EWC, SI, and ER alone) [9,10]. It attains higher final accuracies, significantly reduced forgetting, and exhibits stable, well-structured class-level decision boundaries across tasks.
By harmonizing gradient-space constraints, functional-level alignment, and empirical-level anchoring, GKD-ER establishes a robust balance between stability and plasticity. This work represents a significant step towards building indefinitely operating agents capable of integrating new knowledge continuously, while preserving past expertise—an essential milestone on the path from narrow artificial intelligence to truly adaptive, lifelong learning systems.