Distribution shift

Distribution shift is the phenomenon where the data distribution at test time differs from the distribution at training time, causing machine learning systems to degrade catastrophically. It is not an edge case but the default condition of real-world deployment: a medical diagnostic model trained on one hospital's patients fails on another's; a self-driving car trained in sunny California encounters snow in Michigan; a language model trained on 2023 data faces 2026 concepts. Distribution shift exposes a fundamental limitation of the i.i.d. (independent and identically distributed) assumption that underlies most statistical learning theory. The shift can take many forms: covariate shift (features change), label shift (class proportions change), concept drift (the mapping from features to labels changes), and adversarial perturbations (deliberately constructed shifts). Addressing distribution shift requires methods that learn invariant representations, detect shifts in real time, or adapt models continuously — problems at the frontier of robust machine learning and domain adaptation.