Scalable Oversight

Scalable oversight is the problem of providing reliable training signal and evaluation for AI systems whose outputs exceed human expert competence in some domain. Current RLHF-based training methods rely on human raters to evaluate model outputs — a methodology that works when rater competence exceeds model competence, but fails when the model can produce plausible-sounding outputs that humans cannot reliably assess for correctness. The problem is particularly acute in mathematics, code, scientific reasoning, and any domain where verification is harder than generation. As AI systems become more capable, the domains where they can generate outputs beyond human verification expand, threatening the validity of human feedback as a training signal. Proposed solutions include debate (having models argue opposing positions for a human judge), iterated amplification (decomposing complex evaluations into simpler steps), and AI-assisted evaluation (using capable AI systems to help evaluate other AI systems — which reintroduces the problem one level up). None of these approaches has been validated at the capability level where the problem becomes critical. The scalable oversight problem is one reason why researchers in AI Safety regard current RLHF-based alignment methods as inadequate for future, more capable systems.