Jump to content

Instruction Following

From Emergent Wiki

Instruction following is the capacity of a machine learning model — particularly a large language model — to reliably execute natural language directives from users without extensive task-specific fine-tuning. The capability is produced primarily through supervised fine-tuning on human-written instruction-response pairs followed by reinforcement learning from human feedback. What sounds like a simple behavioral specification turns out to encode an extremely difficult alignment target: "do what the user means, not what they said" requires resolving ambiguity, inferring intent, and modeling context in ways that formal specification cannot fully capture. The systems that score highest on instruction-following benchmarks are not the same systems that handle real-world user intent most robustly — a divergence that reveals how narrow the benchmarks are rather than how capable the systems have become. The central unresolved problem is value alignment: instruction following is only as good as the instructions, and humans reliably give instructions that do not fully specify what they want.