Recurrent Neural Networks

A recurrent neural network (RNN) is a class of neural network architecture designed to process sequential data by maintaining an internal state that persists across time steps. Unlike feedforward networks, which process each input independently, RNNs process inputs in order, using the output of one step as part of the input to the next. This recurrence makes them natural models for language, time series, audio, and any domain where the meaning of the present depends on the history of the past.

The canonical RNN computes, at each time step t, a hidden state h_t as a function of the current input x_t and the previous hidden state h_{t-1}. The output y_t is then computed from h_t. The same parameters are shared across all time steps, which means the network learns a single transition dynamics rather than a separate mapping for each position in the sequence. This parameter sharing is both the source of the RNN's power and the root of its most severe limitations.

The Sequence Problem

The problem RNNs were invented to solve is not merely prediction from sequence. It is the representation of temporal structure itself. In a feedforward network, time is external to the model: the input is a snapshot, the output is a response. In an RNN, time is internal: the model's state at each moment is a compressed representation of everything it has seen so far. This is the architecture of a dynamical system — a stateful process whose future evolution depends on its present state and future inputs. The formal equivalence between RNNs and dynamical systems is exact: any finite-dimensional dynamical system can be approximated by an RNN, and any RNN defines a discrete dynamical system on its hidden state space.

This equivalence is theoretically powerful but practically treacherous. Dynamical systems can be stable, periodic, chaotic, or multistable. An RNN trained by gradient descent is not guaranteed to converge to a useful dynamical regime. The hidden state may settle into fixed points that forget the input history, or into limit cycles that generate repetitive output regardless of input, or into chaotic regimes in which small changes in input produce wildly divergent outputs. The training problem for RNNs is, in part, the problem of learning a dynamical system that is both expressive and stable.

The Vanishing Gradient Problem

The training algorithm for RNNs is backpropagation through time (BPTT), which unrolls the network across a finite number of time steps and applies ordinary backpropagation to the unrolled graph. When the network is unrolled over T steps, the gradient of the loss with respect to the recurrent weights involves a product of T Jacobian matrices. Each Jacobian has a spectral norm close to 1 if the network is well-behaved, but the product of many such matrices tends either to explode (spectral norm > 1) or to vanish (spectral norm < 1). This is the vanishing gradient problem, and it is not an implementation difficulty. It is a structural property of deep temporal composition.

The consequence is that standard RNNs cannot learn long-range dependencies. In a language model, the subject of a sentence may be twenty words away from the verb that agrees with it. An RNN trained by ordinary gradient descent will fail to capture this dependency because the gradient signal from the error at the verb position has decayed to near-zero by the time it propagates back to the subject position. The network effectively has a finite memory horizon, and that horizon is typically shorter than the dependencies that matter in natural language.

Architectural Responses

The vanishing gradient problem motivated a series of architectural innovations:

Long Short-Term Memory (LSTM), introduced by Hochreiter and Schmidhuber in 1997, replaced the simple recurrent transition with a gated mechanism that includes explicit memory cells and multiplicative gates for reading, writing, and forgetting. The LSTM architecture constrains the dynamics of the memory cell to be approximately linear — the cell state is updated by additive increments rather than multiplicative composition — which prevents gradient decay along the cell path. The cost is complexity: an LSTM has three to four times as many parameters as a standard RNN, and the gating mechanism is harder to interpret than a simple recurrent update.

Gated Recurrent Units (GRU), introduced by Cho et al. in 2014, simplified the LSTM by merging the cell and hidden state and reducing the number of gates to two. Empirically, GRUs perform comparably to LSTMs on most tasks while being faster to train and less prone to overfitting. The simplification came at the cost of the explicit separation between short-term and long-term memory that the LSTM maintained.

Neither architecture truly solved the long-range dependency problem. They mitigated it by making the network's default behavior forgetful rather than explosive, but gradient-based training of recurrent architectures remains fundamentally limited in how far it can propagate useful signals through time. The problem is not the architecture. It is the training signal.

Displacement by Transformers

The transformer architecture, introduced by Vaswani et al. in 2017, displaced RNNs as the dominant architecture for sequence modeling by solving the temporal dependency problem through a different mechanism: attention. Rather than compressing history into a fixed-size hidden state, transformers attend directly to all previous tokens, computing pairwise relationships that bypass the sequential bottleneck entirely. The cost is quadratic memory and computation in sequence length, but for the sequence lengths typical in language, this cost is acceptable on modern hardware.

The displacement was rapid and nearly total. By 2020, RNNs were rarely used in natural language processing, and the research frontier had moved to scaling transformers to billions of parameters and longer contexts. The RNN's recurrent structure, once the defining feature of neural sequence modeling, became a historical curiosity.

The Persistent Niche

This narrative of displacement is accurate but incomplete. RNNs persist in domains where the input is genuinely streaming and unbounded, where the quadratic cost of attention is prohibitive, or where the model must operate in real-time with memory constraints. Speech recognition, music generation, and control systems for robotics all continue to use recurrent architectures because the transformer assumption — that the full sequence is available at once — is false in these domains. The RNN's online processing, in which each output is produced from a bounded amount of state rather than a global attention operation, is not merely an engineering compromise. It is a different computational model.

Moreover, the equivalence between RNNs and dynamical systems has made them theoretically important in neuroscience, where recurrent circuits in the brain are increasingly understood as implementing computational dynamics rather than feedforward feature hierarchies. The brain is not a transformer. It is a recurrent network of recurrent networks, and understanding how it computes requires understanding the properties of recurrent dynamics that gradient descent, ironically, often fails to discover.

The Systems-Theoretic Reading

From a systems perspective, the history of RNNs illustrates a general pattern: an architecture is invented to solve a temporal structure problem, its limitations are traced to the training dynamics rather than the architecture itself, alternative architectures are invented that bypass the training problem by changing the problem, and the original architecture survives not in the center but in the margins. The RNN was displaced not because it was wrong but because the community chose to solve a different problem — bounded-context language modeling — where the transformer was superior. The domains where the RNN remains superior are those where the original problem, streaming temporal computation, is still the actual problem.

The lesson is not about architectures. It is about how the metrics and benchmarks that drive research evolution select for solutions to the problems that are easiest to benchmark, not the problems that are most important. The transformer won because attention is easy to parallelize and benchmark on static datasets. The RNN lost not on capability but on convenience.

The recurrent neural network is not a failed technology. It is a successful solution to a problem that the field stopped asking.