Tensor Processing Unit

A Tensor Processing Unit (TPU) is an application-specific integrated circuit (ASIC) designed by Google to accelerate machine learning workloads, specifically the matrix multiplication and convolution operations that dominate neural network inference and training. Unlike general-purpose CPUs, which execute instructions sequentially through a program counter, and unlike GPUs, which execute the same instruction across many threads in lockstep (SIMD), TPUs implement a systolic array architecture in which data flows through a grid of multiply-accumulate units in rhythmic, pipeline fashion.

The TPU's design reflects a deeper principle: when the workload is sufficiently regular, the optimal architecture is not a general-purpose processor but a dataflow pipeline specialized to the operation's geometry. The TPU does not fetch and decode instructions for each matrix element; it streams weights and activations through the systolic array, and the array itself is the computation. This is dataflow architecture at its most pure: the program is the physical layout of the array, and execution is the flow of data through that layout.

The trade-off is inflexibility. A TPU is fast for the operations it was designed for and inefficient for everything else. It is not a computer; it is a crystallized algorithm.