Formal language
A formal language is a set of strings over a finite alphabet, defined by precise syntactic rules that distinguish well-formed expressions from ill-formed ones. Unlike natural languages — whose grammars are extracted by linguists from messy usage — formal languages are designed: their rules are explicit, their semantics is compositional, and their properties are provable. A formal language is not a description of how people communicate; it is a mathematical object whose structure can be analyzed, restricted, and mapped onto other structures with guaranteed precision.
The study of formal languages sits at the intersection of logic, computer science, linguistics, and mathematics — four fields that treat the same object (a system of structured symbols) with different questions. Logic asks what can be expressed and what can be proved. Computer science asks what can be computed, parsed, and verified automatically. Linguistics asks whether the human language faculty is itself a formal system, and if so, which one. Mathematics asks what structural properties these languages possess, and how those properties classify them.
Syntactic Foundations: From Alphabets to Grammars
Every formal language begins with an alphabet — a finite set of symbols — and a grammar that specifies which strings over that alphabet are well-formed. The canonical classification is the Chomsky hierarchy, which partitions formal languages into four classes of increasing expressive power: regular languages (recognized by finite automata), context-free languages (recognized by pushdown automata), context-sensitive languages, and recursively enumerable languages (recognized by Turing machines). This hierarchy is not merely taxonomic. It is a map of what computational resources are required to recognize, parse, and decide properties of a language — and therefore a map of what cognitive or mechanical resources are required to process it.
Chomsky's original motivation for the hierarchy was linguistic, not computational. In Syntactic Structures (1957), he argued that the syntax of natural language cannot be described by finite-state (regular) grammars, because natural languages permit center-embedding — constructions where a clause is nested inside another clause of the same type — to arbitrary depth. Finite automata cannot count unbounded nesting; pushdown automata can. This single argument launched generative grammar and transformed linguistics from a taxonomic discipline into a formal one. Whether Chomsky's claim about the formal power of natural language has survived empirical scrutiny is debated, but the framework he introduced — treat language as a formal system, then ask which formal system — remains the default methodology of theoretical linguistics.
Semantics and the Interface Problem
Syntax without semantics is mere symbol-shuffling. A formal language acquires meaning through an interpretation: a mapping from syntactic objects to a domain of discourse, governed by model-theoretic or proof-theoretic principles. In model-theoretic semantics, a sentence is true in a model if the objects and relations assigned to its symbols stand in the configuration the sentence describes. In proof-theoretic semantics, meaning is constituted by what can be derived from the sentence within a system of inference rules.
The interface between syntax and semantics is where formal languages reveal their deepest structural properties. The Curry-Howard correspondence shows that proofs in intuitionistic logic correspond to programs in typed lambda calculus, and propositions correspond to types. This is not a metaphor. It is an isomorphism: the same formal language can be read as a system of logical derivations or as a programming language, with no loss of information. The discovery that logic and computation are two faces of the same formal structure is among the most consequential results of twentieth-century foundational research.
Formal Languages as Systems
A formal language is not merely a set of strings. It is a system: a space of possible expressions, a grammar that constrains that space, a semantics that maps expressions to meanings, and a community of interpreters (human or machine) that stabilizes the mapping through use. From this systems perspective, formal languages are simplified models of more complex semiotic systems — including natural languages, programming ecosystems, and biological signaling systems.
The information-theoretic perspective adds another layer. A formal language can be viewed as a code: a constrained subset of all possible strings that carries information precisely because the constraints are known to both sender and receiver. The entropy of a formal language — the average information content per symbol, given the grammar — is lower than the entropy of the raw alphabet, and the difference is the redundancy that makes error detection and correction possible. Formal verification exploits this redundancy: by restricting the language in which safety-critical systems are specified, one restricts the space of possible bugs to those expressible within the language — a dramatic reduction in the search space for errors.
The obsession with the Chomsky hierarchy has obscured a more interesting question: not which class a language belongs to, but how formal languages evolve under selective pressure. Programming languages are not designed in a vacuum; they are selected by programmer communities for expressiveness, verifiability, and runtime efficiency. Natural languages, if they are formal systems at all, are formal systems shaped by millennia of conversation-level selection. And biological codes — the genetic code, signaling cascades, immune recognition — are formal languages selected by evolutionary fitness. The unifying framework is not the hierarchy but the dynamics: formal languages as evolving constraint systems, subject to the same statistical mechanics of selection that governs any complex adaptive system. The Chomsky hierarchy classifies; what we need is a theory that explains why certain formal structures persist and others disappear.