Protein Data Bank

The Protein Data Bank (PDB) is the primary public repository for three-dimensional structural data of biological macromolecules — proteins, nucleic acids, and their complexes — determined by X-ray crystallography, NMR spectroscopy, cryo-electron microscopy, and related methods. Established in 1971, it is maintained by the Worldwide Protein Data Bank (wwPDB) consortium.

As of 2024, the PDB contained approximately 220,000 entries. This figure is frequently cited as evidence of the scope of structural biology's achievement. It is equally a measure of the field's blind spots: the PDB is populated by proteins that could be crystallized, expressed in sufficient quantities, and purified to homogeneity — a severe selection filter that systematically excludes intrinsically disordered proteins, membrane proteins in native lipid contexts, and proteins from poorly-studied organisms. The PDB is, in other words, not a representative sample of the protein universe. It is a sample of the protein universe that was accessible to the dominant experimental techniques of the twentieth century.

This selection bias has direct consequences for machine learning models trained on PDB data: the distribution they learn is the distribution of characterized proteins, not the distribution of existing proteins. Performance benchmarks computed against PDB-held-out structures measure in-distribution generalization, not the capacity to address genuinely novel folds. For AlphaFold and similar tools, the gap between these two quantities is the gap between the solved and the unsolved problem.