Protein Folding

This page is about proteins, the folding that gives them their shape, the diseases that result when folding fails, and how Folding@home turns millions of donated computers into a single research instrument to study the whole process. Some of it is technical, but the goal is a clear picture of what your computer is doing and why it helps.

How a chain becomes a machine

Proteins are long chain molecules — necklaces of amino acids that drive every biochemical reaction in your body. They make up your bones, muscles, hair, skin, and blood vessels. They are the antibodies your immune system uses to recognize invaders. To do any of that, each one has to take on a specific 3D shape.

The primary structure of a protein is a chain of amino acids — The primary structure of a protein — a chain of amino acids. (Wikimedia Commons / National Human Genome Research Institute, public domain)

Knowing the amino acid sequence alone is not enough. The protein has to assemble itself — to fold — into that functional shape, fast. The human genome gives us the sequence; what proteins actually do depends on the fold.

Why proteins fold

Proteins fold to reach their most "comfortable" position — the lowest-energy arrangement of atoms in their environment. Many forces drive this: hydrophobic regions try to hide from water in the protein's interior; hydrogen bonds form between specific atoms; charges attract or repel. The final shape is the one that minimizes the total tension.

Protein folding energy landscape, a funnel narrowing toward the native state — The protein folding energy landscape. The unfolded protein explores many shapes near the top; as the protein folds, the funnel walls guide it toward the lowest-energy native state at the bottom. (Wikimedia Commons / Thomas Splettstoesser, CC BY-SA 3.0)

One way to picture it: imagine a beach ball bouncing down the side of a steep mountain. The ball bounces many times as it descends, eventually stopping somewhere. Throw it down again and the path varies; if you repeat the throw thousands of times, you start to see a pattern. Most of the time the ball ends up at the very bottom. Once in a while it lands in a side depression and gets stuck — that's misfolding.

It's also a bit like parallel-parking a car on a crowded street. You usually need several adjustments to get into the space. Sometimes you back out and try again. A protein does the same. Watch a hundred similar parking attempts and you start to understand which moves work and which don't.

What makes Folding@home different from many other approaches is that we don't just want the parked car — we want the whole sequence of moves. The "how did we get here?" is where the disease answers live, and where new treatments can be designed to nudge the process in a better direction.

What happens when proteins misfold

A protein that folds incorrectly can stick to other misfolded copies and form ordered aggregates — clumps that grow into fibrils. Once aggregation gets started, more copies pile on, and the clump propagates.

In the brain these aggregates show up as the plaques and tangles of Alzheimer's disease, the toxic protein clumps of Huntington's and Parkinson's, and the fibrils of prion diseases like BSE. Cystic fibrosis, an inherited form of emphysema, and many cancers also trace back to misfolding.

Amyloid plaque formation between neurons in Alzheimer's disease — Amyloid β plaques accumulating outside neurons — a hallmark of Alzheimer's disease. (Wikimedia Commons / National Institute on Aging, public domain)

Why simulating this is hard

Real protein folding happens on timescales from microseconds to seconds. Computers simulate molecular motion at the femtosecond scale (10⁻¹⁵ seconds), and each step is significant work. To watch a single protein fold once, you'd need trillions of steps. No single computer can do it in a reasonable time.

Before 2010, even the fastest molecular dynamics simulations could only reach nanosecond to microsecond timescales — orders of magnitude short of where real folding happens.

FAHViewer rendering of the SARS-CoV-2 main protease being simulated atom by atom — FAHViewer rendering of the SARS-CoV-2 main protease being simulated atom-by-atom by a Folding@home donor computer. Every spot is an atom; every frame is a few femtoseconds of motion. (Wikimedia Commons / Lomkimarsh, CC BY-SA 4.0)

Folding@home gets around this by running many short trajectories in parallel and stitching them back together statistically. The technique that makes the stitching mathematically rigorous is the Markov state model.

Markov state models

Because folding is statistical, a single trajectory only tells you one possible story. A Markov state model (MSM) is a map of all the distinct shapes a protein explores — the states — and the rates at which it transitions between them. Given enough trajectories, the MSM describes the protein's entire conformational landscape and predicts how it folds, breathes, and sometimes misfolds.

The crucial property: MSMs let us aggregate short, independent simulations instead of needing one impossibly long one. That's exactly the kind of work a distributed network of donor computers can do efficiently.

Markov state model of NTL9 folding — MSM showing 14 of 2000 macrostates for the NTL9 protein. Larger states are more populated; thicker arrows are more likely transitions. Unfolded states in red, native in green. (Voelz et al.)

Markov state model of ACBP folding — MSM for the ACBP protein, illustrating some of the primary transitions. (Voelz et al.)

The Pande group's MSMBuilder — developed with Drs. Xuhui Huang and Gregory Bowman — is the open-source software that builds and analyzes these models. Released in 2009, it's been used in dozens of published studies of protein folding, function, and ligand binding. The work earned Bowman the American Chemical Society's Thomas Kuhn Paradigm Shift Award in 2010.

Adaptive sampling

Once we know that MSMs work, a natural question follows: which simulations are worth running next? You don't want to waste compute cycles re-exploring well-known parts of the landscape; you want to push into the parts you don't yet understand.

Picture exploring a maze with a GPS. The blind strategy is to wander aimlessly until you're tired, then look at where you've been. The smart strategy is to watch the GPS as you go: build the map incrementally, notice when you're stuck in one corner, deliberately steer toward unexplored regions.

Adaptive sampling is the smart strategy applied to MSMs. Instead of running a giant batch of simulations and analyzing afterward, the model gets built on the fly, and the running model tells the next round of simulations where to focus. The result: dramatically more useful work per donated CPU hour.

Project, Run, Clone, Generation — what your computer does

When a researcher proposes a target, the simulation gets organized into a hierarchy you'll see in your Folding@home client.

Project — the protein under study.
Run — a simulation started from one specific initial conformation.
Clone — within a run, multiple trajectories from the same conformation but with different initial atomic velocities, so they explore different paths.
Generation — a clone's trajectory is too long for one computer to finish, so it's split into pieces handed sequentially across donor machines. Generation 0 finishes; generation 1 picks up where it left off; and so on.

That four-number tag — Project, Run, Clone, Generation, or "PRCG" — uniquely identifies the slice of work your computer is contributing. Many clones and many runs are processed in parallel; only the generations are serial. That's why work units have deadlines, and why your computer's speed matters: a late return holds up the next generation.

Does this approach actually work?

In one comparison, we put MSMs head-to-head against the Anton supercomputer — a purpose-built molecular dynamics machine. Anton produced long, continuous trajectories of folding events. We chopped the same protein into short trajectories and built an MSM from them. Our approach reproduced Anton's findings — and revealed a folding pathway Anton's traditional analysis had missed.

Cartoon ribbon rendering of the NTL9 protein, PDB structure 1cqu — NTL9 — a 39-residue protein that folds on the millisecond timescale. (Wikimedia Commons / EBI MSD, public domain)

In a separate landmark result, Folding@home simulated the slow-folding NTL9 protein out to 1.52 milliseconds — a thousand times longer than what was previously achievable, and the first all-atom simulation to reach the timescale where this protein actually folds in nature.

The results

First exaflop computer in history. During the COVID-19 pandemic in March and April 2020, the Folding@home donor network passed 2.4 exaflops — the first computing system of any kind to cross the exaflop threshold, briefly faster than the world's top supercomputers combined.

Over 200 peer-reviewed papers across protein folding, cancer, neurological disease, viral targets, and computational methodology. Results have appeared in Cell, Nature, JAMA, PNAS, JACS, and the major specialty journals.

Open-source methods released to the wider field. MSMBuilder (2009) and Copernicus (2011) are both open source and used by labs that have no connection to Folding@home — the techniques developed here for distributed folding are now general infrastructure for molecular dynamics on clusters and supercomputers.

Where Folding@home came from

Portrait of Vijay Pande, founder of Folding@home — Vijay Pande, who founded Folding@home at Stanford in 2000. (Wikimedia Commons / Adriana Klas, CC BY 4.0)

Portrait of Greg Bowman, director of Folding@home — Greg Bowman, current director of Folding@home.

Folding@home was launched by Vijay Pande at Stanford University on October 1, 2000. The original idea — that statistical aggregation of thousands of short donor-computer simulations could outperform single long runs on dedicated hardware — turned out to be exactly right, and grew into one of the world's largest computing efforts.

After fifteen years leading the project, Pande moved into venture capital, and Greg Bowman — his former Ph.D. student — took over. Bowman moved the lab to Washington University in St. Louis, and then to the University of Pennsylvania, where he leads Folding@home today. More about Greg →

This explainer is adapted from earlier writing by TJ Lane, Gregory Bowman, Robert McGibbon, Christian Schwantes, Vijay Pande, and Bruce Borden.