Proteins (Slide 1) are linear chains of amino acids that fold into precise 3-dimensional shapes to perform a wide range of the fundamental processes of life; they are the most structurally and chemically versatile biomolecules. They include catalysts that control most biochemical processes, signalling molecules, transporters, structural materials and motors. Proteins are linear polymers, assembled according to genetic information stored in genes, but they fold up into well-defined three-dimensional structures that are essential for their function. How the one-dimensional sequence of residues in a protein determines its three-dimensional structure is one of the most important current problems in biology – and biophysics.

3.1 Primary Structure

The primary structure of a protein is the sequence of amino acid residues in the polypeptide chain , written from the N-terminal to the C-terminal. The building blocks from which proteins are formed are amino acids (Slide 2). Amino acids contain both amino and carboxyl groups; those from which proteins are made are alpha-amino acids in which these two functional groups are attached to the same carbon atom. The alpha carbon atom is also linked to a hydrogen atom and a fourth group R – the side chain – which distinguishes one amino acid from another. Phenylalanine is an example: its side chain is a phenyl ring. In the physiological pH range both the carboxylic acid and the amino groups are completely ionised. Molecules that carry charged groups of opposite polarity are known as zwitterions.

With the exception of glycine, whose side chain is just another hydrogen atom, the alpha carbon is a chiral centre. There are two ways of arranging these four groups around it that cannot be superimposed – one is the mirror image of the other. Amino acids are optically active – the plane of polarised light is rotated as it passes through a pure solution of one or other isomer. The two isomers are denoted L and D (for dextro- and laevorotatory). If the side chain itself does not incorporate a chiral centre then these two molecules really are mirror images of each other, known as enantiomers. If they are not mirror images because of the chirality of the side chain then they are known more generally as diastereomers.

All amino acids that are synthesised biochemically have the L configuration, and the biochemical machinery that assembles them into proteins recognises only the L form. It is possible to synthesise a protein from D-amino acids, and it folds up to form a mirror image of the normal protein – but it cannot function as part of a living system because most biomolecular interactions are stereospecific. Nobody knows why the L-form was ‘chosen’. It may be the result of an evolutionary accident that has become locked in.

A protein is formed by linking amino acids by peptide bonds to form a polypeptide chain – a polymer. The peptide bond is formed by a reaction between the carboxyl group on one amino acid and the amino group on the next in which a molecule of water is eliminated (Slide 3). When incorporated in a polypeptide an amino acid becomes an amino acid residue, or simply a residue.

Proteins are polypeptide chains containing typically between 50 and 2000 amino acid residues. Shorter chains are called oligopeptides, or just peptides. The mean molecular mass of an amino acid residue is about 110, so the molecular masses of most proteins lie between 5 kDa and 200 kDa – though they can be as big as about 33,000 residues / 3600kDa. A Dalton is the same as an atomic mass unit.

The primary structure of a protein is the sequence of residues in the polypeptide chain. At the ends there are unreacted amino and carboxyl groups – the two ends of a polypeptide are conventionally labelled N and C accordingly, and the sequence of residues in a polypeptide is always written from N to C. Note that, as with the nucleic acids, the backbone has no inversion symmetry.

There are 20 standard amino acids; proteins in all species are built from the same 20 monomers. The building blocks of genes and of proteins, and the genetic code that relates them, are universal. Biology is astonishingly uniform.

The side chains vary in size, shape, charge, hydrogen bonding capacity, hydrophobicity and chemical reactivity. The versatility of the standard amino acids accounts for the wide range of protein structures and functions.

3.1.1 Amino Acids

Slide 4 shows the first seven amino acids (in increasing order of hydrophbicity), known as aliphatic amino acids as their side chains consist mainly of non-aromatic carbon and hydrogen groups. The side chain of glycine is no more than a hydrogen atom – two of the groups bound to the alpha carbon are the same, so glycine is not chiral. Alanine has a methyl group as its side chain. Valine, leucine and isoleucine have larger hydrocarbon side chains, which make them rather hydrophobic. Hydrophobic amino acids tend to pack together rather than contact water – this effect is important in stabilising the 3D structure of water soluble proteins. Isoleucine has an additional chiral carbon – the stereoisomeric form shown in Slide 4 is the only isomer found in proteins. Methionine has a largely aliphatic side chain that includes a thioether group (thio indicates that a compound contains a sulfur atom rather than oxygen). Proline’s side chain is the only standard amino acid to have a cyclic side chain that is bonded to the alpha carbon and to the nitrogen – this ring restricts its conformational freedom and can be important in determining protein architecture.

Slide 5 shows more amino acids. Phenylalanine, tyrosine, tryptophan have aromatic side chains. Phenylalanine has a phenyl ring – it is non-polar, and hydrophobic. Tyrosine has a hydroxyl group attached to a phenyl ring, and tryptophan has two fused rings containing an NH group. Tyrosine and tryptophan have weakly polar side chains, and are less hydrophobic than phenylalanine.

Serine and threonine have aliphatic side chains with hydroxyl groups – they are polar, and hydrophilic. Threonine, like isoleucine, has an additional chiral centre. Cysteine has almost the same structure as serine, but with a sulfydryl group – like a hydroxyl group but with a sulfur in place of the oxygen. (A sulfydryl group is also known as a thiol group.) Pairs of sulfydryl groups can be linked by disulfide bonds, which play an important part in stabilising the structure of some proteins.

Slide 6 shows hydrophilic and charged amino acids. Five amino acids have charged side chains – lysine, arginine and histidine are basic, and there are two acids, aspartic and glutamic acid – sometimes called aspartate and glutamate to emphasise that they are ionised at neutral pH. Asparagine and glutamine are carboxamide derivatives of aspartic and glutamic acid. Histidine is the only standard amino acid to change charge state in the physiological pH range – it can be neutral or positively charged, depending on its environment, and is often found in the active sites of enzymes where it binds and releases protons.

3.1.2 3-dimensional Protein Structures

Slide 7 shows an example of a 3D protein structure. This is a large protein - RNA polymerase II from yeast. It consists of ten separate polypeptide chains – the shortest 70 residues, 7.7 kDa and the longest 1733 residues, 140 kDa. RNA polymerase makes a complementary RNA copy of a strand of DNA to produce a working copy of a gene. This crystal structure is of the elongation complex of the molecule – it contains the template DNA strand which is coloured orange, and the daughter RNA strand coloured yellow.

There are tens of thousands of known proteins, all – more or less – made up from the same 20 amino acids. You might expect them all to have more-or-less the same properties. Denatured (unfolded) proteins are similar to each other – but proteins in their native states are not. The properties of a protein are largely determined by its three-dimensional structure – unless the polypeptide chain is properly folded the protein will not function. This structure of RNA polymerase II was deduced from X-ray diffraction measurements to 0.3 nm resolution (0.1 nm resolution is possible) – to obtain such data it is necessary to have a crystal of very high quality, which implies that every molecule is folded in the same way. Slide 7 uses a wireframe representation of the complete structure, including the side chains. Carbons are green, oxygen red and nitrogen blue (hydrogens are omitted).

Slide 8 shows other ways to draw 3D protein structures. The left hand panel shows the same experimentally determined structure of RNA polymerase II, but in a simplified ‘cartoon’ representation that is much easier to comprehend. The amino acid side chains are omitted – only the trajectory of the polypeptide backbone is shown. Conventional symbols are used to identify common secondary structures, which will be described later in the lecture –coiled ribbons are alpha helices, parallel ribbons with arrows are beta pleated sheet.

The median length for a human protein is around 360 residues, close to the median for all sequenced eukaryote genomes. The median protein length for sequenced bacterial genomes is 267 residues, 278 for E. coli. For archaea the figure is 247 residues. There is a wide range of protein lengths. Insulin is a small signalling protein with 51 residues, β-galactosidase an enzyme with 1021 residues. Anything less than about 20 or 30 residues is classified as a polypeptide, rather than a protein.

The right hand panel of Slide 8 shows a dimer of a very small protein, human insulin – deduced from X-ray diffraction measurements to 0.1 nm resolution It is a dimer because that is how it crystallised – with two molecules in a unit cell – not because it exists as a dimer naturally. Each molecule consists of two chains, one of 30 residues and one of 21. It was synthesised as a single chain, but the central section was then excised and the two parts are now held together by two disulfide bonds.

3.2 Secondary and higher order structures

The secondary structure of a protein is defined as the local configuration of the backbone and includes motifs such as α helix, β sheet and various turns. Polypeptide chains are flexible, but there are important constraints that limit the number of conformations that they can adopt. Proteins have a number of regular backbone folding patterns, including helices, pleated sheets and turns.

The peptide bond (Slide 9) has significant double bond character: the peptide unit is constrained to be planar in order to maximise electron delocalization by allowing overlap of π orbitals associated with the CO and CN bonds. The peptide unit consists of the six atoms connected by the grey rectangle, starting and ending with alpha carbons, and including the carbon and nitrogen joined by the peptide bond and the oxygen and hydrogen bonded to them. There are two possible planar configurations of the peptide unit – cis and trans – related by a 180º rotation about the peptide bond. The cis conformation is destabilised by a steric clash between the two side chains, so almost all peptide bonds in proteins are in the trans configuration. An exception is a bond with proline on the carboxy terminal side – because the incorporation of the nitrogen atom in the cyclic side chain means that the trans configuration is also hindered.

The backbone consists of planar peptide units locked in the trans configuration that are joined at the alpha carbon atoms. At each alpha carbon there are two degrees of freedom, corresponding to rotation about the bonds to the nitrogen and carbonyl carbon atoms – which are called φ and ψ respectively (Slide 10). These torsional angles are also restricted by steric clashes, including clashes involving the side chains – the Ramachandran plot (a plot of. φ vs. ψ, also known as a Ramachandram diagram) is used to represent allowed combinations. Shaded areas are those with minimal or no steric clashes. The lower plot distinguishes further between minimal (light shading) and no (dark shading) steric clashes.

Slide 11 shows a Ramachandran plot using experimentally measured pairs of angles (φ, ψ ) that describe peptide bond configurations in 12 high-resolution X-ray structures, excluding Gly and Pro residues. (Glycine and proline are special cases – proline because it is particularly hindered, glycine because it minimal side chain – just a hydrogen atom – gives it considerably more conformational freedom than the other amino acids.).Sterically allowed configurations fall into three rather small regions associated with common secondary structures. The forbidden configuration illustrated by the model on the lower right has clash between its side chain and the C=O group in the next peptide unit.

In addition to primary and secondary structure, defined above, protein structures have two higher levels of structure. The complete pattern of folding of the full length of a polypeptide (Slide 12) is called tertiary structure. Some proteins consist of more than one polypeptide chain (for example RNA polymerase in Slides 7 and 8. Separate chains are shown by different colours in Slide 8, left). The spatial arrangement of the component chains (subunits) and the nature of their interaction is called quaternary structure.

In general, highly flexible polymers do not fold into unique structures. The entropy associated with the vast number of possible configurations of a random coil greatly outweighs any decrease in enthalpy associated with a particularly favourable packing arrangement. Proteins are remarkable in that they do fold into well-defined three-dimensional structures. This is at least partly a result of the conformational constraints imposed by the rigidity of the peptide unit and the steric clashes that restrict the bond dihedral angles φ and ψ – these limit the number of accessible structures, and thus limit the entropic penalty associated with choosing any one of them.

3.2.1 Alpha Helix

Polypeptide chains can fold into regular structures, which are repeated in many proteins. The most common are the alpha helixand beta pleated sheet (see below), which are generic secondary structures held together by hydrogen bonds between the backbone peptide units. There are also less-regular but often repeated turn motifs which are used to link helices or the parallel strands of a beta sheet. The alpha helix and beta sheet were predicted by Linus Pauling and Paul Corey on the basis of model building – which inspired the model-building of Watson and Crick that led to the discovery of the structure of DNA.

The alpha helix (Slide 13) is a coiled structure stabilised by intrachain hydrogen bonds. It is a right-handed helix defined by torsion angles φ = -57º and ψ = -47º. There are 3.6 residues per turn, and the pitch is 0.54 nm. The peptide N-H bond of the nth residue forms a relatively strong hydrogen bond with the peptide carbonyl (C=O) group of the (n-4)th residue. The backbone atoms that form the core of the helix are tightly packed – the atoms are in van der Waals contact across the helix. The side chains project outwards and slightly towards the N-terminus to avoid steric interference with each other and with the backbone.

Hydrogen bonding plays an important part in stabilising both DNA and alpha-helices. Important differences are:

α helix is formed from a single polypeptide chain;

double helix from two antiparallel DNA molecules;

α helix is held together by interactions between atoms of backbone, which are on the inside – side chains are on the outside, positioned to minimise interactions between them; double helix is held together by hydrogen bonding and stacking between bases – negatively charged backbones are on the outside to minimise electrostatic repulsion.

The alpha helix is an important structural motif of proteins – many proteins incorporate alpha helices – but most proteins fold into much more complex and interesting structures that biological physics must understand, model, measure and manipulate. The alpha helix is a common structural element of fibrous and globular proteins. In globular proteins an alpha helical domain is typically about 12 residues (= 3 turns, 1.6 nm) in length. Two or more alpha helices can intertwine to form an alpha-helical coiled coil which can be as long as 200 residues ( = 50 turns, 100 nm).

An alpha helix is often represented as a coiled ribbon, as in Slide 14, which shows the light-chain subunit of mouse ferritin, an iron storage protein consisting largely of alpha helices.

3.2.2 Beta Sheet

Beta pleated sheets are so-called because they were identified after alpha helices. They fall in the top left quadrant of the Ramachandran diagram (Slide 15). Beta sheets are formed from extended polypeptide strands running alongside each other and held together by inter-strand hydrogen bonds. The side chains lie alternately above and below the plane of the sheet. As with the alpha helix this is a generic structure held together by interactions between backbone atoms – the side chains play no direct part. This allows these structures to be generic, independent of amino-acid sequence.

As shown on Slide 15, adjacent strands can run parallel or antiparallel. If antiparallel, hydrogen bonds connect the amine and carbonyl groups on one amino acid to the those of an amino acid on an adjacent strand. If the strands are parallel, then the amine and carbonyl on one residue are bonded to the amine and carbonyl from two residues which are two positions apart on an adjacent strand.

Beta sheets can mix parallel and anti-parallel adjacent strands. They have a tendency to be twisted away from perfect planarity (Slide 16). This helps them to fold onto themselves to form beta-barrels.

Some proteins are comprised mainly of beta sheets. Green fluorescent protein (GFP, Slide 17) is extremely important in modern biological physics – this beta barrel contains fluorophore, and the protein can be genetically modified to tune the colour of the fluorophore. By splicing the GFP gene to the gene of a protein of interest, the two proteins can be produced as a single polypeptide chain – they usually fold independently, so the protein of interest retains its normal function and can be traced by means of its fluorescent tag. The other protein shown in Slide 17 contains two large planar β sheets.

The next lecture will focus on the interactions that govern protein folding.