Life is self-organising. The molecules of life store information that determines their chemical and physical structure, and the complicated network of interactions between them that are the signature of living systems (Slide 1). This lecture outlines the Central Dogma of molecular biology and the molecular mechanisms that implement it. It also describes the process of molecular evolution by which self-organising systems can develop.

6.1 The Central Dogma

Physics has laws; biology has a Central Dogma, put forward by Francis Crick (a physicist). The Central Dogma of molecular biology (Slide 2) is that DNA directs its own replication and its transcription to yield RNA which, in turn, directs its translation to form proteins.

The Central Dogma sets out a hierarchy of information. DNA is the information store. It can be replicated because both strands contain a complete set of information; both strands can act as templates for the generation of a daughter double helix that is identical to the parent. When this information is to be used to direct the synthesis of a protein, the relevant section of DNA is copied into messenger RNA. This contains the same information but in a form that can be read by the protein synthesis machinery. Finally, a protein is produced by concatenating amino acids according to the sequence specified by the mRNA. These dominant processes correspond to the solid arrows and can be summarised as:

DNA makes RNA makes proteins.

Some viruses store genetic information in RNA, and use an enzyme called reverse transcriptase to make a DNA copy in order to make use of the protein synthesis machinery of the infected cell – this involves information flow in the opposite direction to the normal process. Some RNA viruses and plants have RNA-directed RNA polymerases – as the name implies, they can replicate RNA directly. There is no known process in which DNA is used directly to determine protein synthesis – but this would not invalidate the Central Dogma. What is important is the absence of certain arrows: proteins never direct the synthesis of DNA or RNA.

Slide 3 shows DNA replication by templated polymerisation. The human genome comprises 3x10,9 base pairs. The spacing between base pairs in the double helix is 3.4Å, so every cell of our bodies contains ~1 m of DNA. We have 23 pairs of chromosomes – each contains a continuous double helix made of two complementary molecules of DNA, each of order a few cm long. Chromosomal DNA is a very long polymer. Every time a cell divides the whole genome is copied, with an error rate of 1:108 to 1:1010 – so 1 to 10 errors in total.

DNA replication requires more than 20 enzymes. The daughter strand is formed directly on the template by addition of deoxyribonucleotide units to the 3′ end of a growing chain (note that the growing chain runs in the opposite direction to the template strand). This reaction is catalyzed by an enzyme called DNA polymerase, which only catalyzes the formation of a new phosphodiester bond if the incoming nucleoside triphosphate is complementary to the next base on the template strand – it is a template-directed enzyme. Some DNA polymerases also proof–read – they have a separate nuclease activity that allows them to detect and remove mismatched nucleotides. Other enzymes (helicases) unwind double-stranded DNA as the polymerase advances, and topoisomerases cope with the twisting of the DNA that is caused by the advancing helicase.

The genome is actively protected – molecular machines continually scan DNA, comparing the complementary strands to find copying errors and looking for damage caused by uv radiation and chemical mutagens. When errors are found, they are cut out and corrected.

RNA synthesis on a DNA template is called transcription (Slide 4). This process is also catalyzed by a polymerase – in this case

RNA polymerase

(RNAp) – which adds nucleoside triphosphates to the free 3′ –OH group on a growing daughter strand. We saw a crystal structure of RNAp in Slide 7 of Lecture 3 of this topic (‘Protein Structures’).

As the RNA polymerase moves along the double-stranded DNA it unwinds the helix by about a turn to open a transcription bubble, which allows the polymerase access to one of the DNA strands which it uses as a template for RNA synthesis. The DNA helix rewinds behind the polymerase, stripping off the RNA daughter strand. The RNA daughter strand is complementary to the template DNA strand, and therefore has the same base sequence as the other DNA strand (known as the coding strand) – except that thymine is replaced by uracil (the corresponding RNA base).

In order to create a sequence of messenger RNA (mRNA) that codes for the synthesis of a protein, RNA polymerase must find the start of the corresponding gene. The polymerase recognises specific promoter sites, at which it binds and initiates transcription, and continues until it reaches a terminator sequence (Slide 5). Other signalling proteins, known as activators and repressors in prokaryotes and transcription factors in eukaryotes, control the probability of initiation. The rate at which a particular protein is synthesised is mainly controlled by frequency with which synthesis of the corresponding mRNA is initiated.

6.1.1 The genetic code

Genes code for protein synthesis. There is a one-to-one correspondence between genes and proteins – a gene is a well-defined unit of DNA whose linear base sequence codes for a linear polypeptide chain that folds up to make a protein. (Note: this is a molecular biologist’s definition of a gene. Some evolutionary biologists use a slightly different definition, in which a gene is any piece of DNA that is likely to be inherited as a unit.) Genes use 3-base codons (groups of adjacent bases in DNA) to represent amino acids: the order in which the codons appear determines the order in which amino acids are added to the chain (Slide 6). Protein synthesis is actually directed by messenger RNA, which is why the code is written with the RNA base U instead of T.

There are 43 = 64 possible 3-base codons. Three are stop signals, the other 61 code for the 20 standard amino acids. All but two amino acids (Met and Trp) are represented by more than one codon – usually codons that represent the same amino acid differ only in the third base.

The genetic code is nearly universal – it is possible to make a human protein in E. coli – which is clear evidence that all life on earth derives from a common ancestor – presumably a unicellular organism.

Slide 7 is a summary of the coding scheme, starting with a DNA gene and ending with a protein. The lower strand of DNA is the template in this case.

There is no direct interaction between a codon on mRNA and the amino acid that it represents. A 3-base section of RNA is not structurally or chemically versatile enough to discriminate between the different amino acids. Instead, adaptors are used to separate the functions of binding to a codon and to the corresponding amino acid.

The adaptor is itself made of RNA – called transfer RNA or tRNA. There is an adaptor for each amino acid. The RNA folds up into a well-defined secondary structure, which is partly stabilised by Watson-Crick base pairing in the double helical regions (see Lecture 1 of this topic ‘The Structure of DNA and RNA’). The anticodon loop contains a 3-base sequence which base-pairs with one of the codons. The amino acid is covalently attached at the 3′ terminus – well separated from the region that interacts with the mRNA see Lecture 1 ‘The Structure of DNA and RNA’ for an explanation of these terms and notation). The complex job of recognising a particular tRNA and loading it with the relevant amino acid is done by another enzyme called aminoacyl-tRNA synthetase – there is one of these for each amino acid.

6.1.2 Protein synthesis

Protein synthesis is catalysed by an immensely complicated molecular machine called the ribosome. This contains proteins, but is largely made of RNA – and it is the RNA component that is largely responsible for its catalytic activity. Every organism has slightly different variant ribosomes (for example the E. coli ribosome has a mass of 2.5 MDa and contains about 4500 nucleotides – it is ~ 25 nm across) and some organisms may have more than one variant. But it is important to note that all ribosomes do essentially the same job, in essentially the same way, with essentially the same structure so we often refer to the ribosome.

The pictures on Slide 8 give an idea of the complexity of the structure. On the left is the secondary structure (intramolecular base-pairing interactions) of the largest of three main strands of RNA. On the right is the X-ray crystal structure of the whole complex. The ribosome contains three binding sites for tRNA – this structure includes tRNA adaptors in each position – they are the gold / red components.

A technical description of the ribosome structure involves some specialist nomenclature. Ribosomal fragments are named according to their rate of sedimentation in centrifugation e.g. 50S, where S is the Svedberg the unit of measurement. Sedimentation rate depends on size and shape, so the numbers in the fragment names do not add up (e.g. 80S comprises 60S and 40S). For prokaryotes and eukaryotes the ribosome structures are as follows:


70S whole ribosomes each consisting of a small (30S) and a large (50S) subunit.

30S subunit: 1 RNA subunit (16S, 1540 nucleotides) bound to 21 proteins.

50S subunit: 2 RNA subunits (5S, 120 nucleotides and 23S (2900 nucleotides) bound to 34 proteins.


80S whole ribosomes each consisting of a small (40S) and a large (60S) subunit.

40S subunit: 1 RNA subunit (18S, 1900 nucleotides) bound to around 33 proteins.

60S subunit: consists of 3 RNA subunits (5S, 120 nucleotides; 5.8S, 160 nucleotides and 28S, 4700 nucleotides) bound to around 49 proteins.

Slide 9 shows templated protein synthesis by the ribosome. There are binding sites for tRNA:

A binds aminoacyl-tRNA, i.e. loaded tRNA;

P binds the tRNA adaptor carrying the growing polypeptide chain;

E exit. The E site is not shown in this slide.

The mRNA template runs through a tunnel. The growing polypeptide chain is covalently attached to one of the tRNA adaptors at the P site. The adaptor carrying the next amino acid is bound to its codon at the A site. As it moves along the mRNA towards the adaptor bound at the P site, the ribosome catalyzes the transfer of the polypeptide chain to form a peptide bond with the amino acid that it carries. The ribosome then moves one codon towards the 3’ end of the mRNA. This takes the P-site adaptor to the E or exit site, and empties the A site ready for the next adaptor. Overall, the ribosome has moved 3 bases along the template in the 5′ to 3′ direction, and added one amino acid to the growing protein.

A typical cell is packed with ribosomes. E. coli can contain up to 20,000 of them, accounting for ~80% of its RNA content and ~10% of its protein content. A cell devotes a great deal of resource to making proteins. One molecule of mRNA can simultaneously act as a template for many ribosomes. Slide 10 shows ribosomes from silkworm producing silk fibroin polypeptides

6.2 Molecular Evolution

6.2.1 Prebiotic chemistry

How did it all start? In the 1920s Russian biochemist Aleksandr Oparin (1894-1980) and British biologist JBS (‘Jack’) Haldane (1892-1964) independently suggested that UV radiation from the sun or lightning discharges caused reactions in the primordial atmosphere to produce the molecules that are the building blocks of life – amino acids, nucleic acid bases and sugars. This became known as the Oparin-Haldane hypothesis.

In 1953 American chemist Stanley Miller (1930-2007) reported a practical demonstration of the action of an electric discharge on a mixture of the reducing gases CH4, NH3, H2O, and H2 that simulated what was viewed at the time as a model atmosphere for the primitive Earth (Slide 11). The result of this experiment was a substantial yield of a mixture of amino acids, together with hydroxy acids, short aliphatic acids, and urea. One of the surprising results of this experiment was that the products were not a random mixture of organic compounds; rather, a relatively small number of compounds were produced in surprisingly high yields. Moreover, with a few exceptions, the compounds were of biochemical significance.

Some meteorites contain many of the same amino acids, suggesting that similar processes can lead to the production of organic molecules elsewhere in the universe, and even that meteorite bombardment could have been a significant source of organic material.

There have been many other laboratory demonstrations that important molecules can be synthesised from simple precursors – for example adenine from hydrogen cyanide.

The Earth formed about 4.6 billion years ago, but the oldest rocks are only about 3.8 billion years old – there is no record of conditions before that. There is dispute over the nature of the primitive atmosphere, and over how organic molecules were synthesised, and how they were concentrated. However, it is generally assumed that an essential first step in evolution was the accumulation and concentration of key molecules.

6.2.2 RNA world

The most fundamental property of living things is that they can replicate themselves – so if we are trying to work out how life began it seems sensible to look for self-replicating molecules. In modern cells self-replication involves a complex system of DNA, RNA and proteins. DNA stores all the information necessary for this process – we can consider the cell as an elaborate mechanism used by DNA to replicate itself. However, although we can understand how such a system perpetuates itself, it is particularly hard to see how it could get started.

The most obvious starting point for life would be a molecule that could replicate itself without requiring external assistance – that is, a molecule that could catalyze its own synthesis from building blocks available in the environment.

Proteins seem particularly ill-adapted to self-replicate – they can be exquisitely selective and efficient catalysts, but they do not have the capacity to use the information present in the amino acid sequence to replicate themselves. DNA can act as a template for its own replication, but it needs RNA and protein catalysts to make a daughter strand - DNA has very limited chemical functionality on its own, and almost no potential as a catalyst. RNA, on the other hand, can do both things (Slide 12) – it can act as a template for its own replication, as DNA does, and it does have catalytic ability. The ribosome, which makes proteins, is largely RNA.

A ribozyme is an RNA enzyme that catalyses hydrolysis of RNA. These natural RNAs have the ability to catalyse the ligation and cleavage of DNA and RNA. (Ligation is the joining together of two chains; cleavage is the separation of one strand into two,) The broader catalytic potential of RNA has been demonstrated by in vitro experiments in which molecules that can perform a particular function are selected from a random pool of sequences. Slide 13 shows the steps in an in vitro selection experiment to find a self-phosphorylating RNA.

These observations have lead to the proposal that, at an early stage, life consisted of a self-replicating RNA system, and that DNA and proteins evolved later. This idea is called the RNA world hypothesis (Slide 14).

If a self-replicating molecule – say RNA - were versatile enough, it might evolve the capacity to help itself even further by segregating or synthesising its own building blocks. It might also catalyze the formation of auxiliary molecules – e.g. proteins - that could assist in replication. A self-replicating system of molecules must keep its components together, otherwise they lose the evolutionary advantage that they gain by cooperating. They could be tethered together or physically contained – perhaps in the space enclosed by a lipid bilayer. Vesicles containing self-replicating systems of molecules might have been the first cells.

There is not enough left of RNA world to see how it worked and, so far, no complete self-replicating system has been demonstrated in the laboratory. If RNA world did exist it was almost certainly not the first stage in the development of life – RNA and its precursors are too difficult to make without the help of enzymes. Perhaps simpler biopolymers came first, and evolved the ability to catalyze the production of RNA, which then took over. The rest of the lecture is concerned with processes for which we do have evidence.

6.2.3 Family trees of life

Classification of living organisms used to depend entirely on tracing relationships between physical characteristics. These can be attributed to common ancestry, and often evidence for those ancestors can be found in the fossil record. On this basis it is possible to construct an evolutionary tree showing lines of descent, and branch points where the ancestors of one species became distinct from those of another. This approach is necessarily rather subjective. It works well for close relatives with many well-defined characteristics, like mammals, but much less well for more distant relatives and for species, such as bacteria, with many fewer easily identifiable characteristics.

The sequencing of entire genomes has transformed this process. By comparing the DNA sequences of organisms it is possible to trace the evolutionary relationships between very different species, and even to provide quantitative measurements of the evolutionary distances between them. This type of study is known as phylogenetics, phylogeny being the evolutionary relationships between different species.

Slide 15 shows the course of evolution deduced from modern biomolecules i.e. molecular phylogeny. This phylogenetic tree relates contemporary organisms from all three domains of life. It is based on analysis of differences between the nucleotide sequences of the 16S RNA subunit of the ribosome. (The most striking thing about this diagram is that all organisms have ribosomes: all use the same cellular machinery to translate genetic information to synthesise proteins. Furthermore, the 16S ribosome sequences are all remarkably similar.) At the bottom of the page are aligned sequences of a fraction of this 1500-base RNA from representatives of the three domains. Identical bases are indicated by red links. (To achieve this alignment it was necessary to introduce a space in the E. coli sequence, marked by the red circle, corresponding to a base that has been lost or to one that has been inserted in the other two genomes.)

Random mutations in DNA sequences occur as replication errors during cell division and as a result of chemical damage that can occur at any time. Some mutations are selectively neutral (i.e. neither advantageous nor disadvantageous). Some sections of DNA are not functionally significant – they do not code for the synthesis of a protein or for a catalytic RNA molecule, and have no regulatory function – mutations in such a region may or may not be perpetuated, as it is a matter of chance whether the mutated cell has more success than its relatives in competing for resources. Some mutations in bits of DNA that do matter are also selectively neutral – either because they transform a codon into another that codes for the same amino acid, or because the changes that they induce do not disrupt protein folding or function. Other mutations do cause functionally significant changes. If these are advantageous then they are more likely to be retained. It is much more likely that they will cause damage, in which case they are unlikely to be retained – either the organism will die, or its descendants will be disadvantaged.

By studying the neutral drift of evolutionarily related (homologous) proteins or RNA molecules that results from random, selectively neutral mutations, it is possible to measure evolutionary distances between organisms and thus to deduce evolutionary relationships. In contrast, parts of a protein sequence that are essential to its function can be deduced by identifying sequences that are highly conserved – any changes in such regions are strongly discriminated against by natural selection.

Cytochrome c is a nearly universal eukaryotic protein. It is part of the electron transport chain in mitochondria (see the Biological Energy lectures). Cyt c transfers electrons between two enzyme complexes – cyt c reductase and cyt c oxidase. This process was established between 1.5 and 2 billion years ago, and the machinery has changed very little since. Cyt c from any eukaryotic organism will react with cytochrome oxidase from any other eukaryote – so a protein from a pigeon will interact appropriately with a partner taken from wheat.

The table in Slide 16 compares the amino acid sequences of cyt c from 34 species. Residues marked with a red arrow are invariant - the same in all species, from human to yeast. There are 38 of these, out of 104 in the whole protein.

The colour key groups amino acids according to their characteristics. In most positions the variation is between amino acids from the same group, so-called conservative substitutions. In only 8 positions, marked *, are there 6 or more variations – these are called hypervariable residues.

The phylogenetic tree in Slide 17 is constructed by assuming that the number of differences between the cyt c amino acid sequences from different organisms is proportional to the evolutionary distance between them. Each branch point indicates the probably existence of a common ancestor of all organisms above it. Note that all modern cytochrome cs are approximately the same distance from the root of the diagram – all have evolved by a similar amount, despite different characteristic times for reproduction. This suggests that chemical damage to DNA while it is acting as a passive store of information rather than errors in copying, accounts for the observed mutation rate.

Slide 18 shows the same phylogenetic tree as Slide 17 but shows more clearly that mammals and insects have diverged equally far from plants since their common branch point i.e. they have diverged more recently than their common ancestor diverged from plants. Insects have shorter generation times than mammals, so if copying errors determined the rate at which accepted mutations occur then the apparent evolutionary distance between insects and plants would be greater than that between mammals and plants. However, the average number of differences in the cytochrome c sequence between mammals and plants is approximately equal to the average number of differences between insects and plants, indicating that DNA mutations accumulate at a constant rate with time. DNA mutations must therefore be predominantly due to random chemical change rather than to replication errors.

6.2.4 Calibrating phylogenetic trees

The fossil record can be used to calibrate phylogenetic trees. On Slide 19, evolutionary distance between species is plotted against the time since, according to the fossil record, the species diverged. The vertical scale has been corrected to allow for coincident mutations. Error bars give the scatter in measured sequence data. Each of these four proteins has accumulated mutations at a remarkably constant rate, but the rates at which they have evolved are very different. The unit evolutionary period is the time for sequences to diverge by 1%: this varies by a factor of 600 from 1.1 Mya for fibrinopeptides to 600Mya for histone H4. This does not imply that the mutation rates for the DNA sequences specifying these proteins are different, but that the rates at which mutations are accepted are different.

Haemoglobin and cyt c are both relatively compact proteins that act as transporters - for oxygen and electrons respectively. Cyt c must interact with a large protein complex over much of its area to enable an electron to tunnel to or from its redox centre – most mutations would affect its functionality, so most are rejected and it evolves only very slowly. Haemoglobin has highly conserved residues near its active site, but it accepts and releases oxygen by diffusion, not by interaction with a protein partner, so mutations, especially those near the surface, are more readily tolerated. Histone H4 is a charged protein that binds DNA in the chromosome – its function clearly makes it very intolerant of variation. Histones H4 from peas and cows, that diverged 1.2 billion years ago, differ by only 2 residues in 102. Fibrinopeptides are ~20-residue polypeptides that are cleaved from the protein fibrinogen to form fibrin during the blood clotting cascade. They have no further function, so there is little selective pressure on them to maintain their amino acid sequence. If we assume that fibrinopeptides accept all mutations, i.e. evolve randomly, then for haemoglobin 1/5 of random amino acid changes are innocuous enough to be accepted, and this fraction is 1/18 for cyt c and 1/550 for histone H4.

The fossil record can only be used to date divergences between species up to ~600 Mya ago, corresponding roughly to the appearance of multicellular organisms whose fossils can be reliably distinguished. Earlier divergence dates can be estimated by extrapolating measured rates of changes of highly conserved proteins that are common to many of the major groupings of organisms. For example:

(animals, fungi) - plants, then animals - fungi ~ 1 billion years ago

eukarya – archaea ~1.8 billion years

(eukarya, archaea) – bacteria 2 billion years

gram positive - gram negative bacteria 1.4 billion years.

Slide 20 shows a quantitative map of relationships between species across all three domains of life. It is possible to construct the map because the molecular machinery of life is remarkably uniform and much of it, e.g. the basic machinery of DNA, RNA and protein synthesis, is universal.