Coronavirus molecular biology

Biology 6B has been reshaped by the coronavirus. First, the pandemic forced us to move to online instruction (which we instructors had said we’d never do for this class). Second, I've tried to infuse the 6B curriculum with viral biology. This quarter’s 6B includes multiple coronavirus connections: 

On this page, my goal is to describe the coronavirus genome and gene expression process. I hope to connect these concepts with the things you already know about molecular genetics, and to point out some unique features of viral gene expression and replication.

You’re already familiar with the fundamentals of information flow in gene expression:

  • DNA replication: DNA polymerase copies the nucleotide sequence of a template DNA strand to make a new complementary DNA strand.
  • Transcription: RNA polymerase copies the nucleotide sequence of a template DNA strand to make a new complementary RNA strand.
  • Translation: Ribosomes, tRNAs, and other components work together to make a polypeptide, with its amino acid sequence determined by the nucleotide sequence of a messenger RNA.

Virus life cycles must include the same basic functions: copying the genome and producing viral polypeptides. However, due to the structure of the coronavirus genome, transcription and translation occur in slightly different ways compared to how they normally work in cells.

The coronavirus genome is +strand RNA

Every virus has a genome, but they’re not all structured the same way. Some viruses have DNA genomes, and some RNA. For RNA viruses, the RNA could be double-stranded (dsRNA, like rotavirus) or single-stranded (ssRNA). Further, ssRNA viruses can be negative-strand (-ssRNA) or positive-strand (+ssRNA). The positive strand is the one that can be translated to make protein, while the negative strand can’t. Viruses that have -ssRNA genomes must first make a copy of the positive RNA strand before their genes can be translated. The coronavirus genome is positive-strand RNA (+ssRNA), so its RNA can be translated directly.

Viral Classification: The Baltimore Scheme

Image credit: Bruslind, The Viruses.

RNA serves two different purposes for coronaviruses: is it the genome that gets passed on in new copies of the virus, and it gets translated to make proteins (as mRNA). Since the coronavirus genome is +strand RNA, the host cell’s ribosomes and other translation machinery immediately begin translation of viral genes when viral RNA enters the cell. Some of the viral proteins produced by translation will make up part of the structure of new virions (copies of the virus), while others are necessary for viral replication, but won’t become part of a virion. I’ll describe some of the protein products below, under ORFs. To understand viral molecular genetics, we need to look at the processes of transcription and translation. 


The SARS-CoV-2 genome is unusually large for an RNA virus, at approximately 30,000 nucleotides (30 kb, or kilobases), and it encodes 20 to 30 proteins. None of your normal mRNAs is that long, and the extreme length of the RNA presents a challenge in terms of translation. Eukaryotic translation normally allows only one polypeptide product per mRNA (unlike the polycistronic mRNAs in bacteria). Translation normally begins with ribosomal attachment at a start codon near the 5’ end of the RNA. How can this extremely long mRNA be translated to produce multiple proteins? Two viral tricks make this happen: polyproteins and subgenomic RNAs, which I’ll describe below.


Like all RNA that codes for proteins, the coronavirus RNA contains open reading frames (ORFs). As you may recall, an ORF must begin with a start codon and continue in a series of 3-nucleotide codons until it reaches a stop codon. This provides ribosomes with the essential information needed for translation. Normally, one ORF encodes one protein, but the coronavirus genome encodes numerous proteins in a small number of ORFs. 

Once coronavirus RNA enters a cell, translation begins at the first start codon, near the 5’ end of the RNA. This marks the beginning of the first open reading frame, called ORF1a/b.  This ORF, located near the 5’ end of the RNA, encodes a very long polypeptide. This polypeptide is actually a polyprotein, meaning that it must be cut into smaller polypeptides, which then function as separate proteins. The polyprotein becomes 16 different proteins, all of which are needed for viral replication. These are called the nonstructural proteins of the coronavirus, because they’re not part of the structure of the virus.

Genome organization

Genome organization of SARS-CoV-2

This diagram shows a map of the complete genome, which is approximately 30,000 nucleotides long and encodes 30 or so different proteins. Nonstructural proteins 1-16 are produced from the ORF1ab polyprotein, while the others are translated separately.


Two of the proteins within the polyprotein are proteases, meaning that they can cut polypeptide chains. The proteases are active from the start, and they cut the polyprotein at the appropriate places to free the other individual proteins to start doing their jobs. One of the proteases also cuts away ubiquitins, preventing the viral proteins from being destroyed by the cell’s quality control process.

Ribosomal frameshift

The first ORF can be translated two different ways, called ORF1a and ORF1b. For ORF1a, translation proceeds from the first start codon to the first stop codon, as you might expect. This produces a large polyprotein (around 500 KDa in mass), which can be cut up to form 11 different functional proteins. Alternatively, the ribosome can skip the first stop codon and continue on, producing an even larger polyprotein, which includes the same proteins as the ORF1a polyprotein, along with five more from ORF1b. The mechanism for skipping the stop codon is a ribosomal frameshift.

Ribosomal frameshift

The ribosome moves along the mRNA (or the RNA is pulled through the ribosome), starting from the start codon. At a particular location, the RNA forms a pseudoknot: specific nucleotide sequences allow the RNA to fold over and base pair with itself, forming a knot-like structure. The pseudoknot makes it difficult for the mRNA to be pulled through the groove on the ribosome. Because it’s partially blocked, the ribosome sometimes takes a 1-nucleotide backwards step before proceeding, so it shifts to a new reading frame. This reading frame doesn’t include the stop codon, so translation continues on until it hits the next stop codon, thousands of nucleotides later. Thus, the ORF1ab polyprotein (translated from the RNA sequences of ORF1a and ORF1b together) is significantly larger than the ORF1a polyprotein.


  • Translation of the ORF1a polyprotein starts at the first start codon at the 5' end of the genome and continues until the first stop codon. This polyprotein cuts itself into 11 finished proteins, using its own proteases.
  • Translation of the ORF1ab popyprotein also starts at the same start codon, but skips the first stop codon due to a ribosomal frameshift. Translation continues until it reaches the next stop codon, producing a larger polyprotein which cuts itself into 16 finished proteins.


The host cell will produce copies of viral proteins, but it won’t transcribe the viral RNA. This requires a particular enzyme, an RNA-dependent RNA Polymerase (RdRp), that cells don’t normally use. (Cells have DNA-dependent RNA polymerases for RNA production and DNA-dependent DNA polymerases for DNA replication, but not RdRp). Before the viral genome gets copied, the cell must first translate the viral genes, producing the multi-protein RdRp enzyme complex needed for RNA replication (transcription).

Once RdRp is produced, it begins to produce new RNA copies of the viral RNA genome. First, negative strand copies of the +ssRNA are produced, then the negative strands can be used as templates for synthesizing more positive strands.

Viral RNA is transcribed in two different ways, fulfilling the RNA’s two roles. Sometimes the entire genome is copied, producing genomic RNA, which can serve as new viral genomes and also as mRNA. Other times, RdRp produces only partial copies, called subgenomic RNA, which function only as mRNA for specific viral genes. The shorter subgenomic RNAs are essential, because the ORF1ab polyproteins don’t include the structural and accessory proteins. These are encoded by separate ORFs, located toward the 3’ end of the genomic RNA.

The process of synthesizing subgenomic RNAs is called discontinuous transcription:

Discontinuous transcription of subgenomic RNA


  1. The process begins with full-length positive-strand genomic RNA. Transcription begins  near the 3’ end, which, like most mRNAs, has a poly-A tail. Negative-strand synthesis starts with poly-U and continues toward the 5’ end of the genomic template. However, at specific sequence sites, the RdRp enzyme complex may “jump” ahead, skipping part of the template and landing at a predetermined spot (L in the original template). The positive strand begins with a leader sequence, and this sequence is always copied into the negative strand, but some regions in the middle are sometimes skipped.
  2. The negative strand copies, both genomic and subgenomic, can then be copied to make new positive strands, which can be used for translation. 
  3. The full-length genomic +strand copies can be packaged into virions, while the subgenomic +strands serve as mRNAs for the structural and accessory proteins that aren’t part of the original ORF1ab polyproteins.

The genomic RNA is too long to translate into one polypeptide, and the rules of eukaryotic translation don’t encourage translation of multiple ORFs from one RNA, so the subgenomic RNA mechanism plays an essential role in producing all the necessary viral proteins.


Most RNA viruses have small genomes and high mutation rates (low fidelity in replication). The high error rate in copying the RNA can help these viruses evolve more rapidly, increasing the odds of generating a mutant virus variant that spreads and reproduces more quickly. However, coronaviruses have much larger genomes than most RNA viruses (the SARS-CoV-2 genome is about twice the size of an influenza virus genome). If the coronavirus mutation rate was as high as that of other RNA viruses, it would produce so many errors per genome replication that the virus would be at a distinct disadvantage. As it turns out, coronaviruses have a solution.

Unlike the polymerases of some RNA viruses, the coronavirus RdRp has proofreading capability. As the enzyme complex moves along the template RNA, there is always a possibility of incorporating a mismatched nucleotide. Without proofreading, that mismatched nucleotide is incorporated as a mutation. The proofreading polymerase, in contrast, is likely to stop and cut out the mismatched nucleotide and replace it with the correct one before going on. The act of cutting away the nucleotide from the end of the strand is called exonuclease activity, and when exonuclease activity is triggered by a mismatch, it’s called proofreading. Viruses with smaller genomes, like influenza viruses or HIV, do not have proofreading capability.

The larger an organism’s genome, the more important it is to increase fidelity in genome replication. Eukaryotes have much larger genomes (a human genome contains 109 nucleotides, compared to the 104 nucleotides of the SARS-C0V-2 genome), and an even greater need for fidelity in replication. Like coronaviruses, eukaryotes have proofreading polymerases, which increase fidelity approximately 100-fold by reducing errors during replication. In addition, eukaryotes also have mismatch repair systems, which provide a way to cut out and replace mismatched nucleotides after replication is complete. These repair mechanisms increase replication fidelity by another 10,00-fold, so that the replication fidelity of eukaryotes is less than that of typical RNA viruses by a factor of 106. Without these mechanisms for high-fidelity replication, large genomes would be impossible.

Other aspects of coronavirus biology

In this article, I have focused on genome replication and gene expression, but there are many other important aspects of coronavirus biology that determine both the spread of the virus and the course of the disease in infected people. In particular, SARS-CoV-2 dramatically alters the structure of the cells it infects, and produces proteins that interfere with the host’s immune response. I won’t try to describe those events here, but I recommend the article by Hartenian et al, The molecular virology of coronaviruses, if you would like to learn more.

Coronavirus life cycle summary

Life cycle of Coronavirus

  1. Binding and viral entry. This step depends on binding between the viral spike protein and the ACE2 receptor on the host cell surface. The exact mechanisms are not completely understood.
  2. Translation of polypeptide. Transcription doesn't have to happen first, because the viral genome is + strand RNA, ready for the cell's ribosomes to translate.
  3. Autoproteolysis and co-translational cleavage of polypeptide to generate non-structural proteins. The first translation product is a polyprotein: a very long polypeptide that must be cut into smaller units that can then fold in to functional proteins. Part of the polyprotein is a protease, which cleaves the polyprotein into functional units. Some of the protein units join together to form a functional RNA-dependent RNA polymerase (RdRp) complex.
  4. (-Sense) subgenomic transcription and RNA replication. The RdRp complex copies the original + strand RNA.
  5. (+Sense) subgenomic trancription and RNA replication. Parts of the cell's endomembrane system have been turned into unusual convoluted membranes, in which viral replication occurs.
  6. Translation of subgenomic mRNA into structural and accessory proteins. These membrane proteins are translated by ribosomes bound to the endoplasmic reticulum.
  7. Nucleocapsid buds into ERGIC studded with S (Spike), E (Envelope), and M (Membrane) proteins. "ERGIC" stands for Endoplasmic Reticulum Golgi Intermediate Compartments; as the name implies, it's part of the endomembrane system, intermediate between the endoplasmic reticulum and the Golgi apparatus. The nucleocapsid is the complete RNA genome, coated with capsid proteins.
  8. Formation of virion.
  9. Exocytosis.


Terms & concepts

  • Fidelity in RNA and DNA replication
  • Genomic and subgenomic RNA
  • ORFs (Open reading frames) in SARS-CoV-2
  • Polyprotein
  • Positive (+) strand and negative (-) strand RNA
  • Proofreading polymerase
  • RNA-dependent RNA polymerase (RdRp)
  • Ribosomal frameshift
  • Structural and nonstructural proteins in coronavirus
  • Viral protease
  • Virion

Polymerase terminology: DNA polymerases make DNA, while RNA polymerases make RNA. These enzymes can be categorized in terms of what they use as a template; DNA-dependent RNA polymerases use DNA as a template to produce RNA. All these polymerases go in the same direction, adding new nucleotides to the 3’ end of the strand that is being synthesized. (Translation also goes the same direction; ribosomes move from 5’ to 3’ along the mRNA strand as they synthesize polypeptides.)

Review questions

  1. Suppose you sequence a SARS-CoV-2 genome from a patient and you find a new mutation. How would you investigate the mutation to assess its importance? How would your approach be different if the viral genome you sequenced wasn’t a coronavirus, but was some completely unknown kind of virus?
  2. Describe some of the enzymatic steps that are unique to the coronavirus life cycle, and could potentially be targets for therapeutic drugs.
  3. How can the coronavirus genome encode so many different proteins in one mRNA?
  4. How  is the mutation rate reduced in coronaviruses? Why is this important for coronaviruses, but not for some other RNA viruses?

References & further reading

The molecular virology of coronaviruses. Hartenian et al., 2020. JBC. An outstanding review of how coronaviruses (including This arSARS-CoV-2) work. This article accompanies a video lecture by one of the authors: Coronaviruses 101: Focus on Molecular Virology. (1:02). Both are clear and accessible. Between the article and the lecture, you can gain a very detailed understanding of coronaviruses. (I’m using some of the graphics from the article under its Creative Commons license.)

Structure and Genome of SARS-CoV-2 (COVID-19) with diagram. Microbe Notes. Brief summary, with some good illustrations.

The Viruses. Bruslind, Oregon State University. A chapter from an online microbiology text. Gives a good description of the categories of viruses under the Baltimore Scheme, which classifies them according to their genomes.

Programmed −1 Ribosomal Frameshifting in coronaviruses: A therapeutic target. Kelly et al., 2021. Virology. Describes how the frameshift works, and explains that, since this is a unique mechanism for coronaviruses, it’s a possible target for drugs that block viral replication.

Fidelity in RNA and DNA replication

Thinking Outside the Triangle: Replication Fidelity of the Largest RNA Viruses. Smith et al., 2014. Annual Review of Virology. Describes the mechanisms and the importance of controlling the mutation rate in viruses. “When judged by ubiquity, adaptation, and emergence of new diseases, RNA viruses are arguably the most successful biological organisms. This success has been attributed to a defect of sorts: high mutation rates (low fidelity) resulting in mutant swarms that allow rapid selection for fitness in new environments.”

What happens when your DNA is damaged? (Video) Menensi, TedEd. This video is about eukaryotic cells, not viruses, but if you’re curious about genome fidelity, this is a good place to start.

DNA Replication with a Proofreading Polymerase. (Video) New England Biolabs. Brief overview of how a DNA polymerase controls error rate; the exonuclease activity is similar to that of the RNA-dependent RNA polymerase of coronaviruses.

Drug development

A call to arms. Service, 2021. Science. This news article gives a good overview of the various approaches to treating Covid-19 with drugs that block specific events in the virus life cycle.

A- A A+