Open Reading Frames in pGLO

The goal of this lab exercise is for you to gain some familiarity with working with DNA and protein sequence data.

This assignment requires you to cut and paste information among several browser windows; you'll find it easiest to complete it using a laptop or a tablet with a keyboard (not a phone).

Answer these questions in Canvas, in the Open Reading Frames Quiz.

By now you are familiar with the pGLO plasmid and how it's used in lab. In this exercise, you'll look more closely at the nucleotide sequence of pGLO to identify some protein-coding genes.

Background

Finding genes

It's not very difficult to isolate some DNA and have it sequenced. In Bio 6B, you purify a couple of different plasmids. You could easily send these off to be sequenced commercially; you could probably get one of your plasmids sequenced for under $50.  Or, if we had sequencing equipment, you could do it yourself. In the case of pGLO, though, we don't need to get it sequenced because it has already been done. We can just look it up.

However, once you have a DNA sequence, how do you figure out what it means? In the case of pGLO, you know that the plasmid contains some important sequences some are noncoding (such as the origin of replication) and some regions code for proteins. Here is a pGLO map from the virtual digest page:

pGLO restriction map

To make a map like this, you would have to start with the nucleotide sequence, analyze the sequence to find the particular regions you want to highlight, and then make a simplified map that shows the locations of important features but doesn't show the actual nucleotide sequence. There are three protein-coding genes shown in this map: GFP (Green Fluorescent Protein), araC (a protein that regulates the GFP operon), and AmpR (the ampicillin resistance gene, also known as β-lactamase).

In this assignment, you'll learn how to use online bioinformatics tools to analyze a nucleotide sequence and find the protein-coding genes. You'll use pGLO as an example, but the process is similar for any DNA sequence, whether it's a plasmid, the genome of a newly discovered bacterial species, or the human genome. Obtaining a nucleotide sequence is only the beginning; the greater challenge is to figure out the biological meaning of that sequence.

Open reading frames

To find protein-coding genes, you need to start by finding open reading frames. (In this section, I will use the word "gene" to mean a gene that codes for a protein; as you know, there are also genes that get transcribed into RNA but don't code for proteins.)

Suppose this is a DNA nucleotide sequence:

5' ACCGCATGTCTCGGATGAAAAGCTGGGGATAGAAGCTA 3'

Note that nucleotide sequences are always given from 5' to 3'. The actual DNA would be double-stranded, like this:

5' ACCGCATGTCTCGGATGAAAAGCTGGGGATAGAAGCTA 3'
3' TAGCTTCTATCCCCAGCTTTTCATCCGAGACATGCGGT 5'

If you want to try to find a protein-coding gene in this DNA sequence, you should start by finding the potential RNA sequence that could be transcribed from the DNA. In a plasmid or a chromosome, either strand could potentially be used as a template for transcription. In this example, if the bottom strand is transcribed, the resulting RNA sequence would be this:

5' ACCGCAUGUCUCGGAUGAAAAGCUGGGGAUAGAAGCUA 3'

Note that this is the same as the upper strand of DNA, but with U instead of T.

The next step is to ask whether that hypothetical messenger RNA sequence could potentially code for a protein. If the sequence is an mRNA and gets translated, the translation process will read the nucleotide sequence as codons, with three nucleotides per codon. There are three ways to read this sequence into codons:

5' ACC GCA UGU CUC GGA UGA AAA GCU GGG GAU AGA AGC UA 3'

5' AC CGC AUG UCU CGG AUG AAA AGC UGG GGA UAG AAG CUA 3'

5' A CCG CAU GUC UCG GAU GAA AAG CUG GGG AUA GAA GCU A 3'

These groupings are called reading frames. For each mRNA, there are three possible reading frames, but only one is used for a particular protein. The correct reading frame is determined during translation by the presence of a start codon, which is the location where an initiator tRNA binds to the mRNA during formation of the translation initiation complex. The start codon in the mRNA is usually AUG, for both eukaryotes and bacteria. Translation begins at a start codon and continues until a stop codon (UAG, UGA, or UAA) is reached. A nucleotide sequence that begins with a start codon and ends with a stop codon is called an open reading frame (ORF), and could potentially code for a protein. For the sequence shown above there is one open reading frame:

5' ACCGC AUG UCU CGG AUG AAA AGC UGG GGA UAG AAGCUA 3'

The start codon, AUG, is shown in green; the stop codon, UAG, is shown in red. Note that the start codon is not at the beginning of the mRNA sequence; there is a 5' untranslated region before the start codon as well as a 3' untranslated region after the stop codon. This is always true for mRNAs. The regions before and after the ORF are not shown grouped into codons, because they are not translated.

Finding genes in a nucleotide sequence

All protein-coding genes are encoded by ORFs, but not all ORFs encode proteins. In principle, the ORF shown above could encode a polypeptide. However, most proteins are much longer so this one would probably not be a real gene. Some ORFs exist just by chance, but aren't actually used by the cell to make proteins. In general you can ignore the very short ones. It makes sense to start analyzing a nucleotide sequence by looking for the longest open reading frames, and analyze those.

Starting with a DNA nucleotide sequence, the basic procedure for finding protein-coding genes goes like this:

  1. Figure out the possible RNA sequences. Since DNA is double-stranded, either strand could be used as a template, so there are two possible RNA sequences for each DNA sequence. In transcription, only a small part of the genome is copied into RNA, but you can start by pretending that the entire sequence is transcribed.
  2. Search the possible RNA sequences for ORFs as described above. (Find all the AUG start codons, and follow the reading frame until you hit a stop codon.)
  3. Determine the amino acid sequences of the hypothetical polypeptides encoded by the ORFs. This is described in the next section.
  4. Analyze the hypothetical amino acid sequences to see if they look like real proteins. This is described in the next section.

This is clearly a job for a computer. In this exercise, you'll explore some software that was designed to do this kind of analysis.

Keep in mind that you are looking at bacterial genes in this exercise. Analyzing eukaryotic genes could be considerably more complex due to the presence of introns.

Analyzing hypothetical polypeptide sequences

Once you've found an ORF, you can use the genetic code to translate that DNA or RNA sequence into a possible amino acid sequence:

5' AUG UCU CGG AUG AAA AGC UGG GGA UAG 3'(codons in mRNA)
   Met Ser Arg Met Lys Ser Trp Gly Stop (3 letter amino acid codes)
    M   S   R   M   K   S   W   G  Stop (1 letter amino acid codes)

Now you have an amino acid sequence for a hypothetical protein. (Note that the codon AUG can be a start codon or a methionine within the sequence. Also keep in mind that this hypothetical amino acid sequence is much too short to be a real protein.) I am showing both the 3-letter amino acid codes and the 1-letter codes, but the 1-letter codes are always used for sequences. How can you find out if that hypothetical amino acid sequence represents a real protein, and what protein it might be? The simplest approach is to compare the hypothetical  amino acid sequence to a database of known protein sequences to see if there is a match. Even if you discovered a new protein that has never been found before, there is a good chance that it is similar to some other known proteins.

This type of comparison between two or more sequences is called an alignment. It's going to require a lot of computing power. Luckily, you can access some extremely powerful computers online for free through NCBI, the National Center for Biotechnology Information, which is part of the National Institutes of Health (NIH). The software you'll use is called BLAST, for Basic Local Alignment Search Tool.

The assignment

The page you're on now goes with an assignment in Canvas, called Open Reading Frames in pGLO (I can't make a direct link to it; you'll have to find it in Canvas). Read this page, answer the questions you see here, and put your answers into the Canvas quiz.

The basic process will go like this:

  1. Copy the pGLO DNA sequence and paste it into ORF Finder to identify open reading frames.
  2. Determine the amino acid sequences encoded by those ORFs and use BLAST to compare these hypothetical proteins to a database of known proteins.
  3. Figure out which what proteins are encoded by these ORFs.

While many ORFs in a sequence don't actually code for proteins, in this example we'll focus on a few that do.

Find open reading frames in pGLO

Get the nucleotide sequence of the pGLO plasmid from the Sequence Data page (I called it "pGLO Original sequence from Bio-Rad;" open it in a new tab). Copy the entire nucleotide sequence from the gray box, including the numbers.

Go to the Open Reading Frame Finder at NCBI and paste the pGLO sequence in the box. Under "Choose Search Parameters," set the minimum ORF length to 300. A minimum ORF length of 300 nucleotides corresponds to a polypeptide sequence of 99 amino acids (3 nucleotides per codon, and the last codon is the stop codon). Most proteins are bigger than that. Check "Ignore Nested ORFs." This will ignore any reading frame that is completely inside another. It's not impossible for proteins to be coded that way, but it's not common. Leave the other options as they are. Click Submit to search for ORFs within the pGLO sequence.

You should see something like this (an example using a different sequence):

ORF viewer

At the top, there is a map of the nucleotide sequence, showing where the ORFs are located within the sequence. The map isn't very useful in this case because it displays a circular plasmid sequence as if it's linear. At the bottom right is a list of ORFs, in order from longest to shortest. In this example, ORF3 is at the top of the list because it's the longest one. The length of ORF3 is 816 nucleotides, or 271 amino acids. The amino acid sequence of the hypothetical protein encoded by ORF3 is shown in the box at left; it begins MSHIQ... et cetera.

Now look at the Open Reading Frame Viewer for the pGLO sequence that you just entered, and start analyzing the ORFs from the pGLO plasmid.

Longest ORF (ORF 4)

Click on the top ORF in the list (if you selected the options listed above, it will be called ORF4). The following questions apply to this ORF.

Question 1: How many amino acids long is the protein encoded by the longest ORF (called ORF4) in pGLO? (The answer is to the right of "ORF4; it says Length in aa, or amino acids).

In the left box you'll see the predicted amino acid sequence for that ORF. Is that a real protein? Click on SmartBLAST to compare this sequence to a database of all known protein sequences. That will take you to a new SmartBLAST tab in your browser. At the top of that page there is a cladogram, showing how your query sequence is related to other similar sequences. In the section called "Best hits" you'll see a short list of protein sequences that closely match the ORF sequence from pGLO. The top hit is probably exactly the same as your ORF sequence, as shown by an Ident score of 100%; this tells you that your ORF corresponds to a real protein that has been sequenced and described. Here's an example from a different plasmid sequence:

Best hits screenshot 2

In the example above, the best hit is called "MFS transporter [Bacillus cereus]," and it is 100% identical to the query sequence that was used in the search. The ORF sequence used in that example is exactly the same as the MFS transporter.

Now answer these questions about the longest ORF from your pGLO search.

Question 2: What protein does the amino acid sequence of ORF4 most closely resemble? In your SmartBLAST result, look at the "Best hits" box and you should see that the top few hits have similar names.

Question 3: What is the molecular mass of the protein encoded by ORF4? Go back to the Open Reading Frame Viewer page and copy the amino acid sequence and then go to the sequence manipulation site at bioinformatics.org. Paste the sequence into the box and hit Submit. You'll get a result in kiloDaltons (abbreviated kD or kDa); Daltons are molecular mass units, essentially the same as the molecular mass units you use in chemistry class. Note that this molecule is a lot bigger than the formula weights that you've used in chemistry; they call them macromolecules for a reason.

You might wonder why it's relevant to know the molecular mass of a protein. The answer is that in the lab, we usually try to find proteins by separating them on SDS-PAGE electrophoresis gels, which separate proteins based on molecular mass. If we know the molecular mass of a protein, we can look for it on the gel.

Second-longest ORF (ORF 2)

Follow the same steps for the next ORF in the list.

Question 4: What protein does the amino acid sequence of ORF2 most closely resemble? (Use SmartBLAST to perform a search.)

Question 5: What is the molecular mass of the protein encoded by ORF2, the second-longest ORF in pGLO?

Third-longest ORF (ORF 1)

Repeat these steps again for the next ORF in the list.

Question 6: How many amino acids long is the protein encoded by ORF1?

Now try to identify that protein. On the ORF Finder page, click on BLAST. For questions 2 and 5 (identifying the other ORFs, you clicked on Smart BLAST to compare the ORF amino acid sequence with known sequences in a database. If you do that for ORF1, you might get some confusing results (go ahead and try it). A better approach for this one will be to click on BLAST instead, and answer the next question.

Question 7: What protein does the amino acid sequence of ORF1 most closely resemble?

The reason you get some confusing results for this one is that researchers often use this protein to make fusion proteins: two proteins fused together as a research tool. The ORF1 protein is often used as a reporter protein, fused to another protein to make it easier to study the other protein. Therefore, BLAST will come up with a very diverse group of proteins that are related to ORF1.

Question 8: What is the molecular mass of the protein encoded by ORF1, the third-longest ORF in pGLO?

In Bio 6B lab, you'd normally run a gel specifically looking for this protein; knowing the molecular mass would tell you where to look on your gel.

Leave the tab with the Open Reading Frame Viewer open; you're going to want that amino acid sequence in a minute.

A mutated pGLO sequence

Go back to the Sequence Data page and find the sequence of Non-Fluorescent mutant pGLO. This mutated version of pGLO was created in our lab by Bio 6B/Special Projects students, but it didn't turn out to be exactly what we expected.

Go to the Open Reading Frame Finder at NCBI again (open it in a new tab) and paste the mutant pGLO sequence in the box. (Use a new browser tab so you can keep the original sequence in the other tab.) Use the same parameters as before (minimum ORF length 300 nucleotides), find the ORFs in this sequence. The two longest ORFs should look familiar; they are the same as the longest ORFs you found in the unmutated sequence. Focus on the third-longest (called ORF1, starting at nucleotide 1342 and ending at 1746). The amino acid sequence begins with MASKG.

Question 9: How many amino acids long is the protein encoded by ORF1 in the Non-Fluorescent mutant pGLO? (The answer is to the right of "ORF1"; it says Length in aa, or amino acids).

After you answer this question, compare your answer to #6. They're both the third-longest ORF in their versions of pGLO, but the length of the amino acid sequences is different. To see how these two sequences are related, go to the next step.

Leave the tab with the Open Reading Frame Viewer open; you're going to want that amino acid sequence in a minute.

Perform a protein sequence alignment

Performing an alignment simply means lining up two or more sequences (nucleotide or amino acid) so you can compare them. Go to EMBOSS Needle Pairwise Sequence Alignment from EMBO. This page has two boxes for sequences. Enter the unmutated amino acid sequence of ORF1 from pGLO in the upper box and the mutated amino acid sequence of ORF1 from pGLO Non-Fluorescent Mutant in the lower box. (Note that there's a box showing that you're entering PROTEIN sequences, the default choice.) Click "Submit" at the bottom. It may take a minute or two to see your results. Eventually you should see your two sequences lined up, showing where they match and where they don't.

Once your alignment is complete, click on "View alignment file" to get a clean look at the output. It should look something like this (the sequences in this example are not from pGLO):

Alignment screenshot

At the top is some information about the alignment; for example, in this case, 140 out of 146 amino acids are identical. The aligned amino acid sequences are at the bottom. Where the two are identical, there is a vertical line connecting them. Where the sequences differ, there's a dot.

Now look at the alignment you just created. You should see that one amino acid sequence is significantly longer than the other (in questions #6 and #9 you gave the length of each of these amino acid sequences). In your alignment, focus on the region where the two proteins overlap and have nearly identical sequences.

Question 10: How many amino acids long is the region where the two amino acid sequences overlap?

Question 11: Within the region of sequence overlap, how many amino acid differences (mismatches) are there? (In other words, how many non-identical amino acids between the two sequences. Ignore the fact that one sequence is much longer than the other.)

What do you think was changed in the DNA sequence to make this change in the amino acid sequence? You can address that by comparing the nucleotide sequences.

Perform a DNA sequence alignment

By now I hope you've realized that the protein sequence you've been looking at is GFP, Green Fluorescent Protein. In the previous question, you compared mutated and unmutated versions of this protein. The mutated, non-fluorescent version of the protein is shorter. Now you can investigate the DNA change that resulted in the shorter amino acid sequence by performing an alignment of the mutated and unmutated nucleotide sequences. You've already searched the entire sequences of these plasmids for ORFs, and found the protein-coding genes. For this question, you can focus only on the regions of DNA that encode the GFP proteins.

>Unmutated GFP coding sequence:

ATGGCTAGCAAAGGAGAAGAACTTTTCACTGGAGTTGTCCCAATTCTTGTTGAATTAGATGGTGATGTTAATGGGCACAAATTTTCTGTCAGTGGAGAGGGTGAAGGTGATGCTACATACGGAAAGCTTACCCTTAAATTTATTTGCACTACTGGAAAACTACCTGTTCCATGGCCAACACTTGTCACTACTTTCTCTTATGGTGTTCAATGCTTTTCCCGTTATCCGGATCATATGAAACGGCATGACTTTTTCAAGAGTGCCATGCCCGAAGGTTATGTACAGGAACGCACTATATCTTTCAAAGATGACGGGAACTACAAGACGCGTGCTGAAGTCAAGTTTGAAGGTGATACCCTTGTTAATCGTATCGAGTTAAAAGGTATTGATTTTAAAGAAGATGGAAACATTCTCGGACACAAACTCGAGTACAACTATAACTCACACAATGTATACATCACGGCAGACAAACAAAAGAATGGAATCAAAGCTAACTTCAAAATTCGCCACAACATTGAAGATGGATCCGTTCAACTAGCAGACCATTATCAACAAAATACTCCAATTGGCGATGGCCCTGTCCTTTTACCAGACAACCATTACCTGTCGACACAATCTGCCCTTTCGAAAGATCCCAACGAAAAGCGTGACCACATGGTCCTTCTTGAGTTTGTAACTGCTGCTGGGATTACACATGGCATGGATGAGCTCTACAAATAA

>Non-Fluorescent Mutant GFP coding sequence:

ATGGCTAGCAAAGGAGAAGAACTTTTCACTGGAGTTGTCCCAATTCTTGTTGAATTAGATGGTGATGTTAATGGGCACAAATTTTCTGTCAGTGGAGAGGGTGAAGGTGATGCTACATACGGAAAGCTTACCCTTAAATTTATTTGCACTACTGGAAAACTACCTGTTCCATGGCCAACACTTGTCACTACTTTCTCTCATGGTGTTCAATGCTTTTCCCGTTATCCGGATCATATGAAACGGCATGACTTTTTCAAGAGTGCCATGCCCGAAGGTTATGTACAGGAACGCACTATATCTTTCAAAGATGACGGGAACTACAAGACGCGTGCTGAAGTCAAGTTTGAAGGTGATACCCTTGTTAATCGTATCGAGTTAAAAGGTATTGATTTTAAAGAATATTGA

Once again go to EMBOSS Needle Pairwise Sequence Alignment. This time, paste the two GFP coding region nucleotide sequences above into the two boxes. (You can include the sequence name, such as ">Unmutated GFP coding sequence:." As long as it starts with >, it is recognized as a name.) Check the box that says "Enter a pair of DNA Sequences" and hit submit and wait to see the resulting alignment. Once the alignment is complete, click on "View Alignment File" to get a cleaner look.

Keep in mind that you are now looking at protein-coding sequences.

Question 12: The first three nucleotides are identical for the two sequences. What codon is this? Keep in mind that these are DNA sequences, not RNA; you can use the DNA codon table in Wikipedia. (Campbell's table shows RNA codons.)

Note that there are some differences between the sequences (aside from the fact that one is much shorter than the other). Remember that the shorter one is a mutated version of the  longer one.

Question 13: How many nucleotide differences (mismatches) are there in the region where the two sequences overlap? (Ignoring the fact that one sequence is much longer than the other.) This question is like a previous one, but now it's about the nucleotide alignment, not the amino acid alignment. Look closely; it's easy to miss a mismatch.

For the next two questions, focus on the last mismatch between the two nucleotide sequences, just before the end of the mutated sequence (nucleotide 403). This mismatch results from one of the mutations that Bio 6B students put into the plasmid, using a technique called Site-Directed Mutagenesis. Compare the last codon (the last 3 nucleotides) of the mutated sequence to the corresponding three nucleotides of the unmutated sequence.

For the next questions, you might want to refer to Campbell, 17.5: Mutations of one or a few nucleotides can affect protein structure and function. Alternatively you can find some information on mutations in Wikipedia.

Question 14: What kind of mutation is this (at nucleotide #403): substitution, insertion, or deletion? For this question, you're comparing the last three nucleotides of the mutated (shorter) nucleotide sequence to the corresponding nucleotides of the unmutated sequence.

Question 15: What kind of effect does this mutation have (at nucleotide #403): silent, missense, or nonsense?

For the last few questions, I asked you to examine only the coding sequence of the GFP gene. The mutated coding sequence is shorter than the original sequence, but the overall plasmid sequences (pGLO original and pGLO Non-Fluorescent Mutant) are the same length (which you can see on the sequence data page). So how is the coding sequence shorter if the overall plasmid sequence isn't shorter? That's what the last two questions are asking you. I hope you can answer that for yourself.

Before you leave the nucleotide mismatch, look at one more mutation. At nucleotide 199, the original, unmutated sequence has T, while the mutated version has C. This is a substitution mutation, but how does it affect the amino acid sequence of the protein? In order to answer the question, you need to know the correct reading frame. Since you already know that these are coding sequences, you can simply divide them into codons starting at the beginning. The codons including nucleotide #199 will look like this:

Unmutated: TAT

Mutated: CAT

Question 16: What kind of effect does this mutation have (at nucleotide #199): silent, missense, or nonsense?

You can also see the effect of this mutation in the amino acid sequence you performed earlier:

ORF1 Mutated      51 TGKLPVPWPTLVTTFSHGVQCFSRYPDHMKRHDFFKSAMPEGYVQERTIS    100
                     ||||||||||||||||:|||||||||||||||||||||||||||||||||
ORF1 Unmutated    51 TGKLPVPWPTLVTTFSYGVQCFSRYPDHMKRHDFFKSAMPEGYVQERTIS    100

The mutation at nucleotide #199 corresponds to an amino acid mismatch at amino acid #67. The unmutated version has Y (Tyrosine), while the mutated version has H (Histidine). The GFP mutation can be abbreviated as Y67H. You certainly don't need to memorize this now, but you'll see the same terminology if you compare the various mutant strains of coronavirus. If you were given various coronavirus genomes and wanted to compare them to see what effect the mutations might have, you'd start by finding the ORFs and looking at how the  mutations will affect specific proteins.

Review

Some of the concepts from this page might turn up on the lab final, even though there's no "wet" lab work for this one.

Terms and concepts

  • Alignment
  • Codon
  • Mutations: insertion, deletion, substitution
  • Mutations: silent, missense, nonsense
  • Start codon
  • Stop codon
  • Reading frame
  • Open reading frame (ORF)

Review questions

  1. What is a reading frame? How many possible reading frames are there for a given DNA nucleotide sequence? How many possible reading frames are there for a given RNA nucleotide sequence?
  2. What is an open reading frame? How would you recognize an open reading frame without using software?
  3. What does ORF Finder do?
  4. If you have already identified a particular protein by SDS-PAGE, how would you identify the ORF that encodes that protein?
  5. How would you know if a particular ORF actually encodes a protein?
  6. What does BLAST do?
  7. In this exercise, you determined the molecular mass of some proteins encoded by ORFs on pGLO. In principle, you might see these proteins on your SDS-PAGE gel after HIC. You're culturing your pGLO-transformed cells with and without arabinose; would you expect these proteins to be the same in both +arabinose and -arabinose samples?
  8. In lab and in this sequence analysis exercise, you have compared several versions of the pGLO plasmid. How are the mutant versions of pGLO different from the original version of pGLO?

References

General background

Wikipedia: Genetic code; Start codon; Open reading frame.

Bioinformatics tools

NCBI Open Reading Frame Finder 

Bioinformatics.org Sequence Manipulation Suite. This page hosts a variety of browser-based tools for manipulating and analyzing sequence data.

BLAST Help. You probably don't need this, but this page gives you links to detailed explanations of all the information you see on the BLAST pages.

ExPASy Bioinformatics Resource Portal. From the Swiss Institute of Bioinformatics, another site with a range of useful tools for bioinformatics.

Benchling. An suite of bioinformatics tools. Benchling requires you to make a free account, and it takes a little while to learn the interface, but it's powerful and convenient. It's what I usually use for sequence analysis.

Sequences

pGLO

pARO191. This plasmid is closely related to pARO180, the plasmid we use in the conjugation lab. Unfortunately, the sequence of pARO180 is not available, so I rely on pARO191 as an approximation. This is the example I used on this page.

lambda

DNA sequencing services

Just in case you're curious about how to get DNA sequencing done.

MCLab. MCLab is not affiliated with McCauley, but I like the name!

Going deeper

Why is start codon selection so precise in eukaryotes?

Start Codon Recognition in Eukaryotic and Archaeal Translation Initiation: A Common Structural Core. In case you thought the process was too simple.

 

A- A A+