Exercise #6 - BIOINFORMATICS

 

Genomics is the study of biological problems by using genetic information derived from whole genomic sequences rather than from studying a small number of genes (as done in conventional genetics).  All of the information for determination of physical characteristics, disease susceptibility, and certain behavioral traits resides in the nucleotide sequence making up the genome of an organism. Through the use of automated DNA sequence analysis, it has been possible to determine the nearly complete genomic sequence of a number of microbial, animal, and plant species including man over the past 15 years.  Two of the most remarkable discoveries of the genomics era are the observation that even complex organisms have relatively few protein coding genes (20-30,000) and that many encode products that are structurally and functionally conserved among all organisms.  Genomic information has allowed investigators to begin studying groups of gene products that work together to determine particular biochemical pathways and phenotypes.  This field is called functional genomics or proteomics and will be a major focus of scientific endeavor for at least the next decade.  Since techniques exist that allow investigators to add or subtract specific genes in model organisms such as yeast, arabidopsis, fruit fly, zebra fish, and mouse, rapid progress is being made toward understanding the function of all genes and interactions between their protein products.

 

Bioinformatics is an emerging discipline that uses computational techniques to present and analyze genomic and proteomic information.  The vast amount of information contained in a complex eukaryotic genome (~3 X 109 bp) is too great for the human mind to comprehend without the assistance of a computer.  Computer databases are used to store the large amounts of sequence information that is now available and to construct linear genomes from overlapping short sequences that are produced from automated sequencing of individual cloned DNA fragments.  Programs exist that allow anyone to rapidly compare an unknown DNA sequence with all of the available information contained in a large government supported database known as GenBank.  For example, the sequence of a PCR fragment derived from a mutant form of a human gene that is associated with a particular genetic disease can be compared with GenBank sequences to determine the wild type gene.  GenBank files contain specific contiguous DNA sequences with information detailing open reading frames that putatively code for proteins.  If the sequence is known to code for specific mRNA molecules, the start and stop points and introns are noted and the open reading frames are virtually translated into protein sequences using the genetic code.  GenBank file also often contain information about the function of the gene product(s) and links to literature sources detailing the original studies that have been done on the gene.  Separate data bases exist that catalog all known protein sequences and structures.  It has been learned that many different proteins share common structural features that are the result of evolutionary conservation of functional elements.  For example, nucleotide-binding domains can often be determined within proteins encoded by newly discovered genes through similarity with other proteins that are known to bind nucleotides.  Similarly, much can be “guessed” about the structure of a protein encoded by a newly discovered gene by comparing the computed sequence with that of other proteins whose structure have been solved by X-ray crystallography.  Proteomics tools exist that can rapidly detect putative sites for protein modifications such as phosphorylation, glycosylation, and proteolytic cleavage.  Signal peptides, membrane spanning domains, and overall protein stability can also be predicted using only primary DNA sequence information as a starting point.

 

In this exercise, you will analyze a random sequence of yeast DNA that was cloned in Exercise #5 using on-line genomics and proteomics tools.  Most programs used for genetic analysis are proprietary and quite expensive.  However, many individual programs exist in the public domain and you will be use software in this exercise that is available to anyone in the world at no cost.  The yeast DNA sequence that you will analyze for your lab report will be sent to you by e-mail so that you can work on your own time.  All of the sequences that will be sent out for this exercise have been pre-screened by your instructors and are known to contain at least part of a protein-coding region.  You will determine the identity of the unknown DNA fragment by comparison to the yeast genome.  The protein coding region of the gene represented by your cloned fragment will be collected and translated to identify all open reading frames.  Finally, your sequence will be compared with the well-studied human genome in order determine if there is a homologous sequence in man. 

 

Before you begin, it might be helpful to copy the DNA sequence for the unknown gene from the e-mail message and paste it into a text editor or word processing program.  That way you can always have the sequence available for pasting into other programs. 

 

Step I. Conduct a BLASTN similarity search to identify the cloned fragment

1).  Go to the National Center for Biotechnology Information (NCBI) website at http://www.ncbi.nlm.nih.gov/. This is the central location for the largest single collection of biological information in the world. Scores of databases, dozens of books, thousands of journal articles and billions of bases of DNA can all be searched from this page.

 

2).  Click BLAST near the top center of the page. There are several versions of BLAST, but the most popular is from NCBI.

 

3).  Click Nucleotide-nucleotide BLAST (blastn). Paste your sequence in the box labeled ‘Search’. Select the ‘nr’ database, and click the ‘BLAST’ button.

 

4).  The next page gives you many options for viewing the output of your search. For now, leave them all as they are, and click the ‘format’ button. Scroll down until you find the graph marked ‘Distribution of BLAST hits on the query sequence’. This shows you where the hits match along the query sequence.

 

5).  Print out the graph, the list of similar sequences and the first 4 sequence alignments for your lab report.

 

6).  Find the colored boxes to the right of some of the similar sequences. These links connect to some of the specialized NCBI databases. The green G box, when available, gives you general gene information. The purple box with an ‘E’ in the middle links to a Gene Expression database, the light blue ‘U’ links to the Unigene database, and so on. Click some of these boxes to find out more information about your sequence.

 

7).  Click the link to ‘Taxonomy reports’, above and to the left of the graph. This opens a new window with a list of hits broken down by lineage.

 

8).  Print out the Lineage report for your lab report.

 

Look through the BLAST FAQs and use them as needed to help answer the questions below.

Questions:

 

a)   What is the E-value for the hit with the highest degree of similarity to the query sequence? What does this value mean?

 

b)   Is your cloned sequence IDENTICAL to any sequences in GenBank? What are the differences between your sequence and the yeast sequences in GenBank? What are possible sources of discrepancies between your sequence and yeast sequences in GenBank?

 

c)   What gene(s) does your cloned sequence represent?

 

d)   What is the putative function of your gene if known?

 

e)   Does your cloned sequence share significant similarity with any non-yeast genes? Are their functions (if known) similar?

 

 

Step II. Translated BLAST

 

1).  Return to the main BLAST page and select the blastX search ‘Translated query vs. protein database (blastx).’

 

2).  Paste your sequence in the box, and click the ‘BLAST’ button.

 

3).  Click the ‘format’ button.

 

4).  The BLASTX search has translated your search into all 6 possible reading frames, 3 forward and 3 reverse. Scroll down to look at the matches and you will see which translation frame is being used for each hit. For example, ‘Frame = +2’ indicates that the match was found in the second forward reading frame. This will also be your first sequence alignment.  Record the name of the protein that has the highest similarity to your search (the first match). 

 

5).  Print out the graph, the list of similar sequences and the first 4 sequence alignments for your lab report.

 

6).  Find the colored boxes to the right of some of the similar sequences. Click some of these boxes to find out more information about your sequence.

 

7).  Click the link to ‘Taxonomy reports’, above and to the left of the graph.  Print out the Lineage report for your lab report.

 

Questions:

a)   Are the top 4 hits identical to those that were found with the previous search? What differences do you find?

 

b)   Which method found the most matches?

 

c)   What frame is found in the top hit?

 

d)   Click the hyperlink to the top hit to find out more about it. At the top right corner of the resulting document, notice that there is a link to Domains. Click this link. What protein domains are found in this sequence?

 

(e) Is there a human homolog to this yeast gene?  If so retain its name for step III.

 

Step III.  Online Mendelian Inheritance in Man analysis.

 

If your gene alignment indicates there is a human homolog to the cloned yeast gene, you can find out whether it is a gene associated with a known human disorder by using this site.

 

1.)  Go back to the NCBI home page (http://www.ncbi.nlm.nih.gov/).

 

2.)  Copy and paste (or type in) the name of your protein (for example:  glucose-6-phosphate dehydrogenase) into the "search for" box at the top of the OMIM page.

 

3.)  You will get a list of accession numbers with a brief title and gene locus.  Click on the first one that has your complete protein name.

 

The site to which you will be taken is a full description of any human diseases caused by defects in your protein, allelic variants, and a literature review.  Everything you want to know about the function of the protein product of the gene you cloned in a human organism.

 

 

Questions:

 

a)   What is the gene map locus of your gene?  Are there multiple loci?

 

b    What disease (if any) is associated with defects in the gene you cloned?

 

c)   What is the function of your protein in a human organism (if known)? Is there an understandable connection between the defect and disease phenotype?