Genomics is the study of
biological problems by using genetic information derived from whole genomic
sequences rather than from studying a small number of genes (as done in
conventional genetics). All of the
information for determination of physical characteristics, disease
susceptibility, and certain behavioral traits resides in the nucleotide sequence
making up the genome of an organism. Through the use of automated DNA sequence
analysis, it has been possible to determine the nearly complete genomic sequence
of a number of microbial, animal, and plant species including man over the past
15 years. Two of the most
remarkable discoveries of the genomics era are the observation that even complex
organisms have relatively few protein coding genes (20-30,000) and that many
encode products that are structurally and functionally conserved among all
organisms. Genomic information has
allowed investigators to begin studying groups of gene products that work
together to determine particular biochemical pathways and phenotypes. This field is called functional genomics
or proteomics and will be a major focus of scientific endeavor for at least the
next decade. Since techniques exist
that allow investigators to add or subtract specific genes in model organisms
such as yeast, arabidopsis, fruit fly, zebra fish, and mouse, rapid progress is
being made toward understanding the function of all genes and interactions
between their protein products.
Bioinformatics is an emerging discipline that
uses computational techniques to present and analyze genomic and proteomic
information. The vast amount of
information contained in a complex eukaryotic genome (~3 X 109 bp) is
too great for the human mind to comprehend without the assistance of a
computer. Computer databases are
used to store the large amounts of sequence information that is now available
and to construct linear genomes from overlapping short sequences that are
produced from automated sequencing of individual cloned DNA fragments. Programs exist that allow anyone to
rapidly compare an unknown DNA sequence with all of the available information
contained in a large government supported database known as GenBank. For example, the sequence of a PCR
fragment derived from a mutant form of a human gene that is associated with a
particular genetic disease can be compared with GenBank sequences to determine
the wild type gene. GenBank files
contain specific contiguous DNA sequences with information detailing open
reading frames that putatively code for proteins. If the sequence is known to code for
specific mRNA molecules, the start and stop points and introns are noted and the
open reading frames are virtually translated into protein sequences using the
genetic code. GenBank file also
often contain information about the function of the gene product(s) and links to
literature sources detailing the original studies that have been done on the
gene. Separate data bases exist
that catalog all known protein sequences and structures. It has been learned that many different
proteins share common structural features that are the result of evolutionary
conservation of functional elements.
For example, nucleotide-binding domains can often be determined within
proteins encoded by newly discovered genes through similarity with other
proteins that are known to bind nucleotides. Similarly, much can be “guessed” about
the structure of a protein encoded by a newly discovered gene by comparing the
computed sequence with that of other proteins whose structure have been solved
by X-ray crystallography.
Proteomics tools exist that can rapidly detect putative sites for protein
modifications such as phosphorylation, glycosylation, and proteolytic
cleavage. Signal peptides, membrane
spanning domains, and overall protein stability can also be predicted using only
primary DNA sequence information as a starting point.
In this exercise, you
will analyze a random sequence of yeast DNA that was cloned in Exercise #5 using
on-line genomics and proteomics tools.
Most programs used for genetic analysis are proprietary and quite
expensive. However, many individual
programs exist in the public domain and you will be use software in this
exercise that is available to anyone in the world at no cost. The yeast DNA sequence that you will
analyze for your lab report will be sent to you by e-mail so that you can work
on your own time. All of the
sequences that will be sent out for this exercise have been pre-screened by your
instructors and are known to contain at least part of a protein-coding
region. You will determine the
identity of the unknown DNA fragment by comparison to the yeast genome. The protein coding region of the gene
represented by your cloned fragment will be collected and translated to identify
all open reading frames. Finally,
your sequence will be compared with the well-studied human genome in order
determine if there is a homologous sequence in man.
Before you begin, it might be
helpful to copy the DNA sequence for the unknown gene from the e-mail message
and paste it into a text editor or word processing program. That way you can always have the
sequence available for pasting into other programs.
Step I. Conduct a BLASTN similarity search to identify the cloned fragment
1). Go to the National Center for Biotechnology Information (NCBI) website at http://www.ncbi.nlm.nih.gov/. This is the central location for the largest single collection of biological information in the world. Scores of databases, dozens of books, thousands of journal articles and billions of bases of DNA can all be searched from this page.
2). Click BLAST near the top center of the page. There are several versions of BLAST, but the most popular is from NCBI.
3). Click Nucleotide-nucleotide BLAST (blastn). Paste your sequence in the box labeled ‘Search’. Select the ‘nr’ database, and click the ‘BLAST’ button.
4). The next page gives you many options for viewing the output of your search. For now, leave them all as they are, and click the ‘format’ button. Scroll down until you find the graph marked ‘Distribution of BLAST hits on the query sequence’. This shows you where the hits match along the query sequence.
5). Print out the graph, the list of similar sequences and the first 4 sequence alignments for your lab report.
6). Find the colored boxes to the right of some of the similar sequences. These links connect to some of the specialized NCBI databases. The green G box, when available, gives you general gene information. The purple box with an ‘E’ in the middle links to a Gene Expression database, the light blue ‘U’ links to the Unigene database, and so on. Click some of these boxes to find out more information about your sequence.
7). Click the link to ‘Taxonomy reports’, above and to the left of the graph. This opens a new window with a list of hits broken down by lineage.
8). Print out the Lineage report for your lab report.
Look through the BLAST FAQs and use them as needed to help answer the questions below.
Questions:
a) What is the E-value for the hit with the highest degree of similarity to the query sequence? What does this value mean?
b)
Is your cloned sequence IDENTICAL to any sequences in GenBank? What are
the differences between your sequence and the yeast sequences in GenBank? What
are possible sources of discrepancies between your sequence and yeast sequences
in GenBank?
c)
What gene(s) does your cloned sequence represent?
d)
What is the putative function of your gene if known?
e) Does your cloned sequence share significant similarity with any non-yeast genes? Are their functions (if known) similar?
Step II. Translated
BLAST
1). Return to the main BLAST page and select the blastX search ‘Translated query vs. protein database (blastx).’
2). Paste your sequence in the box, and click the ‘BLAST’ button.
3). Click the ‘format’ button.
4). The BLASTX search has translated your search into all 6 possible reading frames, 3 forward and 3 reverse. Scroll down to look at the matches and you will see which translation frame is being used for each hit. For example, ‘Frame = +2’ indicates that the match was found in the second forward reading frame. This will also be your first sequence alignment. Record the name of the protein that has the highest similarity to your search (the first match).
5). Print out the graph, the list of similar sequences and the first 4 sequence alignments for your lab report.
6). Find the colored boxes to the right of some of the similar sequences. Click some of these boxes to find out more information about your sequence.
7). Click the link to ‘Taxonomy reports’, above and to the left of the graph. Print out the Lineage report for your lab report.
Questions:
a) Are the top 4 hits identical to those that were found with the previous search? What differences do you find?
b) Which method found the most matches?
c) What frame is found in the top hit?
d) Click the hyperlink to the top hit to find
out more about it. At the top right corner of the resulting document, notice
that there is a link to Domains. Click this link. What protein domains are found
in this sequence?
(e) Is there a human homolog to this yeast gene? If so retain its name for step III.
Step III. Online Mendelian Inheritance in Man
analysis.
If your gene alignment indicates there is a human homolog to the cloned yeast gene, you can find out whether it is a gene associated with a known human disorder by using this site.
1.) Go back to the NCBI home page (http://www.ncbi.nlm.nih.gov/).
2.) Copy and paste (or type in) the name of your protein (for example: glucose-6-phosphate dehydrogenase) into the "search for" box at the top of the OMIM page.
3.) You will get a list of accession numbers with a brief title and gene locus. Click on the first one that has your complete protein name.
The site to which you will be taken is a full description of any human diseases caused by defects in your protein, allelic variants, and a literature review. Everything you want to know about the function of the protein product of the gene you cloned in a human organism.
Questions:
a) What is the gene map locus of your
gene? Are there multiple
loci?
b What disease (if any) is
associated with defects in the gene you cloned?
c) What is the function of your
protein in a human organism (if known)? Is there an understandable connection
between the defect and disease phenotype?