BIOL 1406
PreLab 10.1
What are some of the tools being developed to study entire
genomes?
One of the goals of the Human Genome Project is to sequence the entire human
genome: all 22 somatic chromosomes, along with an X and a Y chromosome. Since
there are 3.2 billion base pairs in the human genome, this is a daunting goal.
Sequencing began in 1990 and by February, 2001, the sequencing was over 90%
complete. This milestone came 4 years earlier than the target date. The only
sequences not covered at that point were highly repetitive regions such as the
centromere and telomere regions of the chromosomes, regions that resist cloning
by the host and vector systems currently used. In addition to sequencing the
human genome, scientists are also working to sequence the genomes of other many
other species, particularly those of scientific, medical, or agricultural
importance. Today, the genomes of over 900 species have been at least partially sequenced.
This large international project
has not only collected enormous databanks of DNA sequence information from
the genomes of dozens of species, it has also promoted the development of
highly automated strategies for studying nucleic acids and proteins. Much of
the reason for the success of the Human Genome Project comes from the
introduction of new “high throughput” technologies for DNA sequencing
that can use automation and robotics to complete. The Human Genome Project has been a catalyst for change in the way biologists approach the study of living things. Using the sophisticated technology for sequencing DNA, biologists are collecting data faster than they can interpret it. In order to develop the tools needed to store, manage, and interpret the huge amount of data being collected, a new field of science called bioinformatics is being developed. |
|
Human chromosomes |
For example of how this can work, when a scientist sequences a new stretch of
DNA, this information becomes meaningful when a new gene can be discovered in
the sequence. Often these genes are hard to spot, especially in eukaryotic
genomes, since the majority of DNA in these organisms does not code for
proteins---in the human genome genes account for less than 2% of the DNA
sequence. A computer program can search the sequence for tell-tale signs of a
gene from sequence data, by searching for DNA consensus sequences for
translational start and stop codons, a ribosome binding site, intron splicing
sites, and a promoter. In eukaryotes, a region of high guanosine and cytosine (G
and C) content is frequently found near clusters of genes, so mapping GC content
along a chromosome can also help to locate the presence of a gene in the DNA
sequence. Since gene sequences can be highly conserved between different
species, an especially powerful approach for identification of a gene in
sequence data is to search databases of DNA sequences looking for sequence
similarities. Due to these so-called “in silico” (computer program), tools
and the dramatic growth of DNA sequence databases, the rate of gene discovery
has increased exponentially.
The power of the bioinformatic approach for the discovery of genes has been
proven with the completion of the yeast genomic sequencing project in 1996. Once
fully sequenced, the bioinformatic approaches for identification of genes
scanned the genome for genes. The genes found this way could be compared with
the large number of genes that had already been discovered through more
classical molecular genetic techniques. The results were remarkable. Before the
yeast genome was sequenced in 1996, an international collaboration of scientists
studying the genetics of this model organism had identified an impressive 2,000
genes by conventional genetic analysis. When the yeast genome sequencing was
completed, bioinformatic searches for similarities of DNA sequences from other
organisms were able to locate an additional 2,000 genes. This means in less than
one year, a single laboratory using DNA sequencing and computer searches of
sequence data could both duplicate and double the gene discovery of a 20-year
international effort.
Once a gene has been identified, many new questions can be asked: what kind of
protein does it code for and what is its function? How does it interact with
other biomolecules of the cell? Is it expressed at all times as a so-called
“housekeeping gene”, or is it a developmentally regulated gene? Is its
expression tissue-specific? Is it expressed in response to an environmental
factor? These questions are the same questions that have been asked by molecular
and cell biologists for decades, usually by studying one protein and its gene at
a time. With the enormous information coming from the genomics project, however,
biologists can ask the same questions about more complex systems. Instead of
asking about one protein at a time, biologists can now ask questions about
hundreds of proteins at a time, looking for patterns of structure and patterns
of expression. Studying proteins on a genomic scale is a new field of biology
called proteomics.
When a new gene has a sequence that has been found to be homologous to a gene in
a database that has already been characterized, sometimes many of these
questions about protein structure and function can be answered quickly by the
bioinformatic approach. For example, the 2,000 new genes discovered by the yeast
genomic sequencing project, discussed above, matched genes whose function had
already been determined.
Bioinformatics approaches are playing an increasing role in protein structure studies. Although the final conformation of a protein is determined by the amino acid sequence of that protein, we have yet to model the correct folding of a protein by its amino acid sequence by computer. There is progress, however, in achieving this so-called “holy grail” of proteomics. As our database of proteins whose crystal structure has been solved, though, the more often we can predict protein structures by their sequence similarities. Also, we have discovered by analysis of sequence databases that there are certain conserved protein families with high sequence homology in part, if not all of the amino acid sequence. Computer program can currently predict protein structures by homology modeling when the sequence homology is as low as 25%. This means that if the amino acid sequences agree by more than 25%, the computer program can accurately predict the secondary and tertiary structure of the amino acid sequences.
Questions about how genes are regulated are rapidly being answered by a new technology called “DNA chips” or a “microarrays”. These chips are designed to allow many hybridization experiments to be performed in parallel. Oligonucleotides are synthesized on a glass surface similar to a microscope slide. Borrowing from the photolithography technology used to etch semiconductor circuits into silicon for chips by the computer industry, arrays of oligonucleotides can be laid out at a density of up to one million different oligonucleotides per square centimeter. By judicious selection of oligonucleotide sequences, complementary DNA for of all the genes expressed in an organism can be assigned at specific positions on a given microarray. A microarray can easily assay for which genes are being expressed in a cell by harvesting the mRNA from the cell, labeling the mRNA by covalent linkage to a colored molecule, and allowing the labeled mRNA to hybridize with the oligonucleotides on the microarray. The microarray is said to be “interrogated” in this way by the labeled mRNA. The genes that are actively being transcribed into mRNA by the cell are then determined by viewing the microarray under the microscope to see which oligonucleotides were hybridized with the labeled mRNA. Although there are very variations in the exact way that the microarrays are designed and in the exact way that they are hybridized with labeled nucleic acids, there is one thing that they all have in common: massive amounts of data from single experiments, requiring computer-assisted analysis and archiving of the results. | |
DNA chip |
The power of the microarray and bioinformatics approach is having a major impact in medical research. For example, a biotechnology company called Sagres Discovery in Davis, California has recently announced that it has identified over 1000 new oncogenes in the mouse genome after just one year of research using this approach.
The practical applications from database information and the new bioinformatics
tools are far-reaching. For example, with the discovery of a new oncogene and
study of its structure and function comes the possibility of a new anti-cancer
drug or strategy. The discovery of disease genes can lead to diagnostics for
inherited diseases. Plant geneticists are using the detailed information coming
from genomics to identify DNA markers to speed the breeding of new traits in our
crop plants. With the discovery of DNA sequence polymorphisms (variations in
allele frequencies within populations), comes DNA fingerprinting strategies for
identity testing in forensics and the judicial system. In this exercise, you will use a computer to access GenBank, the database repository of all DNA and protein sequences housed at the National Center for Biotechnology Information (NCBI) at the National Institutes of Health (NIH). In Section I you will compare amino acid sequences of proteins from different organisms to study their evolutionary relatedness. In Section II you will use a DNA sequence to find a protein translational sequence (ORF for “open reading frame”) of a plant defensin and study the function of this protein by finding homologous sequences in the protein database. In Section III you will use databases of the biological literature available online through the National Library of Medicine to discover what researchers are reporting for the structures and functions of plant defensin proteins. |
|
DNA electrophoresis gel |
Your Turn | |
What is the Human Genome Project?
|
Check your answer. |
What is bioinformatics?
|
Check your answer. |
Why is it often difficult to locate genes
(protein coding sequences) within a long stretch of newly sequenced DNA?
|
Check your answer. |
How do scientists attempt to locate genes
(protein coding sequences) within a long stretch of newly sequenced DNA?
|
Check your answer. |
Once a gene has been identified, what
types of questions are scientists interested in answering about it?
|
Check your answer. |
What is a “DNA chip” or “microarray”?
|
Check your answer. |
Close this browser window to return
to Blackboard and complete the practice quiz and assessment quiz.