What are some of the tools being developed to study entire genomes?

BIOL 1406

PreLab 10.1

What are some of the tools being developed to study entire genomes?

One of the goals of the Human Genome Project is to sequence the entire human genome: all 22 somatic chromosomes, along with an X and a Y chromosome. Since there are 3.2 billion base pairs in the human genome, this is a daunting goal. Sequencing began in 1990 and by February, 2001, the sequencing was over 90% complete. This milestone came 4 years earlier than the target date. The only sequences not covered at that point were highly repetitive regions such as the centromere and telomere regions of the chromosomes, regions that resist cloning by the host and vector systems currently used. In addition to sequencing the human genome, scientists are also working to sequence the genomes of other many other species, particularly those of scientific, medical, or agricultural importance. Today, the genomes of over 900 species have been at least partially sequenced.

This large international project has not only collected enormous databanks of DNA sequence information from the genomes of dozens of species, it has also promoted the development of highly automated strategies for studying nucleic acids and proteins. Much of the reason for the success of the Human Genome Project comes from the introduction of new “high throughput” technologies for DNA sequencing that can use automation and robotics to complete.

The Human Genome Project has been a catalyst for change in the way biologists approach the study of living things. Using the sophisticated technology for sequencing DNA, biologists are collecting data faster than they can interpret it. In order to develop the tools needed to store, manage, and interpret the huge amount of data being collected, a new field of science called bioinformatics is being developed.

Human chromosomes

For example of how this can work, when a scientist sequences a new stretch of DNA, this information becomes meaningful when a new gene can be discovered in the sequence. Often these genes are hard to spot, especially in eukaryotic genomes, since the majority of DNA in these organisms does not code for proteins---in the human genome genes account for less than 2% of the DNA sequence. A computer program can search the sequence for tell-tale signs of a gene from sequence data, by searching for DNA consensus sequences for translational start and stop codons, a ribosome binding site, intron splicing sites, and a promoter. In eukaryotes, a region of high guanosine and cytosine (G and C) content is frequently found near clusters of genes, so mapping GC content along a chromosome can also help to locate the presence of a gene in the DNA sequence. Since gene sequences can be highly conserved between different species, an especially powerful approach for identification of a gene in sequence data is to search databases of DNA sequences looking for sequence similarities. Due to these so-called “in silico” (computer program), tools and the dramatic growth of DNA sequence databases, the rate of gene discovery has increased exponentially.

The power of the bioinformatic approach for the discovery of genes has been proven with the completion of the yeast genomic sequencing project in 1996. Once fully sequenced, the bioinformatic approaches for identification of genes scanned the genome for genes. The genes found this way could be compared with the large number of genes that had already been discovered through more classical molecular genetic techniques. The results were remarkable. Before the yeast genome was sequenced in 1996, an international collaboration of scientists studying the genetics of this model organism had identified an impressive 2,000 genes by conventional genetic analysis. When the yeast genome sequencing was completed, bioinformatic searches for similarities of DNA sequences from other organisms were able to locate an additional 2,000 genes. This means in less than one year, a single laboratory using DNA sequencing and computer searches of sequence data could both duplicate and double the gene discovery of a 20-year international effort.

Once a gene has been identified, many new questions can be asked: what kind of protein does it code for and what is its function? How does it interact with other biomolecules of the cell? Is it expressed at all times as a so-called “housekeeping gene”, or is it a developmentally regulated gene? Is its expression tissue-specific? Is it expressed in response to an environmental factor? These questions are the same questions that have been asked by molecular and cell biologists for decades, usually by studying one protein and its gene at a time. With the enormous information coming from the genomics project, however, biologists can ask the same questions about more complex systems. Instead of asking about one protein at a time, biologists can now ask questions about hundreds of proteins at a time, looking for patterns of structure and patterns of expression. Studying proteins on a genomic scale is a new field of biology called proteomics.

When a new gene has a sequence that has been found to be homologous to a gene in a database that has already been characterized, sometimes many of these questions about protein structure and function can be answered quickly by the bioinformatic approach. For example, the 2,000 new genes discovered by the yeast genomic sequencing project, discussed above, matched genes whose function had already been determined.

Bioinformatics approaches are playing an increasing role in protein structure studies. Although the final conformation of a protein is determined by the amino acid sequence of that protein, we have yet to model the correct folding of a protein by its amino acid sequence by computer. There is progress, however, in achieving this so-called “holy grail” of proteomics. As our database of proteins whose crystal structure has been solved, though, the more often we can predict protein structures by their sequence similarities. Also, we have discovered by analysis of sequence databases that there are certain conserved protein families with high sequence homology in part, if not all of the amino acid sequence. Computer program can currently predict protein structures by homology modeling when the sequence homology is as low as 25%. This means that if the amino acid sequences agree by more than 25%, the computer program can accurately predict the secondary and tertiary structure of the amino acid sequences.

Questions about how genes are regulated are rapidly being answered by a new technology called “DNA chips” or a “microarrays”. These chips are designed to allow many hybridization experiments to be performed in parallel. Oligonucleotides are synthesized on a glass surface similar to a microscope slide. Borrowing from the photolithography technology used to etch semiconductor circuits into silicon for chips by the computer industry, arrays of oligonucleotides can be laid out at a density of up to one million different oligonucleotides per square centimeter. By judicious selection of oligonucleotide sequences, complementary DNA for of all the genes expressed in an organism can be assigned at specific positions on a given microarray. A microarray can easily assay for which genes are being expressed in a cell by harvesting the mRNA from the cell, labeling the mRNA by covalent linkage to a colored molecule, and allowing the labeled mRNA to hybridize with the oligonucleotides on the microarray. The microarray is said to be “interrogated” in this way by the labeled mRNA. The genes that are actively being transcribed into mRNA by the cell are then determined by viewing the microarray under the microscope to see which oligonucleotides were hybridized with the labeled mRNA. Although there are very variations in the exact way that the microarrays are designed and in the exact way that they are hybridized with labeled nucleic acids, there is one thing that they all have in common: massive amounts of data from single experiments, requiring computer-assisted analysis and archiving of the results.

DNA chip

The power of the microarray and bioinformatics approach is having a major impact in medical research. For example, a biotechnology company called Sagres Discovery in Davis, California has recently announced that it has identified over 1000 new oncogenes in the mouse genome after just one year of research using this approach.

The practical applications from database information and the new bioinformatics tools are far-reaching. For example, with the discovery of a new oncogene and study of its structure and function comes the possibility of a new anti-cancer drug or strategy. The discovery of disease genes can lead to diagnostics for inherited diseases. Plant geneticists are using the detailed information coming from genomics to identify DNA markers to speed the breeding of new traits in our crop plants. With the discovery of DNA sequence polymorphisms (variations in allele frequencies within populations), comes DNA fingerprinting strategies for identity testing in forensics and the judicial system.

In this exercise, you will use a computer to access GenBank, the database repository of all DNA and protein sequences housed at the National Center for Biotechnology Information (NCBI) at the National Institutes of Health (NIH). In Section I you will compare amino acid sequences of proteins from different organisms to study their evolutionary relatedness. In Section II you will use a DNA sequence to find a protein translational sequence (ORF for “open reading frame”) of a plant defensin and study the function of this protein by finding homologous sequences in the protein database. In Section III you will use databases of the biological literature available online through the National Library of Medicine to discover what researchers are reporting for the structures and functions of plant defensin proteins.

DNA electrophoresis gel

Your Turn
What is the Human Genome Project?	Check your answer.
What is bioinformatics?	Check your answer.
Why is it often difficult to locate genes (protein coding sequences) within a long stretch of newly sequenced DNA?	Check your answer.
How do scientists attempt to locate genes (protein coding sequences) within a long stretch of newly sequenced DNA?	Check your answer.
Once a gene has been identified, what types of questions are scientists interested in answering about it?	Check your answer.
What is a “DNA chip” or “microarray”?	Check your answer.

Close this browser window to return to Blackboard and complete the practice quiz and assessment quiz.