
Development a software program to dynamically create repetative element databases for RepeatMasker.
Arian Smit (1), Robert Hubley (1), Todd M. Smith (2). 1. Institutes for Systems Biology, 2. Geospiza, Inc. Seattle WA.
A significant fraction of the DNA sequence of eukaryotic genomes is repetitive. Accurately identifying and classifying this fraction is worthwhile because their identification aids in reconstructing genomic evolution and establishing species phylogenies. Repeats must also be identified and suppressed before generalized database searches or gene prediction analysis can be conducted. RepeatMasker (www.repeatmasker.org) is widely used for this purpose and is now a standard tool in DNA sequence analysis. RepeatMasker accomplishes repeat identification and annotation using Cross_Match or WU-BLAST to perform sequence comparisons of the query sequence against human curated databases of repetitive element consensus sequences in RepBase Update (www.girinst.org).
When sequencing a new genome a researcher is faced with a challenge in that there is no database of repeat consensus sequences at the beginning of the annotation phase. Due to the species specificity of repetitive elements, a repeat library prepared from one species generally cannot be used to identify and mask the repeats in another species. Further, as the throughput of DNA sequencing increases, human curators cannot keep repeat identification in pace with data production. Thus, automated techniques are needed to identify repetitive elements in new genomes. Using the output of RECON (Bao 2002), we are developing a repeat library generation tool that will dynamically create databases used by RepeatMasker with the eventual goal of developing an integrated library builder and workbench for RepeatMasker that will be used by future genome projects.
References
Bao, Z. and S.R. Eddy. 2002. Automated de novo identification of repeat sequence families in sequenced genomes. Genome Res 12: 1269-1276.
2005 Plant and Animal Genomics Conference |