
Assembly Improvements Using Sequence Metadata
Christie Robertson 1, Gary Montry 2, Mon-Chaio Lo 1, Joe Slagel, and
Todd M. Smith 1
1. Geospiza, Inc., 2442 NW Market St., Seattle, WA 98107 USA
2. Southwest Parallel Software, Albuquerque, NM
Assembly programs align nucleotide sequences to each other based on similarity
between the sequences. Since each assembly algorithm relies on thresholds
to determine which sequences are similar enough to align and which are not,
every algorithm will inevitably wrongly assemble in some cases and wrongly
fail to assemble in others. An algorithm that performs well on one set of
data might fail dreadfully on another. Assembly algorithms are being challenged
by increasingly diverse biological questions, including EST clustering, genotyping,
and comparative genomics, and by problems inherent to certain data sets,
such as repetitive DNA. We are re-engineering Phrap to improve its performance
and utility by optimizing the core algorithms and developing a framework
to store, manipulate, and view sequence data. XML-formatted hints and constraints
will provide instructions to the core alignment program regarding how parts
of the data, or the data set as a whole, can be handled in individualized
ways. We have re-engineered Phrap, allowing alignments to incorporate information
regarding mate pairs --reads sequenced from the same template, and thereby
possessing a known order and orientation with respect to each other. We are
also utilizing mate pair information to create larger scaffold structures,
with known gap sizes between contigs.
GSAC 2003 DNA Sequencing and Analysis Conference
|
|