
Adding Data to the Assembly Process
Christie Robertson1 , Mon-Chaio Lo1 , Gary Montry2 , Sandra Porter1, Todd Smith1
1 Geospiza, Inc., 2442 NW Market St. #344, Seattle, WA, 98107, USA
2 Southwest Parallel Software, Albuquerque, NM, USA
Assembly programs, which align sequencing reads into larger contiguous sequences, are the core of genomic-related research. Phrap, is a widely-used and robust assembler and has long been a standard component of many sequencing projects. But Phrap, like all assembly programs, is challenged by increasingly diverse biological questions, including EST clustering, genotyping, and comparative genomics. Other major issues are the ability to handle problems inherent to certain datasets, such as repetitive DNA. We have re-engineered Phrap to improve its performance and to allow the incorporation of additional information at specific points of the assembly process. Data that can now be incorporated to guide the assembly process and prevent misassemblies includes mate pairs (reads sequenced from the same template, with known order and orientation with respect to each other) and expected percent identity between reads, among other constraints. We will discuss the implications of this functionality for tackling difficult finishing issues, such as repetitive sequences. In addition, the incorporation of mate pair information allows contigs to be coalesced into larger scaffold structures, giving researchers considerably more information about their assemblies. Our assembler is being incorporated into a larger assembly framework, Linea, which will allow the storage, manipulation, and viewing of sequence and assembly data.
2004 Plant and Animal Genomics Conference
|
|