Wednesday, January 28, 2009

The Next Generation Dilemma: Large Scale Data Analysis

Next week is the AGBT genome conference in Marco Island, Florida. At the conference we will present a poster on work we have been doing with Next Gen Sequencing data analysis. In this post we present the abstract. We'll post the poster when we return from sunny Florida.

Abstract

The volumes of data that can be obtained from Next Generation DNA sequencing instruments make several new kinds of experiments possible and new questions amenable to study. The scale of subsequent analyses, however, presents a new kind of challenge. How do we get from a collection of several million short sequences of bases to genome-scale results? This process involves three stages of analysis that can be described as primary, secondary, and tertiary data analyses. At the first stage, primary data analysis, image data are converted to sequence data. In the middle stage, secondary data analysis, sequences are aligned to reference data to create application-specific data sets for each sample. In the final stage, tertiary data analysis, the data sets are compared to create experiment-specific results. Currently, the software for the primary analyses is provided by the instrument manufacturers and handled within the instrument itself, and when it comes to the tertiary analyses, many good tools already exist. However, between the primary and tertiary analyses lies a gap.

In RNA-Seq, the process of determining relative gene expression means that sequence data from multiple samples must go through the entire process of primary, secondary, and tertiary analysis. To do this work, researchers must puzzle through a diverse collection of early version algorithms that are combined into complicated workflows with steps producing complicated file formats. Command line tools such as MAQ, SOAP, MapReads, and BWA, have specialized requirements for formatted input and output and leave researchers with large data files that still require additional processing and formatting for tertiary analyses. Moreover, once reads are aligned, datasets need to be visualized and further refined for additional comparative analysis. We present a solution to these challenges that closes the gaps between primary, secondary, and tertiary analysis by showing results from a complete workflow system that includes data collection, processing and analysis for RNA-seq.

And, if you cannot be in sunny Florida, join us in Memphis where we will help kick off the ABRF conference with a workshop on Next Generation DNA Sequencing. I'm kicking the workshop off with a talk entitled "From Reads to Data Sets, Why Next Gen is Not Like Sanger Sequencing."

Labels: , , , , , , , ,

Wednesday, April 2, 2008

Working with Workflows

Genetic analysis workflows involve both complex laboratory and data analysis and manipulation procedures. A good workflow management system not only tracks processes, but simplifies the work.

In my last post , I introduced the concept of workflows in describing the issues one needs to think about as they prepare their lab for Next Gen sequencing. To better understand these challenges, we can learn from previous experience with Sanger sequencing in particular and genetic assays in general.

As we know, DNA sequencing serves many purposes. New genomes and genes in the environment are characterized and identified by De Novo sequencing. Gene expression can be assessed by measuring Expressed Sequence Tags (ESTs), and DNA variation and structure can be investigated by resequencing regions of known genomes. We also know that gene expression and genetic variation can also be studied with multiple technologies such as hybridization, fragment analysis, and direct genotyping and it is desirable to use multiple methods to confirm results. Within each of these general applications and technology platforms, specific laboratory and bioinformatics workflows are used to prepare samples, determine data quality, study biology, and predict biological outcomes.

The process begins in the laboratory.

Recently I came across a Wikipedia article on DNA sequencing that had a simple diagram showing the flow of materials from samples to data. I liked this diagram, so I reproduced it, with modifications. We begin with the sample. A sample is a general term that describes a biological material. Sometimes, like when you are at the doctor, these are called specimens. Since biology is all around and in us, samples come from anything that we can extract DNA or RNA from. Blood, organ tissue, hair, leaves, bananas, oysters, cultured cells, feces, you-can-image-what-else, can all be samples for genetic analysis. I know a guy who uses a 22 to collect the apical meristems from trees to study poplar genetics. Samples come from anywhere.

With our samples in hand, we can perform genetic analyses. What we do next depends on what we want to learn. If we want to sequence a genome we're going to prepare a DNA library by randomly shearing the genomic DNA and cloning the fragments into sequencing vectors. The purified cloned DNA templates are sequenced and the data we obtain are assembled into larger sequences (contigs) until, hopefully, we have a complete genome. In resequencing and other genetic assays, DNA templates are prepared from sample DNA by amplifying specific regions of a genome with PCR. The PCR products, amplicons, are sequenced and the resulting data are compared to a reference sequence to identify differences. Gene expression (EST and hybridization) analysis follows similar patterns except that RNA is purified from samples and then converted to cDNA using RT-PCR (Reverse Transcriptase PCR, not Real Time PCR - that's a genetic assay).

From a workflow point of view, we can see how the physical materials change throughout the process. Sample material is converted to DNA or RNA (nucleic acids), and the nucleic acids are further manipulated to create templates that are used for the analytical reaction (DNA sequencing, fragment analysis, RealTime-PCR, ...). As the materials flow through the lab, they're manipulated in a variety of containers. A process may begin with a sample in a tube, use a petri plate to isolate bacterial colonies, 96-well plates to purify DNA and perform reactions, and 384-well plates to collect sequence data. The movement of the materials must be tracked, along with their hierarchical relationships. A sample may have many templates that are analyzed, or a template may have multiple analyses. When we do this a lot we need a way to see where our samples are in their particular processes. We need a workflow management system, like FinchLab.

Labels: , , , , , , ,