Wednesday, February 3, 2010

Sneak Peak: Data Analysis Methods for Whole Transcriptome Sequencing Applications – Challenges and Solutions

RNA sequencing is one of the most popular Next Generation Sequencing (NGS) applications. Next Thursday, February 11 at 10:00 A.M. PDT (1:00 P.M. EDT), we kick off our 2010 webinar series with a presentation designed to help you understand whole transcriptome data analysis and what can be learned in these experiments. In addition, we will show off some of our latest tools and interfaces that can be used to discover new RNAs, new splice forms of transcripts, and alleles of expressed genes.

Summary

RNA sequencing applications such as Whole Transcriptome Analysis, Tag Profiling and Small RNA Analysis allow whole genome analysis of coding as well as non-coding RNA at an unprecedented level. Current technologies allow for the generation of 500 million data points in a single instrument run. In addition to allowing for the complete characterization of all known RNAs in a sample (gene level expression summaries, exon usage, splice junction, single nucleotide variants, insertions and deletions), these applications are also ideal for the identification of novel RNAs as well as novel splicing events.

This presentation will provide an overview of Whole Transcriptome data analysis workflows with emphasis on calculating gene and exon level expression values as well as identifying splice junctions and variants from short read data. Comparisons of multiple groups to identify differential gene expression as well as differential splicing will also be discussed. Using data drawn from the GEO data repository and Short Read Archive (SRA), analysis examples will be presented for both Illumina’s GA and Lifetech’s SOLiD instruments.

Register Today!

Labels: , , , ,

Wednesday, September 23, 2009

GeneSifter in Current Protocols

This month we are pleased to report Geospiza's publication of the first standard protocols for analyzing Next Generation Sequencing (NGS) data. The pulication, appearing in the September issue of Current Protocols, addresses how to analyze data from both microarray, and NGS experiments. The abstract and links to the paper and our press release are provided below.

Abstract

Transcription profiling with microarrays has become a standard procedure for comparing the levels of gene expression between pairs of samples, or multiple samples following different experimental treatments. New technologies, collectively known as next-generation DNA sequencing methods, are also starting to be used for transcriptome analysis. These technologies, with their low background, large capacity for data collection, and dynamic range, provide a powerful and complementary tool to the assays that formerly relied on microarrays. In this chapter, we describe two protocols for working with microarray data from pairs of samples and samples treated with multiple conditions, and discuss alternative protocols for carrying out similar analyses with next-generation DNA sequencing data from two different instrument platforms (Illumina GA and Applied Biosystems SOLiD).

In the chapter we cover the following protocols:
  • Basic Protocol 1: Comparing Gene Expression from Paired Sample Data Obtained from Microarray Experiments
  • Alternate Protocol 1: Compare Gene Expression from Paired Samples Obtained from Transcriptome Profiling Assays by Next-Generation DNA Sequencing
  • Basic Protocol 2: Comparing Gene Expression from Microarray Experiments with Multiple Conditions
  • Alternate Protocol 2: Compare Gene Expression from Next-Generation DNA Sequencing Data Obtained from Multiple Conditions

Links

To view the abstract, contents, figures, and literature cited online visit: Curr. Protoc. Bioinform. 27:7.14.1-7.14.34

To view the press release visit: Geospiza Team Publishes First Standard Protocol for Next Gen Data Analysis

Labels: , , , , , , , ,

Saturday, September 12, 2009

Sneak Peak: Sequencing the Transcriptome: RNA Applications for Next Generation Sequencing

Join us this coming Wednesday, September 16, 2009 10:00 am Pacific Daylight Time (San Francisco, GMT-07:00), for a webinar on whole transcriptome analysis. In the presentation you will learn about how GeneSifter Analysis Edition can be used to identify novel RNAs and novel splice events within known RNAs.

Abstract:

Next Generation Sequencing applications such as RNA-Seq, Tag Profiling, Whole Transcriptome Sequencing and Small RNA Analysis allow whole genome analysis of coding as well as non-coding RNA at an unprecedented level. Current technologies allow for the generation of 200 million data points in a single instrument run. In addition to allowing for the complete characterization of all known RNAs in a sample, these applications are also ideal for the identification of novel RNAs and novel splicing events for known RNAs.

This presentation will provide an overview of the RNA applications using data from the NCBI's GEO database and Short Read Archive with an emphasis on converting raw data into biologically meaningful datasets. Data analysis examples will focus on methods for identifying differentially expressed genes, novel genes, differential splicing and 5’ and 3’ variation in miRNAs.

To register, please visit the event page.

Labels: , , , , , , ,

Monday, June 22, 2009

Sneak Peak: RNA-Seq - Global Profiling of Gene Activity and Alternative Splicing

Join us June 30 at 10:00 am PDT. Eric Olson, Geospiza's VP of Product Development will present an interesting webinar on using RNA-Seq to measure gene expression and discover alternatively spliced messages using GeneSifter Analysis Edition.

Abstract

Next Generation Sequencing applications such as RNA-Seq, Tag Profiling and Small RNA Analysis allow whole genome analysis of coding as well as non-coding RNA at an unprecedented level. Current technologies allow for the generation of 200 million data points in a single instrument run. In addition to allowing for the complete characterization of all known RNAs in a sample, these applications are also ideal for the identification of novel RNAs and novel splicing events for known RNAs.

This presentation will provide an overview of the RNA-Seq data analysis process with emphasis on calculating gene and exon level expression values as well as identifying splice junctions from short read data. Comparisons of multiple groups to identify differential gene expression as well as differential splicing will also be discussed. Using data drawn from the GEO data repository and Short Read Archive (SRA), analysis examples will be presented for both Illumina’s GA and ABI’s SOLiD instruments.

To register visit the Geospiza webex event page.

Labels: , ,

Tuesday, April 21, 2009

What if dbEST was an NGS Experiment? Part I: dbEST

Back in 1997, this alarming statement appeared in a paper [1]:

“Biological research is generating data at an explosive rate. Nucleotide sequence databases along are growing at a rate of >210 million base pairs (bp)/year and it has been estimated that if the present rate of growth continues, by the end of the millennium the sequence databases will have grown to 4 billion bp!” [emphasis mine]

Imagine 4 billion bp of data - what would we do with all that?


The article was about the defunct Merck Gene Index browser, which was developed to make massive numbers of cDNA sequences, also called Expressed Sequence Tags (ESTs), available through a web-based system. The ESTs were being generated through the Merck Gene Index Project which was one of many public and private projects focused on collecting EST and full length cDNA sequences from human and model organism samples. The goal of these projects was to create data resources of transcript sequences for studying gene expression and later finding genes in genomic sequence data. Combined, these projects cost 10's of millions of dollars and spanned nearly a decade. They also produced millions of ESTs that are now stored in NCBI’s dbEST database [2].

And the prediction of GenBank’s growth was close, release 115 of GenBank (Dec, 1999) had 4.6 billion bases. With the most recent release, 9 years later, GenBank has grown to over 103 billion bases and some would say we are just getting started with sequencing DNA [3].

Today, for a few thousand dollars, a single run of an Illumina, SOLiD, or Helicos instrument can collect a greater amount of data than has ever been produced from all the EST projects combined. This begs the question, what would the data look like if dbEST was a Next Generation Sequencing (NGS) experiment?

A Brief History of dbEST

Before we get into comparing dbEST to a Next Generation DNA Sequencing (NGS) experiment, we should discuss what dbEST is and how it came to be. In the early days of automated DNA sequencing (ca. 1990) it was realized that cDNA, reverse transcribed from mRNA, could be partially sequenced and the resulting data could be used to measure which genes are expressed in a cell or tissue. The term EST was coined to describe the fact that each sequence corresponded to an mRNA molecule, and was in effect a “tag” for that molecule [4]. EST stands for Expressed Sequence Tag.

During the early years EST sequencing was controversial. Many proponents of the genome project felt that collecting ESTs would obviate the need for sequencing the entire genome and congress would end funding for the genome project before it was complete. Further controversy arose when NIH decided to patent several of the early brain ESTs. This news created an uproar in the community and led to the famous statement by one nobel laureate that automated sequencing machines “could be run by monkeys [5].”

ESTs also led to the founding of dbEST [2], a valuable resource for quickly assessing the functional aspects of the genome and later for identifying and annotating genes within genomic sequences. Today, EST projects continue to be worthwhile endeavors for exploring new organisms before full genome sequencing can be performed.

In the 15+ years since the founding of dbEST, the database has grown from 22,537 entries to approximately 61 million (4/17/2009). The first dbEST report contained ESTs from seven organisms. Today, over 1700 organisms are represented in dbEST. The species with the highest numbers of ESTs (> 1,000,000) include human, mouse, corn, pig, Arabidopsis, cow, zebrafish, soybean, Xenopus, rice, Ciona, wheat, and rat. More than half of the species however, have fewer than 10,000 ESTs. Since January of this year dbEST has grown by more than 2,000,000 entries.

Despite its value, dbEST, like many resources at the NCBI, requires an “expert” level of understanding to be useful. As classical clone-based cDNA sequencing gives way to more cost effective higher throughput methods like NGS, less emphasis will be placed on making this resource useful beyond maintaining the data as an archival resource that the community can access.

What this means is that when you visit the site, it does not look like much is there. You can get links to the original (closed access) papers and learn about how many sequences are present for each organism. Accession numbers, or gene names can used to look up a sequence and from other pages you can use BLAST to search the resource with a query sequence.

If you want to know more, you have to know how to look for the information and deal with it in the context in which it is presented. For example, I mentioned that dbEST has grown since January. I knew this because, I looked at the list of organisms and numbers of sequences then and now and noticed that more are reported now. However, to tell you where numbers have increased for which organisms, or whether new organisms have been added would require significant time and effort by either saving the different release reports or digging through the dbEST ftp site. When we return to the story, we’ll do some "ftp archealogy" and dig through dbEST records to begin characterizing the human ESTs.


References:

1. Eckman B.A., Aaronson J.S., Borkowski J.A., Bailey W.J., Elliston K.O., Williamson A.R., Blevins R.A., 1998. The Merck Gene Index browser: an extensible data integration system for gene finding, gene characterization and EST data mining. Bioinformatics 14, 2-13.

2. Boguski M.S., Lowe T.M., Tolstoshev C.M., 1993. dbEST--database for “expressed sequence tags”. Nat Genet 4, 332-333. See also: http://www.ncbi.nlm.nih.gov/dbEST/

3. ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt

4. Adams M.D., Kelley J.M., Gocayne J.D., Dubnick M., Polymeropoulos M.H., Xiao H., Merril C.R., Wu A., Olde B., Moreno R.F., 1991. Complementary DNA sequencing: expressed sequence tags and human genome project. Science 252, 1651-1656.
And http://www.genomenewsnetwork.org/resources/timeline/1991_Venter.php

5. http://www.nature.com/nature/journal/v405/n6790/full/405983b0.html

Labels: , , ,

Wednesday, March 4, 2009

Bloginar: The Next Generation Dilemma: Large Scale Data Analysis

Previous posts shared some the things we learned at the AGBT and ABRF meetings in early February. Now it is time to share the work we presented, starting with the AGBT poster, “The Next Generation Dilemma: Large Scale Data Analysis.”

The goal of the poster was to provide a general introduction to the power of Next Generation Sequencing (NGS) and a framework for data analysis. Hence, the abstract described the NGS general data analysis process; its issues and what we are doing for one kind of transcription profiling, RNA-Seq. Between then and now we learned a few things... And the project grew.

The map below guides my “bloginar” poster presentation. In keeping with the general theme of the abstract we focused on transcription analysis, but instead of focusing exclusively on RNA-Seq, the project expanded to compare three kinds of transcription profiling: RNA-Seq, Tag Profiling, and Small RNA Analysis. A link to the poster is provided at the end.

Section 1 provides a general introduction into NGS by discussing the ways NGS is being used to study different aspects of molecular biology. It also covers how the data are analyzed in thee phases (primary, secondary, tertiary) to convert raw data into biologically meaningful information. The three phase model has emerged as a common framework to describe the process of converting image data into primary sequence data (reads) and then turning the reads into information that be used in comparative analyses. Secondary analysis is the phase where reads are aligned to reference sequences to get gene names, position, and (or) frequency information that can be used to measure changes, like gene expression, between samples.

The remaining sections of the poster use examples from transcription analysis to illustrate and address the multiple challenges (listed below) that must be overcome to efficiently use NGS.
  • High end infrastructures are needed to manage and work with extremely large data sets
  • Complex, multistep analysis procedures are required to produce meaningful information
  • Multiple reference data are needed to annotate and verify data and sample quality
  • Datasets must be visualized in multiple ways
  • Numerous Internet resources must be used to fill in additional details
  • Multiple datasets must be comparatively analyzed to gain knowledge
Section 2 describes the three different kinds of transcription profiling experiments. This section provides additional background on the methods and what they measure. For example, RNA-Seq and Tag Profiling are commonly used to measure gene expression. In RNA-Seq, DNA libraries are prepared by randomly amplifying short regions of DNA from cDNA. The sequences that are produced will generally cover the entire region of the transcripts that were originally isolated. Hence, it is possible to get information about alternative splicing and biased allelic expression. In contrast, Tag Profiling focuses on creating DNA libraries from discrete points within the RNA molecules. With Tag Profiling, one can quickly measure relative gene expression, but cannot get information about alternative splicing and allelic expression. The table in section 2 discusses these and other issues one must consider when running the different assays.

Sections 3, 4, and 5 outline three transcriptome scenarios (RNA-Seq, Tag Profiling, and Small RNA, respectively) using real data examples (references provided in the poster). Each scenario follows a common workflow involving the preparation of DNA libraries from RNA samples, followed by secondary analysis, followed by tertiary analysis of the data in GeneSifter Analysis Edition.

For RNA-Seq, two datasets corresponding to mouse erythroid stem (ES) and body (EB) cells were investigated. DNA libraries were produced from each cell line. Sequences were collected from the library and compared to the RefSeq (NCBI) database according to the pipeline shown. The screen captures (middle of the panel) show how the individual reads map to each transcript along with the total numbers of hits summarized by chromosome. The process is repeated twice, once for each cell line, and the two sets of alignments are converted to Gene Lists for comparative analysis in GeneSifter laboratory edition to observe differential expression (bottom of the panel).

The Tag Profiling panel examines data from a recently published experiment (a reference is provided in the poster) in which gene expression was studied in transgenic mice. I’ll leave out the details of the paper, and only point out how this example shows the differences between Tag Profiling and RNA-Seq data. Because Tag Profiling collects data from specific 3’ sites in RNA, the aligned data (middle of the panel) show alignments as single “spikes” toward the 3’ end of transcripts. Occasionally multiple peaks are observed. The question being, are the additional peaks the result of isoforms (alternative polyA sites) or incomplete restriction enzyme digests? How might this be sorted out? Like RNA-Seq, the bottom panel shows the comparative analysis of replicate samples from the wild type (WT) and transgenic (TG) mice.

Data from a small RNA analysis experiment are analyzed in the third panel. Unlike RNA-Seq and Tag Profiling, this secondary analysis has more comparisons of the reads to different sets of reference sequences. The purpose is to identify and filter out common artifacts observed in small RNA preparations. The pipeline we used, and data produced, are shown in the middle of the panel. Histogram plots of read length distribution, determined from alignments in different reference sources, are created because an important feature of small RNAs is that they are small. Distributions clustered around 22 nt indicate a good library. Finally, data are linked to additional reports and databases, like miRBase (Sanger Center), to explore results further. In the example shown, the first hit was to a small RNA that has been observed in opossums; now we have human counter part. In total, four, samples were studied. Like RNA-Seq and Tag Profiling, we can observe the relative expression of each small RNA by analyzing the datasets together (hierarchical clustering, bottom).

Section 6 presents some of the challenges of scale issues that accompany NGS, and how we are addressing these issues with HDF5 technology. This will be a topic of many more posts in the future.

We close the poster by addressing the challenges listed above with the final points:
  • High performance data management systems are being developed through the BioHDF project and GeneSifter system architectures.
  • The examples show how each application and sequencing platform requires a different data analysis workflow (pipeline). GeneSifter provides a platform to develop and make bioinformatics pipelines and data readily available to communities of biologists.
  • The transcriptome is complex, different libraries of sequence data can filter known sequences (e.g. rRNA) and discover new elements (miRNAs) and isoforms of expressed genes.
  • Within a dataset, read maps, tables, and histogram plots are needed to summarize and understand the kinds of sequences present and how they relate to an experiment.
  • Links to Entrez Gene, the USCS genome browser, and miRBASE, show how additional information can be integrated into the application framework and used.
  • Next Gen transcriptomics assays are similar to microarray assays in many ways, hence software systems like Geospiza’s GeneSifter are useful for comparative analysis.
You can also get the file, AGBT_2009.pdf

Labels: , , , , , , ,

Sunday, February 15, 2009

Three Themes from ABRF and AGBT Part I: The Laboratory Challenge

It's been an exciting week on the road at the AGBT and ABRF conferences. From the many presentations and discussions it is clear that the current and future next generation DNA sequencing (NGS) technologies are changing the way we think about genomics and molecular biology. It is also clear that successfully using these technologies impacts research and core laboratories in three significant areas:
  1. The Laboratory: Running successful experiments requires careful attention to detail.
  2. Bioinformatics: Every presentation called out bioinformatics as a major bottleneck. The data are hard to work with and different NGS experiments require different specialized bioinformatics workflows (pipelines).
  3. Information Technology (IT): The bioinformatics bottleneck is exacerbated by IT issues involving data storage, computation, and data transfer bandwidth.

We kicked off ABRF by participating in the Next Gen DNA Sequencing workshop on Saturday (Feb. 7). It was extremely well attended with presentations on experiences in setting up labs for Next Gen sequencing, preparing DNA libraries for sequencing, and dealing with the IT and bioinformatics.

I had the opportunity to provide the “overview” talk. In that presentation “From Reads to Datasets, Why Next Gen is not Sanger Sequencing,” I focused on the kinds of things you can do with NGS technology, its power, and the high level issues that groups are facing today when implementing these systems. I also introduced one of our research projects on developing scalable infrastructures using HDF5 for Next Gen bioinformatics and high-performing, dynamic, software interfaces. Three themes resufraced again and again throughout the day:  one must pay attention to laboratory details, bioinformatics is a bottleneck, and don't underestimate the impact of NGS systems on IT.

In this post, I'll discuss the laboratory details and visit the other themes in posts to come.

Laboratory Management

To better understand the impact of NGS on the lab, we can compare it to Sanger sequencing. In the table below, different categories ranging from the kinds of samples, to their preparation, to the data, are considered to show how NGS differs from Sanger sequencing. Sequencing samples for example are very different between Sanger and NGS. In Sanger sequencing, one typically works with clones or PCR amplicons. Each sample (clone or PCR product) produces a single sequence read. Overall, sequencing systems are robust, so the biggest challenges to labs has been tracking the samples as they move from tube to plate or between wells within plates.

In contrast, NGS experiments involve sequencing DNA libraries and each sample produces millions of reads. Presently, only a few samples are sequenced at a time so the sample tracking issues, when compared to Sanger, are greatly reduced. Indeed, one of the significant advantages and cost savings of NGS is to eliminate the need for cloning or PCR amplification in preparing templates to sequence.

Directly sequencing DNA libraries is a key ability and a major factor that makes NGS so powerful. It also directly contributes to the bioinformatics complexity (more on that in the next post). Each one of the millions of reads that are produced from the sample corresponds to an individual molecule, present in the DNA library. Thus, the overall quality of the data and the things you can learn are a direct function of the library.



Producing good libraries requires that you have a good handle on many factors. To begin, you will need to track RNA and DNA concentrations, at different steps of the process. You also need to know the “quality” of the molecules in the sample. For example, RNA assays will give the best results when RNA is carefully prepared and free of RNAses. In RNA-Seq, the best results are obtained when the RNA is fragmented prior to cDNA synthesis. To understand the quality of the starting RNA, fragmentation, and cDNA synthesis steps, tools like agarose gels or Bioanalyzer traces are used to evaluate fragment lengths and determine overall sample quality. Other assays and sequencing projects have similar processes. Throughout both conferences, it was stressed that regardless of whether you are sequencing genomes, small RNAs, performing an RNA-Seq, or other “tag and count” kinds of experiments, you need to pay attention to the details of the process. Tools like the NanoDrop, or QPCR procedure need to be routinely used to measure RNA or DNA concentration. Tools like gels and the Bioanalyzer are used to measure sample quality. And, in many cases both kinds of tools are used.

Through many conversations, it became clear that Bioanalyzer images, Nanodrop reports, and other lab data quickly accumulate during these kinds of experiments. While an NGS experiment is in progress, these data are pretty accessible and the links between data quality and the collected data are easy to see. It only takes a few weeks, however,  for these lab data to disperse.  They find their way into paper notebooks, or unorganized folders on multiple computers. When the results from one sample need to be compared to another,  a new problem appears. It becomes harder and harder to find the lab data that correspond to each sample.

To summarize, NGS technology makes it possible to interrogate large ensembles of individual RNA or DNA molecules. Different questions can be asked by preparing the ensembles (libraries) in different ways involving complex procedures. To ensure that the resulting data are useful, the libraries need to be of high and known quality. Quality is measured with multiple tools at different points of the process to produce multiple forms of laboratory data. Traditional methods such as laboratory notebooks, files on computers, and post-it notes however, make these data hard to find when the time comes to compare results between samples.

Fortunately, the GeneSifter Lab Edition solves these challenges. The Lab Edition of Geospiza’s software platform provides a comprehensive laboratory information management system (LIMS) for NGS and other kinds of genetic analysis assays, experiments, and projects. Using web-based interfaces, laboratories can define protocols (laboratory workflows) with any number of steps. Steps may be ordered and required to ensure that procedures are correctly followed. Within each step, the laboratory can define and collect different kinds of custom data (Nanodrop values, Bioanalyzer traces, gel images, ...). Laboratories using the GeneSifter Lab Edition can produce more reliable information because they can track the details of their library preparation and link key laboratory data to sequencing results.

Labels: , , , , ,

Monday, February 2, 2009

Next Gen Laboratory Software Systems for Core Facilities

Do you have a core lab? Considering adding Next Generation DNA sequencing capacity to your lab? Then you will be interested in visiting our both and checking out our poster at the annual Association for Biomolecular Research Facilities (ABRF) meeting next week in Memphis TN. We'll be at booth 408, and presenting poster number V27-S1.

Poster Abstract

Throughout the past year, as next generation sequencing (NGS) technologies have emerged in the marketplace, their promise of what can be done with massive amounts of sequence data has been tempered with the reality that performing experiments and working with the data is extremely challenging. As core labs contemplate acquiring NGS technologies, they must consider how the new technologies will affect their current and future operations. The old model of collecting and delivering data is likely to change to one where the core lab becomes an active participant in advising and helping clients set up experiments and analyze the data. However, while many labs want to utilize NGS, few have the Information Technology (IT) infrastructures and procedures in place to successfully make use of these systems.

In the case of gene expression, NGS technologies are being evaluated as complementary or replacement technologies for microarrays. Assays like RNA-Seq and tag profiling, that focus on measuring relative gene expression, require that researchers and core labs must puzzle through a diverse collection of early version algorithms that are combined into complicated workflows with many steps producing complicated file formats. Command line tools such as MAQ, SOAP, MapReads, and BWA, have specialized requirements for formatted input and output and leave researchers with large data files that still require additional processing and formatting for tertiary analyses. Moreover, once reads are aligned, datasets need to be visualized and further refined for additional comparative analysis. We present solutions to these challenges by showing results from a complete workflow system that includes data collection, processing, and analysis for RNA-seq suited for the core laboratory.

In the poster we'll walk through the laboratory and data analysis issues one needs to think about to perform a two cell expression comparison with RNA-Seq. Below is a snippet from the poster. I'll post the full presentation when I return.

Labels: , , , ,

Wednesday, January 28, 2009

The Next Generation Dilemma: Large Scale Data Analysis

Next week is the AGBT genome conference in Marco Island, Florida. At the conference we will present a poster on work we have been doing with Next Gen Sequencing data analysis. In this post we present the abstract. We'll post the poster when we return from sunny Florida.

Abstract

The volumes of data that can be obtained from Next Generation DNA sequencing instruments make several new kinds of experiments possible and new questions amenable to study. The scale of subsequent analyses, however, presents a new kind of challenge. How do we get from a collection of several million short sequences of bases to genome-scale results? This process involves three stages of analysis that can be described as primary, secondary, and tertiary data analyses. At the first stage, primary data analysis, image data are converted to sequence data. In the middle stage, secondary data analysis, sequences are aligned to reference data to create application-specific data sets for each sample. In the final stage, tertiary data analysis, the data sets are compared to create experiment-specific results. Currently, the software for the primary analyses is provided by the instrument manufacturers and handled within the instrument itself, and when it comes to the tertiary analyses, many good tools already exist. However, between the primary and tertiary analyses lies a gap.

In RNA-Seq, the process of determining relative gene expression means that sequence data from multiple samples must go through the entire process of primary, secondary, and tertiary analysis. To do this work, researchers must puzzle through a diverse collection of early version algorithms that are combined into complicated workflows with steps producing complicated file formats. Command line tools such as MAQ, SOAP, MapReads, and BWA, have specialized requirements for formatted input and output and leave researchers with large data files that still require additional processing and formatting for tertiary analyses. Moreover, once reads are aligned, datasets need to be visualized and further refined for additional comparative analysis. We present a solution to these challenges that closes the gaps between primary, secondary, and tertiary analysis by showing results from a complete workflow system that includes data collection, processing and analysis for RNA-seq.

And, if you cannot be in sunny Florida, join us in Memphis where we will help kick off the ABRF conference with a workshop on Next Generation DNA Sequencing. I'm kicking the workshop off with a talk entitled "From Reads to Data Sets, Why Next Gen is Not Like Sanger Sequencing."

Labels: , , , , , , , ,

Wednesday, January 21, 2009

The Experts Agree

It depends what you are trying to do. That is the take home message in Genome Technology’s (GT) trouble-shooting guide on picking assembly and alignment algorithms for Next-Gen sequence data.

In the guide, the GT team asked nine Next-Gen sequencing and bioinformatics experts to answer six questions:
  1. How do you choose which alignment algorithm to use?
  2. How do you optimize your alignment algorithm for both high speed and low error rate?
  3. What approach do you use to handle mismatches or alignment gaps?
  4. How do you choose which assembly algorithm to use?
  5. Do you use mate-paired reads for de novo assembly? how?
  6. What impact does the quality of raw read data have on alignment or assembly? how do your algorithms enhance this?
Even a quick look at the questions shows us that many factors need to be considered in setting up a Next-Gen sequencing lab. Questions 1 and 4 point out that aligning sequences is different from assembling them. Other questions address issues related to the size of the data sets being compared, the quality of the data being analyzed, the kinds of information that can be obtained, and the computational approaches being used for different problems.

What the experts said

First, they all agree that different problems require different approaches and have different requirements. In the first question about which aligner to use, the most common response was “for what application and which instrument?” Fundamentally, SOLiD data are different from Illumina GA data which are different from 454 data. While the end results may all be sequences of A's, G's, C's, and T's; the data are derived in different ways because of the platform-specific twists in collecting the data (recall “Color Space, Flow Space, Sequence Space, or Outer Space). Not only are there platform-specific methods for interpreting raw data, multiple programs have been developed for each instrument with their own strengths and weaknesses in terms of speed, sensitivity, the kinds of data they use (color, base, or flow spaces, quality values, and paired end data), and the information that is finally produced. Hence, in addition to choosing a sequencing platform you also have to think about the sequencing application, or the kind of experiment, that will be performed. In gene expression studies, for example, an RNA-Seq experiment has different requirements in terms of aligning the data and interpreting the output than an experiment with Tag Profiling.

Overall the trouble-shooting guide discussed 17 total algorithms, eight for alignment, and nine for assembly (two of which were for Sanger methods). Even this selection wasn't a comprehensive list. When other sites [1, 2] and articles [3] are included and proprietary methods are factored in, over 20 algorithms are available. So what to do? Which is best?

That depends

Yes, the choice of algorithm ultimately depends on what you are trying to do. While we can agree that there is no best solution, we also know that is not a helpful response. What is needed is a way to test the suitability of different algorithms for different kinds of experiments and to represent data in standard ways so that the features of specific algorithms can be evaluated. Also, as this is a new field, standard requirements for how data should be aligned, defining a correct alignment, and what kinds of information are the most informative in describing alignments are still emerging. Some of the early programs are helping to define these requirements.

One program we've used, at Geopsiza, for identifying requirements is MAQ, a program for sequence alignment. As noted in previous blogs [MAQ attack], MAQ is a great general purpose tool. It provides comprehensive information about the data being aligned and details about alignments. MAQ works well for many applications including RNA-Seq, Tag Profiling, ChIP-Seq, and resequencing assays focused on SNP discovery. In performance tests, MAQ is slower than some of the newer programs, one of which is being developed by MAQ’s author, but MAQ is a good model for getting the right kinds of information, formatted in a decent way. Indeed MAQ was the most cited program in the GT guide.

Let’s return to the bigger issue. That is, how can we easily compare between algorithms? For that we need a system where one can easily define a standardized dataset and reference sequence, and have a platform where a new algorithm can be added and run from a common interface. Standard reports that present features of the alignments could then be used to compare programs and parameters.

The laboratory edition of GeneSifter supports these kinds of comparisons. The distributed system architecture allows one to quickly develop control scripts to run programs and format their output in figures and tables that make comparisons possible. With this kind of system in place, the challenges move from which program to run and how to run it, to how to get the right kinds of information and best display the data. To address these issues, Geospiza’s research and development team is working on projects focused on using technologies like HDF5 to create scalable standardized data models for storing information from alignment and assembly programs. Ultimately this work will make it easy to optimize Next-Gen sequencing applications and assays and compare between assorted programs.

References
1. http://en.wikipedia.org/wiki/Sequence_alignment_software,
2. http://www.massgenomics.org/2009/01/short-read-aligners-update-at-agbt.html
3. Shendure J., Ji H., 2008. Next-generation DNA sequencing. Nat Biotechnol 26, 1135-1145.

Labels: , , , ,

Thursday, November 20, 2008

Introducing GeneSifter

Today, Geospiza announced the acquisition of the award-winning GeneSifter microarray data analysis product. This news has significant implications for Geospiza’s current and new customers. With GeneSifter and FinchLab, Geospiza will deliver complete end to end systems for data intensive genetic analysis applications like Next Gen sequencing and microarrays.

As an example, let's consider transcriptomics or gene expression. One goal of such experiments is to compare the relative gene expression between cells to see how different genes are up or down regulated as the cells change over time or respond to some sort of treatment.

The general process, whether it involves microarrays or Next Gen sequencing, is to measure the number of RNA molecules for a given gene, either over a period of time or after different treatments. Laboratory processes create the molecules to assay, the molecules are measured, data are collected, and we process the data to produce tables of information. These tables are then compared with one another to identify genes that are differentially expressed. With the gene expression results in hand, one can delve deeper by utilizing other databases like Entrez Gene or pathway sites to learn about gene function and gain insights.

From a systems perspective, you need a LIMS to define sample information and keep track of workflow steps and the data generated at the bench. You will also need to track which samples are on a slide, or lane, or well when the data are collected. You will need to store and organize the data by sample. Then, you will need to analyze the data through multiple programs in a pipelined process (filter, align ...) to produce information, like gene lists, that can be compared for each sample. You may want to review this information to see that your experiments are on track and then, if they are, you will want to compare the gene lists from different experiments to tell a story.

FinchLab, combined with Geospiza’s hosted Software as a Service (SaaS) delivery, solves challenges related to IT, LIMS, and the core data analysis. GeneSifter completes the process by delivering a software solution that lets you compare your gene lists. GeneSifter provides information about the relative gene expression between samples and links gene information to key public resources to uncover additional details.

It's an exciting time for those in the genetic analysis and genomics fields. New high throughput data collection technologies are giving scientists the ability to interrogate systems and understand biology in a whole new way. As we come to the end of 2008 and think about 2009, Geospiza is excited to think about how we will integrate and extend our products to further develop end to end systems for a wide variety of genomics applications that target basic and clinical research to help us improve human health and well being.



Labels: , , ,

Sunday, November 9, 2008

Next Gen-Omics

Advances in Next Gen technologies have led to a number of significant papers in recent months, highlighting their potential to advance our understanding of cancer and human genetics (1-3). These and the other 100's of papers demonstrate the value of Next Gen sequencing. The work completed thus far has been significant, but much more needs to be done to make these new technologies useful for a broad range of applications. Experiments will get harder.

While much of the discussion in the press focuses on rapidly sequencing human genomes for low cost as part of the grail of personalized genomics (4), a vast amount of research must be performed at the systems level to fully understand the relationship between biochemical processes in a cell and how the instructions for the processes are encoded in the genome. Systems biology and a plethora of "omics" have emerged to measure multiple aspects of cell biology as DNA is transcribed into RNA and RNA translated into protein and proteins interact with molecules to carry out biochemistry.

As noted in the last post we are developing proposals to further advance the state-of-the-art in working with Next Gen data sets. In one of those proposals, Geospiza will develop novel approaches to work with data from applications of Next Gen sequencing technologies that are being developed study the omics of DNA transcription and gene expression.

Toward furthering our understanding of gene expression, Next Gen DNA sequencing is being used to perform quantitative assays where DNA sequences are used as highly informative data points. In these assays, large datasets of sequence reads are collected in a massively parallel format. Reads are aligned to reference data to obtain quantitative information by tabulating the frequency, positional information, and variation from the reads in the alignments. Data tables from samples that differ by experimental treatment, environment, or in populations, are compared in different ways to make discoveries and draw experimental conclusions. Recall the three phases of data analysis.

However, to be useful these data sets need to come from experiments that measure what we think they should measure. The data must be high quality and free of artifacts. In order to compare quantitative information between samples, the data sets must be refined and normalized so that biases introduced through sample processing are accounted for. Thus, a fundamental challenge to performing these kinds of experiments is working with the data sets that are produced. In this regard numerous challenges exist.

The obvious ones relating to data storage and bioinformatics are being identified in both the press and scientific literature (5,6). Other, less published, issues include a lack of:
  • standard methods and controls to verify datasets in the context of their experiments,
  • standardized ways to describe experimental information and
  • standardized quality metrics to compare measurements between experiments.
Moreover data visualization tools and other user interfaces, if available, are primitive and significantly slow that pace at which a researcher can work with the data. Finally, information technology (IT) infrastructures that can integrate the system parts dealing with sample tracking, experimental data entry, data management, data processing and result presentation are incomplete.

We will tackle the above challenges by working with the community to develop new data analysis methods that can run independently and within Geospiza's FinchLab. FinchLab handles the details of setting up a lab, managing its users, storing and processing data, and making data and reports available to end users through web-based interfaces. The laboratory workflow system and flexible order interfaces provide the centralized tools needed to track samples, their metadata, and experimental information. Geospiza's hosted (Software as a Service [SaaS]) delivery models remove additional IT barriers.

FinchLab's data management and analysis server make the system scalable through a distributed architecture. The current implementation of the analysis server creates a complete platform to rapidly prototype new data analysis workflows and will allow us to quickly devise and execute feasibility tests, experiment with new data representations, and iteratively develop the needed data models to integrate results with experimental details.

References

1. Ley, T. J., Mardis, E. R., Ding, L., Fulton, B., et al. DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome. Nature 456, 66-72 (2008).

2. Wang, J., Wang, W., Li, R., Li, Y., et al. The diploid genome sequence of an Asian individual. Nature 456, 60-65 (2008).

3. Bentley, D. R., Balasubramanian, S., Swerdlow, H. P., Smith, G. P., et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53-59 (2008).

4. My genome. So what? Nature 456, 1 (2008).

5. Prepare for the deluge. Nature Biotechnology 26, 1099 (2008).

6. Byte-ing off more than you can chew. Nature Methods 5, 577 (2008).

Labels: , , , , , ,

Wednesday, October 8, 2008

Road Trip: AB SOLiD Users Meeting

Wow! That's the best way to summarize my impressions from the Applied Biosystems (AB) SOLiD users conference last week, when AB launched their V3 SOLiD platform. AB claims that this system will be capable of delivering a human genome's worth of data for about $10,000 US.

Last spring, the race to the $1000 genome leaped forward when AB announced that they sequenced a human genome at 12-fold coverage for $60,000. When the new system ships in early 2009, that same project can be completed for $10,000. Also, this week others have claimed progress towards a $5000 human genome.

That's all great, but what can you do with this technology besides human genomes?

That was the focus of the SOLiD users conference. For a day and a half, we were treated to presentations from scientists and product managers from AB as well as SOLiD customers who have been developing interesting applications. Highlights are described below.

Technology Improvements:

Increasing Data Throughput - Practically everyone is facing the challenge of dealing with large volumes of data, and now we've learned the new version of the SOLiD system will produce even more. A single instrument run will produce between 125 million to 400 million reads depending on the application. This scale up is achieved by increasing the bead density on a slide, dropping the overall cost per individual read. Read lengths are also increasing, making it possible to get between 30 and 40 gigabases of data from a run. And, the amount of time required for each run is shrinking; not only can you get all of these data, you can do it again more quickly.

Increasing Sample Scale - Many people like to say, yes, the data is a problem, but at least the sample numbers are low, so sample tracking is not that hard.

Maybe they spoke too soon.

AB and the other companies with Next Gen technologies are working to deliver "molecular barcodes" that allow researchers to combine multiple samples on a single slide. This is called "multiplexing." In multiplexing, the samples are distinguished by tagging each one with a unique sequence, the barcode. After the run, the software uses the sequence tags to sort the data into their respective data sets. The bottom line is that we will go from a system that generates a lot of data from a few samples, to a system that generates even more data from a lot of samples.

Science:

What you can do with 100's of millions of reads: On the science side, there were many good presentations that focused on RNA-Seq and variant detection using the SOLiD system. Of particular interest was Dr. Gail Payne's presentation on the work, recently published in Genome Research, entitled "Whole Genome Mutational Profiling Using Next Generation Sequencing Technology." In the paper, the 454, Illumina, and SOLiD sequencing platforms were compared for their abilities to accurately detect mutations in a common system. This is one of the first head to head to head comparisons to date. Like the presidential debates, I'm sure each platform will be claimed to be the best by its vendor.

From the presentation and paper, the SOLiD platform does offer a clear advantage in its total throughput capacity. 454 showed showed the long read advantage in that approximately 1.5% more of the yeast genome studied was covered by 454 data than with shorter read technology. And, the SOLiD system, with its dibase (color space) encoding, seemed to provide higher sequence accuracy. When the reads were normalized to the same levels of coverage, a small advantage for SOLiD, can be seen.

When false positive rates of mutation detection were compared, SOLiD had zero for all levels of coverage (6x, 8x, 10x, 20x, 30x, 175x [full run of two slides]), Illumina had two false positives at 6x and 13x, and zero false positives for 19x and 44x (full run of one slide) coverage, and 454 had 17, six, and one false positive for 6x, 8x, and 11x (full run) coverage, respectively.

In terms of false negative (missed) mutations, all platforms did a good job. At coverages above 10x, none of the platforms missed any mutations. The 454 platform missed a single mutation at 6x and 8x coverage and Illumina missed two mutations at 6x coverage. SOLiD, on the other hand, missed four and five at 8x and 6x coverage, respectively.

What was not clear from the paper and data, was the reproducibility of these results. From what I can tell, single DNA libraries were prepared and sequenced; but replicates were lacking. Would the results change if each library preparation and sequencing process was repeated?

Finally, the work demonstrates that it is very challenging to perform a clean "apples to apples" comparison. The 454 and Illumina data were aligned with Mosiak and the SOLiD data were aligned with MapReads. Since each system produces different error profiles and the different software programs each make different assumptions about how to use the error profiles to align data and assess variation, the results should not be over interpreted. I do, however, agree with the authors, that these systems are well-suited for rapidly detecting mutations in a high throughput manner.

ChIP-Seq / RNA-Seq: On the second day, Dr. Jessie Gray presented work on combining ChIP-Seq and RNA-Seq to study gene expression. This is important work because it illustrates the power of Next Gen technology and creative ways in which experiments can be designed.

Dr. Gray's experiment was designed to look at this question: When we see that a transcription factor is bound to DNA, how do we know if that transcription factor is really involved in turning on gene expression?

ChIP-Seq allows us to determine where different transcription factors are bound to DNA at a given time, but it does not tell us whether that binding event turned on transcription. RNA-Seq tells us if transcription is turned on, after a given treatment or point in time, but it doesn't tell us which transcription factors were involved. Thus, if we can combine ChiP-Seq and RNA-Seq measurements, we can elucidate a cause and effect model and find where a transcription factor is binding and which genes it potentially controls.

This might be harder than it sounds:

As I listened to this work, I was struck by two challenges. On the computational side, one has to not only think about how to organize and process the sequence data into alignments and reduce those aligned datasets into organized tables that can be compared, but also how to create the right kind of interfaces for combining and interactively exploring the data sets.

On the biochemistry side, the challenges presented with ChIP-Seq reminded me of the old adage of trying to purify disapearase - "the more you purify the less there is." ChIP-Seq and other assays that involve multiple steps of chemical treatments and purification, produce vanishingly small amounts of material for sampling. The later challenge complicates the first challenge, because in systems where one works with "invisible" amounts of DNA, a lot of creative PCR, like "in gel PCR" is required to generate sufficient quantities of sample for measurement.

PCR is good for many things, including generating artifacts. So, the computation problem expands. A software system that generates alignments, reduces them to data sets that can be combined in different ways, and provides interactive user interfaces for data exploration, must also be able to understand common artifacts so that results can be quality controlled. Data visualizations must also be provided so that researchers can distinguish biological observations from experimental error.

These are exactly the kinds of problems that Geospiza solves.

Labels: , , , , , ,

Monday, October 6, 2008

Sneak Peak: Genetic Analysis From Capillary Electrophoresis to SOLiD

On October 7, 2008 Geospiza hosted a webinar featuring the FinchLab, the only software product to track the entire genetic analysis process, from sample preparation, through processing to analyzed results.

If you are as disappointed about missing it as we are about you missing, no worries. You can get the presentation here.

If you are interested in:
  • Learning about Next Gen sequencing applications
  • Seeing what makes the Applied Biosystems SOLiD system powerful for transcriptome analysis, CHiP-Seq, resequenicng experiments, and other applications
  • Understanding the flow of data and information as samples are converted into results
  • Overcoming the significant data management challenges that accompany Next Gen technologies
  • Setting up Next Gen sequencing in your core lab
  • Creating a new lab with Next Gen technologies
This webinar is for you!

In the webinar, we talked about the general applications of Next Gen sequencing and focused on using SOLiD to perform Digital Gene Expression experiments by highlighting mRNA Tag Profiling and whole transcriptome analysis. Throughout the talk we gave specific examples about collecting and analyzing SOLiD data and showed how the Geospiza FinchLab solves challenges related to laboratory setup and managing Next Gen data and analysis workflows.

Labels: , , , , , , ,

Thursday, September 4, 2008

The Ends Justify the DNA

In Next Gen experiments, libraries of DNA fragments are created in different ways, from different samples, and sequenced in a massively parallel format. The preparation of libraries is a key step in these experiments. Understanding and validating the results requires knowing how the libraries were created and where the samples came from.

Background

In the last post, I introduced the concept that nearly all Next Gen sequencing applications are fundamentally quantitative assays that utilize DNA sequences as data points.

In Sanger sequencing, the new DNA molecules are synthesized, beginning at a single starting point determined by the primer. If the sequencing primer binds to heterogeneous molecules that contain the same binding site, for example, two slightly different viruses in a mixed population, a single read from Sanger sequencing could represent a mixture of many different molecules in the population, with multiple bases at certain positions. Next Gen sequencing, on the other hand, produces single reads from single individual molecules. This difference between the two methods allows one to simultaneously collect millions of sequence reads in a massively parallel format from single samples.

An additional benefit of massively parallel sequencing is that it eliminates the need to clone DNA, or create numerous PCR products. Although this change reduces the complexity of tracking samples, it increases the need to track experiments with greater detail and think about how we work with the data, how we analyze the data, and how we validate our observations to generate hypotheses, make discoveries, and identify new kinds of systematic artifacts.

Making Libraries

To better understand the significance of what a Next Gen experiment measures, we need to understand what DNA libraries are and how they are prepared. For this discussion we'll define a DNA library as a random collection of DNA molecules (or fragments) that can be separated and identified.

Before we do any kind of Next Gen experiment, we want to know something about the kinds of results we’d expect to see from our library. To begin, let’s consider what we would see from a genomic library consisting of EcoRI restriction fragments. If the digestion is complete, EcoRI will cut DNA between an G and A every time it encounters the sequence: 5'-GAATTC-3'. Every fragment in this library would have the sequence 5'-AATT-3' at every 5’ end. The average length of the fragments will be 4096 bases (~5 kbp). However, the distribution of fragment lengths follows Poisson statistics [1], so the actual library will have a few very large fragments (>> 5 kbp) and numerous small fragments

You may ask “why is this useful?”

Our EcoRI library example helps us to think about our expectations for Next Gen experimental results. That is, if we collect 10 million reads from a sample, what should we expect to see when we compare our data to reference data? We need to know what kinds of results to expect in order to determine if our data represent discoveries, or artifacts. Artifacts can be introduced during sample preparation, sample tracking, library preparation, or from the data collection instruments. If we can’t distinguish between artifacts and discoveries, the artifacts will slow us down and lead to risky publications.

In the case of our EcoRI digest, we can use our predictions to validate our results. If we collected sequences from the estimated 732,000 fragments and aligned the resulting data back to a reference genome, we would expect to see blocks of aligned reads at every one of the 732,000 restriction sites. Further, for each site there should be two blocks, one showing matches to the "forward" strand and one showing matches to the "reverse" strand.

We could also validate our data set by identifying the positions of EcoRI restriction sites in our reference data. What we'd likely see is that most things work perfectly. In some cases, however, we would also see alignments, but no evidence of a restriction site. In other cases, we would see a restriction site in the reference genome, but no alignments. These deviations would identify differences between the reference sequence and the sequence of the genome we used for the experiment. Those differences could either result from errors in the sequence of the reference data or a true biological difference. In the latter case, we would examine the bases and confirm the presence of a restriction length fragment polymorphism (RFLPs). From this example, we can see how we can define the expected results, and use that prediction to validate our data and determine whether our results correspond to interesting biology or experimental error.

Digital Gene Expression

Of course what we expect to see in the data is a function of the kind of experiment we are trying to do. To illustrate this point I'll compare two different kinds of Next Gen experiments that are both used to measure gene expression: Tag Profiling and RNA-Seq.

In Tag Profiling, mRNA is attached to a bead, converted to cDNA, and digested with restriction enzymes. The single fragments that remain attached to the beads are isolated and ligated to adaptor molecules, each one containing a type II restriction site. The fragments are further digested with the type II restriction enzyme and ligated to a sequencing adaptor to create a library of cDNA ends with 17 unique bases, or tags. Sequencing such a library will, in theory, yield a collection of reads that represents the population of RNA molecules in the starting material. Highly expressed genes will be represented by a larger number of tagged sequences than genes expressed at lower levels.

Both Tag profiling and RNA-Seq begin with an mRNA purification step, but after that point the procedures differ. Rather than synthesize a single full-length cDNA for every transcript, RNA-Seq uses random six-base primers to initiate cDNA synthesis at many different positions in each RNA molecule. Because these primers represent every combination of six base sequences, priming with these sequences produces a collection of overlapping cDNA molecules. Starting points for DNA synthesis will be randomly distributed, giving high sequence coverage for each mRNA in the starting material. Like Tag Profiling, genes expressed at high levels will have more sequences present in the data than genes expressed at low levels. Unlike Tag Profiling, any single transcript will produce several cDNAs aligning at different locations.

When the sequence data sets for Tag Profiling and RNA-seq are compared, we can see how the different methods for preparing the DNA libraries contrast with one another. In this example, Tag Profiling [2] and RNA-seq [3] data sets were aligned to human mRNA reference sequences (RefSeq, NCBI). The data were processed with Maq [4] and results displayed in FinchLab. In both cases, relative gene expression can be estimated by the number of sequences that align. If we know the origins of the libraries, the kinds of genes and their expression can give us confidence that the results fit the expression profile we expect. For example the RNA-seq data set is from mouse brain and we see genes at the top of the list that we expect to be expressed in this kind of tissue (last figure below).

The Tag Profiling and RNA-seq data sets also show striking differences that reflect how the libraries are prepared. In each report, the second column gives information about the distribution of alignments in the reference sequence. For Tag Profiling this is reported as "Tags." The number of Tags corresponds to the number of positions along the reference sequence where the tagged sequences align. In an ideal system, we would expect one tag per molecule of RNA. Next Gen experiments however, are very sensitive, so we can also see tags for incomplete digests. Additionally, sequencing errors, and high mismatch tolerance in the alignments can sometimes place reads incorrectly and give unusually high numbers of tags. When the data are more closely examined, we do see that the distribution of alignments follows our expectations more closely. That is, we generally see a high number of reads at one site, with the other tag sites showing a low number of aligned reads.


For RNA-seq, on the other hand, we display the second column (Read Map) as an alignment graph. For RNA-seq data, we expect that the number of alignment start points will be very high and randomly distributed throughout the sequence. We can see that this expectation matches our results by examining the thumbnail plots. In the Read Map graphs, the x-axis represents the gene length and the y-axis is the base density. Presently, all graphs have their data plotted on a normalized x-axis, so the length of an mRNA sequence corresponds to the density of data points in the graph. Longer genes have points that are closer together. You can also see gaps in the plots; some are internal and many are at the 3'-end of the genes. When the alignments are examined more closely, and we incorporate our knowledge of the exon structure or polyA addition sites, we can see that many of these gaps either show potential sites for alternative splicing or data annotation issues.


In summary, Next Gen experiments use DNA sequencing to identify and count molecules, from libraries, in a massively parallel format. The preparation of the libraries allows us to define expected outcomes for the experiment and choose methods for validating the resulting data. FinchLab makes use of this information to display data in ways that make it easy to quickly observe results from millions of sequence data points. With these high-level views and links to drill down reports and external resources, FinchLab provides researchers with the tools needed to determine whether their experiments are on track to creating new insights, or if new approaches are needed to avoid artifacts.

References

[1] The distribution of restriction enzyme sites in Escherichia coli. G A Churchill, D L Daniels, and M S Waterman. Nucleic Acids Res. 1990 February 11; 18(3): 589–597.

[2] Tag Profile dataset was obtained from Illumina.

[3] Mapping and quantifying mammalian transcriptomes by RNA-Seq. A Mortazavi, BA Williams, K McCue K, L Schaeffer, B Wold. Nat Methods. 2008 Jul;5(7):621-8. Epub 2008 May 30.
Data available at: http://woldlab.caltech.edu/rnaseq/

[4] Mapping short DNA sequencing reads and calling variants using mapping quality scores. H Li, J Ruan, R Durbin. Genome Res. 2008 Aug 19. [Epub ahead of print]

Labels: , , , , ,