Monday, July 6, 2009

Bloginar: Scalable Bioinformatics Infrastructures with BioHDF. Part IV: HDF5 Benefits

Now that we're back from Alaska and done with the 4th of July fireworks, it's time to present the next installment of our series on BioHDF.

HDF highlights
HDF technology is designed for working with large amounts of scientific data and is well suited for Next Generation Sequencing (NGS). Scientific data are characterized by very large datasets that contain discrete numeric values, images, and other data, collected over time from different samples and locations. These data naturally organize into multidimensional arrays. To obtain scientific information and knowledge, we combine these complex datasets in different ways and (or) compare them to other data using multiple computational tools. One difficulty that plagues this work is that the software applications and systems for organizing the data, comparing datasets, and visualizing the results are complicated, resource intensive, and challenging to develop. Many of the development and implementation challenges can be overcome using the HDF5 file format and software library

Previous posts have covered:
1. An introduction
2. Background of the project
3. Complexities of NGS data analysis and performance advantages offered by the HDF platform.

HDF5 changes how we approach NGS data analysis.

As previously discussed, the NGS data analysis workflow can be broken into three phases. In the first phase (primary analysis) images are converted into short strings of bases. Next, the bases, represented individually or encoded as di-bases (SOLiD), are aligned to reference sequences (secondary analysis) to create derivative data types such as contigs or annotated tables of alignments, that are further analyzed (tertiary analysis) in comparative ways. Quantitative analysis applications, like gene expression, compare the results of secondary analyses between individual samples to measure gene expression and identify mRNA isoforms, or make other observations based on a sample’s origin or treatment.

The alignment phase of the data analysis workflow creates the core information. During this phase, reads are aligned to multiple kinds of reference data to understand sample and data quality, and obtain biological information. The general approach is to align reads to sets of sequence libraries (reference data). Each library contains a set of sequences that are annotated and organized to provide specific information.

Quality control measures can be added at this point. One way to measure data quality is to ask how many reads were obtained from constructs without inserts. Aligning the read data to a set of primers (individually and joined in different ways) that were used in the experiment, allows us to measure the number reads that match and how well they match. A higher quality dataset will have a larger proportion of sequences matching our sample and a smaller proportion of sequences that only match the primers. Similarly, different biological questions can be asked using libraries constructed of sequences that have biological meaning.

Aligning reads to sequence libraries is the easy part. The challenge is analyzing the alignments. Because the read datasets in NGS assays are large, organizing alignment data into forms we can query is hard. The problem is simplified by setting up multistage alignment processes as a set of filters. That is, reads that match one library are excluded from the next alignment. Differential questions are then asked by counting the numbers of reads that match each library. With this approach, each set of alignments is independent of the other alignments and a program only needs to analyze one set of alignments at time. Filter-based alignment is also used to distinguish reads with perfect matches from those with one or more mismatches.

Still, filter-based alignment approaches have several problems. When new sequence libraries are introduced, the entire multistage alignment process must be repeated to update results. Next, information about reads that have multiple matches in different libraries, or perfect matches and non-perfect matches within a library are lost. Finally, because alignment formats between programs differ and good methods for organizing alignment data do not exist, it is hard to compare alignments between multiple samples. This last issue also creates challenges for linking alignments to the original sequence data and formatting information for other tools.

As previously noted, solving the above problems requires that alignment data be organized in ways that facilitate computation. HDF5 provides the foundation to organize and store both read and alignment data to enable different kinds of data comparisons. This ability is demonstrated by the following two examples.

In the first example (left), reads from different sequencing platforms (SOLiD, Illumina, 454) were stored in HDF5. Illumina RNA-Seq reads from three different samples were aligned to the human genome and annotations from a UCSC GFF (genome file format) file were applied to define gene boundaries. The example shows the alignment data organized into three HDF5 files, one per sample, but in reality the data could have been stored in a single file or files organized in other ways. One of HDF's strengths is that the HDF5 I/O library can query multiple files as if they were a single file, providing the ability to create the high-level data organizations that are the most appropriate for a particular application or use case. With reads and alignments structured in these files, it is a simple matter to integrate data to view base (color) compositions for reads from different sequencing platforms, compare alternative splicing between samples, and select a subset of alignments from a specific genomic region, or gene, in a "wig" format for viewing in a tool like the UCSC genome browser.

The second example (right) focuses on how organizing alignment data in HDF5 can change how differential alignment problems are approached. When data are organized according to a model that defines granularity and relationships, it becomes easier to compute all alignments between reads and multiple reference sources, than think about how to perform differential alignments and implement the process. In this case, a set of reads (obtained from cDNA) are aligned to primers, QC data (ribosomal RNA [rRNA] and mitochondrial DNA [mtDNA]), miRBase, refseq transcripts, the human genome, and a library of exon junctions. During alignment up to three mismatches are tolerated between a read and its hit. Alignment data are stored in HDF5 and, because the data were not filtered, a greater variety of questions can be asked. Subtractive questions mimic the differential pipeline where alignments are used to filter reads from subsequent steps. At the same time, we can also ask "biological" questions about the number of reads that came from rRNA or mtDNA or from genes in the genome or exon junctions. And for these questions, we can examine the match quality between each read and its matching sequence in the reference data sources, without having to reprocess the same data multiple times.

The above examples demonstrate the benefits of being able to organize data into structures that are amenable to computation. When data are properly structured, new approaches that expand the ways in which data are analyzed can be implemented. HDF5 and its library of software routines move the development process from activities associated with optimizing the low level infrastructures needed to support such systems to designing and testing different data models and exploiting their features.

The final post of this series will cover why we chose to work with HDF5 technology.

Labels: , , , ,

Sunday, March 8, 2009

Bloginar: Next Gen Laboratory Systems for Core Facilities

Geospiza kicked off February by attending the AGBT and ABRF conferences. As part of our participation at ABRF, we presented a scenario, in our poster, where a core lab provides Next Generation Sequencing (NGS) transcriptome analysis services. This story shows how GeneSifter Lab and Analysis Edition’s capabilities overcome the challenges of implementing NGS in a core lab environment.

Like the last post, which covered our AGBT poster, the following poster map will guide the discussion.


As this poster overlaps the previous poster in terms providing information about RNA assays and analyzing the data, our main points below will focus on how GeneSifter Lab Edition solves challenges related to laboratory and business processes associated with setting up a new lab for NGS or bringing NGS into an existing microarray or Sanger sequencing lab.

Section 1 contains the abstract, an introduction to the core laboratory, and background information on different kinds of transcription profiling experiments.

The general challenge for a core lab lies in the need to run a business that offers a wide variety of scientific services for which samples (physical materials) are converted to data and information that have biological meaning. Different services often require different lab processes to produce different kinds of data. To facilitate and direct lab work, each service requires specialized information and instructions for samples that will be processed. Before work is started, the lab must review the samples and verify that the information has been correctly delivered. Samples are then routed through different procedures to prepare them for data collection. In the last steps, data are collected, reviewed, and the results are delivered back to clients. At the end of the day (typically monthly), orders are reviewed and invoices are prepared either directly or by updating accounting systems.

In the case of NGS, we are learning that the entire data collection and delivery process gets more complicated. When compared to Sanger sequencing, genotyping, or other assays that are run in 96-well formats, sample preparation is more complex. NGS requires that DNA libraries be prepared and different steps of the of process need to be measured and tracked in detail. Also, complicated bioinformatics workflows are needed to understand the data from both a quality control and biological meaning context. Moreover, NGS requires a substantial investment in information technology.

Section 2 walks through the ways in which GeneSifter Lab Edition helps to simplify the NGS laboratory operation.

Order Forms

In the first step, an order is placed. Screenshots show how GeneSifter can be configured for different services. Labs can define specialized form fields using a variety of user interface elements like check boxes, radio buttons, pull down menus, and text entry fields. Fields can be required or be optional and special rules such as ranges for values can be applied to individual fields within specific forms. Orders can also be configured to take files as attachments to track data, like gel images, about samples. To handle that special “for lab use only" information, fields in forms can be specified as laboratory use only. Such fields are hidden to the customers view and when the orders are processed they are filled later by lab personnel. The advantage of GeneSifter’s order system is that the pertinent information is captured electronically in the same system that will be used to track sample processing and organize data. Indecipherable paper forms are eliminated along with the problem of finding information scattered on multiple computers.

Web-forms do create a special kind of data entry challenge. Specifically, when there is a lot of information to enter for a lot samples, filling in numerous form fields on a web-page can be a serious pain. GeneSifter solves this problem in two ways:

First, all forms can have “Easy Fill” controls that provide column highlighting (for fast tab-and-type data entry), auto fill downs, and auto fill downs with number increments so one can easily “copy” common items into all cells of a column, or increment an ending number to all values in a column. When these controls are combined with the “Range Selector,” a power web-based user interface makes it easy to enter large numbers of values quickly in flexible ways.

Second, sometimes the data to be entered is already in an Excel spreadsheet. To solve this problem, each form contains a specialized Excel spreadsheet validator. The form can be downloaded as an Excel template and the rules, previously assigned to field when the form was created, are used to check data when they are uploaded. This process spots problems with data items and reports ten at upload time when they are easy to fix, rather than later when information is harder to find. This feature eliminates endless cycles of contacting clients to get the correct information.

Laboratory Processing

Once order data are entered, the next step is to process orders. The middle of section 2 describes this process using an RNA-Seq assay as an example. Like other NGS assays, the RNA-Seq protocol has many steps involving RNA purification, fragmentation, random primed conversion into cDNA, and DNA library preparation of the resulting cDNA for sequencing. During the process, the lab needs to collect data on RNA and DNA concentration as well as determine the integrity of the molecules throughout the process. If a lab runs different kinds of assays they will have to manage multiple procedures that may have different requirements for ordering of steps and laboratory data that need to be collected.

By now it is probably not a surprise to learn that GeneSifter Lab Edition has a way to meet this challenge too. To start, workflows (lab procedures) can be created for any kind of process with any number of steps. The lab defines the number of steps and their order and which steps are required (like the order forms). Having the ability to mix required and optional steps in a workflow gives a lab the ultimate flexibility to support those “we always do it this way, except the times we don’t” situations. For each step the lab can also define whether or not any additional data needs to be collected along the way. Numbers, text, and attachments are all supported so you can have your Nanodrop and Bioanalyzer too.

Next, an important feature of GeneSifter workflows is that a sample can move from one workflow to another. This modular approach means that separate workflows can be created for RNA preparation, cDNA conversion, and sequencing library preparation. If a lab has multiple NGS platforms, or a combination of NGS and microarrays, they might find that a common RNA preparation procedure is used, but the processes diverge when the RNA is converted into forms for collecting data. For example, aliquots of the same RNA preparation may be assayed and compared on multiple platforms. In this case a common RNA preparation protocol is followed, but sub-samples are taken through different procedures, like a microarray and NGS assay, and their relationship to the “parent” sample must be tracked. This kind of scenario is easy to set up and execute in GeneSifter Lab Edition.

Finally, one of GeneSifter’s greatest advantages is that a customized system with all of the forms, fields, Excel import features, and modular workflows can be added by lab operators without any programming. Achieving similar levels of customization with traditional LIMS products takes months and years with initial and reoccurring costs of six or more figures.

Collecting Data

The last step of the process is collecting the data, reviewing it, and making sequences and results available to clients. Multiple screenshots illustrate how this works in GeneSifter Lab Edition. For each kind of data collection platform, a “run” object is created. The run holds the information about reactions (the samples ready to run) and where they will be placed in the container that will be loaded into the data collection instrument. In this context, the container is used to describe 96 or 384-well plates, glass slides with divided areas called lanes, regions, chambers, or microarray chips. All of these formats are supported and in some cases specialized files (sample sheets, plate records) are created and loaded into instrument collection software to inform the instrument about sample placement and run conditions for individual samples.

During the run, samples are converted to data. This process, different for each kind of data collection platform, produce variable numbers and kinds of files that are organized in completely different ways. Using tools that work with GeneSifter, raw data and tracking information are entered into the database to simplify access to the data at a later time. The database also associates sample names and other information with data files, eliminating the need to rename files with complex tracking schemes. The last steps of the process involve reviewing quality information and deciding whether to release data to clients or repeat certain steps of the process. When data are released, each client receives an email directing them to their data.

The lab updates the orders and optionally creates invoices for services. GeneSifter Lab Edition can be used to manage those business functions as well. We’ll cover GeneSifter’s pricing and invoicing tools at some other time, be assured they are as complete as the other parts of the system.

NGS requires more than simple data delivery

Section 3 covers issues related to the computational infrastructure needed to support NGS and the data analysis aspects of the NGS workflow. In this scenario, our core lab also provides data analysis services to convert those multi-million read files into something that can be used to study biology. Much of this covered in the previous post, so it will not be repeated here.

I will summarize by making the final points that Geospiza’s GeneSifter products cover all aspects of setting up a lab for NGS. From sample preparation, to collecting data, to storing and distributing results, to running complex bioinformatics workflows and presenting information in ways to get scientifically meaningful results, a comprehensive solution is offered. GeneSifter products can be delivered as hosted solutions to lower costs. Our hosted, Software as a Service, solutions allow groups to start inexpensively and manage costs as the needs scale. More importantly, unlike in-house IT systems, which require significant planning and implementation time to remodel (or build) server rooms and install computers, GeneSifter products get you started as soon as you decide to sign up.

Labels: , , , , ,

Wednesday, March 4, 2009

Bloginar: The Next Generation Dilemma: Large Scale Data Analysis

Previous posts shared some the things we learned at the AGBT and ABRF meetings in early February. Now it is time to share the work we presented, starting with the AGBT poster, “The Next Generation Dilemma: Large Scale Data Analysis.”

The goal of the poster was to provide a general introduction to the power of Next Generation Sequencing (NGS) and a framework for data analysis. Hence, the abstract described the NGS general data analysis process; its issues and what we are doing for one kind of transcription profiling, RNA-Seq. Between then and now we learned a few things... And the project grew.

The map below guides my “bloginar” poster presentation. In keeping with the general theme of the abstract we focused on transcription analysis, but instead of focusing exclusively on RNA-Seq, the project expanded to compare three kinds of transcription profiling: RNA-Seq, Tag Profiling, and Small RNA Analysis. A link to the poster is provided at the end.

Section 1 provides a general introduction into NGS by discussing the ways NGS is being used to study different aspects of molecular biology. It also covers how the data are analyzed in thee phases (primary, secondary, tertiary) to convert raw data into biologically meaningful information. The three phase model has emerged as a common framework to describe the process of converting image data into primary sequence data (reads) and then turning the reads into information that be used in comparative analyses. Secondary analysis is the phase where reads are aligned to reference sequences to get gene names, position, and (or) frequency information that can be used to measure changes, like gene expression, between samples.

The remaining sections of the poster use examples from transcription analysis to illustrate and address the multiple challenges (listed below) that must be overcome to efficiently use NGS.
  • High end infrastructures are needed to manage and work with extremely large data sets
  • Complex, multistep analysis procedures are required to produce meaningful information
  • Multiple reference data are needed to annotate and verify data and sample quality
  • Datasets must be visualized in multiple ways
  • Numerous Internet resources must be used to fill in additional details
  • Multiple datasets must be comparatively analyzed to gain knowledge
Section 2 describes the three different kinds of transcription profiling experiments. This section provides additional background on the methods and what they measure. For example, RNA-Seq and Tag Profiling are commonly used to measure gene expression. In RNA-Seq, DNA libraries are prepared by randomly amplifying short regions of DNA from cDNA. The sequences that are produced will generally cover the entire region of the transcripts that were originally isolated. Hence, it is possible to get information about alternative splicing and biased allelic expression. In contrast, Tag Profiling focuses on creating DNA libraries from discrete points within the RNA molecules. With Tag Profiling, one can quickly measure relative gene expression, but cannot get information about alternative splicing and allelic expression. The table in section 2 discusses these and other issues one must consider when running the different assays.

Sections 3, 4, and 5 outline three transcriptome scenarios (RNA-Seq, Tag Profiling, and Small RNA, respectively) using real data examples (references provided in the poster). Each scenario follows a common workflow involving the preparation of DNA libraries from RNA samples, followed by secondary analysis, followed by tertiary analysis of the data in GeneSifter Analysis Edition.

For RNA-Seq, two datasets corresponding to mouse erythroid stem (ES) and body (EB) cells were investigated. DNA libraries were produced from each cell line. Sequences were collected from the library and compared to the RefSeq (NCBI) database according to the pipeline shown. The screen captures (middle of the panel) show how the individual reads map to each transcript along with the total numbers of hits summarized by chromosome. The process is repeated twice, once for each cell line, and the two sets of alignments are converted to Gene Lists for comparative analysis in GeneSifter laboratory edition to observe differential expression (bottom of the panel).

The Tag Profiling panel examines data from a recently published experiment (a reference is provided in the poster) in which gene expression was studied in transgenic mice. I’ll leave out the details of the paper, and only point out how this example shows the differences between Tag Profiling and RNA-Seq data. Because Tag Profiling collects data from specific 3’ sites in RNA, the aligned data (middle of the panel) show alignments as single “spikes” toward the 3’ end of transcripts. Occasionally multiple peaks are observed. The question being, are the additional peaks the result of isoforms (alternative polyA sites) or incomplete restriction enzyme digests? How might this be sorted out? Like RNA-Seq, the bottom panel shows the comparative analysis of replicate samples from the wild type (WT) and transgenic (TG) mice.

Data from a small RNA analysis experiment are analyzed in the third panel. Unlike RNA-Seq and Tag Profiling, this secondary analysis has more comparisons of the reads to different sets of reference sequences. The purpose is to identify and filter out common artifacts observed in small RNA preparations. The pipeline we used, and data produced, are shown in the middle of the panel. Histogram plots of read length distribution, determined from alignments in different reference sources, are created because an important feature of small RNAs is that they are small. Distributions clustered around 22 nt indicate a good library. Finally, data are linked to additional reports and databases, like miRBase (Sanger Center), to explore results further. In the example shown, the first hit was to a small RNA that has been observed in opossums; now we have human counter part. In total, four, samples were studied. Like RNA-Seq and Tag Profiling, we can observe the relative expression of each small RNA by analyzing the datasets together (hierarchical clustering, bottom).

Section 6 presents some of the challenges of scale issues that accompany NGS, and how we are addressing these issues with HDF5 technology. This will be a topic of many more posts in the future.

We close the poster by addressing the challenges listed above with the final points:
  • High performance data management systems are being developed through the BioHDF project and GeneSifter system architectures.
  • The examples show how each application and sequencing platform requires a different data analysis workflow (pipeline). GeneSifter provides a platform to develop and make bioinformatics pipelines and data readily available to communities of biologists.
  • The transcriptome is complex, different libraries of sequence data can filter known sequences (e.g. rRNA) and discover new elements (miRNAs) and isoforms of expressed genes.
  • Within a dataset, read maps, tables, and histogram plots are needed to summarize and understand the kinds of sequences present and how they relate to an experiment.
  • Links to Entrez Gene, the USCS genome browser, and miRBASE, show how additional information can be integrated into the application framework and used.
  • Next Gen transcriptomics assays are similar to microarray assays in many ways, hence software systems like Geospiza’s GeneSifter are useful for comparative analysis.
You can also get the file, AGBT_2009.pdf

Labels: , , , , , , ,

Sunday, February 15, 2009

Three Themes from ABRF and AGBT Part I: The Laboratory Challenge

It's been an exciting week on the road at the AGBT and ABRF conferences. From the many presentations and discussions it is clear that the current and future next generation DNA sequencing (NGS) technologies are changing the way we think about genomics and molecular biology. It is also clear that successfully using these technologies impacts research and core laboratories in three significant areas:
  1. The Laboratory: Running successful experiments requires careful attention to detail.
  2. Bioinformatics: Every presentation called out bioinformatics as a major bottleneck. The data are hard to work with and different NGS experiments require different specialized bioinformatics workflows (pipelines).
  3. Information Technology (IT): The bioinformatics bottleneck is exacerbated by IT issues involving data storage, computation, and data transfer bandwidth.

We kicked off ABRF by participating in the Next Gen DNA Sequencing workshop on Saturday (Feb. 7). It was extremely well attended with presentations on experiences in setting up labs for Next Gen sequencing, preparing DNA libraries for sequencing, and dealing with the IT and bioinformatics.

I had the opportunity to provide the “overview” talk. In that presentation “From Reads to Datasets, Why Next Gen is not Sanger Sequencing,” I focused on the kinds of things you can do with NGS technology, its power, and the high level issues that groups are facing today when implementing these systems. I also introduced one of our research projects on developing scalable infrastructures using HDF5 for Next Gen bioinformatics and high-performing, dynamic, software interfaces. Three themes resufraced again and again throughout the day:  one must pay attention to laboratory details, bioinformatics is a bottleneck, and don't underestimate the impact of NGS systems on IT.

In this post, I'll discuss the laboratory details and visit the other themes in posts to come.

Laboratory Management

To better understand the impact of NGS on the lab, we can compare it to Sanger sequencing. In the table below, different categories ranging from the kinds of samples, to their preparation, to the data, are considered to show how NGS differs from Sanger sequencing. Sequencing samples for example are very different between Sanger and NGS. In Sanger sequencing, one typically works with clones or PCR amplicons. Each sample (clone or PCR product) produces a single sequence read. Overall, sequencing systems are robust, so the biggest challenges to labs has been tracking the samples as they move from tube to plate or between wells within plates.

In contrast, NGS experiments involve sequencing DNA libraries and each sample produces millions of reads. Presently, only a few samples are sequenced at a time so the sample tracking issues, when compared to Sanger, are greatly reduced. Indeed, one of the significant advantages and cost savings of NGS is to eliminate the need for cloning or PCR amplification in preparing templates to sequence.

Directly sequencing DNA libraries is a key ability and a major factor that makes NGS so powerful. It also directly contributes to the bioinformatics complexity (more on that in the next post). Each one of the millions of reads that are produced from the sample corresponds to an individual molecule, present in the DNA library. Thus, the overall quality of the data and the things you can learn are a direct function of the library.



Producing good libraries requires that you have a good handle on many factors. To begin, you will need to track RNA and DNA concentrations, at different steps of the process. You also need to know the “quality” of the molecules in the sample. For example, RNA assays will give the best results when RNA is carefully prepared and free of RNAses. In RNA-Seq, the best results are obtained when the RNA is fragmented prior to cDNA synthesis. To understand the quality of the starting RNA, fragmentation, and cDNA synthesis steps, tools like agarose gels or Bioanalyzer traces are used to evaluate fragment lengths and determine overall sample quality. Other assays and sequencing projects have similar processes. Throughout both conferences, it was stressed that regardless of whether you are sequencing genomes, small RNAs, performing an RNA-Seq, or other “tag and count” kinds of experiments, you need to pay attention to the details of the process. Tools like the NanoDrop, or QPCR procedure need to be routinely used to measure RNA or DNA concentration. Tools like gels and the Bioanalyzer are used to measure sample quality. And, in many cases both kinds of tools are used.

Through many conversations, it became clear that Bioanalyzer images, Nanodrop reports, and other lab data quickly accumulate during these kinds of experiments. While an NGS experiment is in progress, these data are pretty accessible and the links between data quality and the collected data are easy to see. It only takes a few weeks, however,  for these lab data to disperse.  They find their way into paper notebooks, or unorganized folders on multiple computers. When the results from one sample need to be compared to another,  a new problem appears. It becomes harder and harder to find the lab data that correspond to each sample.

To summarize, NGS technology makes it possible to interrogate large ensembles of individual RNA or DNA molecules. Different questions can be asked by preparing the ensembles (libraries) in different ways involving complex procedures. To ensure that the resulting data are useful, the libraries need to be of high and known quality. Quality is measured with multiple tools at different points of the process to produce multiple forms of laboratory data. Traditional methods such as laboratory notebooks, files on computers, and post-it notes however, make these data hard to find when the time comes to compare results between samples.

Fortunately, the GeneSifter Lab Edition solves these challenges. The Lab Edition of Geospiza’s software platform provides a comprehensive laboratory information management system (LIMS) for NGS and other kinds of genetic analysis assays, experiments, and projects. Using web-based interfaces, laboratories can define protocols (laboratory workflows) with any number of steps. Steps may be ordered and required to ensure that procedures are correctly followed. Within each step, the laboratory can define and collect different kinds of custom data (Nanodrop values, Bioanalyzer traces, gel images, ...). Laboratories using the GeneSifter Lab Edition can produce more reliable information because they can track the details of their library preparation and link key laboratory data to sequencing results.

Labels: , , , , ,

Monday, October 6, 2008

Sneak Peak: Genetic Analysis From Capillary Electrophoresis to SOLiD

On October 7, 2008 Geospiza hosted a webinar featuring the FinchLab, the only software product to track the entire genetic analysis process, from sample preparation, through processing to analyzed results.

If you are as disappointed about missing it as we are about you missing, no worries. You can get the presentation here.

If you are interested in:
  • Learning about Next Gen sequencing applications
  • Seeing what makes the Applied Biosystems SOLiD system powerful for transcriptome analysis, CHiP-Seq, resequenicng experiments, and other applications
  • Understanding the flow of data and information as samples are converted into results
  • Overcoming the significant data management challenges that accompany Next Gen technologies
  • Setting up Next Gen sequencing in your core lab
  • Creating a new lab with Next Gen technologies
This webinar is for you!

In the webinar, we talked about the general applications of Next Gen sequencing and focused on using SOLiD to perform Digital Gene Expression experiments by highlighting mRNA Tag Profiling and whole transcriptome analysis. Throughout the talk we gave specific examples about collecting and analyzing SOLiD data and showed how the Geospiza FinchLab solves challenges related to laboratory setup and managing Next Gen data and analysis workflows.

Labels: , , , , , , ,

Wednesday, August 20, 2008

Next Gen DNA Sequencing Is Not Sequencing DNA

In the old days, we used DNA sequencing primarily to learn about the sequence and structure of a cloned gene. As the technology and throughput improved, DNA sequencing became a tool for investigating entire genomes. Today, with the exception of de novo sequencing, Next Gen sequencing has changed the way we use DNA sequences. We're no longer looking for new DNA sequences. We're using Next Gen technologies to perform quantitative assays with DNA sequences as the data points. This is a different way of thinking about the data and it impacts how we think about our experiments, data analysis, and IT systems.

In de novo sequencing, the DNA sequence of a new genome, or genes from the environment is elucidated. De novo sequencing ventures into the unknown. Each new genome brings new challenges with respect to interspersed repeats, large segmented gene duplications, polyploidy and interchromosomal variation. The high redundancy samples obtained from Next Gen technology lower the cost and speed this process because less time is required for getting additional data to fill in gaps and finish the work.

The other ultra high throughput DNA sequencing applications, on the other hand, focus on collecting sequences from DNA or RNA molecules for which we already have genomic data. Generally called "resequencing," these applications involve collecting and aligning sequence reads to genomic reference data. Experimental information is obtained by tabulating the frequency, positional information, and variation of the reads in the alignments. Data tables from samples that differ by experimental treatment, environment, or in populations, are compared in different ways to make discoveries and draw conclusions.

DNA sequences are information rich data points

EST (expressed sequence tag) sequencing was one of the first applications to use sequence data in a quantitative way. In EST applications, mRNA from cells was isolated, converted to cDNA, cloned, and sequenced. The data from an EST library provided both new and quantitative information. Because each read came from a single molecule of mRNA, a set of ESTs could be assembled and counted to learn about gene expression. The composition and number of distinct mRNAs from different kinds of tissues could be compared and used to identify genes that were expressed at different time points during development, in different tissues, and in different disease states, such as cancer. The term "tag" was invented to indicate that ESTs could also be used to identify the genomic location of mRNA molecules. Although the information from EST libraries was been informative, lower cost methods such as microarray hybridization and real time-PCR assays replaced EST sequencing over time, as more genomic information became available.

Another quantitative use of sequencing has been to assess allele frequency and identify new variants. These assays are commonly known as "resequencing" since they involve sequencing a known region of genomic DNA in a large number of individuals. Since the regions of DNA under investigation are often related to health or disease, the NIH has proposed that these assays be called "Medical Sequencing." The suggested change also serves to avoid giving the public the impression that resequencing is being carried out to correct mistakes.

Unlike many assay systems (hybridization, enzyme activity, protein binding ...) where an event or complex interaction is measured and described by a single data value, a quantitative assay based on DNA sequences yields a greater variety of information. In a technique analogous to using an EST library, an RNA library can be sequenced, and the expression of many genes can be measured at once, by counting the number of samples that align to a given position or reference. If the library is prepared from DNA, a count of the aligned reads could measure the copy number of a gene. The composition of the read data itself can be informative. Mismatches in aligned reads can help discern alleles of a gene, or members of a gene family. In a variation assay, reads can both assess the frequency of a SNP and discover new variation. DNA sequences could be used in quantitative assays to some extent with Sanger sequencing, but the cost and labor requirements prevented wide spread adoption.

Next Gen adds a global perspective and new challenges

The power of Next Gen experiments comes from sequencing DNA libraries in a massively parallel fashion. Traditionally, a DNA library was used to clone genes. The library was prepared by isolating and fragmenting genomic DNA, ligating the pieces to a plasmid vector, transforming bacteria with the ligation products, and growing colonies of bacteria on plates with antibiotics. The plasmid vector would allow a transformed bacterial cell to grow in the presence of an antibiotic so that transformed cells could be separated from other cells. The transformed cells would then be screened for the presence of a DNA insert or gene of interest through additional selection, colorimetric assay (e.g. blue / white), or blotting. Over time, these basic procedures were refined and scaled up in factory style production to enable high throughput shotgun sequencing and EST sequencing. A significant effort and cost in Sanger sequencing came from the work needed to prepare and track large numbers of clones, or PCR-products, for data linking and later retrieval to close gaps or confirm results.

In Next Gen sequencing, DNA libraries are prepared, but the DNA is not cloned. Instead other techniques are used to "separate," amplify, and sequence individual molecules. The molecules are then sequenced all at once, in parallel, to yield large global data sets in which each read represents a sequence from an individual molecule. The frequency of occurrence of a read in the population of reads can now be used to measure the concentration of individual DNA molecules. Sequencing DNA libraries in this fashion significantly lowers costs, and makes previously cost prohibitive experiments possible. It also changes how we need to think about and perform our experiments.

The first change is that preparing the DNA library is the experiment. Tag profiling, RNA-seq, small RNA, ChIP-seq, DNAse hypersensitivity, methylation, and other assays all have specific ways in which DNA libraries are prepared. Starting materials and fragmentation methods define the experiment and how the resulting datasets will be analyzed and interpreted. The second change is that large numbers of clones no longer need to be prepared, tracked, and stored. This reduces the number of people needed to process samples, and reduces the need for robotics, large number of thermocyclers, and other laboratory equipment. Work that used to require a factory setting can now be done in a single laboratory, or mailroom if you believe the ads.

Attention to details counts

Even though Next Gen sequencing gives us the technical capabilities to ask detailed and quantitative questions about gene structure and expression, successful experiments demand that we pay close attention to the details. Obtaining data that are free of confounding artifacts and accurately represent the molecules in a sample, demands good technique and a focus on detail. DNA libraries no longer involve cloning, but their preparation does require multiple steps performed over multiple days. During this process, different kinds of data ranging from gel images to discrete data values, may be collected and used later for trouble shooting. Tracking the experimental details requires that a system be in place that can be configured to collect information from any number and kind of process. The system also needs to be able to link data to the samples, and convert the information from millions of sequence data points to tables, graphics and other representations that match the context of the experiment and give a global view of how things are working. FinchLab is that kind of system.

Labels: , , , , ,

Wednesday, July 30, 2008

BioHDF at BOSC

The scale of Next Gen sequencing is only going to increase, hence we need to fundamentally change the way we work with Next Gen data. New software systems with scalable data models, APIs, software tools, and viewers are needed to support the very large datasets used by the applications that analyze Next Gen DNA sequence data.

That was the theme of a talk I presented at the BOSC (Bioinformatics Open Source Conference) meeting that preceded ISMB (Intelligent Systems for Molecular Biology) in Toronto, Canada, July 19th. You can get the slides from the BOSC site. At the same time, we posted a blog on Genographia, a next-generation genomics community web site devoted to Next Gen sequencing discussions and idea sharing. The key points are summarized below.

Motivation

The BioHDF project is motivated by the fact that the next and future generations of data collection technologies, like DNA sequencing, are creating ever increasing amounts of data. Getting meaningful information from these data require that multiple programs be used in complex processes. Current practices for working with these data create many kinds of challenges, ranging from managing large numbers of files and formats to having the computation power and bandwidth to make calculations and move data around. These practices have a high cost in terms of storage, CPU, and bandwidth efficiency. In addition, they require significant human effort in understanding idiosyncratic program behavior and output formats

Is there a better way?

Many would agree that if we could reduce the number of file formats, avoid data duplication, and improve how we access and process data, we could develop better performing and more interoperable applications. Doing so requires improved ways of storing data and making it accessible to programs. For a number of years we have thought about these goals might be accomplished and looked to other data-intensive fields to see how others have solved these problems. Our search ended when we found HDF (hierarchical data format), a standard file format and library used in the physical and earth sciences.

BioHDF

HDF5 can be used in many kinds of bioinformatics applications. For specialized areas, like DNA sequencing, domain specific extensions will be needed. BioHDF is about developing those extensions, through community support, to create a file format and accompanying library of software functions that are needed to build the scalable software applications of the future. More will follow, if you are interested contact me: todd at geospiza.com.

Labels: , , , , ,

Tuesday, May 20, 2008

Finch 3: Defining the Experimental Information

In today's genetic analysis laboratory, multiple instruments are used to collect a variety of data ranging from DNA sequences to individual values that measure DNA (or RNA) hybridization, nucleotide incorporations, or other binding events. Next Gen sequencing adds to this complexity and offers additional challenges with the amount of data that can be produced for a given experiment.

In the last post, I defined basic requirements for a complete laboratory and data management system in the context of setting up a Next Gen sequencing lab. To review, I stated that laboratory workflow systems need to perform the following basic functions:
  1. Allow you set up different interfaces to collect experimental information
  2. Assign specific workflows to experiments
  3. Track the workflow steps in the laboratory
  4. Prepare samples for data collection runs
  5. Link data from the runs back to the original samples
  6. Process data according to the needs of the experiment
I also added that if you operate a core lab, you'll want to bill for your services and get paid.

In this post I'm going to focus on the first step, collecting experimental information. For this exercise let's say we work in a lab that has:
  • One Illumina Solexa Genome Analyzer
  • One Applied Biosystems SOLiD System
  • One Illumina Bead Array station
  • Two Applied Biosystems 3730 Genetic Analyzers, used for both sequencing and fragment analysis















This image shows our laboratory home page. We run our lab as a service lab. For each data collection platform we need to collect different kinds of sample information. One kind of information is the sample container. Our customer's samples will be sent the lab in many different kinds of containers depending on the kind of experiment. Next Gen sequencing platforms like SOLiD, Solexa, and 454 are low throughput with respect to sample preparation, so samples will be sent to us in tubes. Instruments like the Bead Array and 3730 DNA sequencing instrument, usually involve sets of samples in 96 or 384 well plates. In some cases, samples start in tubes and end up in plates, so you'll need to determine which procedures use tubes and which use plates and how the samples will enter the lab.

Once the samples have reached the lab, and been checked, you are also going to do different things to the samples in order to prepare them for the different data collection platforms. You'll want to know which samples should go to what platforms and have the workflows for different processes defined so that they are easy to follow and track. You might even want to track and reuse certain custom reagents like DNA primers, probes and reagent kits. In some cases you'll want to know physical information, like DNA, RNA, or concentration, upfront. In other cases you'll determine information later.

Finally, let's say you work at an institution that focuses on a specific area of research, like cancer, or mouse genetics, or plant research. In these settings you might want to also track information about sample source. Such information could include species, strain, tissue, treatment or many other kinds of things. If you want to explore this information later you'll probably want to define a vocabulary that can be "read" by computer programs. To ensure that the vocabulary can be followed, interfaces will be needed to enter this information without typing or else you'll have a problem like pseudomonas, psuedomonas, or psudomonas.

Information systems that support the above scenarios have to deal with a lot of "sometimes this" and "sometimes that" kinds of information. If one path is taken, Sanger sequencing on a 3730, different sample information and physical configurations are needed than we need with Next Gen sequencing. Next Gen platforms have different sample requirements too. SOLiD and 454 require emulsion PCR to prepare sequencing samples, whereas Solexa, amplifies DNA molecules on slides in clusters. Additionally, the information entry system also has deal with "I care" and "I don't care" kinds of data like information about sample sources, or experimental conditions. These kinds of information are needed later to understand the data in the context of the experiment, but do not have much impact on the data collection processes.

How would you create a system to support these diverse and changing requirements?

One way to do this would be to build a form with many fields and rules for filling it out. You know those kinds of forms. They say things like "ignore this section if you've filled out this other section." That would be a bad way to do this, because no one would really get things right, and the people tasked with doing the work would spend a lot of time either asking questions about what they are supposed to be doing with the samples or answering questions about how to fill out the form.

Another way would be to tell people that their work is too complex and they need custom solutions for everything they do. That's expensive.

A better way to do this would be to build a system for creating forms. In this system, different forms are created by the people who develop the different services. The forms are linked to workflows (lab procedures) that can understand sample configurations (plates, tubes, premixed ingredients, and required information). If the systems is really good, you can easily create new forms and add fields to them to collect physical information (sample type, concentration) or experimental information (tissue, species, strain, treatment, your mothers maiden name, favorite vacation spot, ...) without having to develop requirements with programmers and have them build forms. If your system is exceptionally good, smart, and clever it will let you create different kinds of forms and fields and prevent you from doing things that are in direct conflict with one another. If your system is modern, it will be 100% web-based and have cool web 2.0 features like automated fill downs, column highlighting, and multi-selection devices so that entering data is easy, intuitive, and even a bit fun.

FinchLab, built on the Finch 3 platform, is such a system.

Labels: , , , ,

Friday, April 25, 2008

Managing Digital Gene Expression Workflows with FinchLab

Last Wed (4/23) Illumina hosted a Geospiza presentation featuring how FinchLab supports mRNA tag profiling experiments. We had a great turnout and the presentation is posted on the Illumina web site.

In the webninar we talked about:
  • Next Gen sequencing applications
  • How the Illumina Genome Analyzer makes mRNA Tag Profiling more sensitive by looking at some features of mRNA Tag Profiling data sets with FinchLab
  • Setting up and tracking laboratory workflows with FinchLab
  • Why it is important to link the laboratory work and data analysis work
  • Setting up data analysis and reviewing results with FinchLab
  • Using hosted solutions to overcome the significant data management challenges that accompany Next Gen technologies
Over the coming weeks and months we'll explore the above points through multiple posts. In the meantime, get the presentation and enjoy.

From Sample to Results: Managing Illumina Data Workflow with FinchLab

Labels: , , , , , , , ,

Monday, April 21, 2008

Sneak Peak: Managing Next Gen Digital Gene Expression Workflows

This Wednesday, April 23rd, Illumina will host a webinar featuring the Geospiza FinchLab.

If you are interested in:
  • Learning about Next Gen sequencing applications
  • Seeing how the Illumina Genome Analyzer makes mRNA Tag Profiling more sensitive
  • Understanding the flow of data and information as samples are converted into results
  • Overcoming the significant data management challenges that accompany Next Gen technologies
  • Setting up Next Gen sequencing in your core lab
  • Creating a new lab with Next Gen technologies
This webinar is for you!

In the webinar, we will talk about the general applications of Next Gen sequencing and focus on using the Illumina Genome Analyzer to perform Digital Gene Expression experiments by highlighting mRNA Tag Profiling. Throughout the talk we will give specific examples about collecting and analyzing tag profiling data and show how the Geospiza FinchLab solves challenges related to laboratory setup and managing Next Gen data and analysis workflows.

Labels: , , , , ,

Friday, April 4, 2008

Lab work without data analysis and management is doo doo

As we begin to contemplate next generation sequence data management, we can use Sanger sequencing to teach us important lessons. One of which, is the value of linking laboratory and data workflows to be able to view information in the context of our assays and experiments.

I have been fortunate to hear J. Michael Bishop speak on a couple of occasions. He ended these talks by quoting one of his biochemistry mentors, "genetics without biochemistry is doo doo." In a similar vein, lab work without data analysis and management is doo doo. That is when you separate the lab from the data analysis, you have to work through a lot of doo to figure things out. Without a systematic way to view summaries of large data sets, the doo is overwhelming.

To illustrate, I am going to share some details about a resequencing project we collaborated on. We came to this project late, so much of the data had been collected, and there were problems, lots of doo. Using Finch however, we could quickly organize and analyze the data, and present information in summaries with drill downs to the details to help troubleshoot and explain observations that were seen in the lab.

10,686 sequence reads: forward / reverse sequences from 39 amplicons from 137 individuals

The question being asking in this project was: are there new variants in a gene that are related to phenotypes observed in a specialized population? This is the kind of question medical researchers ask frequently. Typically they have a unique collection of samples that come from a well understood population of individuals. Resequencing is used to interrogate the samples for rare variants, or genotypes.

In this process, we purify DNA from sample material (blood), and use PCR with exon specific probes to amplify small regions of DNA within the gene. The PCR primers have regions called universal adaptors. Our sequencing primers will bind to those regions. Each PCR product, called an amplicon, is sequenced twice, once from each strand to give double coverage of the bases.

When we do the math, we will have to track the DNA for 137 samples and 5343 amplicons. Each amplicon is sequenced, at a minimum twice, to give us 10,686 reads. From a physical materials point of view that means 137 tubes with sample; 56, 96-well plates for PCR; and 112, 96-well plates for sequencing. In a 384-well format we could have used 14 plates for PCR and 28 plates for sequencing. For a genome center, this level of work is trivial, but for a small lab this is significant work and things can happen. Indeed as not all the work is done in a single lab the process can be more complex. And you need to think about how you would lay this out - 96 does not divide by 39 very well.

From a data perspective, we can use sequence quality values to identify potential laboratory and biological issues. The figure below summarizes 4608 reads. Each pair of rows is one sample (forward / reverse sequence pairs, alternating gray and white - 48 total). Each column is an amplicon. Each cell in the table represents a single read from an amplicon and sample. Color is used to indicate quality. In this analysis, quality is defined as the ratio of Q20 to read length (Q20/rL), which works very well for PCR amplicons. The better the data, the closer this ratio is to one. In the table below, green indicates Q20/rL values between 0.60 and 1.00, blue indicates values between 0.30 and 0.59, and red indicates Q20/rL values less than 0.29. The summary shows patterns that, as we will learn next week, show lab failures and biological issues. See if you can figure them out.

Labels: , , , , , , ,

Wednesday, April 2, 2008

Working with Workflows

Genetic analysis workflows involve both complex laboratory and data analysis and manipulation procedures. A good workflow management system not only tracks processes, but simplifies the work.

In my last post , I introduced the concept of workflows in describing the issues one needs to think about as they prepare their lab for Next Gen sequencing. To better understand these challenges, we can learn from previous experience with Sanger sequencing in particular and genetic assays in general.

As we know, DNA sequencing serves many purposes. New genomes and genes in the environment are characterized and identified by De Novo sequencing. Gene expression can be assessed by measuring Expressed Sequence Tags (ESTs), and DNA variation and structure can be investigated by resequencing regions of known genomes. We also know that gene expression and genetic variation can also be studied with multiple technologies such as hybridization, fragment analysis, and direct genotyping and it is desirable to use multiple methods to confirm results. Within each of these general applications and technology platforms, specific laboratory and bioinformatics workflows are used to prepare samples, determine data quality, study biology, and predict biological outcomes.

The process begins in the laboratory.

Recently I came across a Wikipedia article on DNA sequencing that had a simple diagram showing the flow of materials from samples to data. I liked this diagram, so I reproduced it, with modifications. We begin with the sample. A sample is a general term that describes a biological material. Sometimes, like when you are at the doctor, these are called specimens. Since biology is all around and in us, samples come from anything that we can extract DNA or RNA from. Blood, organ tissue, hair, leaves, bananas, oysters, cultured cells, feces, you-can-image-what-else, can all be samples for genetic analysis. I know a guy who uses a 22 to collect the apical meristems from trees to study poplar genetics. Samples come from anywhere.

With our samples in hand, we can perform genetic analyses. What we do next depends on what we want to learn. If we want to sequence a genome we're going to prepare a DNA library by randomly shearing the genomic DNA and cloning the fragments into sequencing vectors. The purified cloned DNA templates are sequenced and the data we obtain are assembled into larger sequences (contigs) until, hopefully, we have a complete genome. In resequencing and other genetic assays, DNA templates are prepared from sample DNA by amplifying specific regions of a genome with PCR. The PCR products, amplicons, are sequenced and the resulting data are compared to a reference sequence to identify differences. Gene expression (EST and hybridization) analysis follows similar patterns except that RNA is purified from samples and then converted to cDNA using RT-PCR (Reverse Transcriptase PCR, not Real Time PCR - that's a genetic assay).

From a workflow point of view, we can see how the physical materials change throughout the process. Sample material is converted to DNA or RNA (nucleic acids), and the nucleic acids are further manipulated to create templates that are used for the analytical reaction (DNA sequencing, fragment analysis, RealTime-PCR, ...). As the materials flow through the lab, they're manipulated in a variety of containers. A process may begin with a sample in a tube, use a petri plate to isolate bacterial colonies, 96-well plates to purify DNA and perform reactions, and 384-well plates to collect sequence data. The movement of the materials must be tracked, along with their hierarchical relationships. A sample may have many templates that are analyzed, or a template may have multiple analyses. When we do this a lot we need a way to see where our samples are in their particular processes. We need a workflow management system, like FinchLab.

Labels: , , , , , , ,