Sunday, November 1, 2009

GeneSifter Laboratory Edition Update

GeneSifter Laboratory Edition has been updated to version 3.13. This release has many new features and improvements that further enhance its ability to support all forms of DNA sequencing and microarray sample processing and data collection.

Geospiza Products

Geospiza's two primary products, GeneSifter Laboratory Edition (GSLE) and GeneSifter Analysis Edition (GSAE), form a complete software system that supports many kinds of genomics and genetic analysis applications. GSLE is the LIMS (Laboratory Information Management System) that is used by core labs and service companies worldwide that offer DNA sequencing (Sanger and Next Generation), microarray analysis, fragment analysis and other forms of genotyping. GSAE is the analysis system researchers use to analyze their data and make discoveries. Both products are actively updated to keep current with latest science and technological advances.

The new release of GSLE helps labs share workflows, perform barcode-based searching, view new data reports, simplify invoicing, and automate data entry through a new API (application programming interface).

Sharing Workflows

GSLE laboratory workflows make it possible for labs to define and track their protocols and data that are collected when samples are processed. Each step in a protocol can be configured to collect any kind of data, like OD values, bead counts, gel images and comments, that are used to record sample quality. In earlier versions, protocols could be downloaded as PDF files that list the steps and their data. With 3.13, a complete workflow (steps, rules, custom data) can be downloaded as an XML file that can be uploaded into another GSLE system to recreate the entire protocol with just a few clicks. This feature simplifies protocol sharing and makes it possible for labs to test procedures in one system and add them to another when they are ready for production.

Barcode Searching and Sample Organization

Sometimes a lab needs to organize separate tubes in 96-well racks for sample preparation. Assigning each tube's rack location can be an arduous process. However, if the tubes are labeled with barcode identifiers, a bed scanner can be used to make the assignments. GSLE 3.13 provides an interface to upload bed scanner data and assign tube locations in a single step. Also, new search capabilities have been added to find orders in the system using sample or primer identifiers. For example, orders can be retrieved by scanning a barcode from a tube in the search interface.


Reports and Data

Throughout GSLE, many details about data can be reviewed using predefined reports. In some cases, pages can be quite long, but only a portion of the report is interesting. GSLE now lets you collapse sections of report pages to focus on specific details. New download features have also been added to better support access to those very large NGS data files.

GSLE has always been good at identifying duplicate data in the system, but not always as good at letting you decide how duplicate data are managed. Managing duplicate data is now more flexible to better support situations where data need to be reanalyzed and reloaded.

The GSLE data model makes it possible to query the database using SQL. In 3.13, the view tables interface has been expanded so that the data stored in each table can be reviewed with a single click.

Invoices

Core lab's that send invoices will benefit from changes that make it possible to download many PDF formatted orders and invoices into a single zipped folder. Configurable automation capabilities have also been added to set invoice due dates and generate multiple invoices from a set of completed orders.

API Tools

As automation and system integration needs increase, external programs are used to enter data from other systems. GSLE 3.13 supports automated data entry through a novel self-documenting API. The API takes advantage of GSLE's built in data validation features that are used by the system's web-based forms. At each site, the API can be turned on and off by on-site administrators and its access can be limited to specific users. This way, all system transactions are easily tracked using existing GLSE logging capabilities. In addition to data validation and access control, the API is self-documenting. Each API containing form has a header that includes key codes, example documentation, and features to view and manually upload formatted data to test automation programs and help system integrators get their work done. GSLE 3.13 further supports enterprise environments with an improved API that is used to query external password authentication servers.

Labels: , , , , , ,

Sunday, March 8, 2009

Bloginar: Next Gen Laboratory Systems for Core Facilities

Geospiza kicked off February by attending the AGBT and ABRF conferences. As part of our participation at ABRF, we presented a scenario, in our poster, where a core lab provides Next Generation Sequencing (NGS) transcriptome analysis services. This story shows how GeneSifter Lab and Analysis Edition’s capabilities overcome the challenges of implementing NGS in a core lab environment.

Like the last post, which covered our AGBT poster, the following poster map will guide the discussion.


As this poster overlaps the previous poster in terms providing information about RNA assays and analyzing the data, our main points below will focus on how GeneSifter Lab Edition solves challenges related to laboratory and business processes associated with setting up a new lab for NGS or bringing NGS into an existing microarray or Sanger sequencing lab.

Section 1 contains the abstract, an introduction to the core laboratory, and background information on different kinds of transcription profiling experiments.

The general challenge for a core lab lies in the need to run a business that offers a wide variety of scientific services for which samples (physical materials) are converted to data and information that have biological meaning. Different services often require different lab processes to produce different kinds of data. To facilitate and direct lab work, each service requires specialized information and instructions for samples that will be processed. Before work is started, the lab must review the samples and verify that the information has been correctly delivered. Samples are then routed through different procedures to prepare them for data collection. In the last steps, data are collected, reviewed, and the results are delivered back to clients. At the end of the day (typically monthly), orders are reviewed and invoices are prepared either directly or by updating accounting systems.

In the case of NGS, we are learning that the entire data collection and delivery process gets more complicated. When compared to Sanger sequencing, genotyping, or other assays that are run in 96-well formats, sample preparation is more complex. NGS requires that DNA libraries be prepared and different steps of the of process need to be measured and tracked in detail. Also, complicated bioinformatics workflows are needed to understand the data from both a quality control and biological meaning context. Moreover, NGS requires a substantial investment in information technology.

Section 2 walks through the ways in which GeneSifter Lab Edition helps to simplify the NGS laboratory operation.

Order Forms

In the first step, an order is placed. Screenshots show how GeneSifter can be configured for different services. Labs can define specialized form fields using a variety of user interface elements like check boxes, radio buttons, pull down menus, and text entry fields. Fields can be required or be optional and special rules such as ranges for values can be applied to individual fields within specific forms. Orders can also be configured to take files as attachments to track data, like gel images, about samples. To handle that special “for lab use only" information, fields in forms can be specified as laboratory use only. Such fields are hidden to the customers view and when the orders are processed they are filled later by lab personnel. The advantage of GeneSifter’s order system is that the pertinent information is captured electronically in the same system that will be used to track sample processing and organize data. Indecipherable paper forms are eliminated along with the problem of finding information scattered on multiple computers.

Web-forms do create a special kind of data entry challenge. Specifically, when there is a lot of information to enter for a lot samples, filling in numerous form fields on a web-page can be a serious pain. GeneSifter solves this problem in two ways:

First, all forms can have “Easy Fill” controls that provide column highlighting (for fast tab-and-type data entry), auto fill downs, and auto fill downs with number increments so one can easily “copy” common items into all cells of a column, or increment an ending number to all values in a column. When these controls are combined with the “Range Selector,” a power web-based user interface makes it easy to enter large numbers of values quickly in flexible ways.

Second, sometimes the data to be entered is already in an Excel spreadsheet. To solve this problem, each form contains a specialized Excel spreadsheet validator. The form can be downloaded as an Excel template and the rules, previously assigned to field when the form was created, are used to check data when they are uploaded. This process spots problems with data items and reports ten at upload time when they are easy to fix, rather than later when information is harder to find. This feature eliminates endless cycles of contacting clients to get the correct information.

Laboratory Processing

Once order data are entered, the next step is to process orders. The middle of section 2 describes this process using an RNA-Seq assay as an example. Like other NGS assays, the RNA-Seq protocol has many steps involving RNA purification, fragmentation, random primed conversion into cDNA, and DNA library preparation of the resulting cDNA for sequencing. During the process, the lab needs to collect data on RNA and DNA concentration as well as determine the integrity of the molecules throughout the process. If a lab runs different kinds of assays they will have to manage multiple procedures that may have different requirements for ordering of steps and laboratory data that need to be collected.

By now it is probably not a surprise to learn that GeneSifter Lab Edition has a way to meet this challenge too. To start, workflows (lab procedures) can be created for any kind of process with any number of steps. The lab defines the number of steps and their order and which steps are required (like the order forms). Having the ability to mix required and optional steps in a workflow gives a lab the ultimate flexibility to support those “we always do it this way, except the times we don’t” situations. For each step the lab can also define whether or not any additional data needs to be collected along the way. Numbers, text, and attachments are all supported so you can have your Nanodrop and Bioanalyzer too.

Next, an important feature of GeneSifter workflows is that a sample can move from one workflow to another. This modular approach means that separate workflows can be created for RNA preparation, cDNA conversion, and sequencing library preparation. If a lab has multiple NGS platforms, or a combination of NGS and microarrays, they might find that a common RNA preparation procedure is used, but the processes diverge when the RNA is converted into forms for collecting data. For example, aliquots of the same RNA preparation may be assayed and compared on multiple platforms. In this case a common RNA preparation protocol is followed, but sub-samples are taken through different procedures, like a microarray and NGS assay, and their relationship to the “parent” sample must be tracked. This kind of scenario is easy to set up and execute in GeneSifter Lab Edition.

Finally, one of GeneSifter’s greatest advantages is that a customized system with all of the forms, fields, Excel import features, and modular workflows can be added by lab operators without any programming. Achieving similar levels of customization with traditional LIMS products takes months and years with initial and reoccurring costs of six or more figures.

Collecting Data

The last step of the process is collecting the data, reviewing it, and making sequences and results available to clients. Multiple screenshots illustrate how this works in GeneSifter Lab Edition. For each kind of data collection platform, a “run” object is created. The run holds the information about reactions (the samples ready to run) and where they will be placed in the container that will be loaded into the data collection instrument. In this context, the container is used to describe 96 or 384-well plates, glass slides with divided areas called lanes, regions, chambers, or microarray chips. All of these formats are supported and in some cases specialized files (sample sheets, plate records) are created and loaded into instrument collection software to inform the instrument about sample placement and run conditions for individual samples.

During the run, samples are converted to data. This process, different for each kind of data collection platform, produce variable numbers and kinds of files that are organized in completely different ways. Using tools that work with GeneSifter, raw data and tracking information are entered into the database to simplify access to the data at a later time. The database also associates sample names and other information with data files, eliminating the need to rename files with complex tracking schemes. The last steps of the process involve reviewing quality information and deciding whether to release data to clients or repeat certain steps of the process. When data are released, each client receives an email directing them to their data.

The lab updates the orders and optionally creates invoices for services. GeneSifter Lab Edition can be used to manage those business functions as well. We’ll cover GeneSifter’s pricing and invoicing tools at some other time, be assured they are as complete as the other parts of the system.

NGS requires more than simple data delivery

Section 3 covers issues related to the computational infrastructure needed to support NGS and the data analysis aspects of the NGS workflow. In this scenario, our core lab also provides data analysis services to convert those multi-million read files into something that can be used to study biology. Much of this covered in the previous post, so it will not be repeated here.

I will summarize by making the final points that Geospiza’s GeneSifter products cover all aspects of setting up a lab for NGS. From sample preparation, to collecting data, to storing and distributing results, to running complex bioinformatics workflows and presenting information in ways to get scientifically meaningful results, a comprehensive solution is offered. GeneSifter products can be delivered as hosted solutions to lower costs. Our hosted, Software as a Service, solutions allow groups to start inexpensively and manage costs as the needs scale. More importantly, unlike in-house IT systems, which require significant planning and implementation time to remodel (or build) server rooms and install computers, GeneSifter products get you started as soon as you decide to sign up.

Labels: , , , , ,

Wednesday, March 4, 2009

Bloginar: The Next Generation Dilemma: Large Scale Data Analysis

Previous posts shared some the things we learned at the AGBT and ABRF meetings in early February. Now it is time to share the work we presented, starting with the AGBT poster, “The Next Generation Dilemma: Large Scale Data Analysis.”

The goal of the poster was to provide a general introduction to the power of Next Generation Sequencing (NGS) and a framework for data analysis. Hence, the abstract described the NGS general data analysis process; its issues and what we are doing for one kind of transcription profiling, RNA-Seq. Between then and now we learned a few things... And the project grew.

The map below guides my “bloginar” poster presentation. In keeping with the general theme of the abstract we focused on transcription analysis, but instead of focusing exclusively on RNA-Seq, the project expanded to compare three kinds of transcription profiling: RNA-Seq, Tag Profiling, and Small RNA Analysis. A link to the poster is provided at the end.

Section 1 provides a general introduction into NGS by discussing the ways NGS is being used to study different aspects of molecular biology. It also covers how the data are analyzed in thee phases (primary, secondary, tertiary) to convert raw data into biologically meaningful information. The three phase model has emerged as a common framework to describe the process of converting image data into primary sequence data (reads) and then turning the reads into information that be used in comparative analyses. Secondary analysis is the phase where reads are aligned to reference sequences to get gene names, position, and (or) frequency information that can be used to measure changes, like gene expression, between samples.

The remaining sections of the poster use examples from transcription analysis to illustrate and address the multiple challenges (listed below) that must be overcome to efficiently use NGS.
  • High end infrastructures are needed to manage and work with extremely large data sets
  • Complex, multistep analysis procedures are required to produce meaningful information
  • Multiple reference data are needed to annotate and verify data and sample quality
  • Datasets must be visualized in multiple ways
  • Numerous Internet resources must be used to fill in additional details
  • Multiple datasets must be comparatively analyzed to gain knowledge
Section 2 describes the three different kinds of transcription profiling experiments. This section provides additional background on the methods and what they measure. For example, RNA-Seq and Tag Profiling are commonly used to measure gene expression. In RNA-Seq, DNA libraries are prepared by randomly amplifying short regions of DNA from cDNA. The sequences that are produced will generally cover the entire region of the transcripts that were originally isolated. Hence, it is possible to get information about alternative splicing and biased allelic expression. In contrast, Tag Profiling focuses on creating DNA libraries from discrete points within the RNA molecules. With Tag Profiling, one can quickly measure relative gene expression, but cannot get information about alternative splicing and allelic expression. The table in section 2 discusses these and other issues one must consider when running the different assays.

Sections 3, 4, and 5 outline three transcriptome scenarios (RNA-Seq, Tag Profiling, and Small RNA, respectively) using real data examples (references provided in the poster). Each scenario follows a common workflow involving the preparation of DNA libraries from RNA samples, followed by secondary analysis, followed by tertiary analysis of the data in GeneSifter Analysis Edition.

For RNA-Seq, two datasets corresponding to mouse erythroid stem (ES) and body (EB) cells were investigated. DNA libraries were produced from each cell line. Sequences were collected from the library and compared to the RefSeq (NCBI) database according to the pipeline shown. The screen captures (middle of the panel) show how the individual reads map to each transcript along with the total numbers of hits summarized by chromosome. The process is repeated twice, once for each cell line, and the two sets of alignments are converted to Gene Lists for comparative analysis in GeneSifter laboratory edition to observe differential expression (bottom of the panel).

The Tag Profiling panel examines data from a recently published experiment (a reference is provided in the poster) in which gene expression was studied in transgenic mice. I’ll leave out the details of the paper, and only point out how this example shows the differences between Tag Profiling and RNA-Seq data. Because Tag Profiling collects data from specific 3’ sites in RNA, the aligned data (middle of the panel) show alignments as single “spikes” toward the 3’ end of transcripts. Occasionally multiple peaks are observed. The question being, are the additional peaks the result of isoforms (alternative polyA sites) or incomplete restriction enzyme digests? How might this be sorted out? Like RNA-Seq, the bottom panel shows the comparative analysis of replicate samples from the wild type (WT) and transgenic (TG) mice.

Data from a small RNA analysis experiment are analyzed in the third panel. Unlike RNA-Seq and Tag Profiling, this secondary analysis has more comparisons of the reads to different sets of reference sequences. The purpose is to identify and filter out common artifacts observed in small RNA preparations. The pipeline we used, and data produced, are shown in the middle of the panel. Histogram plots of read length distribution, determined from alignments in different reference sources, are created because an important feature of small RNAs is that they are small. Distributions clustered around 22 nt indicate a good library. Finally, data are linked to additional reports and databases, like miRBase (Sanger Center), to explore results further. In the example shown, the first hit was to a small RNA that has been observed in opossums; now we have human counter part. In total, four, samples were studied. Like RNA-Seq and Tag Profiling, we can observe the relative expression of each small RNA by analyzing the datasets together (hierarchical clustering, bottom).

Section 6 presents some of the challenges of scale issues that accompany NGS, and how we are addressing these issues with HDF5 technology. This will be a topic of many more posts in the future.

We close the poster by addressing the challenges listed above with the final points:
  • High performance data management systems are being developed through the BioHDF project and GeneSifter system architectures.
  • The examples show how each application and sequencing platform requires a different data analysis workflow (pipeline). GeneSifter provides a platform to develop and make bioinformatics pipelines and data readily available to communities of biologists.
  • The transcriptome is complex, different libraries of sequence data can filter known sequences (e.g. rRNA) and discover new elements (miRNAs) and isoforms of expressed genes.
  • Within a dataset, read maps, tables, and histogram plots are needed to summarize and understand the kinds of sequences present and how they relate to an experiment.
  • Links to Entrez Gene, the USCS genome browser, and miRBASE, show how additional information can be integrated into the application framework and used.
  • Next Gen transcriptomics assays are similar to microarray assays in many ways, hence software systems like Geospiza’s GeneSifter are useful for comparative analysis.
You can also get the file, AGBT_2009.pdf

Labels: , , , , , , ,

Monday, February 23, 2009

Three Themes from AGBT and ABRF Part III: The IT Problem

The power of Next Generation DNA Sequencing (NGS) technology come from the fact that a massive amount of data, sampling millions of individual molecules, is collected in a massively parallel format. This power also limits the potential wide-spread adoption of the technology because of the IT (Information Technology) challenges that result from the massive amount of data created with each sequencer run.

IT challenges form the third technical theme from the AGBT and ABRF conferences. The previous two posts underscored the need for good laboratory practices and rich bioinformatics support to make NGS experiments successful. This post discusses the experiences communicated by the early adopters of NGS technology with respect to the computing infrastructure.

Surprises

Throughout the literature and NGS presentations, the data management issues created by NGS play a central role. Recent editorials in Nature Methods [1] and Nature Biotechnology [2] speak to the problem and express researchers' frustrations in dealing with the lack of IT infrastructures. At the ABRF workshop, we had two presentations specifically focused on the IT challenges, describing two different experiences.

In the first case, the group implementing NGS had a number of surprises after the NGS system was installed and running. They learned that these systems not only require a lot of storage and computing support, they also use up a lot of bandwidth when data are transferred. The bandwidth problem led to the need for a revised network architecture to isolate the NGS data flow from other network activity.

This talk brought similar surprises to mind. In other labs, NGS “surprises” have led to groups needing to upgrade server rooms by installing backup power, air conditioning, and other equipment. Of course these surprises are manageable if you have an IT group and a server room in the first place. In some cases, groups start with even less and find that the IT costs makes the NGS endeavor very expensive. Even with support and space the IT costs for bringing in NGS can quickly grow into six figures (above $100,000) for infrastructure alone.

The second presentation was given by a group who was well prepared for NGS. Their university had made a previous commitment to building an IT infrastructure to support data intensive genomics research, so adding NGS was a step up in their view. Their experience allowed them to develop a strong implementation plan that called for a number of systems upgrades that included upgrading network hardware. While total costs were less than the six figure surprises others experienced, they did spend many tens of thousands of dollars on new file servers, CPUs, network switches, and server room upgrades.

The conclusion from both of the presentations was that if you are going to set up an NGS infrastructure three things are important: planning, planning, planning. Also, institutional support is critically important since renovations and new building may need to ramp up too. Personnel with network, systems administration, and unix experience are also essential. Finally, as the second speaker put it, you need to encourage researchers to invest in the infrastructure. If they are not involved in the process and contributing time and money, the endeavor can quickly fail.

These talks bring me to my favorite marketing slogan where one of Illumina’s customers put an NGS instrument in their mail room. Whenever I hear that, or see the ad, it makes me think, “yes, you can turn a mail room into a genome center, but where will you put the data center?

There is a solution


For those thinking about NGS technology, or running an NGS experiment where the samples are submitted to a lab, and the data returned, even contemplating the IT requirements can be discouraging. But, it does not have to be this way. Over the past ten years, an immense infrastructure of data centers has emerged . Today, there are many options and price points available for storage, computing, and backup systems. Groups can save significant time and money using on-line services because costs scale with need. Moreover, on-line services eliminate the need for dedicated systems and data administrators putting more money in the budget for experiments. You have a choice. Jump in and do some interesting science or work hard to have your campus facilities remodeled.

Geospiza is taking advantage of the Internet’s infrastructure to offer our clients cost effective ways to get NGS running in their lab. GeneSifter Laboratory Edition can be delivered through a SaaS (Software as a Service) model to get labs up and running quickly. Just sign up, get access, and you are ready to go. GeneSifter Analysis Edition solves the IT problem for research groups who get their sequencing done through core labs or other service providers. In these cases, you upload you data and with a few clicks, process your data and analyze the results. Because the infrastructure is built, overall costs for IT and bioinformatics are much lower, and you do not have to experience a remodeling project.

References
1. 2008. Byte-ing off more than you can chew. Nat Methods 5, 577.
2. 2008. Prepare for the deluge. Nat Biotechnol 26, 1099.

Labels: , , , ,

Thursday, February 19, 2009

Three Themes from AGBT and ABRF Part II: The Bioinformatics Bottleneck

In my last post, I summarized the presentations and conversations regarding Next Gen Sequencing (NGS) challenges in terms of three themes: You have to pay attention to details in the laboratory, bioinformatics is a bottleneck, and the IT burden is significant. In that post, I discussed the issues related to the laboratory and how GeneSifter Lab Edition overcomes those challenges.

This post tackles the second theme: the bioinformatics bottleneck.

In the Sanger days, bioinformatics was really a challenge for only the highest throughput facilities like genome centers. In these labs, streamlined workflows (pipelines) were developed for the different kinds of sequencing (genomes, ESTs[expressed sequence tags, 1], SAGE [serial analysis of gene expression, 2]). Because, Sanger sequencing was high cost and low throughput, compared to NGS, the cost of developing the bioinformatics pipelines was low relative to the cost of collecting the data. Thus, large-scale projects such as whole genome shotgun sequencing, ESTs, or resequencing studies, could be supported by a handful of pipelines that changed infrequently. In addition, small-scale Sanger projects could be handled well by desktop software and services like NCBI BLAST.

NGS breaks the Sanger paradigm. A single NGS instrument has the same throughput as an entire warehouse of Sanger instruments. To illustrate this point in a striking way, we can look at dbEST - NCBI’s database of ESTs. Today, there are approximately 59 million reads in this database, representing the total accumulation of sequencing projects over a 10 year period. Considering that one run of an Illumina GA or SOLiD can produce between 80 and 180 million reads in week or two we can now, in a single week, produce up to three times more ESTs than we have seen deposited over the past 10 years. These numbers also dwarf the amount of data collected from other gene expression analysis systems like microarrays and sequencing techniques like SAGE.


The emergence of the bioinformatics bottleneck

The bioinformatics bottleneck is related to the fact that NGS platforms are general purpose; they collect sequence data. That’s it. Because they collect a lot of data very quickly, we can use sequences as data points for many different kinds of measurements. When we think this way, an extremely wide range of experiments can be conceived.

From sequencing complete genomes, to sampling genes in the environment, to measuring mutations in cancer, to understanding epigenomics, to measuring gene expression and the transcriptome, NGS applications are proliferating at a rapid pace. However, each experiment requires a specialized bioinformatics pipeline and the algorithms used within a bioinformatics pipelines must be tuned for the data produced from the different sequencing platforms and questions being asked. When these considerations are combined with other issues like what reference data to use for sequence comparisons the number of bioinformatics pipelines can grow in a combinatorial fashion.

The early recommendation is that each lab wanting to do NGS work needs to have a dedicated bioinformatics professional. In more than one talk, presenters even quantified bioinformatics support in terms of FTEs (full time equivalents) per instrument. Bioinformatics is needed in both the sequencing laboratory, to develop and maintain quality control pipelines, and in the research environment, to process (align) the data, mine the output for interesting features, and perform comparative analyses between datasets.

But this won’t work

It is clear that bioinformatics is critical to understanding the data being produced. However, the current recommendation that any group planning NGS experiments should also have a dedicated bioinformatician is impractical for several reasons.

First, the model of a bioinformatician for every lab is simply not scalable. Fundamentally, there are not enough people that understand the science, programming, statistics, and other resources such as different forms of reference data, algorithms, and data types needed to make sense of NGS data. We see plenty of evidence, in the literature and presentations, that there are many outstanding people doing this work and contributing to the community, the problem is that they already have jobs!

Even if we consider that the above model is workable, hiring people takes significant time, is expensive, and ongoing costs are going to be high. These time and cost investments only become reasonable when a significant number of experiments are planned. One or two instruments will produce between 25 and 50 runs worth of data per year. If you calculate instrument costs, reagents, salary, and overhead costs, you are quickly into many thousands of dollars per sample. Indeed, a theme expressed in the bioinformatics bottleneck is that bioinformatics is becoming the single largest ongoing cost of NGS. Add in the IT computer support (next post) and you better have a plan for running a lot more than 50 runs per year. Remember the first issue - good bioinformaticians with NGS analysis experience have jobs.

If you have access to bioinformatics support, or can hire an individual, that person will quickly become overwhelmed with work. The biggest reason is that the software infrastructures needed to quickly develop new pipelines, automate them, and deliver data in ways that can be consumed by non-programming scientists are typically lacking. The result is that scientific programming efforts generally turn into lengthy software development projects because without an infrastructure, the numbers and kinds of experiments quickly grow past beyond the capacity of a single individual.

So, What can be done?

Geospiza solves the bioinformatics challenge in multiple ways. GeneSifter Lab and Analysis editions provide a platform that delivers the complete infrastructures needed to deploy NGS data processing pipelines and deliver results through web-based interfaces. These systems include pipelines for many of the common NGS applications such as transcription analysis, small RNA detection, ChIP-Seq and other assays. The system architecture and accompanying API creates a framework to quickly add new pipelines and make the results available to biologists running the experiments.

For those with access to bioinformatics help, GeneSifter will make your team more productive because developers will be freed of the burden of having to create the automation and delivery infrastructure, enabling them to focus on new scientific programming problems. For those without access to such resources, we have many pipelines ready to go. Moreover, because we have a platform and the infrastructure already built, as well as deep bioinformatics experience, we can create and deliver new analysis pipelines quickly. Finally, our product development roadmap is well-aligned with the most common NGS assays which means we you can probably do your bioinformatics analysis today!

References: 

1. Adams M.D., Soares M.B., Kerlavage A.R., Fields C., Venter J.C., 1993. Rapid cDNA sequencing (expressed sequence tags) from a directionally cloned human infant brain cDNA library. Nat Genet 4, 373-380.

2. Velculescu V.E., Zhang L., Vogelstein B., Kinzler K.W., 1995. Serial analysis of gene expression. Science 270, 484-487.

Labels: , , , ,

Monday, February 2, 2009

Next Gen Laboratory Software Systems for Core Facilities

Do you have a core lab? Considering adding Next Generation DNA sequencing capacity to your lab? Then you will be interested in visiting our both and checking out our poster at the annual Association for Biomolecular Research Facilities (ABRF) meeting next week in Memphis TN. We'll be at booth 408, and presenting poster number V27-S1.

Poster Abstract

Throughout the past year, as next generation sequencing (NGS) technologies have emerged in the marketplace, their promise of what can be done with massive amounts of sequence data has been tempered with the reality that performing experiments and working with the data is extremely challenging. As core labs contemplate acquiring NGS technologies, they must consider how the new technologies will affect their current and future operations. The old model of collecting and delivering data is likely to change to one where the core lab becomes an active participant in advising and helping clients set up experiments and analyze the data. However, while many labs want to utilize NGS, few have the Information Technology (IT) infrastructures and procedures in place to successfully make use of these systems.

In the case of gene expression, NGS technologies are being evaluated as complementary or replacement technologies for microarrays. Assays like RNA-Seq and tag profiling, that focus on measuring relative gene expression, require that researchers and core labs must puzzle through a diverse collection of early version algorithms that are combined into complicated workflows with many steps producing complicated file formats. Command line tools such as MAQ, SOAP, MapReads, and BWA, have specialized requirements for formatted input and output and leave researchers with large data files that still require additional processing and formatting for tertiary analyses. Moreover, once reads are aligned, datasets need to be visualized and further refined for additional comparative analysis. We present solutions to these challenges by showing results from a complete workflow system that includes data collection, processing, and analysis for RNA-seq suited for the core laboratory.

In the poster we'll walk through the laboratory and data analysis issues one needs to think about to perform a two cell expression comparison with RNA-Seq. Below is a snippet from the poster. I'll post the full presentation when I return.

Labels: , , , ,

Wednesday, January 21, 2009

The Experts Agree

It depends what you are trying to do. That is the take home message in Genome Technology’s (GT) trouble-shooting guide on picking assembly and alignment algorithms for Next-Gen sequence data.

In the guide, the GT team asked nine Next-Gen sequencing and bioinformatics experts to answer six questions:
  1. How do you choose which alignment algorithm to use?
  2. How do you optimize your alignment algorithm for both high speed and low error rate?
  3. What approach do you use to handle mismatches or alignment gaps?
  4. How do you choose which assembly algorithm to use?
  5. Do you use mate-paired reads for de novo assembly? how?
  6. What impact does the quality of raw read data have on alignment or assembly? how do your algorithms enhance this?
Even a quick look at the questions shows us that many factors need to be considered in setting up a Next-Gen sequencing lab. Questions 1 and 4 point out that aligning sequences is different from assembling them. Other questions address issues related to the size of the data sets being compared, the quality of the data being analyzed, the kinds of information that can be obtained, and the computational approaches being used for different problems.

What the experts said

First, they all agree that different problems require different approaches and have different requirements. In the first question about which aligner to use, the most common response was “for what application and which instrument?” Fundamentally, SOLiD data are different from Illumina GA data which are different from 454 data. While the end results may all be sequences of A's, G's, C's, and T's; the data are derived in different ways because of the platform-specific twists in collecting the data (recall “Color Space, Flow Space, Sequence Space, or Outer Space). Not only are there platform-specific methods for interpreting raw data, multiple programs have been developed for each instrument with their own strengths and weaknesses in terms of speed, sensitivity, the kinds of data they use (color, base, or flow spaces, quality values, and paired end data), and the information that is finally produced. Hence, in addition to choosing a sequencing platform you also have to think about the sequencing application, or the kind of experiment, that will be performed. In gene expression studies, for example, an RNA-Seq experiment has different requirements in terms of aligning the data and interpreting the output than an experiment with Tag Profiling.

Overall the trouble-shooting guide discussed 17 total algorithms, eight for alignment, and nine for assembly (two of which were for Sanger methods). Even this selection wasn't a comprehensive list. When other sites [1, 2] and articles [3] are included and proprietary methods are factored in, over 20 algorithms are available. So what to do? Which is best?

That depends

Yes, the choice of algorithm ultimately depends on what you are trying to do. While we can agree that there is no best solution, we also know that is not a helpful response. What is needed is a way to test the suitability of different algorithms for different kinds of experiments and to represent data in standard ways so that the features of specific algorithms can be evaluated. Also, as this is a new field, standard requirements for how data should be aligned, defining a correct alignment, and what kinds of information are the most informative in describing alignments are still emerging. Some of the early programs are helping to define these requirements.

One program we've used, at Geopsiza, for identifying requirements is MAQ, a program for sequence alignment. As noted in previous blogs [MAQ attack], MAQ is a great general purpose tool. It provides comprehensive information about the data being aligned and details about alignments. MAQ works well for many applications including RNA-Seq, Tag Profiling, ChIP-Seq, and resequencing assays focused on SNP discovery. In performance tests, MAQ is slower than some of the newer programs, one of which is being developed by MAQ’s author, but MAQ is a good model for getting the right kinds of information, formatted in a decent way. Indeed MAQ was the most cited program in the GT guide.

Let’s return to the bigger issue. That is, how can we easily compare between algorithms? For that we need a system where one can easily define a standardized dataset and reference sequence, and have a platform where a new algorithm can be added and run from a common interface. Standard reports that present features of the alignments could then be used to compare programs and parameters.

The laboratory edition of GeneSifter supports these kinds of comparisons. The distributed system architecture allows one to quickly develop control scripts to run programs and format their output in figures and tables that make comparisons possible. With this kind of system in place, the challenges move from which program to run and how to run it, to how to get the right kinds of information and best display the data. To address these issues, Geospiza’s research and development team is working on projects focused on using technologies like HDF5 to create scalable standardized data models for storing information from alignment and assembly programs. Ultimately this work will make it easy to optimize Next-Gen sequencing applications and assays and compare between assorted programs.

References
1. http://en.wikipedia.org/wiki/Sequence_alignment_software,
2. http://www.massgenomics.org/2009/01/short-read-aligners-update-at-agbt.html
3. Shendure J., Ji H., 2008. Next-generation DNA sequencing. Nat Biotechnol 26, 1135-1145.

Labels: , , , ,

Friday, December 12, 2008

Papers, Papers, and more Papers

Next Gen Sequencing is hot, hot, hot! You can tell by the numbers and frequency in which papers are being published.

A few posts ago, I wrote about a couple of grant proposals that we were preparing on methods to detect rare variants in cancer and improve the tools and methods to validate datasets from quantitative assays that utilize Next Gen data, like RNA-Seq, ChIP-Seq, or Other-Seq experiments. Besides the normal challenges of getting two proposals written and uploaded to the NIH, there was an additional challenge. Nearly everyday, we opened the tables-of-contents in our e-mail and found a new papers highlighting Next Gen Sequencing techniques, applications, or biological discoveries made through Next Gen techniques. To date, over 200 Next Gen publications have been produced. During the last two months alone more than 30 papers have been published. Some of these (listed in the figure below) were relevant to the proposals we were drafting.

The papers highlighted many of the themes we've touched on here, including the advantages of Next Gen sequencing and challenges with dealing with the data. As we are learning, these technologies allow us to explore the genome and genomics of systems biology at significantly higher resolutions than previously imagined. In one of the higher profile efforts, teams at the Washington University School of Medical and Genome Center compared a leukemia genome to a normal genome using cells from the same patient. This first intra-person whole genome analysis identified acquired mutations in ten genes, eight of which were new. Interestingly, the eight genes have unknown functions and might be important some day for new therapies.

Next Gen technologies are also confirming that molecular biology is more complicated than we thought. For example, the four most recent papers in Science show us that not only is 90% of the genome actively transcribed, but many genes have both sense and anti-sense RNA expressed. It is speculated that the anti-sense transcripts have a role in regulating gene expression. Also, we are seeing that nearly every gene produces alternatively spiced transcripts. The most recent papers indicate that between 92% and 97% of transcripts are alternatively spliced. My guess is that the only genes, not alternatively spliced are those lacking introns, like olfactory receptors. Although, when alternative transcription starts and alternative polyadenylation sites are considered, we may see that all genes are processed in multiple ways. It will be interesting to see how the products of alternative splicing and anti-sense transcription might interact.

This work has a number of take home messages.
  1. Like astronomy, when we can see deeper we see more. Next Gen technologies are giving us the means to interrogate large collections of individual RNA or DNA molecules and speculate more on functional consequences.
  2. Our limits are our imaginations. The reported experiments have used a variety of creative approaches to study genomic variation, sample expressed molecules from different strands of DNA, and measure protein DNA/RNA interaction.
  3. Good hands do good science. As pointed out in the paper from the Sanger Center on their implementation of Next Gen sequencing, the processes are complex and technically demanding. You need to have good laboratory practices with strong informatics support for all phases (laboratory, data management, and data analysis) of the Next Gen sequencing processes.
The final point is very important and Geospiza’s lab management and data analysis products will simplify your efforts in getting Next Gen systems running to make your major investment pay off and quickly publish results.

To see how, join us for a webinar next Wednesday, Dec. 17 at 10 am PDT, for RNA Expression Analysis with Geospiza.


Click on the figure to enlarge the text.

Labels: , , , , , ,

Sunday, November 9, 2008

Next Gen-Omics

Advances in Next Gen technologies have led to a number of significant papers in recent months, highlighting their potential to advance our understanding of cancer and human genetics (1-3). These and the other 100's of papers demonstrate the value of Next Gen sequencing. The work completed thus far has been significant, but much more needs to be done to make these new technologies useful for a broad range of applications. Experiments will get harder.

While much of the discussion in the press focuses on rapidly sequencing human genomes for low cost as part of the grail of personalized genomics (4), a vast amount of research must be performed at the systems level to fully understand the relationship between biochemical processes in a cell and how the instructions for the processes are encoded in the genome. Systems biology and a plethora of "omics" have emerged to measure multiple aspects of cell biology as DNA is transcribed into RNA and RNA translated into protein and proteins interact with molecules to carry out biochemistry.

As noted in the last post we are developing proposals to further advance the state-of-the-art in working with Next Gen data sets. In one of those proposals, Geospiza will develop novel approaches to work with data from applications of Next Gen sequencing technologies that are being developed study the omics of DNA transcription and gene expression.

Toward furthering our understanding of gene expression, Next Gen DNA sequencing is being used to perform quantitative assays where DNA sequences are used as highly informative data points. In these assays, large datasets of sequence reads are collected in a massively parallel format. Reads are aligned to reference data to obtain quantitative information by tabulating the frequency, positional information, and variation from the reads in the alignments. Data tables from samples that differ by experimental treatment, environment, or in populations, are compared in different ways to make discoveries and draw experimental conclusions. Recall the three phases of data analysis.

However, to be useful these data sets need to come from experiments that measure what we think they should measure. The data must be high quality and free of artifacts. In order to compare quantitative information between samples, the data sets must be refined and normalized so that biases introduced through sample processing are accounted for. Thus, a fundamental challenge to performing these kinds of experiments is working with the data sets that are produced. In this regard numerous challenges exist.

The obvious ones relating to data storage and bioinformatics are being identified in both the press and scientific literature (5,6). Other, less published, issues include a lack of:
  • standard methods and controls to verify datasets in the context of their experiments,
  • standardized ways to describe experimental information and
  • standardized quality metrics to compare measurements between experiments.
Moreover data visualization tools and other user interfaces, if available, are primitive and significantly slow that pace at which a researcher can work with the data. Finally, information technology (IT) infrastructures that can integrate the system parts dealing with sample tracking, experimental data entry, data management, data processing and result presentation are incomplete.

We will tackle the above challenges by working with the community to develop new data analysis methods that can run independently and within Geospiza's FinchLab. FinchLab handles the details of setting up a lab, managing its users, storing and processing data, and making data and reports available to end users through web-based interfaces. The laboratory workflow system and flexible order interfaces provide the centralized tools needed to track samples, their metadata, and experimental information. Geospiza's hosted (Software as a Service [SaaS]) delivery models remove additional IT barriers.

FinchLab's data management and analysis server make the system scalable through a distributed architecture. The current implementation of the analysis server creates a complete platform to rapidly prototype new data analysis workflows and will allow us to quickly devise and execute feasibility tests, experiment with new data representations, and iteratively develop the needed data models to integrate results with experimental details.

References

1. Ley, T. J., Mardis, E. R., Ding, L., Fulton, B., et al. DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome. Nature 456, 66-72 (2008).

2. Wang, J., Wang, W., Li, R., Li, Y., et al. The diploid genome sequence of an Asian individual. Nature 456, 60-65 (2008).

3. Bentley, D. R., Balasubramanian, S., Swerdlow, H. P., Smith, G. P., et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53-59 (2008).

4. My genome. So what? Nature 456, 1 (2008).

5. Prepare for the deluge. Nature Biotechnology 26, 1099 (2008).

6. Byte-ing off more than you can chew. Nature Methods 5, 577 (2008).

Labels: , , , , , ,

Wednesday, October 22, 2008

Journal Club: Focus on Next Gen Sequencing

Yesterday I received my issue of Nature Biotechnology. This month it features Next-Generation (Next Gen) Sequencing. One editorial, one profile, three news features, a commentary, two perspectives, and two reviews discuss the origins, trials, tribulations and what’s coming next in Next Gen. For now, I'll focus on the editorial.

Bioinformatics is a big big issue

“If the data problem is not addressed, ABI’s SOLiD, 454’s GS FLX, Illumina’s GAII or any of the other deep sequencing platforms will be destined to sit in their air-conditioned rooms like a Stradivarius without a bow” was the closing statement in the lead editorial “Prepare for the deluge.

Reminds me of something I said a few months back.

In the editorial, Nature Biotechnology (NBT) makes a number of important points starting with how the launch of the Roche/454 pyrosequencer in 2005 could generate as much data as more than 50 ABI capillary sequencers. Since that launch, we have seen new instruments emerge that are producing ever increasing amounts of data by orders of magnitude. Or as NBT put it “The overwhelming amounts of data being produced are the equivalent of taking a drink from a fire hose.”

It's like they read our web site (we ran the image below at the beginning of the year).


The volumes of data and new ways in which it must be worked with are creating many challenges. To begin, there is the conundrum of what to keep; do you keep raw images and processed reads? Or do you just keep the reads? If you keep raw images, the costs are significant. The cost of storing all that information must be considered in the context of the likelihood of whether you will ever need to go back to these data. We call this the data life cycle.

From raw images, the next challenge is the computational infrastructure needed to process reads and obtain meaningful information. This is a complex process that involves many steps and high performance computers. NBT made the accurate and important point that the instrument manufacturer only provide the software to analyze what comes off of the machine for common applications. A great deal of bioinformatics support is needed for downstream analysis once the initial data alignments or assemblies are completed. Also, standards for comparing data between instrument platforms are lacking. This makes it difficult to compare results from different instruments.

While more is needed in terms of bioinformatics support, being able to get tools for alignment and assembly is a good starting point and NBT lauded ABI’s SOLiD community program as a step in the right direction. This kind of approach is also needed by the other instrument vendors. Presently Illumina and Roche include their tools with an instrument purchase. This is fine for the laboratory, but it makes a hard problem harder for any researchers who might be getting data sets from different labs. This could lead to threads of frustration.

As the article continued, the "overwhelmed" scale increased to dire.

NBT stated:
“What all of this means is that for the foreseeable future, next-generation sequencing platforms may remain out of the hands of labs lacking the deep pockets needed for bioinformatics support.”
They also added,
“Thus, if the next-generation platforms are to truly democratize sequencing—bringing genomics out of the main sequencing centers and into the laboratories of single investigators or small academic consortia—much more effort needs to be expended in developing cost-effective software and data management solutions.

NBT offered some solutions, including getting the instrument vendors to develop community based solutions, and encouraging the grant funding organizations to fund bioinformatics as much as they fund sequencing.

Is Next Gen for everyone?


The NBT editors made a lot of great points, but we do not see the world in as dire terms as they do. Yes, a great challenge to Next Gen and getting up and running with this equipment includes preparing for the informatics challenges that await. Next Gen is not Sanger. You cannot look at every read to figure out what your data mean and you will need a serious computational infrastructure to store, organize and work with the data. Also, not mentioned in the article, but incredibly important, you will need a laboratory information management system to organize your experimental information and track the many steps needed to prepare good DNA libraries for sequencing.

And, there are solutions.

Geospiza’s FinchLab combined with our Software as a Service (SaaS) delivery, provides immediate access to the necessary software and hardware infrastructure to run these new instruments.

FinchLab delivers the software infrastructure to support laboratory workflows for all the platforms, links the resulting data to samples, and - through a growing list of data analysis pipelines and visualization interfaces - provides the necessary bioinformatics for a wide range of sequencing applications. Further, our bioinformatics approach is community-based. We are working with the best tools as they emerge and are collaborating with multiple groups to advance additional research and development.

SaaS delivers the computing infrastructure on demand. With our SaaS model, the computer infrastructure is always available and grows with your needs. You do not have to set up a large computer system, or build a new building, or risk over or under investing to deal with the data.

With FinchLab, the vision of next-generation platforms truly democratizing sequencing can be realized.


Labels: , ,

Wednesday, October 8, 2008

Road Trip: AB SOLiD Users Meeting

Wow! That's the best way to summarize my impressions from the Applied Biosystems (AB) SOLiD users conference last week, when AB launched their V3 SOLiD platform. AB claims that this system will be capable of delivering a human genome's worth of data for about $10,000 US.

Last spring, the race to the $1000 genome leaped forward when AB announced that they sequenced a human genome at 12-fold coverage for $60,000. When the new system ships in early 2009, that same project can be completed for $10,000. Also, this week others have claimed progress towards a $5000 human genome.

That's all great, but what can you do with this technology besides human genomes?

That was the focus of the SOLiD users conference. For a day and a half, we were treated to presentations from scientists and product managers from AB as well as SOLiD customers who have been developing interesting applications. Highlights are described below.

Technology Improvements:

Increasing Data Throughput - Practically everyone is facing the challenge of dealing with large volumes of data, and now we've learned the new version of the SOLiD system will produce even more. A single instrument run will produce between 125 million to 400 million reads depending on the application. This scale up is achieved by increasing the bead density on a slide, dropping the overall cost per individual read. Read lengths are also increasing, making it possible to get between 30 and 40 gigabases of data from a run. And, the amount of time required for each run is shrinking; not only can you get all of these data, you can do it again more quickly.

Increasing Sample Scale - Many people like to say, yes, the data is a problem, but at least the sample numbers are low, so sample tracking is not that hard.

Maybe they spoke too soon.

AB and the other companies with Next Gen technologies are working to deliver "molecular barcodes" that allow researchers to combine multiple samples on a single slide. This is called "multiplexing." In multiplexing, the samples are distinguished by tagging each one with a unique sequence, the barcode. After the run, the software uses the sequence tags to sort the data into their respective data sets. The bottom line is that we will go from a system that generates a lot of data from a few samples, to a system that generates even more data from a lot of samples.

Science:

What you can do with 100's of millions of reads: On the science side, there were many good presentations that focused on RNA-Seq and variant detection using the SOLiD system. Of particular interest was Dr. Gail Payne's presentation on the work, recently published in Genome Research, entitled "Whole Genome Mutational Profiling Using Next Generation Sequencing Technology." In the paper, the 454, Illumina, and SOLiD sequencing platforms were compared for their abilities to accurately detect mutations in a common system. This is one of the first head to head to head comparisons to date. Like the presidential debates, I'm sure each platform will be claimed to be the best by its vendor.

From the presentation and paper, the SOLiD platform does offer a clear advantage in its total throughput capacity. 454 showed showed the long read advantage in that approximately 1.5% more of the yeast genome studied was covered by 454 data than with shorter read technology. And, the SOLiD system, with its dibase (color space) encoding, seemed to provide higher sequence accuracy. When the reads were normalized to the same levels of coverage, a small advantage for SOLiD, can be seen.

When false positive rates of mutation detection were compared, SOLiD had zero for all levels of coverage (6x, 8x, 10x, 20x, 30x, 175x [full run of two slides]), Illumina had two false positives at 6x and 13x, and zero false positives for 19x and 44x (full run of one slide) coverage, and 454 had 17, six, and one false positive for 6x, 8x, and 11x (full run) coverage, respectively.

In terms of false negative (missed) mutations, all platforms did a good job. At coverages above 10x, none of the platforms missed any mutations. The 454 platform missed a single mutation at 6x and 8x coverage and Illumina missed two mutations at 6x coverage. SOLiD, on the other hand, missed four and five at 8x and 6x coverage, respectively.

What was not clear from the paper and data, was the reproducibility of these results. From what I can tell, single DNA libraries were prepared and sequenced; but replicates were lacking. Would the results change if each library preparation and sequencing process was repeated?

Finally, the work demonstrates that it is very challenging to perform a clean "apples to apples" comparison. The 454 and Illumina data were aligned with Mosiak and the SOLiD data were aligned with MapReads. Since each system produces different error profiles and the different software programs each make different assumptions about how to use the error profiles to align data and assess variation, the results should not be over interpreted. I do, however, agree with the authors, that these systems are well-suited for rapidly detecting mutations in a high throughput manner.

ChIP-Seq / RNA-Seq: On the second day, Dr. Jessie Gray presented work on combining ChIP-Seq and RNA-Seq to study gene expression. This is important work because it illustrates the power of Next Gen technology and creative ways in which experiments can be designed.

Dr. Gray's experiment was designed to look at this question: When we see that a transcription factor is bound to DNA, how do we know if that transcription factor is really involved in turning on gene expression?

ChIP-Seq allows us to determine where different transcription factors are bound to DNA at a given time, but it does not tell us whether that binding event turned on transcription. RNA-Seq tells us if transcription is turned on, after a given treatment or point in time, but it doesn't tell us which transcription factors were involved. Thus, if we can combine ChiP-Seq and RNA-Seq measurements, we can elucidate a cause and effect model and find where a transcription factor is binding and which genes it potentially controls.

This might be harder than it sounds:

As I listened to this work, I was struck by two challenges. On the computational side, one has to not only think about how to organize and process the sequence data into alignments and reduce those aligned datasets into organized tables that can be compared, but also how to create the right kind of interfaces for combining and interactively exploring the data sets.

On the biochemistry side, the challenges presented with ChIP-Seq reminded me of the old adage of trying to purify disapearase - "the more you purify the less there is." ChIP-Seq and other assays that involve multiple steps of chemical treatments and purification, produce vanishingly small amounts of material for sampling. The later challenge complicates the first challenge, because in systems where one works with "invisible" amounts of DNA, a lot of creative PCR, like "in gel PCR" is required to generate sufficient quantities of sample for measurement.

PCR is good for many things, including generating artifacts. So, the computation problem expands. A software system that generates alignments, reduces them to data sets that can be combined in different ways, and provides interactive user interfaces for data exploration, must also be able to understand common artifacts so that results can be quality controlled. Data visualizations must also be provided so that researchers can distinguish biological observations from experimental error.

These are exactly the kinds of problems that Geospiza solves.

Labels: , , , , , ,

Thursday, September 18, 2008

Road Trip: 454 Users Conference

Quiz: What can sequence small genomes in a single run? What can more than double or triple the EST database for any organism?
Answer: The Roche (454) Genome Sequencer FLX™ System.

Last week I had the pleasure of attending the Roche 454 users conference where the new release (Titanium) of the 454 sequencer was highlighted . This upgrade produces more, longer reads so that more than 600 million bases can be generated in each run. When compared to previous versions, the FLX Titanium produces about five times more data. The conference was well attended and outstanding with informative presentations on science, technology, and practical experiences.

In the morning of the first full day, Bill Farmerie, from the University of Florida, presented on how he got into DNA sequencing as a service and how he sees Next Gen sequencing changing the core lab environment. Back in 1998 he set out to establish a genomics service and talked to many groups about what to do. They told him two important things:
  1. "Don't sweat the sequencing part - this is what we are trained for."
  2. "Worry about information management - this we are not trained for."
From here, he discussed how Next Gen got started in his lab and related his experiences over the past three years and made these points:
  • The first two messages are still true. Sequencing gets solved, the problem is informatics.
  • DNA sequencing is expanding, more data are being produced faster at lower costs.
  • This is democratizing genomics - many groups now have access to high throughput technology that provides "genome center" capabilities.
  • The next bioinformatics challenge is enabling the research community, the groups with the sequencing projects, to make use of their data and information. This is not like Sanger, core labs need to deliver results with data.
  • The way to approach new problems and increase scale is to relieve bioinformatics staff of the burden of doing routine things so they can focus on developing novel applications.
  • To accomplish the above point, buy what you can and build what you have to.
Other speakers made similar points. The informatics challenge begins in the lab, but quickly becomes a major problem for the end researcher.

Bill has been following his points successfully for many years now. We starting working with him on his first genomics service and continue to support his lab with Next Gen. Our relationship with Bill and his group has been a great experience.

Other highlights from the meeting included:

A talk on continuous process improvements in DNA sequencing at the Broad Institute. Danielle Perrin presented work on how the Broad tackles process optimization issues during production to increase throughput, decrease errors, or save costs. In my perspective, this presentation really stresses the importance of coupling laboratory management with data analysis.

Multiple talks on microbial genomics. A strength of the 454 platform is how it generates long reads making this a platform of choice for sequencing smaller genomes and performing metagenomic surveys. We were also introduced to the RAST (Rapid Annotation using Subsystem Technology) server, an ideal tool for working with your completed genome or metagenome data set.

Many examples of how having millions of reads makes new gene expression and variation analysis discoveries possible when compared to other platforms like microarrays. In these talks speakers were occasionally asked which is better, long 454 reads or short reads from Illumina or SOLiD? The speakers typically said you need both, they complement each other.

The Wolly Mammoth. Steven Schuster from Penn State presented his and colleagues' work on sequencing mammoth DNA and its relatedness over 1000's of years. Next Gen is giving us a new "omics," Museomics.

And, of course, our poster demonstrating how FinchLab provides an end to end workflow solution for 454 DNA sequencing. In the poster (you have to click the image to get the BIG picture), we highlighted some new features coming out at the end of the month. These include the ability to collect custom data during lab processing, coupling Excel to FinchLab forms, and work on 454 data analysis. Now you will be able to enter the bead counts, agarose images, or whatever else you need to track lab details to make those continuous process improvements. Excel coupling makes data entry though FinchLab forms even easier. The 454 data analysis complements our work with Sanger, SOLiD, and Illumina data to make the FinchLab platform complete for any genomics lab.

Labels: , , , ,

Thursday, September 4, 2008

The Ends Justify the DNA

In Next Gen experiments, libraries of DNA fragments are created in different ways, from different samples, and sequenced in a massively parallel format. The preparation of libraries is a key step in these experiments. Understanding and validating the results requires knowing how the libraries were created and where the samples came from.

Background

In the last post, I introduced the concept that nearly all Next Gen sequencing applications are fundamentally quantitative assays that utilize DNA sequences as data points.

In Sanger sequencing, the new DNA molecules are synthesized, beginning at a single starting point determined by the primer. If the sequencing primer binds to heterogeneous molecules that contain the same binding site, for example, two slightly different viruses in a mixed population, a single read from Sanger sequencing could represent a mixture of many different molecules in the population, with multiple bases at certain positions. Next Gen sequencing, on the other hand, produces single reads from single individual molecules. This difference between the two methods allows one to simultaneously collect millions of sequence reads in a massively parallel format from single samples.

An additional benefit of massively parallel sequencing is that it eliminates the need to clone DNA, or create numerous PCR products. Although this change reduces the complexity of tracking samples, it increases the need to track experiments with greater detail and think about how we work with the data, how we analyze the data, and how we validate our observations to generate hypotheses, make discoveries, and identify new kinds of systematic artifacts.

Making Libraries

To better understand the significance of what a Next Gen experiment measures, we need to understand what DNA libraries are and how they are prepared. For this discussion we'll define a DNA library as a random collection of DNA molecules (or fragments) that can be separated and identified.

Before we do any kind of Next Gen experiment, we want to know something about the kinds of results we’d expect to see from our library. To begin, let’s consider what we would see from a genomic library consisting of EcoRI restriction fragments. If the digestion is complete, EcoRI will cut DNA between an G and A every time it encounters the sequence: 5'-GAATTC-3'. Every fragment in this library would have the sequence 5'-AATT-3' at every 5’ end. The average length of the fragments will be 4096 bases (~5 kbp). However, the distribution of fragment lengths follows Poisson statistics [1], so the actual library will have a few very large fragments (>> 5 kbp) and numerous small fragments

You may ask “why is this useful?”

Our EcoRI library example helps us to think about our expectations for Next Gen experimental results. That is, if we collect 10 million reads from a sample, what should we expect to see when we compare our data to reference data? We need to know what kinds of results to expect in order to determine if our data represent discoveries, or artifacts. Artifacts can be introduced during sample preparation, sample tracking, library preparation, or from the data collection instruments. If we can’t distinguish between artifacts and discoveries, the artifacts will slow us down and lead to risky publications.

In the case of our EcoRI digest, we can use our predictions to validate our results. If we collected sequences from the estimated 732,000 fragments and aligned the resulting data back to a reference genome, we would expect to see blocks of aligned reads at every one of the 732,000 restriction sites. Further, for each site there should be two blocks, one showing matches to the "forward" strand and one showing matches to the "reverse" strand.

We could also validate our data set by identifying the positions of EcoRI restriction sites in our reference data. What we'd likely see is that most things work perfectly. In some cases, however, we would also see alignments, but no evidence of a restriction site. In other cases, we would see a restriction site in the reference genome, but no alignments. These deviations would identify differences between the reference sequence and the sequence of the genome we used for the experiment. Those differences could either result from errors in the sequence of the reference data or a true biological difference. In the latter case, we would examine the bases and confirm the presence of a restriction length fragment polymorphism (RFLPs). From this example, we can see how we can define the expected results, and use that prediction to validate our data and determine whether our results correspond to interesting biology or experimental error.

Digital Gene Expression

Of course what we expect to see in the data is a function of the kind of experiment we are trying to do. To illustrate this point I'll compare two different kinds of Next Gen experiments that are both used to measure gene expression: Tag Profiling and RNA-Seq.

In Tag Profiling, mRNA is attached to a bead, converted to cDNA, and digested with restriction enzymes. The single fragments that remain attached to the beads are isolated and ligated to adaptor molecules, each one containing a type II restriction site. The fragments are further digested with the type II restriction enzyme and ligated to a sequencing adaptor to create a library of cDNA ends with 17 unique bases, or tags. Sequencing such a library will, in theory, yield a collection of reads that represents the population of RNA molecules in the starting material. Highly expressed genes will be represented by a larger number of tagged sequences than genes expressed at lower levels.

Both Tag profiling and RNA-Seq begin with an mRNA purification step, but after that point the procedures differ. Rather than synthesize a single full-length cDNA for every transcript, RNA-Seq uses random six-base primers to initiate cDNA synthesis at many different positions in each RNA molecule. Because these primers represent every combination of six base sequences, priming with these sequences produces a collection of overlapping cDNA molecules. Starting points for DNA synthesis will be randomly distributed, giving high sequence coverage for each mRNA in the starting material. Like Tag Profiling, genes expressed at high levels will have more sequences present in the data than genes expressed at low levels. Unlike Tag Profiling, any single transcript will produce several cDNAs aligning at different locations.

When the sequence data sets for Tag Profiling and RNA-seq are compared, we can see how the different methods for preparing the DNA libraries contrast with one another. In this example, Tag Profiling [2] and RNA-seq [3] data sets were aligned to human mRNA reference sequences (RefSeq, NCBI). The data were processed with Maq [4] and results displayed in FinchLab. In both cases, relative gene expression can be estimated by the number of sequences that align. If we know the origins of the libraries, the kinds of genes and their expression can give us confidence that the results fit the expression profile we expect. For example the RNA-seq data set is from mouse brain and we see genes at the top of the list that we expect to be expressed in this kind of tissue (last figure below).

The Tag Profiling and RNA-seq data sets also show striking differences that reflect how the libraries are prepared. In each report, the second column gives information about the distribution of alignments in the reference sequence. For Tag Profiling this is reported as "Tags." The number of Tags corresponds to the number of positions along the reference sequence where the tagged sequences align. In an ideal system, we would expect one tag per molecule of RNA. Next Gen experiments however, are very sensitive, so we can also see tags for incomplete digests. Additionally, sequencing errors, and high mismatch tolerance in the alignments can sometimes place reads incorrectly and give unusually high numbers of tags. When the data are more closely examined, we do see that the distribution of alignments follows our expectations more closely. That is, we generally see a high number of reads at one site, with the other tag sites showing a low number of aligned reads.


For RNA-seq, on the other hand, we display the second column (Read Map) as an alignment graph. For RNA-seq data, we expect that the number of alignment start points will be very high and randomly distributed throughout the sequence. We can see that this expectation matches our results by examining the thumbnail plots. In the Read Map graphs, the x-axis represents the gene length and the y-axis is the base density. Presently, all graphs have their data plotted on a normalized x-axis, so the length of an mRNA sequence corresponds to the density of data points in the graph. Longer genes have points that are closer together. You can also see gaps in the plots; some are internal and many are at the 3'-end of the genes. When the alignments are examined more closely, and we incorporate our knowledge of the exon structure or polyA addition sites, we can see that many of these gaps either show potential sites for alternative splicing or data annotation issues.


In summary, Next Gen experiments use DNA sequencing to identify and count molecules, from libraries, in a massively parallel format. The preparation of the libraries allows us to define expected outcomes for the experiment and choose methods for validating the resulting data. FinchLab makes use of this information to display data in ways that make it easy to quickly observe results from millions of sequence data points. With these high-level views and links to drill down reports and external resources, FinchLab provides researchers with the tools needed to determine whether their experiments are on track to creating new insights, or if new approaches are needed to avoid artifacts.

References

[1] The distribution of restriction enzyme sites in Escherichia coli. G A Churchill, D L Daniels, and M S Waterman. Nucleic Acids Res. 1990 February 11; 18(3): 589–597.

[2] Tag Profile dataset was obtained from Illumina.

[3] Mapping and quantifying mammalian transcriptomes by RNA-Seq. A Mortazavi, BA Williams, K McCue K, L Schaeffer, B Wold. Nat Methods. 2008 Jul;5(7):621-8. Epub 2008 May 30.
Data available at: http://woldlab.caltech.edu/rnaseq/

[4] Mapping short DNA sequencing reads and calling variants using mapping quality scores. H Li, J Ruan, R Durbin. Genome Res. 2008 Aug 19. [Epub ahead of print]

Labels: , , , , ,