Tuesday, April 13, 2010

Bloginar: Standardizing Bioinformatics with BioHDF (HDF5)

Yesterday we (The HDF Group and Geospiza) released the BioHDF prototype software.  To mark the occasion, and demonstrate some of BioHDF’s capabilities and advantages, I share the poster we presented at this year’s AGBT (Advances in Genome Biology and Technology) conference.

The following map guides the presentation. The poster has a title and four main sections, which cover background information, specific aspects of the general Next Generation Sequencing (NGS) workflow, and HDF5’s advantages for working with large amounts of NGS data.
 
Section 1.  The first section introduces HDF5 (Hierarchical Data Format) as a software platform for working with scientific data.  The introduction begins with the abstract and lists five specific challenges created by NGS: 1) high end computing infrastructures are needed to work with NGS data, 2) NGS data analysis involves complex multi-step processes that, 3) compare NGS data to multiple reference sequence databases, 4) the resulting datasets of alignments must be visualized in multiple ways, and 5) scientific knowledge is gained when many datasets are compared. 

Next, choices for managing NGS data are compared in a four category table.  These include text and binary formats. While text formats (delimited and XML) have been popular for bioinformatics, they do not scale well and binary formats are gaining in popularity. The current bioinformatics binary formats are listed (bottom left) along with a description of their limitations. 

The introduction closes with a description of HDF5 and its advantages for supporting NGS data management and analysis. Specifically, HDF5 is platform for managing scientific data. Such data are typically complex and consist of images, large multi-dimensional arrays, and meta data. HDF5 has been used for over 20 years in other data intensive fields; it is robust, portable, and tuned for high performance computing. Thus HDF5 is well suited for NGS. Indeed, groups from academic researchers to NGS instrument vendors, and software companies are recognizing the value of HDF5.
Section 2. This section illustrates how HDF5 facilitates primary data analysis. First we are reminded that NGS data are analyzed in three phases: primary analysis, secondary analysis and tertiary analysis. Primary analysis is the step that converts images to reads consisting of basecalls (or colors, or flowgrams), and quality values. In secondary analysis, reads are aligned to reference data (mapped) or amongst themselves (assembled). In many NGS assays, secondary analysis produces tables of alignments that must be compared to one and other, in tertiary analysis, to gain scientific insights. 

The remaining portion of section 2 shows how Illumina GA and SOLiD primary data (reads and quality values) can be stored in BioHDF and later reviewed using the BioHDF tools and scripts.  The resulting quality graphs are organized into three groups (left to right) to show base composition plots, quality value (QV) distribution graphs, and other summaries.

Base composition plots show the count of each base (or color) that occurs at a given position in the read. These plots are used to assess overall randomness of a library and observe systematic nucleotide incorporation errors or biases.

Quality value plots show the distribution of QVs at each base position within the ensemble of reads. As each NGS run produces many millions of reads, it is worthwhile summarizing QVs in multiple ways. The first plots, from the top, show the average QV per base with error bars indicating QVs that are within one standard deviation of the mean. Next, box and whisker plots show the overall quality distribution (median, lower and upper quartile, minimum and maximum values) at each position. These plots are followed by “error” plots which show the total count of QVs below certain thresholds (red, QV < 10; green QV < 20; blue, QV < 30). The final two sets of plots show the number of QVs at each position for all observed values and the number of bases having each quality value.

The final group of plots show overall dataset complexity, GC content (base space only), average QV/read, and %GC vs average QV (base space only).  Dataset complexity is computed by determining the number of times a given read exactly matches other reads in the dataset. In some experiments, too many identical reads indicates a problem like PCR bias. In other cases, like tag profiling, many identical reads are expected from highly expressed genes. Errors in the data can artificially increase complexity.
Section 3.  Primary data analysis gives us a picture of how well the samples were prepared or how well the instrument ran with some indication about sample quality. Secondary and tertiary analysis tell us about sample quality and more importantly, provides biological insights. The third section focuses on secondary and tertiary analysis and begins with a brief cartoon showing a high level data analysis workflow using BioHDF to store primary data, alignment results, and annotations. BioHDF tools are used to query these data and other software within GeneSifter is used to compare data between samples and display the data in interactive reports to examine the details from single or multiple samples.

The left side of this section illustrates what is possible with single samples. Beginning with a simple table that indicates how many reads align to each reference sequence, we can drill into multiple reports that provide increasing detail about the alignments. For example, the gene list report (second from top) uses gene model annotations to summarize the alignments for all genes identified in the dataset. Each gene is displayed as a thumbnail graphic that can be clicked to see greater detail, which is shown in the third plot. The Integrated Gene View not only shows the density of reads across the gene's genomic region, but also shows evidence of splice junctions, and identified single base differences (SNVs) and small insertions and deletions (indels). Navigation controls provide ways to zoom into and out of the current view of data, and move to new locations. Additionally, when possible, the read density plot is accompanied by an Entrez gene model and dbSNP data so that data can be observed in a context of known information. Tables that describe the observed variants follow. Clicking on a variant drills into the alignment viewer to show the reads encompassing the point of variation.

The right side illustrates multi-sample analysis in GeneSifter. In assays like RNA-Seq, alignment tables are converted to gene expression values that can be compared between samples. Volcano (top) and other plots are used visualize the differences between the datasets. Since each point in the volcano plot represents the difference in expression for a gene between two samples (or conditions), we can click on that point to view the expression details for that gene (middle) in the different samples. In the case of RNA-Seq, we can also obtain expression values for the individual exons with the gene, making it possible to observe differential exon levels in conjunction with overall gene expression levels (middle). Clicking the appropriate link in the exon expression bar graph, takes us to the alignment details for the samples being analyzed (bottom), in this example we have two cases and two control replicates. Like the single sample Integrated Gene Views, annotations are displayed with alignment data. When navigation buttons are clicked all of the displayed genes move together so that you can explore the gene's details and surrounding neighborhood for multiple samples in a comparative fashion.
Section 4.  The poster closes with details about BioHDF.  First, the data model is described. An advantage of the BioHDF model is that read data are organized non-redundantly. Other formats, like BAM, tend to store reads with alignments and if a read has multiple alignments in a genome, or is aligned to multiple reference sequences, it gets stored multiple times. This may seem trivial, but anything that can happen a million times, becomes noticeable. This fact is demonstrated in the in table listed in the second panel “High Performance Computing Advantages.”  Other HDF5 advantages are listed below the performance stats table.  Most notably is HDF5’s ability to easily support multiple indexing schemes like nested containment lists (NClists). NClists solve the problem of efficiently accessing reads from alignments that may be contained in other alignments, which I will save for a later post.

Finally, the poster is summarized with a number of take home points. These reiterate the fact that NGS is driving the need to use binary file formats to manage NGS and analysis results and that HDF5 provides an attractive solution because of its long history and development efforts that specifically target scientific programming requirements. In our hands, HDF5 has helped make GeneSifter a highly scalable and interactive web-application with less development effort than would have been needed to implement other technologies.  

If you are software developer and are interested in BioHDF please visit www.biohdf.org.  If you do not want to program and instead, want a way to easily analyze your NGS data to make new discoveries, please contact us

Labels: , , , , , , , ,

Friday, March 19, 2010

RNA Deep Sequencing - Beyond Proof of Concept

ABRF 2010 begins this weekend.  In addition to my LIMS presentation on Sunday, I will present our poster featuring data analysis of sequences from "Sex-specific and lineage-specific alternative splicing in primates" (Blekhman et. al) in GeneSifter Analysis Edition.

The poster number is RP-3. Stop by and see how we learned that not all samples are what they seem to be ...

Abstract 

Next Generation DNA Sequencing (NGS) technologies are powerful tools for rapidly sequencing genomes and studying functional genomics. Presently, the value of NGS technology has been largely demonstrated on individual sample analyses (1-3). The full potential of NGS will be realized when it can be used in multisample experiments that involve different measurements and include replicates, and controls to make valid statistical comparisons. Arguably, improvements in current technology, and soon to be available “third” generation systems, will make it possible to simultaneously measure 100’s to1000’s of individual samples in single experiments to study transcription, alternative splicing, and how sequences vary between individuals and within expressed genes. However, several bioinformatics systems challenges must be overcome to effectively manage both the volumes of data being produced and the complexity of processing the numerous datasets that will be generated.

In this poster we present a system that is used it to verify and further characterize previously published data from a gene expression study that includes both replicates and experimental values comparing sex and lineage specific alternative splicing in primates (4). This system, developed on a high performance computing architecture, stores and organizes the data, aligns millions of reads to different reference sequences, identifies and removes artifacts, executes comparative and statistical analyses, and finally links results to pathway and ontological information for making discoveries and confirming hypotheses. The supporting infrastructure includes intuitive user interfaces for organizing data, executing analytical operations, viewing summarized reports, navigating through details in the results and can be easily operated by biologists.

1. Marioni JC, et. al. (2008) Genome Res.

2. Ramsköld D, et. al. (2009) PLoS Comput Biol.

3. Pleasance ED, et. al.(2010) Nature.

4. Blekhman R, et. al. (2009) Genome Res.

Labels: , ,

Tuesday, February 23, 2010

GeneSifter Lab Edition v3.14 - Release Notes

GeneSifter Laboratory Edition (GSLE) 3.14.0 introduces a host of new features and capabilities that make daily laboratory data management work even easier.  Read below to learn why GSLE is a leading LIMS product for all forms of DNA sequencing, microarrays, and other genetic analysis applications.

Orders and Invoices

Multi plate submissions: Order forms have been extended in several ways to further simplify how labs collect sample and project information. A new order form template lets core facilities, managing larger sequencing projects, easily receive samples and their information in a multiple plate format. New order fields specific to the plate format are included to support sample tracking and lab work.

Add data to fields: Orders forms have been further improved by adding the ability to add new values (or terms) to dropdown fields that already exist on published order forms.


Project field: Additionally, labs can add an optional project field to forms. With these improvements, labs can create forms that are easier to use and modify, as well as enable project tracking for their customers.

Sample location and sample selection: Two new features deliver help for labs that provide sample storage (biobanking) services to their clients. First, order forms can include sample location information. This is particularly useful in situations where samples are delivered in 96-well plates that are stored for later use. Second, samples already stored by the lab as purified DNA, RNA or other material (templates) can be selected from specialized search interfaces within order forms. Like all GSLE sample entry forms, these features can be included or not on a case-by-case basis depending on your specific needs. 

Invoice formatting: For labs that have the dreaded chore of sending billing data to accounting departments we have added the ability to modify the invoice number format to include additional characters that are used to distinguish which labs are sending information.

Laboratory Operations


GSLE provides the ability to create, list and follow steps in sample protocols (also called workflows). In 3.14 new features not only expand the capabilities but make it possible to further standardize procedures. 


Multiplexing: In Next Generation Sequencing (NGS) several libraries are often combined into a single lane or region of a slide to increase the number of individual samples analyzed in a sequencing run. As each library is prepared, a specific adaptor sequence is added so sequence reads corresponding to different samples can be identified by their adaptor tag. This procedure, called multiplexing or barcoding, is supported in 3.14 and allows the lab to combine samples and adaptor sequences and group the combination of libraries together (Worksets) for sample processing and instrument runs. Once data are collected, sample naming conventions, combined with adaptor sequence (Multiplex Identifier, MID) stored in sample sheets, are used to separate individual reads into files corresponding to the samples that were in the original workset.

Batch data entry: Some lab processes require that samples are manipulated in groups (batches), but laboratory data are collected for individual samples within the batch. For example, the concentrations of individual DNA samples may need to be measured in a 96-well plate. To improve how the OD values, comments, or other information are entered, workflow steps have been updated to include batch data entry forms that provide spreadsheet like data entry capabilities. Like all GSLE batch data entry forms, data can be entered easily using the form’s column highlight and easy fill controls, or uploaded from an excel spreadsheet.

Subsample processing: GSLE 3.14 also increases sample processing flexibility. As noted above, order forms can now support the ability to select samples that are already stored in the system. This feature is further extended into the laboratory by creating tools that allow many new samples to be created from a “parent” or stock samples. When new samples (templates) are created, options are provided so that each new sample can be entered into a different process. For example, you receive a tissue sample that needs several experiments performed; RNA-Seq, ChIP-Seq and resequencing. Now you can easily pick the sample and create three new sub samples defining which process will be performed on each sample with just a few clicks.

Selecting samples based on custom data: Some labs need to use custom data entered into order forms to sort and filter samples in the lab. For example, an order form may ask a researcher to enter read lengths for their NGS run. A 36 base run is much faster than a 100 base run, and on some platforms costs less. Thus, the lab will sort samples based on read length prior to the data collection event. While always possible to get this information in many GSLE displays, 3.14 adds new capabilities to use any custom data in its specialized sample picker tools.

Other Features

Customer data management: GSLE v3.14 gives labs’ customers increased ability to organize their chromatograms, fragment analysis files and microarray files as needed. Data files can be edited, relabeled, moved or deleted. Projects and folders can be created, modified or deleted to aid in data organization.

Application Programming Interface (Onsite Installations Only)

SQL-API: As automation and system integration needs increase, requirements for supporting programmatic data entry become more important. GSLE has continued to expand the self-documenting Application Programming Interface (API). We have also added an SQL API that can be used to create custom reports that are accessed via a wget style unix command.


Input API enhancements: The Input API now returns success IDs and CGI parameter names have been eliminated. The full documentation can be reviewed by contacting support@geospiza.com for the GSLE SQL API Manual or the GSLE Input API Manual. 


Next Generation Analysis Transfer Tool (Hosted Partners Only)

Simplified data transfers: A data transfer interface has been added to connect GSLE and GeneSifter Analysis Edition (GSAE). Partner Program administrators use the interface to select data files in GSLE and transfer them to their customer’s account in GSAE.

Schema Table update note


There was an update to an existing schema table;  the column "Plate_Label" is now in table om_sample_plate instead of om_order.

Labels: , , , , ,

Thursday, December 31, 2009

2009 Review

The end of the year is a good time to reflect, review accomplishments, and think about the year to come. 2009 was a good year for Geospiza’s customers, with many exciting accomplishments for the company. Highlights are reviewed below.

Two products form a complete genetic analysis system


Geospiza’s two core products, GeneSifter Laboratory Edition (GSLE) and GeneSifter Analysis Edition (GSAE), help laboratories do their work and scientists analyze their data. GSLE is the LIMS (Laboratory Information Management System) that laboratories, from service labs to high-throughput data production centers, use to collect information about samples, track and manage laboratory procedures, organize and process data, and deliver data and results back to researchers. GSLE supports traditional DNA sequencing (Sanger), fragment analysis, genotyping, microarrays, Next Generation Sequencing (NGS) and other technologies.

In 2008, Geospiza released the third version of the platform (back then it was known as FinchLab). This version launched a new way of providing LIMS solutions. Traditional LIMS systems require extensive programming and customization to meet a laboratory’s specific requirements. They include a very general framework designed to support a wide range of activities. Their advantage is that they are highly customizable. However, this advantage comes at the expense of very high acquisition costs accompanied by lengthy requirements planning and programming before they become operational.

In contrast, GSLE contains default settings that support genetic analysis out-of-the-box, while allowing laboratories to customize operations without programmer support. Default settings in GSLE suppport DNA sequencing, microarray, and genotyping services. The GSLE abstraction layer supports extensive configuration to meet specific needs as they arise. Through this design, the costs of acquiring and operating a high-quality advanced LIMS system are significantly reduced.

Throughout 2009, 100’s of features were added to GSLE to increase support for instruments and data types, and improve how laboratory procedures (workflows) are created, managed, and shared. Enhancements were made to features like experiment ordering, organization, and billing. We also added new application programming interfaces (APIs) to enable integration with enterprise software. Specific highlights included:
  • Extending microarray support to include sample sheet generation and automate uploading files
  • Improving NGS file and data browsing to simplify the process of searching and viewing the 1000’s of files produced in Next Gen sequencing runs
  • Making NGS data downloads, of very large gigabase files, robust and easy
  • Adding worksets to group DNA and RNA samples in customized ways that facilitate laboratory processing
  • Creating APIs to utilize external password servers and programmatically receive data using GSLE form objects
  • Enhancing ways for groups to add HTML to pages to customize their look and feel
In addition to the above features, we’ve also completed development on methods to multiplex NGS samples and track MIDs (molecular identifiers and molecular barcodes), enter laboratory data like OD values and bead counts in batches, create orders with multiple plates, and access SQL queries through an API. Look for these great features and more in the early part of 2010.

GSAE

As noted, GSAE is Geospiza’s data analysis product. While GSLE is capable of running of running advanced data analysis pipelines, the primary focus of data analysis in GSLE is to provide quality control. Thus its data analyses and presentation focus on single samples. GSAE provides the infrastructure and tools to compare the results between samples. In the case of NGS, GSAE also provides more reports and data interactions. GSAE began as a web-based microarray data analysis platform making it well suited for NGS-based gene expression assays. Over 2009 many new features were added to extend its utility to NGS data analysis with a focus on whole transcriptome analysis. Highlights included:
  • Developing data analysis pipelines for RNA-Seq, Small RNA, ChIP-Seq, and other kinds of NGS assays
  • Adding tools to visualize and discover alternatively spliced transcripts in gene expression assays
  • Extending expression analysis tools to include interactive volcano plots, unbalanced two-way ANOVA computations
  • Increasing NGS transcriptome analysis capabilities to include variation detection and visualization
The above features fulfill the requirements needed to make a platform complete for both NGS and microarray-based gene expression analysis. And, the addition of variation detection and visualization lays the groundwork for GSAE to extend its market leadership to resequencing data analysis.

Geospiza Research

In 2009 Geospiza won two research awards in the form of Phase II STTR and Phase I SBIR grants. The STTR project is researching new ways to organize, compress, and access NGS data by adapting HDF technologies to bioinformatics. Through this work we are developing a robust data management infrastructure that supports our NGS sequencing analysis pipelines and interactive user interfaces. The second award targets NGS-based variation detection. This work began in the last quarter of the year, but is already delivering new ways to identify and visualize variants in RNA-Seq and whole transcriptome analysis.

To learn more about our progress in 2009, visit our news page. It includes our press releases and reports in the news, publications citing our software, and webinars where we have presented our latest and greatest.

As we close 2009, we especially want to thank our customers and collaborators for their support in making the year successful and we look forward to an exciting year ahead in 2010.

Labels: , , , ,

Sunday, November 8, 2009

Expeditiously Exponential: Data Sharing and Standardization

We can all agree that our ability to produce genomics and other kinds of data is increasing at exponential rates. Less clear, is understanding the consequences for how these data will be shared and ultimately used. These topics were explored in last month's (Oct. 9, 2009) policy forum feature in the journal Science.

The first article, listed under the category "megascience," dealt with issues about sharing 'omics data. The challenge being that systems biology research demands that data from many kinds of instrument platforms (DNA sequencing, mass spectrometry, flow cytometry, microscopy, and others) be combined in different ways to produce a complete picture of a biological system. Today, each platform generates its own kind of "big" data that, to be useful, must be computationally processed and transformed into standard outputs. Moreover, the data are often collected by different research groups focused on particular aspects of a common problem. Hence, the full utility of the data being produced can only be realized when the data are made open and shared throughout the scientific community. The article listed past efforts in developing sharing policies and the central table included 12 data sharing policies that are already in effect.

Sharing data solves half of the problem, the other aspect is being able to use the data once shared. This requires that data be structured and annotated in ways that make it understandable by a wide range of research groups. Such standards typically include minimum information check lists that define specific annotations, and which data should be kept from different platforms. The data and metadata are stored in structured documents that reflect a community's view about what is important to know with respect to how data were collected and the samples the data were collected from. The problem is that annotation standards are developed by diverse groups and, like the data, are expanding. This expansion creates new challenges with making data interoperable; the very problem standards try to address.

The article closed with high-level recommendations for enforcing policy through funding and publication requirements and acknowledged that full compliance requires that general concerns with pre-publication data use and patient information be addressed. More importantly, the article acknowledged that meeting data sharing and formatting standards has economic implications. That is, researches need time-efficient data management systems, the right kinds of tools and informatics expertise to meet standards. We also need to develop the right kind of global infrastructure to support data sharing.

Fortunately complying with data standards is an area where Geospiza can help. First, our software systems rely on open, scientifically valid tools and technologies. In DNA sequencing we support community developed alignment algorithms. The statistical analysis tools in GeneSifter Analysis Edition utilize R and BioConductor to compare gene expression data from both microarrays and DNA sequencing. Further, we participate in the community by contributing additional open-source tools and standards through efforts like the BioHDF project. Second, the GeneSifter Analysis and Laboratory platforms provide the time-effiecient data management solutions needed to move data through its complete life cycle from collection, to intermediate analysis, to publishing files in standard formats.

GeneSifter lowers researcher's economic barriers of meeting data sharing and annotation standards keep the focus on doing good science with the data.

Labels: , , , , , , ,

Wednesday, September 23, 2009

GeneSifter in Current Protocols

This month we are pleased to report Geospiza's publication of the first standard protocols for analyzing Next Generation Sequencing (NGS) data. The pulication, appearing in the September issue of Current Protocols, addresses how to analyze data from both microarray, and NGS experiments. The abstract and links to the paper and our press release are provided below.

Abstract

Transcription profiling with microarrays has become a standard procedure for comparing the levels of gene expression between pairs of samples, or multiple samples following different experimental treatments. New technologies, collectively known as next-generation DNA sequencing methods, are also starting to be used for transcriptome analysis. These technologies, with their low background, large capacity for data collection, and dynamic range, provide a powerful and complementary tool to the assays that formerly relied on microarrays. In this chapter, we describe two protocols for working with microarray data from pairs of samples and samples treated with multiple conditions, and discuss alternative protocols for carrying out similar analyses with next-generation DNA sequencing data from two different instrument platforms (Illumina GA and Applied Biosystems SOLiD).

In the chapter we cover the following protocols:
  • Basic Protocol 1: Comparing Gene Expression from Paired Sample Data Obtained from Microarray Experiments
  • Alternate Protocol 1: Compare Gene Expression from Paired Samples Obtained from Transcriptome Profiling Assays by Next-Generation DNA Sequencing
  • Basic Protocol 2: Comparing Gene Expression from Microarray Experiments with Multiple Conditions
  • Alternate Protocol 2: Compare Gene Expression from Next-Generation DNA Sequencing Data Obtained from Multiple Conditions

Links

To view the abstract, contents, figures, and literature cited online visit: Curr. Protoc. Bioinform. 27:7.14.1-7.14.34

To view the press release visit: Geospiza Team Publishes First Standard Protocol for Next Gen Data Analysis

Labels: , , , , , , , ,

Saturday, September 12, 2009

Sneak Peak: Sequencing the Transcriptome: RNA Applications for Next Generation Sequencing

Join us this coming Wednesday, September 16, 2009 10:00 am Pacific Daylight Time (San Francisco, GMT-07:00), for a webinar on whole transcriptome analysis. In the presentation you will learn about how GeneSifter Analysis Edition can be used to identify novel RNAs and novel splice events within known RNAs.

Abstract:

Next Generation Sequencing applications such as RNA-Seq, Tag Profiling, Whole Transcriptome Sequencing and Small RNA Analysis allow whole genome analysis of coding as well as non-coding RNA at an unprecedented level. Current technologies allow for the generation of 200 million data points in a single instrument run. In addition to allowing for the complete characterization of all known RNAs in a sample, these applications are also ideal for the identification of novel RNAs and novel splicing events for known RNAs.

This presentation will provide an overview of the RNA applications using data from the NCBI's GEO database and Short Read Archive with an emphasis on converting raw data into biologically meaningful datasets. Data analysis examples will focus on methods for identifying differentially expressed genes, novel genes, differential splicing and 5’ and 3’ variation in miRNAs.

To register, please visit the event page.

Labels: , , , , , , ,

Tuesday, May 12, 2009

Small RNAs Get Smaller

Tiny RNAs recently joined the growing list of non-coding RNA (ncRNA) molecules [1]. Their absolute function is not understood, but they are possibly a new class of ncRNA and appear to be most associated with transcription of highly expressed genes in human, chickens and Drosophila and possibly others.

This was the conclusion of work published in the May issue of Nature Genetics. Remember when all we had to worry about was the central dogma? DNA was transcribed into RNA and RNA was translated into protein. Life was so simple.

Not really. Even as the first genetic code was being elucidated [2], the possibility of uncovering the second code, that translates nucleic acid sequences into protein sequences was being contemplated [3]. Translating RNA into protein required other kinds of RNA that became known as ribosomal (rRNA) and transfer RNA (tRNA). The RNA between DNA and protein became messenger (m)RNA. In the late seventies, introns were discovered [4,5] and soon to follow were small nuclear (sn)RNAs and “snurps” (small nuclear riboproteins). The snRNAs were further characterized as small nucleolar (snoRNA) and Cajal body-specific (scaRNA) RNAs, and a class of new molecules were investigated for their involvement in mRNA splicing.

As the mechanisms for splicing were being worked out, researchers were able to prove that RNA could also be an enzyme [6]. In this case, the intron is the enzyme responsible for splicing itself out to create the mature mRNA. At the same time, another group discovered that the catalytic unit of RNAase-P, an enzyme involved in converting precursor tRNAs into active tRNAs, is also RNA [7]. Indeed, later work revealed that rRNA in the large ribosome subunit catalyzes the peptidyl transferase reaction to join amino acids together to build proteins [8]. Not only does the central dogma require a multitude of RNA molecules to transcribe DNA into RNA and translate RNA into protein, but the RNA molecules are responsible for carrying the information needed to make proteins and supplying the enzymatic activity to do the work!

What else does RNA do?

More than we can imagine. Starting with the discovery that double stranded RNA (dsRNA) could inhibit gene expression by turning on RNA interference (RNAi) pathways [9], new RNAs were identified, micro (miRNA) and small interfering (siRNA), as essential to the RNAi pathway. miRNA and siRNA were the early members of what would become a large and growing class of RNAs now referred to as non-coding RNAs (ncRNAs).

The ncRNAs represent a next frontier in RNA research and understanding gene expression. Some ncRNAs are large, like lincRNAs (large intervening non-coding RNAs) [10], but most are small between 18 and 31 nt. Within in the small ncRNA group are piwi-interacting (piRNA), repeat associated small interfering (rasiRNA), small temporal (stRNA), and now transcription initiation (tiRNA) RNA. I like tiny RNA.

Tiny, or tiRNAs, were discovered by Next Generation Sequencing (NGS) studies. RNA libraries were prepared from specific size fractions of capped messages. The resulting libraries were sequenced on the Roche FLX Genome Sequencing system and the data were aligned to human genome build 36.1 and compared to transcription start sites (TSS) defined by RefGene (NCBI). The authors reasoned the previous deep-sequencing studies missed these RNA molecules because they tend to be disregarded as low-abundance spurious, or degradation products. However, because they can be cloned, they must have a 5’ phosphate and, when aligned to genomc sequences, the NGS reads cluster in a non-random fashion around TSSs.

GeneSifter enables small RNA research

NGS makes it possible to explore the RNA world in new ways by designing experiments to capture small RNA molecules and sequence them in a massively parallel, high throughput format. However, both the experiments and data analysis are technically challenging. Fortunately GeneSifter Laboratory Edition (GSLE) and GeneSifter Analysis Edition (GSAE) can help. In GSLE you can use the software to track RNA preparation steps and record data at different points of the process. GSAE is accompanied with data analysis pipelines designed to filter artifacts and identify known small RNAs. Post alignment clustering reports, based on coverage in a genome, can be used to further refine results an discover new RNA species as well. Moreover, you can convert the clustering reports into lists of expression values for these RNAs and compare their expression between different samples, tissues, or experimental conditions.


References
1. Taft R.J., Glazov E.A., Cloonan N., et. al., 2009. Tiny RNAs associated with transcription start sites in animals. Nat Genet 41, 572-578.

2. Watson J.D. and Crick F.H.C. Molecular structure of nucleic acids: a structure for deoxyribose nucleic acid. Nature 171, 737-738 (1953)

3. Crick F.H., Barnett L., Brenner S., Watts-Tobin R.J., 1961. General nature of the genetic code for proteins. Nature 192, 1227-1232.

4. Chow L.T., Roberts J.M., Lewis J.B., Broker T.R., 1977. A map of cytoplasmic RNA transcripts from lytic adenovirus type 2, determined by electron microscopy of RNA:DNA hybrids. Cell 11, 819-836.

5. Berk A.J., Sharp P.A., 1977. Sizing and mapping of early adenovirus mRNAs by gel electrophoresis of S1 endonuclease-digested hybrids. Cell 12, 721-732.

6. Zaug A.J., Cech T.R., 1982. The intervening sequence excised from the ribosomal RNA precursor of Tetrahymena contains a 5-terminal guanosine residue not encoded by the DNA. Nucleic Acids Res 10, 2823-2838.

7. Guerrier-Takada C., Gardiner K., Marsh T., Pace N., Altman S., 1983. The RNA moiety of ribonuclease P is the catalytic subunit of the enzyme. Cell 35, 849-857.

8. Nissen P., Hansen J., Ban N., Moore P.B., Steitz T.A., 2000. The structural basis of ribosome activity in peptide bond synthesis. Science 289, 920-930.

9. Fire A., Xu S., Montgomery M.K., Kostas S.A., Driver S.E., Mello C.C., 1998. Potent and specific genetic interference by double-stranded RNA in Caenorhabditis elegans. Nature 391, 806-811.

10. Guttman M., Amit I., Garber M., French C., Lin M.F., Feldser D., Huarte M., Zuk O., Carey B.W., Cassady J.P., Cabili M.N., Jaenisch R., Mikkelsen T.S., Jacks T., Hacohen N., Bernstein B.E., Kellis M., Regev A., Rinn J.L., Lander E.S., 2009. Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals. Nature 458, 223-227.


Further Reading
ncRNA - http://nar.oxfordjournals.org/cgi/reprint/35/suppl_1/D178
snoRNA - http://en.wikipedia.org/wiki/SnoRNA
siRNA - http://en.wikipedia.org/wiki/SiRNA
miRNA - http://en.wikipedia.org/wiki/MicroRNA
piRNA - http://en.wikipedia.org/wiki/Piwi-interacting_RNA
rasiRNA - http://en.wikipedia.org/wiki/RasiRNA
stRNA - http://jcs.biologists.org/cgi/content/full/116/23/4689
Ribozymes - http://en.wikipedia.org/wiki/Ribozyme


Databases
miRBASE - http://microrna.sanger.ac.uk/sequences/
RNAdb - http://research.imb.uq.edu.au/rnadb/default.aspx

Labels: , , , , ,

Tuesday, April 28, 2009

Life in the Clouds

Today, the Applied Biosystems division of Life Technologies announced their partnership with us (Geospiza) to use Amazon Web Services (AWS) to use cloud-computing technologies to help customers manage data from advanced genomic analysis platforms.

This news is significant for several reasons.

First, as noted by our President, Rob Arnold, in Xconomy, this is the first time a leading gene sequencing company has agreed to offer the sequencing instrument, the consumable chemicals needed to run experiments, and the software needed to sort through and make sense of the data, in a single package and run the software under a SaaS model.

Second, through this news, and other activities, we are proactively addressing one of the major challenges for Next Generation Sequencing. Specifically, that the costs and time for purchasing, deploying, and maintaining IT systems to support NGS data management and analysis are simply out of reach for the vast majority of research groups that can benefit from these technologies. Presently, the groups making the greatest progress with NGS have advanced bioinformatics and IT support. However, if these technologies are going to truly meet their promise of revolutionizing genomics, the numbers of scientists utilizing NGS needs to increase. All over the country (and world) medical researchers, microbiologists, plant biologists, and other scientists have interesting samples and new ideas for which NGS experiments will provide amazing discoveries, but they can only follow through on those ideas if they can work with their data in a reasonable cost effective way.

Finally, today's news is gratifying because it validates one of Geospiza's important early technology decisions. Cloud-computing, also called "Software as a Service" (SaaS) is not new to Geospiza. When we started in 1997, we made an important decision to develop our platform as a web-based system. We've been working with web technology and Internet-based services right from the beginning. Back then we used the term ASP (Application Service Provider) instead of SaaS, but we have had clients effectively using our systems this way for many years. Our long experience with cloud computing has prepared us to meet the new challenges created by NGS and we look forward to working with Applied Biosystems on the AWS platform to extend the number of options we can provide to our customers, helping to lower their computing costs, and working to enable their science.

For more information visit:

Geospiza Cloud
Press Release

FinchTalks where SaaS is discussed:

Have we been deluged?
Three Themes from AGBT and ABRF Part III: The IT Problem
Next Gen-Omics

Focus on Next Gen Sequencing
Closing 2008

Labels: , , , ,

Wednesday, March 4, 2009

Bloginar: The Next Generation Dilemma: Large Scale Data Analysis

Previous posts shared some the things we learned at the AGBT and ABRF meetings in early February. Now it is time to share the work we presented, starting with the AGBT poster, “The Next Generation Dilemma: Large Scale Data Analysis.”

The goal of the poster was to provide a general introduction to the power of Next Generation Sequencing (NGS) and a framework for data analysis. Hence, the abstract described the NGS general data analysis process; its issues and what we are doing for one kind of transcription profiling, RNA-Seq. Between then and now we learned a few things... And the project grew.

The map below guides my “bloginar” poster presentation. In keeping with the general theme of the abstract we focused on transcription analysis, but instead of focusing exclusively on RNA-Seq, the project expanded to compare three kinds of transcription profiling: RNA-Seq, Tag Profiling, and Small RNA Analysis. A link to the poster is provided at the end.

Section 1 provides a general introduction into NGS by discussing the ways NGS is being used to study different aspects of molecular biology. It also covers how the data are analyzed in thee phases (primary, secondary, tertiary) to convert raw data into biologically meaningful information. The three phase model has emerged as a common framework to describe the process of converting image data into primary sequence data (reads) and then turning the reads into information that be used in comparative analyses. Secondary analysis is the phase where reads are aligned to reference sequences to get gene names, position, and (or) frequency information that can be used to measure changes, like gene expression, between samples.

The remaining sections of the poster use examples from transcription analysis to illustrate and address the multiple challenges (listed below) that must be overcome to efficiently use NGS.
  • High end infrastructures are needed to manage and work with extremely large data sets
  • Complex, multistep analysis procedures are required to produce meaningful information
  • Multiple reference data are needed to annotate and verify data and sample quality
  • Datasets must be visualized in multiple ways
  • Numerous Internet resources must be used to fill in additional details
  • Multiple datasets must be comparatively analyzed to gain knowledge
Section 2 describes the three different kinds of transcription profiling experiments. This section provides additional background on the methods and what they measure. For example, RNA-Seq and Tag Profiling are commonly used to measure gene expression. In RNA-Seq, DNA libraries are prepared by randomly amplifying short regions of DNA from cDNA. The sequences that are produced will generally cover the entire region of the transcripts that were originally isolated. Hence, it is possible to get information about alternative splicing and biased allelic expression. In contrast, Tag Profiling focuses on creating DNA libraries from discrete points within the RNA molecules. With Tag Profiling, one can quickly measure relative gene expression, but cannot get information about alternative splicing and allelic expression. The table in section 2 discusses these and other issues one must consider when running the different assays.

Sections 3, 4, and 5 outline three transcriptome scenarios (RNA-Seq, Tag Profiling, and Small RNA, respectively) using real data examples (references provided in the poster). Each scenario follows a common workflow involving the preparation of DNA libraries from RNA samples, followed by secondary analysis, followed by tertiary analysis of the data in GeneSifter Analysis Edition.

For RNA-Seq, two datasets corresponding to mouse erythroid stem (ES) and body (EB) cells were investigated. DNA libraries were produced from each cell line. Sequences were collected from the library and compared to the RefSeq (NCBI) database according to the pipeline shown. The screen captures (middle of the panel) show how the individual reads map to each transcript along with the total numbers of hits summarized by chromosome. The process is repeated twice, once for each cell line, and the two sets of alignments are converted to Gene Lists for comparative analysis in GeneSifter laboratory edition to observe differential expression (bottom of the panel).

The Tag Profiling panel examines data from a recently published experiment (a reference is provided in the poster) in which gene expression was studied in transgenic mice. I’ll leave out the details of the paper, and only point out how this example shows the differences between Tag Profiling and RNA-Seq data. Because Tag Profiling collects data from specific 3’ sites in RNA, the aligned data (middle of the panel) show alignments as single “spikes” toward the 3’ end of transcripts. Occasionally multiple peaks are observed. The question being, are the additional peaks the result of isoforms (alternative polyA sites) or incomplete restriction enzyme digests? How might this be sorted out? Like RNA-Seq, the bottom panel shows the comparative analysis of replicate samples from the wild type (WT) and transgenic (TG) mice.

Data from a small RNA analysis experiment are analyzed in the third panel. Unlike RNA-Seq and Tag Profiling, this secondary analysis has more comparisons of the reads to different sets of reference sequences. The purpose is to identify and filter out common artifacts observed in small RNA preparations. The pipeline we used, and data produced, are shown in the middle of the panel. Histogram plots of read length distribution, determined from alignments in different reference sources, are created because an important feature of small RNAs is that they are small. Distributions clustered around 22 nt indicate a good library. Finally, data are linked to additional reports and databases, like miRBase (Sanger Center), to explore results further. In the example shown, the first hit was to a small RNA that has been observed in opossums; now we have human counter part. In total, four, samples were studied. Like RNA-Seq and Tag Profiling, we can observe the relative expression of each small RNA by analyzing the datasets together (hierarchical clustering, bottom).

Section 6 presents some of the challenges of scale issues that accompany NGS, and how we are addressing these issues with HDF5 technology. This will be a topic of many more posts in the future.

We close the poster by addressing the challenges listed above with the final points:
  • High performance data management systems are being developed through the BioHDF project and GeneSifter system architectures.
  • The examples show how each application and sequencing platform requires a different data analysis workflow (pipeline). GeneSifter provides a platform to develop and make bioinformatics pipelines and data readily available to communities of biologists.
  • The transcriptome is complex, different libraries of sequence data can filter known sequences (e.g. rRNA) and discover new elements (miRNAs) and isoforms of expressed genes.
  • Within a dataset, read maps, tables, and histogram plots are needed to summarize and understand the kinds of sequences present and how they relate to an experiment.
  • Links to Entrez Gene, the USCS genome browser, and miRBASE, show how additional information can be integrated into the application framework and used.
  • Next Gen transcriptomics assays are similar to microarray assays in many ways, hence software systems like Geospiza’s GeneSifter are useful for comparative analysis.
You can also get the file, AGBT_2009.pdf

Labels: , , , , , , ,

Sunday, March 1, 2009

Sneak Peak: Small RNA Analysis with Geospiza

Join us this Wednesday, March 4th at 10:00 A.M. PST (1:00 P.M. EST), for a webinar focusing on small RNA analysis. Eric Olson, our VP of Product Development and principle designer of Geospiza’s GeneSifter Analysis Edition will present our latest insights on analyzing large Next Generation Sequencing datasets to study small RNA biology.

Follow the link to register for this interesting presentation.

Abstract

Next Generation Sequencing allows whole genome analysis of small RNAs at an unprecedented level. Current technologies allow for the generation of 200 million data points in a single instrument run. In addition to allowing for the complete characterization of all known small RNAs in a sample, these applications are also ideal for the identification of novel small RNAs. This presentation will provide an overview of micro RNA expression analysis from raw data to biological significance using examples from publicly available datasets and Geospiza’s GeneSifter software.



Labels: , , , , ,

Monday, February 23, 2009

Three Themes from AGBT and ABRF Part III: The IT Problem

The power of Next Generation DNA Sequencing (NGS) technology come from the fact that a massive amount of data, sampling millions of individual molecules, is collected in a massively parallel format. This power also limits the potential wide-spread adoption of the technology because of the IT (Information Technology) challenges that result from the massive amount of data created with each sequencer run.

IT challenges form the third technical theme from the AGBT and ABRF conferences. The previous two posts underscored the need for good laboratory practices and rich bioinformatics support to make NGS experiments successful. This post discusses the experiences communicated by the early adopters of NGS technology with respect to the computing infrastructure.

Surprises

Throughout the literature and NGS presentations, the data management issues created by NGS play a central role. Recent editorials in Nature Methods [1] and Nature Biotechnology [2] speak to the problem and express researchers' frustrations in dealing with the lack of IT infrastructures. At the ABRF workshop, we had two presentations specifically focused on the IT challenges, describing two different experiences.

In the first case, the group implementing NGS had a number of surprises after the NGS system was installed and running. They learned that these systems not only require a lot of storage and computing support, they also use up a lot of bandwidth when data are transferred. The bandwidth problem led to the need for a revised network architecture to isolate the NGS data flow from other network activity.

This talk brought similar surprises to mind. In other labs, NGS “surprises” have led to groups needing to upgrade server rooms by installing backup power, air conditioning, and other equipment. Of course these surprises are manageable if you have an IT group and a server room in the first place. In some cases, groups start with even less and find that the IT costs makes the NGS endeavor very expensive. Even with support and space the IT costs for bringing in NGS can quickly grow into six figures (above $100,000) for infrastructure alone.

The second presentation was given by a group who was well prepared for NGS. Their university had made a previous commitment to building an IT infrastructure to support data intensive genomics research, so adding NGS was a step up in their view. Their experience allowed them to develop a strong implementation plan that called for a number of systems upgrades that included upgrading network hardware. While total costs were less than the six figure surprises others experienced, they did spend many tens of thousands of dollars on new file servers, CPUs, network switches, and server room upgrades.

The conclusion from both of the presentations was that if you are going to set up an NGS infrastructure three things are important: planning, planning, planning. Also, institutional support is critically important since renovations and new building may need to ramp up too. Personnel with network, systems administration, and unix experience are also essential. Finally, as the second speaker put it, you need to encourage researchers to invest in the infrastructure. If they are not involved in the process and contributing time and money, the endeavor can quickly fail.

These talks bring me to my favorite marketing slogan where one of Illumina’s customers put an NGS instrument in their mail room. Whenever I hear that, or see the ad, it makes me think, “yes, you can turn a mail room into a genome center, but where will you put the data center?

There is a solution


For those thinking about NGS technology, or running an NGS experiment where the samples are submitted to a lab, and the data returned, even contemplating the IT requirements can be discouraging. But, it does not have to be this way. Over the past ten years, an immense infrastructure of data centers has emerged . Today, there are many options and price points available for storage, computing, and backup systems. Groups can save significant time and money using on-line services because costs scale with need. Moreover, on-line services eliminate the need for dedicated systems and data administrators putting more money in the budget for experiments. You have a choice. Jump in and do some interesting science or work hard to have your campus facilities remodeled.

Geospiza is taking advantage of the Internet’s infrastructure to offer our clients cost effective ways to get NGS running in their lab. GeneSifter Laboratory Edition can be delivered through a SaaS (Software as a Service) model to get labs up and running quickly. Just sign up, get access, and you are ready to go. GeneSifter Analysis Edition solves the IT problem for research groups who get their sequencing done through core labs or other service providers. In these cases, you upload you data and with a few clicks, process your data and analyze the results. Because the infrastructure is built, overall costs for IT and bioinformatics are much lower, and you do not have to experience a remodeling project.

References
1. 2008. Byte-ing off more than you can chew. Nat Methods 5, 577.
2. 2008. Prepare for the deluge. Nat Biotechnol 26, 1099.

Labels: , , , ,

Thursday, February 19, 2009

Three Themes from AGBT and ABRF Part II: The Bioinformatics Bottleneck

In my last post, I summarized the presentations and conversations regarding Next Gen Sequencing (NGS) challenges in terms of three themes: You have to pay attention to details in the laboratory, bioinformatics is a bottleneck, and the IT burden is significant. In that post, I discussed the issues related to the laboratory and how GeneSifter Lab Edition overcomes those challenges.

This post tackles the second theme: the bioinformatics bottleneck.

In the Sanger days, bioinformatics was really a challenge for only the highest throughput facilities like genome centers. In these labs, streamlined workflows (pipelines) were developed for the different kinds of sequencing (genomes, ESTs[expressed sequence tags, 1], SAGE [serial analysis of gene expression, 2]). Because, Sanger sequencing was high cost and low throughput, compared to NGS, the cost of developing the bioinformatics pipelines was low relative to the cost of collecting the data. Thus, large-scale projects such as whole genome shotgun sequencing, ESTs, or resequencing studies, could be supported by a handful of pipelines that changed infrequently. In addition, small-scale Sanger projects could be handled well by desktop software and services like NCBI BLAST.

NGS breaks the Sanger paradigm. A single NGS instrument has the same throughput as an entire warehouse of Sanger instruments. To illustrate this point in a striking way, we can look at dbEST - NCBI’s database of ESTs. Today, there are approximately 59 million reads in this database, representing the total accumulation of sequencing projects over a 10 year period. Considering that one run of an Illumina GA or SOLiD can produce between 80 and 180 million reads in week or two we can now, in a single week, produce up to three times more ESTs than we have seen deposited over the past 10 years. These numbers also dwarf the amount of data collected from other gene expression analysis systems like microarrays and sequencing techniques like SAGE.


The emergence of the bioinformatics bottleneck

The bioinformatics bottleneck is related to the fact that NGS platforms are general purpose; they collect sequence data. That’s it. Because they collect a lot of data very quickly, we can use sequences as data points for many different kinds of measurements. When we think this way, an extremely wide range of experiments can be conceived.

From sequencing complete genomes, to sampling genes in the environment, to measuring mutations in cancer, to understanding epigenomics, to measuring gene expression and the transcriptome, NGS applications are proliferating at a rapid pace. However, each experiment requires a specialized bioinformatics pipeline and the algorithms used within a bioinformatics pipelines must be tuned for the data produced from the different sequencing platforms and questions being asked. When these considerations are combined with other issues like what reference data to use for sequence comparisons the number of bioinformatics pipelines can grow in a combinatorial fashion.

The early recommendation is that each lab wanting to do NGS work needs to have a dedicated bioinformatics professional. In more than one talk, presenters even quantified bioinformatics support in terms of FTEs (full time equivalents) per instrument. Bioinformatics is needed in both the sequencing laboratory, to develop and maintain quality control pipelines, and in the research environment, to process (align) the data, mine the output for interesting features, and perform comparative analyses between datasets.

But this won’t work

It is clear that bioinformatics is critical to understanding the data being produced. However, the current recommendation that any group planning NGS experiments should also have a dedicated bioinformatician is impractical for several reasons.

First, the model of a bioinformatician for every lab is simply not scalable. Fundamentally, there are not enough people that understand the science, programming, statistics, and other resources such as different forms of reference data, algorithms, and data types needed to make sense of NGS data. We see plenty of evidence, in the literature and presentations, that there are many outstanding people doing this work and contributing to the community, the problem is that they already have jobs!

Even if we consider that the above model is workable, hiring people takes significant time, is expensive, and ongoing costs are going to be high. These time and cost investments only become reasonable when a significant number of experiments are planned. One or two instruments will produce between 25 and 50 runs worth of data per year. If you calculate instrument costs, reagents, salary, and overhead costs, you are quickly into many thousands of dollars per sample. Indeed, a theme expressed in the bioinformatics bottleneck is that bioinformatics is becoming the single largest ongoing cost of NGS. Add in the IT computer support (next post) and you better have a plan for running a lot more than 50 runs per year. Remember the first issue - good bioinformaticians with NGS analysis experience have jobs.

If you have access to bioinformatics support, or can hire an individual, that person will quickly become overwhelmed with work. The biggest reason is that the software infrastructures needed to quickly develop new pipelines, automate them, and deliver data in ways that can be consumed by non-programming scientists are typically lacking. The result is that scientific programming efforts generally turn into lengthy software development projects because without an infrastructure, the numbers and kinds of experiments quickly grow past beyond the capacity of a single individual.

So, What can be done?

Geospiza solves the bioinformatics challenge in multiple ways. GeneSifter Lab and Analysis editions provide a platform that delivers the complete infrastructures needed to deploy NGS data processing pipelines and deliver results through web-based interfaces. These systems include pipelines for many of the common NGS applications such as transcription analysis, small RNA detection, ChIP-Seq and other assays. The system architecture and accompanying API creates a framework to quickly add new pipelines and make the results available to biologists running the experiments.

For those with access to bioinformatics help, GeneSifter will make your team more productive because developers will be freed of the burden of having to create the automation and delivery infrastructure, enabling them to focus on new scientific programming problems. For those without access to such resources, we have many pipelines ready to go. Moreover, because we have a platform and the infrastructure already built, as well as deep bioinformatics experience, we can create and deliver new analysis pipelines quickly. Finally, our product development roadmap is well-aligned with the most common NGS assays which means we you can probably do your bioinformatics analysis today!

References: 

1. Adams M.D., Soares M.B., Kerlavage A.R., Fields C., Venter J.C., 1993. Rapid cDNA sequencing (expressed sequence tags) from a directionally cloned human infant brain cDNA library. Nat Genet 4, 373-380.

2. Velculescu V.E., Zhang L., Vogelstein B., Kinzler K.W., 1995. Serial analysis of gene expression. Science 270, 484-487.

Labels: , , , ,

Monday, February 2, 2009

Next Gen Laboratory Software Systems for Core Facilities

Do you have a core lab? Considering adding Next Generation DNA sequencing capacity to your lab? Then you will be interested in visiting our both and checking out our poster at the annual Association for Biomolecular Research Facilities (ABRF) meeting next week in Memphis TN. We'll be at booth 408, and presenting poster number V27-S1.

Poster Abstract

Throughout the past year, as next generation sequencing (NGS) technologies have emerged in the marketplace, their promise of what can be done with massive amounts of sequence data has been tempered with the reality that performing experiments and working with the data is extremely challenging. As core labs contemplate acquiring NGS technologies, they must consider how the new technologies will affect their current and future operations. The old model of collecting and delivering data is likely to change to one where the core lab becomes an active participant in advising and helping clients set up experiments and analyze the data. However, while many labs want to utilize NGS, few have the Information Technology (IT) infrastructures and procedures in place to successfully make use of these systems.

In the case of gene expression, NGS technologies are being evaluated as complementary or replacement technologies for microarrays. Assays like RNA-Seq and tag profiling, that focus on measuring relative gene expression, require that researchers and core labs must puzzle through a diverse collection of early version algorithms that are combined into complicated workflows with many steps producing complicated file formats. Command line tools such as MAQ, SOAP, MapReads, and BWA, have specialized requirements for formatted input and output and leave researchers with large data files that still require additional processing and formatting for tertiary analyses. Moreover, once reads are aligned, datasets need to be visualized and further refined for additional comparative analysis. We present solutions to these challenges by showing results from a complete workflow system that includes data collection, processing, and analysis for RNA-seq suited for the core laboratory.

In the poster we'll walk through the laboratory and data analysis issues one needs to think about to perform a two cell expression comparison with RNA-Seq. Below is a snippet from the poster. I'll post the full presentation when I return.

Labels: , , , ,

Wednesday, January 21, 2009

The Experts Agree

It depends what you are trying to do. That is the take home message in Genome Technology’s (GT) trouble-shooting guide on picking assembly and alignment algorithms for Next-Gen sequence data.

In the guide, the GT team asked nine Next-Gen sequencing and bioinformatics experts to answer six questions:
  1. How do you choose which alignment algorithm to use?
  2. How do you optimize your alignment algorithm for both high speed and low error rate?
  3. What approach do you use to handle mismatches or alignment gaps?
  4. How do you choose which assembly algorithm to use?
  5. Do you use mate-paired reads for de novo assembly? how?
  6. What impact does the quality of raw read data have on alignment or assembly? how do your algorithms enhance this?
Even a quick look at the questions shows us that many factors need to be considered in setting up a Next-Gen sequencing lab. Questions 1 and 4 point out that aligning sequences is different from assembling them. Other questions address issues related to the size of the data sets being compared, the quality of the data being analyzed, the kinds of information that can be obtained, and the computational approaches being used for different problems.

What the experts said

First, they all agree that different problems require different approaches and have different requirements. In the first question about which aligner to use, the most common response was “for what application and which instrument?” Fundamentally, SOLiD data are different from Illumina GA data which are different from 454 data. While the end results may all be sequences of A's, G's, C's, and T's; the data are derived in different ways because of the platform-specific twists in collecting the data (recall “Color Space, Flow Space, Sequence Space, or Outer Space). Not only are there platform-specific methods for interpreting raw data, multiple programs have been developed for each instrument with their own strengths and weaknesses in terms of speed, sensitivity, the kinds of data they use (color, base, or flow spaces, quality values, and paired end data), and the information that is finally produced. Hence, in addition to choosing a sequencing platform you also have to think about the sequencing application, or the kind of experiment, that will be performed. In gene expression studies, for example, an RNA-Seq experiment has different requirements in terms of aligning the data and interpreting the output than an experiment with Tag Profiling.

Overall the trouble-shooting guide discussed 17 total algorithms, eight for alignment, and nine for assembly (two of which were for Sanger methods). Even this selection wasn't a comprehensive list. When other sites [1, 2] and articles [3] are included and proprietary methods are factored in, over 20 algorithms are available. So what to do? Which is best?

That depends

Yes, the choice of algorithm ultimately depends on what you are trying to do. While we can agree that there is no best solution, we also know that is not a helpful response. What is needed is a way to test the suitability of different algorithms for different kinds of experiments and to represent data in standard ways so that the features of specific algorithms can be evaluated. Also, as this is a new field, standard requirements for how data should be aligned, defining a correct alignment, and what kinds of information are the most informative in describing alignments are still emerging. Some of the early programs are helping to define these requirements.

One program we've used, at Geopsiza, for identifying requirements is MAQ, a program for sequence alignment. As noted in previous blogs [MAQ attack], MAQ is a great general purpose tool. It provides comprehensive information about the data being aligned and details about alignments. MAQ works well for many applications including RNA-Seq, Tag Profiling, ChIP-Seq, and resequencing assays focused on SNP discovery. In performance tests, MAQ is slower than some of the newer programs, one of which is being developed by MAQ’s author, but MAQ is a good model for getting the right kinds of information, formatted in a decent way. Indeed MAQ was the most cited program in the GT guide.

Let’s return to the bigger issue. That is, how can we easily compare between algorithms? For that we need a system where one can easily define a standardized dataset and reference sequence, and have a platform where a new algorithm can be added and run from a common interface. Standard reports that present features of the alignments could then be used to compare programs and parameters.

The laboratory edition of GeneSifter supports these kinds of comparisons. The distributed system architecture allows one to quickly develop control scripts to run programs and format their output in figures and tables that make comparisons possible. With this kind of system in place, the challenges move from which program to run and how to run it, to how to get the right kinds of information and best display the data. To address these issues, Geospiza’s research and development team is working on projects focused on using technologies like HDF5 to create scalable standardized data models for storing information from alignment and assembly programs. Ultimately this work will make it easy to optimize Next-Gen sequencing applications and assays and compare between assorted programs.

References
1. http://en.wikipedia.org/wiki/Sequence_alignment_software,
2. http://www.massgenomics.org/2009/01/short-read-aligners-update-at-agbt.html
3. Shendure J., Ji H., 2008. Next-generation DNA sequencing. Nat Biotechnol 26, 1135-1145.

Labels: , , , ,

Tuesday, January 6, 2009

From Reads to Data Sets, Why Next Gen is not like Sanger Sequencing

Time is running out! Be sure to register for the ABRF 2009 Next Generation DNA Sequencing Workshop to be held in Memphis TN Sat. Feb. 7, 2009. The day will include discussions from core lab directors and others about how to implement new sequencing technologies in a core lab environment. We'll consider how next generation sequencing differs from Sanger sequencing and devote much of the day learning about the practical impact of the differences.

An introduction:

Initially, DNA sequencing was performed to learn about sequences and structure of cloned genes. The first widely-used sequencing systems were based on the “Sanger” method. DNA was synthesized in the presence of chain terminating radioactive dideoxy-nucleotides (1). Mixtures of DNA fragments were separated by size using gel electrophoresis and the bases were identified and entered into a computer through manual techniques. Automated DNA sequencing instruments arrived later. These instruments made DNA sequencing thousands of times more efficient by detecting fluorescently labeled fragments and sending the information directly to a computer (2). For the first time, it became possible to sequence entire human genomes (3,4).

While highly successful, Sanger sequencing is cost prohibitive when it comes to deeper investigations of biological systems. Just as the questions investigated by Sanger sequencing shifted from single genes to entire genomes, the questions being asked by Next Generation techniques are changing as well. Questions related to transcription, for example, and promoter occupancy, can now be answered by using a massively parallel format to sample large collections of individual molecules. In effect, every RNA or DNA molecule might be sampled and counted. Not only are we looking at a “Next Generation” of DNA sequencing, we are looking a next generation of experimental techniques that answer different kinds of questions than those we asked before. These new technologies require fundamental changes in terms of experiment design and data analysis systems.


1) Sanger F., Nicklen S., Coulson A.R., 1977. “DNA sequencing with chain-terminating inhibitors.” Proc Natl Acad Sci U S A 74, 5463-5467.

2) Smith L.M., Sanders J.Z., Kaiser R.J., Hughes P., Dodd C., Connell C.R., Heiner C., Kent S.B., Hood L.E., 1986. “Fluorescence detection in automated DNA sequence analysis.” Nature 321, 674-679.

3) International Human Genome Sequencing Consortium, 2001. “Initial sequencing and analysis of the human genome.” Nature 409, 860-921.

4) Venter J.C., Adams M.D., Myers E.W., et. al. 2001. “The sequence of the human genome.” Science 291, 1304-1351.

Labels: , ,

Friday, December 12, 2008

Papers, Papers, and more Papers

Next Gen Sequencing is hot, hot, hot! You can tell by the numbers and frequency in which papers are being published.

A few posts ago, I wrote about a couple of grant proposals that we were preparing on methods to detect rare variants in cancer and improve the tools and methods to validate datasets from quantitative assays that utilize Next Gen data, like RNA-Seq, ChIP-Seq, or Other-Seq experiments. Besides the normal challenges of getting two proposals written and uploaded to the NIH, there was an additional challenge. Nearly everyday, we opened the tables-of-contents in our e-mail and found a new papers highlighting Next Gen Sequencing techniques, applications, or biological discoveries made through Next Gen techniques. To date, over 200 Next Gen publications have been produced. During the last two months alone more than 30 papers have been published. Some of these (listed in the figure below) were relevant to the proposals we were drafting.

The papers highlighted many of the themes we've touched on here, including the advantages of Next Gen sequencing and challenges with dealing with the data. As we are learning, these technologies allow us to explore the genome and genomics of systems biology at significantly higher resolutions than previously imagined. In one of the higher profile efforts, teams at the Washington University School of Medical and Genome Center compared a leukemia genome to a normal genome using cells from the same patient. This first intra-person whole genome analysis identified acquired mutations in ten genes, eight of which were new. Interestingly, the eight genes have unknown functions and might be important some day for new therapies.

Next Gen technologies are also confirming that molecular biology is more complicated than we thought. For example, the four most recent papers in Science show us that not only is 90% of the genome actively transcribed, but many genes have both sense and anti-sense RNA expressed. It is speculated that the anti-sense transcripts have a role in regulating gene expression. Also, we are seeing that nearly every gene produces alternatively spiced transcripts. The most recent papers indicate that between 92% and 97% of transcripts are alternatively spliced. My guess is that the only genes, not alternatively spliced are those lacking introns, like olfactory receptors. Although, when alternative transcription starts and alternative polyadenylation sites are considered, we may see that all genes are processed in multiple ways. It will be interesting to see how the products of alternative splicing and anti-sense transcription might interact.

This work has a number of take home messages.
  1. Like astronomy, when we can see deeper we see more. Next Gen technologies are giving us the means to interrogate large collections of individual RNA or DNA molecules and speculate more on functional consequences.
  2. Our limits are our imaginations. The reported experiments have used a variety of creative approaches to study genomic variation, sample expressed molecules from different strands of DNA, and measure protein DNA/RNA interaction.
  3. Good hands do good science. As pointed out in the paper from the Sanger Center on their implementation of Next Gen sequencing, the processes are complex and technically demanding. You need to have good laboratory practices with strong informatics support for all phases (laboratory, data management, and data analysis) of the Next Gen sequencing processes.
The final point is very important and Geospiza’s lab management and data analysis products will simplify your efforts in getting Next Gen systems running to make your major investment pay off and quickly publish results.

To see how, join us for a webinar next Wednesday, Dec. 17 at 10 am PDT, for RNA Expression Analysis with Geospiza.


Click on the figure to enlarge the text.

Labels: , , , , , ,

Thursday, November 20, 2008

Introducing GeneSifter

Today, Geospiza announced the acquisition of the award-winning GeneSifter microarray data analysis product. This news has significant implications for Geospiza’s current and new customers. With GeneSifter and FinchLab, Geospiza will deliver complete end to end systems for data intensive genetic analysis applications like Next Gen sequencing and microarrays.

As an example, let's consider transcriptomics or gene expression. One goal of such experiments is to compare the relative gene expression between cells to see how different genes are up or down regulated as the cells change over time or respond to some sort of treatment.

The general process, whether it involves microarrays or Next Gen sequencing, is to measure the number of RNA molecules for a given gene, either over a period of time or after different treatments. Laboratory processes create the molecules to assay, the molecules are measured, data are collected, and we process the data to produce tables of information. These tables are then compared with one another to identify genes that are differentially expressed. With the gene expression results in hand, one can delve deeper by utilizing other databases like Entrez Gene or pathway sites to learn about gene function and gain insights.

From a systems perspective, you need a LIMS to define sample information and keep track of workflow steps and the data generated at the bench. You will also need to track which samples are on a slide, or lane, or well when the data are collected. You will need to store and organize the data by sample. Then, you will need to analyze the data through multiple programs in a pipelined process (filter, align ...) to produce information, like gene lists, that can be compared for each sample. You may want to review this information to see that your experiments are on track and then, if they are, you will want to compare the gene lists from different experiments to tell a story.

FinchLab, combined with Geospiza’s hosted Software as a Service (SaaS) delivery, solves challenges related to IT, LIMS, and the core data analysis. GeneSifter completes the process by delivering a software solution that lets you compare your gene lists. GeneSifter provides information about the relative gene expression between samples and links gene information to key public resources to uncover additional details.

It's an exciting time for those in the genetic analysis and genomics fields. New high throughput data collection technologies are giving scientists the ability to interrogate systems and understand biology in a whole new way. As we come to the end of 2008 and think about 2009, Geospiza is excited to think about how we will integrate and extend our products to further develop end to end systems for a wide variety of genomics applications that target basic and clinical research to help us improve human health and well being.



Labels: , , ,